Mining Mail

Simon Cozens

In one of my earlier articles, “Filtering Mail with Mail::Audit and News::Gateway”, (TPJ 5.2) we discussed a relatively simple way to help manage email, by filtering into mail folders and gatewaying to private news groups. This article discusses the “next generation” of mail handling, mail mining, and demonstrates the utility of a Mail::Miner set up.

Data Mining

The “formal” definition of data mining from the Free Online Dictionary of Computing states that it is “analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data”. However, when we use it in this article, we use it in a much looser sense - data mining is the automated extraction of core pieces of information from a mass of data, and its filing such that querying and retrieval is made relatively easy.

One convenient source of a massive corpus of data suitable for data mining is the mass of email that arrives at our system every day.

Email is a surprisingly interesting data format. It contains a lot of structured, regular data, in the form of mail headers, which is easy for a computer to parse. Unfortunately, the utility of the mail headers is pretty hit and miss. While things like “To” and “Subject” are always going to be important, in the vast majority of cases, many of the headers are almost useless when the message has been delivered, filed and read.

The structure of the body is also relatively easy to parse; there may be binary or textual attachments, which may or may not contain useful data. Finally, there’s usually one reasonably large unstructured part, the textual body of the message.

Just like the mail headers, this is pretty hit and miss too. There could be things that we will want to remember later: phone numbers, dates, place names, addresses, snippets of code, and so on. But interspersed with that, we find line after line of small talk, signature files, flames, and all kinds of other non-information whch is almost useless when the message has been delivered, filed and read.

The idea behind mail mining is to provide a means for separating the wheat from the chaff. I want to be able to find out where and when I’m meeting someone, without much concern for the state of the weather in western Japan one particular Friday afternoon. So our goal, then, is to produce a mechanism for extracting useful information from both the structured and unstructured portions of a mail message, preferably without human intervention, and provide a means for retrieving that information quickly and easily.

To put it in extremely human terms, I want a tool which lets me say “Show me the mail I got around three weeks ago from Nat which was something to do with web services and had an interesting snippet of code in it”.

And this is precisely what the Mail::Miner module does.

The Mail::Miner Method

Mail::Miner, as its name implies, is a module, rather than a complete application. To be precise, it’s a collection of modules, arranged in the following framework:

EDITOR: Insert http://ddtm.simon-cozens.org/~simon/mailminer.png here. I can provide it in alternative formats if needed.

The mail comes in at the top of the diagram, and is converted by your Mail::Audit filter into a MIME::Entity, which is handed to Mail::Miner::Message by whatever your delivery process happens to be. Right at that moment, an entry is created for the email in a relational database, storing the from address, subject, and other useful but trivial metadata. Any attachments are stripped off and filed separately in the database, associated with the mail message in question. A notification is added to the body of the email, of the form:

	[ text/x-perl attachment test.pl detached - use 
	        mm –detach 821
	 to recover ]

Notice the format of this text: to retrieve the attachment, I just cut and paste the middle line onto a shell prompt, and the attachment will be dumped into the current directory. (If there’s already a test.pl there, mm, the Mail::Miner command line utility, will prompt before overwriting.)

Then Mail::Miner locates and calls any Mail Miner Recognisers. We’ll come back to those in a second.

After this point, the mail, with its newly flattened body, can be filed into the database. Once that’s done, the message can leave the Mail Miner system, ready for delivery to the user’s inbox.

Notice that here, even with no cleverness, we have a system for managing attachments, plus a searchable database of old email. However, the real power of Mail Miner comes in its recognisers.

Recognisers and Queries

Recognisers are, simply, modules that look for things that may be considered interesting in an incoming message, (“assets”) file them away for later, and provide an interface to query for them.

Let’s take a tour of the currently implemented Mail::Miner recognisers.

The first recogniser isn’t really a recogniser at all, but it does provide a query mechanism - the Mail::Miner::Message module itself can be used to query the From address of messages in the database. Recognisers declare command line options that they can fulfill, so I can now say

 % mm –from ddj.com –summary
 26 matched
 766:2002-03-29:       Kevin Carlson <kcarlson@ddj.com>:Dr. Dobb’s Journal
 768:2002-03-29:       Kevin Carlson <kcarlson@ddj.com>:Re: Dr. Dobb’s Journal 
 769:2002-03-29:       Kevin Carlson <kcarlson@ddj.com>:Re: Dr. Dobb’s Journal 
 770:2002-03-29:       Kevin Carlson <kcarlson@ddj.com>:Re: Dr. Dobb’s Journal 
 783:2002-04-03:       Kevin Carlson <kcarlson@ddj.com>:Re: Dr. Dobb’s Journal 
 825:2002-04-17:             Rosalyn Lum <rlum@ddj.com>:Dr. Dobb’s Journal
 …

If I hadn’t specified the summary option, mm would have returned a dump of all of the above messages in a Unix mailbox format - this gives us virtual folders, a la Evolution. (http://http://www.ximian.com/products/evolution/)

Let’s add some more intelligent recognisers. Now, I’m a firm believer in 80% solutions; most of the time, the effort required to make an algorithm “perfect” isn’t worth it. Edge cases are, well, edge cases, and if you don’t expect the algorithm to get things right, 80% of the time is just fine.

So when I say “intelligent” recognisers, I’m not referring to artificial intelligence; in fact, what the recognisers do is more along the lines of artificial stupidity, leaving the human (who is supposed to have some sort of natural intelligence) to do some elementary top-level filtering. Computers are good at grinding data, so we’ll leave them to do that, and humans are good at top-level filtering, so we’ll leave them to do that.

Think of the recognisers, then, as a production line of trained monkeys. When these monkeys find something interesting or shiny, they file it away in the database, as an “asset”. Assets know which mail message they were found in, and which monkey discovered them.

For instance, when a recogniser attempts to discover any phone numbers in a message, it throws up anything that it can find that looks even remotely like a phone number. This naturally produces one or two false positives - although not that many, since most of the long sequences of numbers, parentheses and hyphens found in mail messages turn out to be phone numbers anyway - but that’s Officially OK by the Mail Miner design philosophy. After all, if you can very quickly scan through 500 megabytes of email and produce three candidate phone numbers for someone, two of which are obviously bogus, I’d call that a sufficiently big win.

The phone number recogniser is actually a slightly interesting example, because using it alters the output format of mm. Essentially, there are two types of recogniser: those which help to find particular messages, and those which actually store “hard” information. The former type of recogniser produces a mailbox full of messages that match; the latter type of recogniser just dumps out the “asset” in question.

What does this mean in practice? Well, when you’re asking mm about phone numbers, it’s more than likely that you don’t want messages containing phone numbers, but you actually want to get at the numbers themselves. So, for instance, if I ask Mail::Miner for Tim O’Reilly’s phone number:

  % mm –from “Tim O’Reilly” –phone
 Phone numbers found in message 2863 from “Tim O’Reilly” <tim@oreilly.com>:
 (555) 123-4567

(Notice there that I’ve used both the simple “From” query tool and the query tool provided by the phone number recogniser to form an additive filter.)

Of course, if I want to check this out, I can get a copy of the message in question:

  % mm –id 2863
  From mail-miner-2863@localhost Thu Oct 31 18:58:53 2002
  Received: from rock.oreilly.com ([209.204.146.34] helo=smtp.oreilly.com)
  …

Or just find out what the mail was actually about:

 % mm –id 2863 –summary
 1 matched
 2863:2002-08-31:       “Tim O’Reilly” <tim@oreilly.com>:Re: Mail::Miner update

And speaking of what mails are about, let’s move on to another recogniser, the keywords recogniser.

One interesting aspect of Mail::Miner from my point of view is that it’s sparked the development of a few other neat little modules. The first was taking note of the fact that, as discussions tend to drift, a once-relevant Subject: line is now completely irrelevant to a particular email. How do I find important details about a forthcoming business trip if they’re hidden in a thread entitled Return from caller?

To solve this problem, I came up with the amazingly simple Lingua::EN::Keywords module to extract a set of salient keywords from a block of text. The keywords recogniser runs this module over a mail message, files each keyword returned from Lingua::EN::Keywords as an asset attached to the message, and provides the (synonymous) –about and –keyword command line query functions.

The other recogniser which is currently implemented is Mail::Miner::Recogniser::Address; this recognises physical addresses, by looking for something that looks a bit like a postcode or US zip code and state, and filing the entire paragraph.

Database Digression

There now follows a technical digression that you may skip if you’re not particularly interested in reading a rant about relational databases.

As we’ve seen above with –from … –address, queries can be combined. As an interesting technical point, I wanted each query to be represented by a single SQL statement for efficiency. How this works in practice is that we generate SELECT * FROM messages WHERE and each recogniser takes its argument, generates a suitable WHERE clause, and all the WHERE clauses get ANDed together.

When the SELECT statement returns a set of messages, another function in each of the recognisers is called to allow them to post-process and filter out inapplicable messages on a more fine grained basis than can be achieved with raw SQL alone.

This is an elegant and efficient design.

Unfortunately, most of the interesting recognisers are looking for messages which contain assets of a particular form. Hence, the WHERE clause that they return is actually a subselect along the lines of

  EXISTS (
     SELECT * FROM assets
     WHERE message = message.id
       AND recogniser = “me”
       AND asset LIKE ‘%something%’
  )

Of course, the most popular open source relational database does not support subselects. Hence, for the moment, Mail::Miner requires the second most popular open source relational database, PostgreSQL, until either MySQL gets its act together or I am persuaded that there’s an equally elegant design that doesn’t use subselects.

It’s about the user, stupid

In the grand tradition of this sort of article, I shall talk at length about currently unimplemented features as though they were up and running.

The other module sparked by the Mail::Miner project came out of the need to express timeframes in a human manner. As well as being a big fan of 80% solutions, I’m also a big fan of fuzzy input. Fuzzy input means that the computer, which has an awful lot of processing power when it comes to precise operations, tries its best to understand the human, who isn’t all that great at precise operations. This requires admitting that computer programs are there for the user’s benefit and not the programmers, but that’s a different story and will be told a different time.

So, given that the whole premise of Mail::Miner is that I only vaguely know what I’m looking for, I don’t want to have to specify explicit dates and times in order to narrow down searches. If I knew the date and time of the email, I wouldn’t need Mail::Miner in the first place!

Instead, I want to be able to say “find me the email I got from Adam sometime around a week ago”. The Date::PeriodParser module was written to solve this problem: given a “fuzzy” date expressed in English, produce a pair of Unix time values which are likely to bracket the date.

For instance, as I write this on a Thursday night:

  % perl -MDate::PeriodParser -le ‘print scalar localtime $_ for
    parse_period(“around the morning of the day before yesterday”)’

  Mon Oct 28 22:00:00 2002
  Tue Oct 29 14:00:00 2002

“Around” the morning of the day before yesterday translates to between very late on Monday evening to early Tuesday afternoon. Mail Miner’s –date option will give an interface to this.

It’s not just about the user

So far in our explorations of Mail::Miner, we’ve seen how it can be used to extract salient information from the incoming email of a single user. However, if the system was to be deployed across an organisation with multiple users, this would require each use to have their own database.

Or would it? The natural extension to filing information about how you communicate by email is to file information about how an organisation communicates internally and with its clients.

A planned future phase of Mail::Miner is to have it sit on the main mail gateway to a company and file every single incoming and outgoing email. From this simple idea, we can develop a customer relationship management utility - now you know where your clients live, what they do, and, more importantly, who’s been speaking to them about what.

So what next?

The future of Mail::Miner lies in three distinct developments: the first is the development of more specific recognisers; the second, in developing this idea of Mail::Miner as an organisation-wide tool; thirdly, the extension of Mail::Miner from being purely a search and retrieval tool to being an integrated part of an email client.

In the first category, I intend to work on recognisers that detect and extract place names, dates and times, code snippets, human languages used in an email, and much more.

In the second, I see far more possiblities. Adding a user-defined asset recogniser and asset management tool will allow you to specify more accurately the details that you want to record from an email; combine this with the idea of restructuring the assets system so that it can be applied either to a particular email or a particular recipient, and you have a system which can store the fact that a particular client has expressed an interest in playing golf, or is going on holiday for the next two weeks, or many other things which a computer cannot pick out, no matter how many monkeys are involved.

Similarly, there could be recognisers which attempt to divine the relationships between correspondents on an email message - if I always Cc Joe when I email Amy, then Mail::Miner should take notice of this.

This ties in with the third category, which combines the data retrieval capabilities of Mail::Miner with the ordinary email client. I’ve already intimated that Mail::Miner can be used to generate virtual mail folders - imagine if your favourite email client could search read mail based on language or a vague description of when you read it! On this front, I expect to work on IMAP proxies which can use an ordinary mail client to perform Mail::Miner searches, as well as integration with some of the more common Unix mail clients.

However, for me, Mail::Miner is and always has been a way to make sure I never need to remember anything again - now, who do I have to send this article do? I’m sure I got an email from someone at ddj.com a week or so ago…


neverclickonthislink