- 2553 reads
Simon Cozens
In one of my earlier articles, "Filtering Mail with Mail::Audit and
News::Gateway", (TPJ 5.2) we discussed a relatively simple way to
help manage email, by filtering into mail folders and gatewaying to
private news groups. This article discusses the "next generation" of
mail handling, mail mining, and demonstrates the utility of a
Mail::Miner set up.
Data Mining
The "formal" definition of data mining from the Free Online Dictionary of Computing states that it is "analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data". However, when we use it in this article, we use it in a much looser sense - data mining is the automated extraction of core pieces of information from a mass of data, and its filing such that querying and retrieval is made relatively easy.
One convenient source of a massive corpus of data suitable for data mining is the mass of email that arrives at our system every day.
Email is a surprisingly interesting data format. It contains a lot of structured, regular data, in the form of mail headers, which is easy for a computer to parse. Unfortunately, the utility of the mail headers is pretty hit and miss. While things like "To" and "Subject" are always going to be important, in the vast majority of cases, many of the headers are almost useless when the message has been delivered, filed and read.
The structure of the body is also relatively easy to parse; there may be binary or textual attachments, which may or may not contain useful data. Finally, there’s usually one reasonably large unstructured part, the textual body of the message.
Just like the mail headers, this is pretty hit and miss too. There could be things that we will want to remember later: phone numbers, dates, place names, addresses, snippets of code, and so on. But interspersed with that, we find line after line of small talk, signature files, flames, and all kinds of other non-information whch is almost useless when the message has been delivered, filed and read.
The idea behind mail mining is to provide a means for separating the wheat from the chaff. I want to be able to find out where and when I’m meeting someone, without much concern for the state of the weather in western Japan one particular Friday afternoon. So our goal, then, is to produce a mechanism for extracting useful information from both the structured and unstructured portions of a mail message, preferably without human intervention, and provide a means for retrieving that information quickly and easily.
To put it in extremely human terms, I want a tool which lets me say "Show me the mail I got around three weeks ago from Nat which was something to do with web services and had an interesting snippet of code in it".
And this is precisely what the Mail::Miner module does.
The Mail::Miner Method
Mail::Miner, as its name implies, is a module, rather than a
complete application. To be precise, it’s a collection of modules,
arranged in the following framework:
EDITOR: Insert http://ddtm.simon-cozens.org/~simon/mailminer.png here. I can provide it in alternative formats if needed.
The mail comes in at the top of the diagram, and is converted by your
Mail::Audit filter into a MIME::Entity, which is handed to
Mail::Miner::Message by whatever your delivery process happens to
be. Right at that moment, an entry is created for the email in a
relational database, storing the from address, subject, and other
useful but trivial metadata. Any attachments are stripped off and
filed separately in the database, associated with the mail message in
question. A notification is added to the body of the email, of the
form:
[ text/x-perl attachment test.pl detached - use mm –detach 821 to recover ]
Notice the format of this text: to retrieve the attachment, I just cut
and paste the middle line onto a shell prompt, and the attachment will
be dumped into the current directory. (If there’s already a test.pl
there, mm, the Mail::Miner command line utility, will prompt
before overwriting.)
Then Mail::Miner locates and calls any Mail Miner
Recognisers. We’ll come back to those in a second.
After this point, the mail, with its newly flattened body, can be filed into the database. Once that’s done, the message can leave the Mail Miner system, ready for delivery to the user’s inbox.
Notice that here, even with no cleverness, we have a system for managing attachments, plus a searchable database of old email. However, the real power of Mail Miner comes in its recognisers.
Recognisers and Queries
Recognisers are, simply, modules that look for things that may be considered interesting in an incoming message, ("assets") file them away for later, and provide an interface to query for them.
Let’s take a tour of the currently implemented Mail::Miner
recognisers.
The first recogniser isn’t really a recogniser at all, but it does
provide a query mechanism - the Mail::Miner::Message module itself
can be used to query the From address of messages in the
database. Recognisers declare command line options that they can
fulfill, so I can now say
% mm –from ddj.com –summary 26 matched 766:2002-03-29: Kevin Carlson:Dr. Dobb’s Journal 768:2002-03-29: Kevin Carlson :Re: Dr. Dobb’s Journal 769:2002-03-29: Kevin Carlson :Re: Dr. Dobb’s Journal 770:2002-03-29: Kevin Carlson :Re: Dr. Dobb’s Journal 783:2002-04-03: Kevin Carlson :Re: Dr. Dobb’s Journal 825:2002-04-17: Rosalyn Lum :Dr. Dobb’s Journal …
If I hadn’t specified the summary option, mm would have returned a
dump of all of the above messages in a Unix mailbox format - this
gives us virtual folders, a la Evolution.
(http://http://www.ximian.com/products/evolution/)
Let’s add some more intelligent recognisers. Now, I’m a firm believer in 80% solutions; most of the time, the effort required to make an algorithm "perfect" isn’t worth it. Edge cases are, well, edge cases, and if you don’t expect the algorithm to get things right, 80% of the time is just fine.
So when I say "intelligent" recognisers, I’m not referring to artificial intelligence; in fact, what the recognisers do is more along the lines of artificial stupidity, leaving the human (who is supposed to have some sort of natural intelligence) to do some elementary top-level filtering. Computers are good at grinding data, so we’ll leave them to do that, and humans are good at top-level filtering, so we’ll leave them to do that.
Think of the recognisers, then, as a production line of trained monkeys. When these monkeys find something interesting or shiny, they file it away in the database, as an "asset". Assets know which mail message they were found in, and which monkey discovered them.
For instance, when a recogniser attempts to discover any phone numbers in a message, it throws up anything that it can find that looks even remotely like a phone number. This naturally produces one or two false positives - although not that many, since most of the long sequences of numbers, parentheses and hyphens found in mail messages turn out to be phone numbers anyway - but that’s Officially OK by the Mail Miner design philosophy. After all, if you can very quickly scan through 500 megabytes of email and produce three candidate phone numbers for someone, two of which are obviously bogus, I’d call that a sufficiently big win.
The phone number recogniser is actually a slightly interesting
example, because using it alters the output format of
mm. Essentially, there are two types of recogniser: those which
help to find particular messages, and those which actually store
"hard" information. The former type of recogniser produces a mailbox
full of messages that match; the latter type of recogniser just dumps
out the "asset" in question.
What does this mean in practice? Well, when you’re asking mm about
phone numbers, it’s more than likely that you don’t want
messages containing phone numbers, but you actually want to get at
the numbers themselves. So, for instance, if I ask Mail::Miner for Tim
O’Reilly’s phone number:
% mm –from "Tim O’Reilly" –phone Phone numbers found in message 2863 from "Tim O’Reilly": (555) 123-4567
(Notice there that I’ve used both the simple "From" query tool and the query tool provided by the phone number recogniser to form an additive filter.)
Of course, if I want to check this out, I can get a copy of the message in question:
% mm –id 2863 From mail-miner-2863@localhost Thu Oct 31 18:58:53 2002 Received: from rock.oreilly.com ([209.204.146.34] helo=smtp.oreilly.com) …
Or just find out what the mail was actually about:
% mm –id 2863 –summary 1 matched 2863:2002-08-31: "Tim O’Reilly":Re: Mail::Miner update
And speaking of what mails are about, let’s move on to another recogniser, the keywords recogniser.
One interesting aspect of Mail::Miner from my point of view is that
it’s sparked the development of a few other neat little modules. The
first was taking note of the fact that, as discussions tend to drift,
a once-relevant Subject: line is now completely irrelevant to a
particular email. How do I find important details about a forthcoming
business trip if they’re hidden in a thread entitled Return from caller?
To solve this problem, I came up with the amazingly simple
Lingua::EN::Keywords module to extract a set of salient keywords
from a block of text. The keywords recogniser runs this module over a
mail message, files each keyword returned from Lingua::EN::Keywords
as an asset attached to the message, and provides the (synonymous)
–about and –keyword command line query functions.
The other recogniser which is currently implemented is
Mail::Miner::Recogniser::Address; this recognises physical
addresses, by looking for something that looks a bit like a postcode
or US zip code and state, and filing the entire paragraph.
Database Digression
There now follows a technical digression that you may skip if you’re not particularly interested in reading a rant about relational databases.
As we’ve seen above with –from … –address, queries can be
combined. As an interesting technical point, I wanted each query to be
represented by a single SQL statement for efficiency. How this works
in practice is that we generate SELECT * FROM messages WHERE and
each recogniser takes its argument, generates a suitable WHERE
clause, and all the WHERE clauses get ANDed together.
When the SELECT statement returns a set of messages, another
function in each of the recognisers is called to allow them to
post-process and filter out inapplicable messages on a more fine
grained basis than can be achieved with raw SQL alone.
This is an elegant and efficient design.
Unfortunately, most of the interesting recognisers are looking for
messages which contain assets of a particular form. Hence, the
WHERE clause that they return is actually a subselect along the
lines of
EXISTS (
SELECT * FROM assets
WHERE message = message.id
AND recogniser = "me"
AND asset LIKE ‘%something%’
)
Of course, the most popular open source relational database does not
support subselects. Hence, for the moment, Mail::Miner requires the
second most popular open source relational database, PostgreSQL, until
either MySQL gets its act together or I am persuaded that there’s an
equally elegant design that doesn’t use subselects.
It’s about the user, stupid
In the grand tradition of this sort of article, I shall talk at length about currently unimplemented features as though they were up and running.
The other module sparked by the Mail::Miner project came out of the
need to express timeframes in a human manner. As well as being a big
fan of 80% solutions, I’m also a big fan of fuzzy input. Fuzzy input
means that the computer, which has an awful lot of processing power
when it comes to precise operations, tries its best to understand the
human, who isn’t all that great at precise operations. This requires
admitting that computer programs are there for the user’s benefit and
not the programmers, but that’s a different story and will be told a
different time.
So, given that the whole premise of Mail::Miner is that I only vaguely
know what I’m looking for, I don’t want to have to specify explicit
dates and times in order to narrow down searches. If I knew the date
and time of the email, I wouldn’t need Mail::Miner in the first
place!
Instead, I want to be able to say "find me the email I got from Adam
sometime around a week ago". The Date::PeriodParser module was
written to solve this problem: given a "fuzzy" date expressed in
English, produce a pair of Unix time values which are likely to
bracket the date.
For instance, as I write this on a Thursday night:
% perl -MDate::PeriodParser -le ‘print scalar localtime $_ for
parse_period("around the morning of the day before yesterday")’
Mon Oct 28 22:00:00 2002
Tue Oct 29 14:00:00 2002
"Around" the morning of the day before yesterday translates to between
very late on Monday evening to early Tuesday afternoon. Mail Miner’s
–date option will give an interface to this.
It’s not just about the user
So far in our explorations of Mail::Miner, we’ve seen how it can be
used to extract salient information from the incoming email of a
single user. However, if the system was to be deployed across an
organisation with multiple users, this would require each use to have
their own database.
Or would it? The natural extension to filing information about how you communicate by email is to file information about how an organisation communicates internally and with its clients.
A planned future phase of Mail::Miner is to have it sit on the main
mail gateway to a company and file every single incoming and outgoing
email. From this simple idea, we can develop a customer relationship
management utility - now you know where your clients live, what they
do, and, more importantly, who’s been speaking to them about what.
So what next?
The future of Mail::Miner lies in three distinct developments: the
first is the development of more specific recognisers; the second, in
developing this idea of Mail::Miner as an organisation-wide tool;
thirdly, the extension of Mail::Miner from being purely a search
and retrieval tool to being an integrated part of an email client.
In the first category, I intend to work on recognisers that detect and extract place names, dates and times, code snippets, human languages used in an email, and much more.
In the second, I see far more possiblities. Adding a user-defined asset recogniser and asset management tool will allow you to specify more accurately the details that you want to record from an email; combine this with the idea of restructuring the assets system so that it can be applied either to a particular email or a particular recipient, and you have a system which can store the fact that a particular client has expressed an interest in playing golf, or is going on holiday for the next two weeks, or many other things which a computer cannot pick out, no matter how many monkeys are involved.
Similarly, there could be recognisers which attempt to divine the
relationships between correspondents on an email message - if I always
Cc Joe when I email Amy, then Mail::Miner should take notice of this.
This ties in with the third category, which combines the data
retrieval capabilities of Mail::Miner with the ordinary email
client. I’ve already intimated that Mail::Miner can be used to
generate virtual mail folders - imagine if your favourite email client
could search read mail based on language or a vague description of
when you read it! On this front, I expect to work on IMAP proxies
which can use an ordinary mail client to perform Mail::Miner
searches, as well as integration with some of the more common Unix
mail clients.
However, for me, Mail::Miner is and always has been a way to make
sure I never need to remember anything again - now, who do I have to
send this article do? I’m sure I got an email from someone at
ddj.com a week or so ago…





