Mail learning: the what and the how

Subject tags: 

Since about 2002 I’ve had a passion for mining data and relationships information from email. I organise my life around my email, as I’m sure that many people do, treating it as a big datastore. I’m convinced that your mail contains everything you need to know - appointments, addresses, phone numbers, URLs, documents, relationships… The trouble is, how to find it all.

(As an aside, since 2002 I’ve called all these things “assets”, so I use the same terminology here.)

I worked for a year for a company in Belfast with the same vision - build something to index, search and retrieve key data from email. Things never took off, and then along came Google Mail and we were sure they were going to solve the problem. It’s now seven years later, an age in Internet time, and they haven’t. There are now - finally - a few players in the arena but (once again) none of them quite gets it. Let’s look at what they’re doing, what needs to be done and, much more importantly, how to do it.

First, let’s survey the state of the art. The closest out there is Kwaga. This is a plugin on top of Google Mail - good start, because it gives you searching for free - and it does asset detection and retrieval. The idea’s great, but last I looked it had all the functionality of a rigged demo. I just couldn’t get the bugger to work. It also requires me to give it my mail username and password, as do many of its competitors. I don’t know when people stopped getting the stern talk from sysadmins about not telling their password to anyone, but I still give it. My wife doesn’t know my mail password, so J Random Company won’t have it either.

Next up is Mailana. (I’m indebted to gnat for the list of players, by the way.)

It’s an Exchange plugin as far as I can see, which gives me the heaves. (To be fair, I’ve never used Outlook, so I shouldn’t judge.) It seems to do keyword analysis and clustering, and show you who is connected to what keywords. This is a slightly different problem from the information retrieval problem, but it’s an important one for business. So what they do, they seem to do well, albeit in a Windows-only, desktop client way, but what they don’t do is help you find useful stuff in your email.

Gist tackles the same problem - finding out more about people and what they’re doing - although as a (horribly busy) web-based desktop. And the approach it takes is interesting - combining multiple information sources such as blogs, newswires, LinkedIn and so on to build up a fuller picture of information about a person. While I like the idea of building this on top of my own conviction - your mail contains everything you need to know - it does actually seem to ignore a lot of the important information that is going on in your mail. It extracted attachments, but I couldn’t get it to extract contact data or anything else. Also it is people-centered, and what I want is asset-centered. Finally, a lot of my friends are Japanese and their names turned into UTF8-gobbledigook. Urgh.

I haven’t got onto threadsy but that seems also to take a combining approach, pulling together disparate sources of information to build up a fuller picture. Again, I want to see more done with the information.

Trampoline SystemsSONAR appears to be running Graphviz on an email database. I remember writing a program in June 2001 to do this - this was the output.

And that’s everyone. Apart from… well, me, actually.

And now I will show you a more excellent way

Two weeks ago, I had a conversation with someone where we got talking about what I used to do in Ireland. And he was very enthused about it, saying that it would be invaluable for his business, and so on. Very sadly, I had to tell him that our company didn’t make it and still, seven years on, there was just nothing really working in this space. In fact, four years ago I ranted about this very subject.

But that conversation got me energised and got my passion back. I picked up some tools I’d written with Simon Wistow and got coding again. Of course since then there’ve been new technologies in semantic analysis, search engines, named entity extraction, so I threw them all into the mix. The result is mearch. I’d buy the obvious domain name but to be honest I can’t afford it.

And this is the problem: I really can’t afford to be doing this. Not in terms of money, not in terms of time. It’s already wrecking my degree, and my wife groans every time I mention I’ve been working on “the email thing”. I want this to exist but I’m a missionary, not a programmer. I shouldn’t be working on this. So I’m not entirely sure what to do with it.

But I’ll tell you how it works.

There are a few things that a really good mail analysis tool needs to do:


This should be a “needless to say” but nobody’s really doing this well. Index everything, content, headers, dates, assets… the more things you can give me a handle on, the more specific a search I can construct and the more time I can save; the more I can use half-remembered facts (I wrote PeriodParser so I could look for emails “about three weeks ago”) to find what I need, the happier I am.

So for instance, when there’s an attachment to an email, that’s three pieces of information you should give me a handle on: attachment content, attachment type and attachment filename. That way I can say “give me all the emails from Brad with a PDF attachment”.

Saved searches are not something that mearch does yet, but it really should.

Asset extraction

Assets, as I’ve mentioned above, are names, URLs, phone numbers, addresses, and any other pieces of key information that you can pull out of an email. Mearch calls them “Essentials” in the user interface.

When I initially did Twingle and then Email::Store, we had to pull these things out manually. I wrote a seriously half-assed named entity extraction system, and we used heuristics to pull out phone numbers and other stuff.

Now there’s a brilliant solution to the problem: throw a chunk of data at OpenCalais and it’ll send you back all the named entities - company names, technologies, personal names, fax numbers, phone numbers and the like. It won’t recognise addresses, so you need to still use half-assed heuristics for that, but then you can use Google to geocode them and that will also clean the data for you as well, making the heuristics seem a bit less half-assed.

The key is that you then index these assets so that you can pull out emails which have a given asset. If you want to find Person X’s phone number, then internally to your system you run the search “person_id:1234 has:phonenumber” and look at the “phonenumber” field of the documents returned.


Let people write yellow stickies on emails, and write their own notes on any of the assets that we’ve found, and index those too.

The solution here is very simple: wikis. I automatically seed the data on assets from Wikipedia, but doing something similar to what gist does and finding blogs, news sources and so on wouldn’t be bad either.

Mail similarity

You want to be able to cluster emails around subjects, tags and so on. There are two ways to do this, and you want to do both.

The easy way is to take back the keyword data returned by Calais (not the named entity data - if you ask it, it’ll send you important keywords from the input as well) and use that as tags for a tag cloud.

The second is to use spreading activation graphing on the body of the email to find clusters of similar documents.

And needless to say, you’ve got to do threading and thread visualization. mearch uses this module which provides three thread views, including one called mail arcs, an interesting project from IBM’s remail project.

Organic Groups / Networks

This is the killer feature for me of mearch, and it’s also what a lot of the other players in the field are concentrating on. Let me tell you the original problem that started me playing with this, and then I’ll tell you how I resolved it.

I belong to a church group here in Gloucester called feig. We do all of our organising by email, but we don’t have a decent mailing list, just a fairly big Cc list. To find out where we’re meeting this week I search for “has:address group:feig”, but of course there is no entity called “feig” in the email - just a list of people. In fact, the emails going out might not mention “feig”. I know as a human that this particular group of fifteen people is called “feig”, but a computer can’t know that automatically.

It especially can’t know that automatically when people start joining the list, dropping off the list, when people forget some names from the Cc list, or when we write to feig but also Cc one or two other random people who aren’t members. It all happens. But I still want to search for emails to “feig” and have the computer just know.

Now think about a large business. You may think you know what the main collaborative groups in the business are, but you may be surprised to find that people are working together - especially in a more networked age - across departments and even with people outside the organisation. There are “groups” out there but I bet most people aren’t aware of them.

The solution again is spreading activation graphing but this time instead of the body of the email you graph the “addressings” (From, To, CC). This helps you to identify clusters of people who email each other often, with the kind of fuzziness you need to deal with groups which form, change and disband organically.

To do this well, however, you need - and anyone playing in the email space really needs - really good

Entity correlation

I’ll be straight-up honest and tell you that this is a hard problem I haven’t managed to solve very well yet. But I have made some advances.

What I mean by entity correlation is this: suppose you have a cluster of emails addressed to “S Cozens” and “H Cozens”. Now an email comes along which is addressed to “Simon Cozens” and “Henrietta Cozens”. It should go in the cluster, but the names are different, so it won’t. “Tom Evans” and “Thomas Evans” might be the same person, so they should be clustered together. But of course there might be two people in the organisation called “Thomas Evans”, so they should be treated as separate.

Now the lazy answer is to do this by email address instead of by friendly name, since people can muck up the friendly names but if they muck up the email addresses then the mail won’t arrive. And the lazy answer is good enough for a lot of cases. But there are a similarly large number of cases for which it won’t work. “” and “” are the same person. “” and something you wouldn’t guess “” are the same person. But “danandruth@” something are two people. “” could be ten or twenty people.

Smart software - and I’m taking it as read that we’re trying to develop smart software - has to at least make the effort to do some disambiguation between and correlation of people.

One of the key questions in email processing is what is a person. Now that blog post also posits a solution, which - five years later - I’m working on again. The idea is that a number of “correlators” work together as software agents, scanning the email database and find interesting things which can help to work out which “entity” (distinct person) in the database a given friendly name / email address pair relates to. These agents pass messages to each other and give a confidence score for their own results. There are all kinds of correlators you can write:

  • Use explicit data - we’ve asked the user if two people are the same and they’ve said yes. (Incidentally, asking the user to help us with things we’re not sure of is another key innovation in mearch.)
  • Look in the signoff and signature to find someone’s name to disambiguate multiple users of an account.
  • If an email is sent to one address and a reply comes from another, it may be the same person behind both.
  • Domain names which are pretty similar and held by the same organisation could refer to the same person.


Nobody’s doing email search and handling well, apart from me, and I’m not doing it that well either, but at least I know how to do it better.