I love when my brain has some spare cycles. I love this feeling of being bored. And I love this moment when one of my brain’s cells wake up and shouts out loud, “hey, let’s do something”. In these moments everything looks like a fun thing to do, the next most important project that is going to change the world. I have so many such projects, not finished, never actually fully implemented, the next big thing. Because life is too short to finish projects :).
This time, one of my projects turned into something that works, and I have been using it for the last two years more than 20 times, a Neo4j workshop.
It was a dark, snowy night in Cracow, Poland when said to myself, “hey, what if, I put all my emails into Neo4j”. Doesn’t this sound like a fun project? Of course, it does. You may ask, so what’s the business value of such project. Let me explain you, my business value minded people, fun is the ultimate value, together with exploration and finding people I can go for a drink. That’s what I call business value.
So let me welcome you to a journey on how to find your “true” friends by analysing your inbox in a graph.
This post assumes, at least, basic knowledge about Neo4j and Cypher. It is for people who already started their exploration of this amazing land. This is just one of many ideas you can play with, to get better feel of graphs.
In case you are not familiar with Neo4j, a my favourite graph database, it should be fine as well, as long as you follow my instructions, have downloaded neo4j copy from http://neo4j.com/download/ and read tutorial how to run neo4j shell :).
Enjoy the ride!
Feed your graph
First of all we will need a way to import all of your emails into graph. This is kind of boring task. So I will not go into much details about implementation, as you can find it at http://bitbucket.org/kcrimson/recograph.
Let’s feed the monster
hg clone http://bitbucket.org/kcrimson/recograph cd recograph mvn package cd target/appassembler/bin ./import-gmail -u <email@example.com> -password <your.password> -f Inbox
`-f` option allows you to set folders you want to fetch. It also supports sub folders, so you can pass for example `-f “[Gmail]/AnotherFolder”`. In order to import all of your emails you can just fetch single folder, called “[Gmail]/Wszystkie”, if you use GMail with Polish language :).
GMail users need to enable IMAP protocol.
It will take a while. It will probably fail few times. But don’t worry you can re-run it as many times as you wish, as it will import every email just once, based of message identifier. You can also check file `import-gmail.log` to check for status of the import and lookup for errors. If you have enough patience, after an hour or so (depending on the size of your inbox) you will end-up with `mails` directory with Neo4j graph in it. With a structure which looks more or less like this.
In Reply To
So what’s in it? All your mails as nodes with label Message and properties `Subject`, `Message-Id` and `In-Reply-To` and all of your recipients as nodes with label InternetAddress and properties `address` and `personal`. You also find relationships of types FROM,TO,CC and BCC, between your Messages and InternetAddreesses.
This is a simple graph, which doesn’t tell you much about your social life, let’s tune it a little bit.
First thing that came to my mind was, emails threads, how can I find all emails threads in my inbox. After few “I am feeling lucky” searches I found that there are actually two headers in emails, which I can use to explore threads. They are called `Message-Id` and `In-Reply-To`. So it was a time to materialize few relationships.
CREATE INDEX ON :Message(`Message-Id`); CREATE INDEX ON :Message(`In-Reply-To`);
MATCH (n:Message) MATCH (m:Message) WHERE n.`Message-Id`=m.`In-Reply-To` CREATE n<-[:IN_REPLY_TO]-m;
What happened here is, first I created two indexes, just to spice up things, sorry, speed up, when I will be marching through the whole graph, to find nodes (mails) which are related. Next is a lovely Cypher query which matches nodes where Message-Id equals In-Reply-To and creates a new relationship. This is the thing I love in graphs, exploring, building layers of knowledge from simple facts, layer after layer, till the moment self-consciousness of graph awakens 🙂
The longest thread
So as you can see it is no longer a simple graph. We have enriched it with `IN_REPLY_TO` relationship, which points to mail which was a response to another mail.
Now, let’s look for a longest email thread in our mailbox.
MATCH thread=((firstMail:Message)<-[:IN_REPLY_TO*]-(lastMail:Message)) WHERE NOT ()<-[:IN_REPLY_TO]-firstMail AND NOT (lastMail)<-[:IN_REPLY_TO]-() WITH firstMail, length(nodes(thread)) AS threadLength RETURN firstMail.Subject, threadLength ORDER BY threadLength DESC LIMIT 5;
Isn’t it lovely? I will print this query, someday, on my t-shirt. It is so beautiful.
To be honest it took me some time to make this query use the power of Cyhper, in its pure form. The trick is in the question I asked myself. What is email thread? For the sake of simplicity, something that has beginning and end. I know it is wrong, because email threads are actually trees, not lists, but this is something for another post, to explore. I had to find first and last email in the thread. Once you draw it on a piece of paper it is easy. First email in thread is the mail which doesn’t have outgoing `IN_REPLY_TO` relationships, NOT ()<-[:IN_REPLY_TO]-firstMail, and last email is the email which doesn’t have incoming `IN_REPLY_TO` relationship, NOT (lastMail)<-[:IN_REPLY_TO]-(). Just few ASCII arrows away.
Who is my friend
This is a place when things get even more interesting. Once I have email threads I can try to explore clusters of friends. But let’s come up with definition of friend. If I have your email address, it means I am your friend (it is not so straightforward in Poland :)). It leads to interesting observation, and another layer of facts. If I send email to Tomasz and Kuba, it means they are my friends. But when Kuba responds to all, adding Wiktor, it means he knows Wiktor, but Wiktor not necessarily knows Tomasz and me. And here is another crazy query, with even more arrows.
OPTIONAL MATCH (m:Message)<-[:IN_REPLY_TO]-(n:Message)-[:TO|CC|BCC]->r WHERE NOT m-[:TO|CC|BCC]->r WITH n,r MATCH n-[:FROM]->f MERGE r-[:KNOWS]-f;
So I am looking for all recipients of the email, and check if there are relationships to these recipients in a previous email in the thread. For these recipients I create `KNOWS` relationship.
Where did my cluster go
And last but not least. Clusters of friends. This time we will use well-known algorithm, called local clustering coefficient. I am not going to bore you with details, but in short, it counts number of relationships between your direct friends versus number of potential relationships between them. Which in short, looks like this
MATCH (a)-[:KNOWS]-(b) WITH a, count(distinct b) as neighbours MATCH (a)-[:KNOWS]-()-[r:KNOWS]-()-[:KNOWS]-(a) WHERE exists(a.name) WITH a, neighbours, count(distinct r) AS connected_neighbours WHERE neighbours>1 RETURN a.name, (2*toFloat(connected_neighbours))/(neighbours*(neighbours-1))
In short clustering coefficient equal 1, means that friends of this person know each other very well, really close circle of friends. We could also explore complex world of cliques, for example Bron–Kerbosch algorithm. One interesting fact, finding maximal clique in a graph is on the list of the most complex computational problems, as listed on Wikipedia.
And that’s all folks, I hope you enjoyed it, especially the moments when out of simple facts, we were able to build knowledge.
There is of course a lots of things we can, like adding time dimension to our graph. We can use `Date` header and spread time tree and look at who responded to our emails within minutes or days :).
I hope also I have shown you that Cypher is a great language, but as with every abstraction you have to unlearn old tricks.
nice post, but I am not sure if the clustering coefficient is calculated properly. It should include the number of all possible connections between the direct neighbors of a.
with a, count(b) as neighbors,
return a, neighbors*(neighbors-1)
I might be missing something here, can you elaborate on this?
I think it is right, based on definition from Wikipedia https://en.wikipedia.org/wiki/Clustering_coefficient#Local_clustering_coefficient, it says “and therefore for each neighbourhood N_i there are k_i(k_i-1) links that could exist”, but I may read it wrong 🙂
exactly but ki is is the number of neighbors of a vertex (i.e. a) you are not calculating it. You are only calculating the number of relations between the neighbors of the vertex.
the definition is ambiguous, one line later, “The local clustering coefficient C_i for a vertex v_i is then given by the proportion of links between the vertices within its neighbourhood divided by the number of links that could possibly exist between them”, especially “… of links between vertices with its neighbourhood” :), time for an expert call 🙂
I just played around with it, trying to calculate the clustering coefficient as for the pictures examples in the wiki. The results are “wrong”. You can reproduce it:
Ok my bad, this is undirected graph in this case, which changes things a little bit, plus yes, I should count number of direct neighbours, please check if this corrected query works as expect,
WITH a, count(distinct b) as neighbours
WITH a, neighbours, count(distinct r) AS connected_neighbours
RETURN a.name, (2*toFloat(connected_neighbours))/(neighbours*(neighbours-1));