PDF version of this article is available right here.
Part I
What is Hadoop?
No, really – is there any point in writing this? This theme has been discussed many many times, an incredible amount of blogposts and tutorial has been written – about Hadoop, HDFS, HBase and so on and so forth. Many of them ended up with “To be continued”. So – this is how it’s going to be continued, here and now.
Someone can say, that the topic of this article is totally trivial and worth no time for reading let alone writing – may be. However, for some of us it may save some time trying to get into Map/Reduce world, by solving a complex task one part by another (we shall return more than once to this statement – hand on with me), and to certain degree I am trying to write an article which I personally missed so much when I just started learning Hadoop.
Worth mentioning, that I will try to write as little code here as possible (no code at all is the best) – an incredible lot of examples are available on the Internet, but since Hadoop’s interface changes every so often, not much use for us of those of them which were written for Hadoop 0.15 when the current is Hadoop 0.20.1 (and it indeed is – at the moment of writing, though it may change when you read this article).
When I first seen the Hadoop, the huge problem for me was to understand – what the hell Map/Reduce is and how can I benefit of it (I will write Map/Reduce like this – via slash – to point out the difference of the approach from the product). The value of this approach will get clear further on – after we consider a simple task.
A simple task
So, a small example: say we have a corpus of text – of a number of documents (not necessarily many documents to start with) and we want to calculate such a popular metric as tf-idf for every document and every token. What do we need to know in order to get these figures?
- Number of tokens in corpus
- Number of unique terms (or just terms) in corpus
- Number of documents in corpus
- Number of occurrences of every term in every document
- Number of documents containing every term
It certainly isn’t a complex task – you can write a program to do it in the most straightforward way and it unlikely will take you more than half an hour. However, things start get complicated when number of documents grows as well as number of tokens, number of terms get closer to the total number of terms in the language of corpus, and finally, dictionaries stop to fit into the memory. What shall you do now? This is probably a time when one must remember, that complex problems can often be solved one part by another – and this indeed can be solved this way. So, lets tear our problem apart – and define a number of sub-tasks.
First of all, let’s make an assumption regarding the format of the input data. We assume, that input data has a form of a flat file:
<w11> <w12> <w13> ... <w1N>
<w21> <w22> <w23> ... <w2M>
...
In other words, every document is a set of (possibly and quite probably) repeating words, and every such set is located on one line of a text file. This isn’t really huge assumptions as pretty much every text document can be represented in such a form.
Counting the number of tokens and terms in corpus is an easy task, and can be solved in linear time. However, it may happen we won’t even need to solve it separately as these values will get calculated automagically. The most interesting subtask on this stage is to calculate, how many times every term occurs in corpus.
No, you’re not mistaken – this is the most popular, described in every Hadoop tutorial, program – an infamous WordCount. It is so widely recognised I see no point producing it’s text here – you can find it right in the “Quick Start” of the official Hadoop tutorial, and it will work for us nicely. How does it work however? To make a long story short, on the map stage the each map task of the WordCount tokenises it’s input and creates following pairs:
map1:
<term1> 1
<term2> 1
<term3> 1
map2:
<term2> 1
<term3> 1
<term3> 1
map3:
<term1> 1
...
You see, that every term can appear more than once and every of them is associated with a value 1. Every term here is a key of the map tasks’ output and “1” is a value. In it’s turn, reduce tasks receives this key (i.e. <term1> and so on) and a list of all values from all map tasks associated with this key (here it’ll be the list of 1’s). This will look like:
<term1> 1,1
<term2> 1,1
<term3> 1,1,1
Adding up this 1s (and practically – simply counting the number of elements in the list) we get a number of occurrences of the every term in corpus:
the 19283
to 3432
from 343
...
...
london 14
Now, that sounds like we got something – however, the value of this result is not entirely obvious. Hang on – we can count the number of lines in this file and it’ll produce the total number of unique terms, and summing up the value from the second column we get the total number of tokens in the corpus! We haven’t really done anything yet we have two fundamental features of the corpus – not bad at all!
From now on this turns to be a classical information retrieval problem. We have a dictionary – a complete list of all terms which may occur in the corpus. Now, we need to find out, how often and which exactly terms can occur in every document. HEre we can use a slightly modified version of our old buddy WordCount, which shall calculate the number of term occurrences applied to every concrete document. Arguably, the easiest way to produce it is to change the output key of the map task to “docID-term” pair, so the mapper shall produce:
map1:
1_the 1
1_the 1
1_the 1
1_to 1
...
map2:
2_the 1
2_the 1
2_from 1
...
map3:
37_london 1
...
Reducer tasks for this case shall remain identical – it simply needs to add up all the values with the same key – not much work, really. By the end of the day we should get:
1_the 3
1_to 1
...
2_the 2
2_from 1
...
37_london 1
But what is it? In fact, what we’ve just produced is a well known term frequency, often abbreviated as tf(t,d) – which means that it is a function of a term and document. For example, an article about London will likely have higher tf for london rather than article about that weird thingie they call a football on that side of the pond (where it shall probably be equal to zero but zero is a frequency nevertheless!).
In this example we have designed and implemented an algorithm for calculating one of the most fundamental and popular statistical features in information retrieval – the term frequency. The value of this approach is that it can be extended on the corpus of any imaginable size – yet it’ll stay usable and one can run it even on only one machine (while it could be paralleled on several hundreds nodes at no extra cost). Hence, the answer to the question “How can I benefit of Map/Reduce” probably can now be answered: using Map/Reduce one can split one computationally complex task into a sequence of smaller ones, each of them can be easily computed on your home computer. These small tasks can be executed either sequentially or in parallel, after which it’s results shall be combined to produce the final answer.
But wait, probably this is exactly the place where I shall say that sacred phrase: “To Be Continued”. In the next article we would finish calculating tf-idf – namely we’ll calculate idf part of it and will combine the results. Finally, we shall move to solving this problem for the real, production-size set of data. Stay tuned.
Part II.
Calculating IDF
In this part we’re going to continue calculation of TF-IDF, and for now we need to calculate the second part of it – namely idf(t). Worth saying, that it does not depend on the document and is calculated on per-term basis over the whole corpus. This can be easily seen by looking at the formula:

That is, |D| is the number of documents in the corpus and the divider of the logarithm is the number of such documents d which contain the token t (for which IDF value is calculated).
So, how can we calculate it? There’re quite a few ways out there. One of them to use as the input data, calculated as an output for TF stage. Just to remind you, it looked like:
1_the 3
1_to 1
...
2_the 2
2_from 1
...
37_london 1
What’s our task here? We want to find out, how many documents contain the word, say, the. The solution is quite intuitive: we take every key from the input data, split it to (2, the) and write to the map task’s output that we have at least one document, containing the word the:
the 1
Repeating the steps above for every value in the input file, we get:
the 1
to 1
the 1
from 1
london 1
The rest is more or less obvious: reduce task receives these data (here - the term and list of ‘1’s), computes the length of the list and outputs:
the 2
to 1
from 1
london 1
Q.E.D. The problem of calculating IDF is now trivial (although you could also do it on grid, as the produced list can be quite large).
On the final step we calculate the TF-IDF value. That’s where problems start. We have two files: one with TF values, another with IDF. Neither of them fit into memory in most general case (although in the most realistic case too). What shall we do now? Looks like a dead-end!
Not quite. Here we can use a very popular (though a bit strange) technique, which is widely adopted in complex Map/Reduce programs as the data passthrough. To illustrate it, imagine that when we processed file with TF values (see above), instead of “1” we write into map task output something like docID_tfValue? Then we’ll get an output:
the 1_3
to 1_3
the 2_2
from 2_1
...
london 37_1
We still can compute the number of documents containing each term in reduce task (as we didn’t sum up these ones, but simply calculated the length of the list). On the other hand, now we have some additional data, which means we can modify reduce’s output like:
the 2_1_3
the 2_2_2
to 1_1_3
from 1_2_1
...
london 1_37_1
The key/value the -> 2_1_3 shall be read as:
There’re 2 documents which contains term
the, and the document with ID1contains 3 such terms.
It’s as simple as that. Now what we got? Reducer didn’t actually reduced amount of the data much, but added, if you want, a new dimension to it. What can we do with this data? Assuming that we now the number of documents and total number of tokens in corpus, we can write a trivial Map/Reduce program (which won’t actually need any Reduce stage at all). Each incoming string it’ll represent as:
term = key
(docCount, docId, countInDoc) = value.split("_")
(I hope you understand, that this is just a pseudo-code in pseudo-language which just by some strange accident looks similar to Python. The underscore delimiter is also just an example - you can use whatever you want or need). Now we can do:
tf = countInDoc / TOTAL_TOKEN_COUNT
idf = log(TOTAL_DOC_COUNT / docCount)
result = tf * idf
output(key = docId + "_" + term, value = result)
What did we get? We got TF-IDF value for a given document and term.
In this article we’ve computed the TF-IDF value for a corpus of documents. Important to mention, that increasing the size of corpus will not increase the amount of required memory – the computations will simply take longer (which can be easily fixed by adding extra nodes to the cluster). We have also studied one of the most popular techniques which is widely used for complex Map/Reduce jobs. Finally we’ve studied some fundamental Map/Reduce design patterns. In future I’m planning to consider solving this and other problems for real production-size data, with some code examples.
Reading list
- Yahoo! Hadoop Tutorial – first thing to read, the best manual currently available.
- Hadoop QuickStart Guide
- Hadoop Map/Reduce Tutorial
- Hadoop and Distributed Computing at Yahoo!
- Term frequency-inverse document frequency.