In the previous article, we installed Hadoop, now it is time to play with it. In this article, we are going to take some example data, store it in Hadoop FS, and see how we can run a simple MapReduce task.
The data we are going to use is called MovieLens. It is a list of one hundred thousand movie ratings from one thousand users on one thousand seven hundred movies. The dataset is quite old, but the advantage is that it has been tested multiple times and it is very stable which will prevent any differences when following this article.
The dataset can be found here. We just need to download the zip file and extract its content, and we are interested in the file called u.data.
Using the file system
Let’s check how our file system looks like. We can do this using the UI or from the console. In this article, I am going to focus on working with the console, but I will try to point to the UI when something is interesting or that can be easily checked there in addition to the console command.
As I have said, let’s check the file system. To use the UI, we can open the Resource Manager Web UI, probably http://localhost:9870 and go to “Menu -> Utilities -> Browse the file system“. We should see something like the following screenshot because our file system is empty:
To achieve the same on the console, we can execute the next command:
hadoop fs -ls /
It is important to notice that we need to specify the path. If it is not specified it will throw an error. For more information about the error, we can read this link.
After executing the command, we will see that nothing is listed, but we know it is working. Let’s know upload our data. First, we will create a folder, and after that, we will upload the data file:
$ hadoop fs -mkdir /ml-100k $ hadoop fs -copyFromLocal u.data /ml-100k/u.data $ hadoop fs -ls / $ hadoop fs -ls /ml-100k
Easy and simple. As we can see it works as we would expect from any file system. We can remove the file and folder we have just added:
$ hadoop fs -rm /ml-100k/u.data $ hadoop fs -rmdir /ml-100k
The file we are using has the format:
user id | item id | rating | timestamp.
To process it, we are going to be using Python and a library called MRJob. More information can be found here.
We need to install it to be able to use it:
pip3 install mrjob
And, we need to build a very basic MapReduce job, for example, one that lists the count of given ratings on the data.
The code should look like this:
from mrjob.job import MRJob from mrjob.step import MRStep class RatingsBreakdown(MRJob): def mapper(self, _, line): (userID, movieID, rating, timestam) = line.split('\t') yield rating, 1 def reducer(self, key, values): yield key, sum(values) if __name__ == '__main__': RatingsBreakdown.run()
The code is pretty simple and easy to understand. A job is defined by a class that inherits from
MRJob. This class contains methods that define the steps of your job. A “step” consists of a mapper, a combiner, and a reducer, but all of those are optional, though we must have at least one. The
mapper() method takes a key and a value as args (in this case, the key is ignored, and a single line of text input is the value) and yields as many key-value pairs as it likes. The
reduce() method takes a key and an iterator of values and also yields as many key-value pairs as it likes. (In this case, it sums the values for each key). The final required component of a job file is to invoke the
We can run it in different ways which are very useful depending on the type of goal we have. For example, if we have just a small file with a small sample of the data we are just using to test the code, we can run the Python script with a local file. If we want to run a similar sample data but see how it behaves in Hadoop, we can upload a local file on-demand while executing the script. Finally, if we want to execute the whole shebang, we can run the script against the big bulky file we have on the file system. For the different options, we have the following commands. Considering the whole file, in the case, is small we can run the three types of executions using the same file.
$ python3 ratings-breakdown.py u.data $ python3 ratings-breakdown.py -r hadoop u.data $ python3 ratings-breakdown.py -r hadoop hdfs:///ml-100k/u.data
The result, no matter what execution we have decided to use, would be:
"1" 6110 "2" 11370 "3" 27145 "4" 34174 "5" 21201
As a curiosity, if we now take a look at the ResourceManager Web UI probably http://localhost:8088, we will see the jobs executed for the last two commands:
This is all for today. We have stored something in our Hadoop file system and we have executed our first MapReduce job.