data science tutorials and snippets prepared by tomis9
hadoop is the first popular big data tool ever;
it let’s you “quickly” compute huge amounts of data thanks to dividing computation into many machines (a cluster of machines); “quickly” comparing to a standard, one-machine approach;
you can store and easily access huge amounts of data thanks to hadoop’s distributed file system (hdfs);
in general, hadoop is a cornerstone of big data.
In production environment Hadoop should be installed on several interconnected machines, called a cluster. I’m not going to explain how to do this, as this is usually a task for sysadmins/DevOps guys, and not for data scientists. However it is very usufeul to have your own one-node hadoop installation on your laptop for practising basic big data solutions. This is what we’re going to install in this section.
Script which you have to run to install hadoop is available here. You can run the whole file with
sudo bash hadoop.sh
but I recommend running it line by line, so you would understand what you are doing and what is needed to have hadoop running.
You probably expect that knowing Hadoop will let you do Big Data. It would, but in reality nobody uses Hadoop to run data processing anymore. Spark and Hive have much nicer apis, and do the same tasks faster.
But there is still one functionality that was not replaced: a file system, called hadoop distributed file system or hdfs, where limited disk capacity is no longer a problem.
There are a couple of useful commands to remember. We can clearly see their resemblance to bash commands.
ls - list files in a given directory
hdfs dfs -ls <dir>
mkdir - create a directory
hdfs dfs -mkdir <path>
put - upload file to hdfs
hdfs dfs -put <localSrc> <dest>
copyFromLocal - copy file local to hdfs, obviously
hdfs dfs -copyFromLocal <localSrc> <dest>
get - download file from hdfs
hdfs dfs -get <src> <localDest>
copyToLocal - copy file from hdfs to local, obviously
hdfs dfs -copyToLocal <src> <localDest>
mv - move file from one directory to another or rename a file
hdfs dfs -mv <src> <dest>
cp - copy file from one directory to another
hdfs dfs -cp <src> <dest>
rm - remove a file from hdfs (-rm -r
- remove recursively)
hdfs dfs -rm <dir>
More commands are available here.
Hadoop GUI (TODO)
example of MapReduce: wordcount (TODO)
RHadoop (TODO)