Data engineering
hadoop
Dec 4, 2018     2 minutes read

1. What is hadoop and why would you use it?

2. Installation

In production environment Hadoop should be installed on several interconnected machines, called a cluster. I’m not going to explain how to do this, as this is usually a task for sysadmins/DevOps guys, and not for data scientists. However it is very usufeul to have your own one-node hadoop installation on your laptop for practising basic big data solutions. This is what we’re going to install in this section.

Script which you have to run to install hadoop is available here. You can run the whole file with

sudo bash hadoop.sh

but I recommend running it line by line, so you would understand what you are doing and what is needed to have hadoop running.

3. “Hello World” examples

You probably expect that knowing Hadoop will let you do Big Data. It would, but in reality nobody uses Hadoop to run data processing anymore. Spark and Hive have much nicer apis, and do the same tasks faster.

But there is still one functionality that was not replaced: a file system, called hadoop distributed file system or hdfs, where limited disk capacity is no longer a problem.

There are a couple of useful commands to remember. We can clearly see their resemblance to bash commands.

ls - list files in a given directory

hdfs dfs -ls <dir>

mkdir - create a directory

hdfs dfs -mkdir <path>

put - upload file to hdfs

hdfs dfs -put <localSrc> <dest>

copyFromLocal - copy file local to hdfs, obviously

hdfs dfs -copyFromLocal <localSrc> <dest>

get - download file from hdfs

hdfs dfs -get <src> <localDest>

copyToLocal - copy file from hdfs to local, obviously

hdfs dfs -copyToLocal <src> <localDest>

mv - move file from one directory to another or rename a file

hdfs dfs -mv <src> <dest>

cp - copy file from one directory to another

hdfs dfs -cp <src> <dest>

rm - remove a file from hdfs (-rm -r - remove recursively)

hdfs dfs -rm <dir>

More commands are available here.

4. Subjects still to cover