Elastic Map Reduce with Amazon S3, AWS, EMR, Python, MrJob and Ubuntu 14.04

This tutorial is about setting up an environment with scripts to work via Amazon's Hadoop implmentation EMR on huge datasets. With dataset I mean extremely large datasets and a simple yet powerful grep does not cut it any more for you. What you need is Hadoop.

Setting up Hadoop the first time or scaling it can be to much of an effort, this is why we switched to Amazon Elastic Map Reduce, or EMR, Amazon's implementation of Yahoo!'s Hadoop, which itself is an implementation of Google's MapReduce paper. Amazon's EMR will take care of the Hadoop architecture and scalability; in the likely case one cluster is not enough for you.

So let me outline the architecture of the tools and services I have in mind to get our environment going.

emr-appstack

AWS & S3

You will need your Access Key, Private Key and usually a private key file to access AWS from programatically.

cntml

sudo apt-get install cntlm

Edit /etc/cntlm.conf:

To create the hashes:

cntlm -H -c /etc/cntlm.conf

Enter them into your cntlm config. Start cntlm with

sudo service cntlm start

To verify it starts and bootup, issue a

sudo update-rc.d cntlm defaults

Fetch from FTP with wget

In our case, the logs lie on a SSL Website. I use wget to get passed the log in and fetch all logs. Wget on Ubuntu 14.04 is not working with SSL and can therefore be considered broken. Software that can't talk over encryption in light of the global spying apparatus is no good software.

sudo apt-get remove wget
sudo apt-get install gnutls-bin libgnutls-bin
mkdir -p ~/Apps/Linux
cd ~/Apps/Linux
wget http://ftp.gnu.org/gnu/wget/wget-1.15.tar.gz
tar xvf wget-1.15.tar.gz && cd wget-1.15/ ./configure --prefix=/usr --with-ssl && make && sudo make install

Now we can finally create a fetch.sh file:

Sync to S3 with s3cmd

S3cmd in Ubuntu, Fedora, RedHat and Debian is broken for files that have UTF-8 characters in them. You need to download from this repo to get it to work.

git clone https://github.com/s3tools/s3cmd/
cd s3cmd
sudo python setup.py install
sudo apt-get install python-dateutil
sudo cp s3cmd /usr/bin/
sudo cp -r S3 /usr/bin/

Test with s3cmd. If it works finally push the files to your S3 bucket:

s3cmd sync --skip-existing logs/property/*.gz s3://YourBucket/property/input/

Skip Existing will make sure to not upload already present files twice.

Verify Number of Files

s3cmd ls s3://yourbucket/property/input/ | wc -l

Compare this number to the wordcount of where fetch.sh downloaded, e.g. the relative directory logs/property/

Boto

Check which Boto you have installed with:

>>>import boto
>>>boto.Version
2.29.1
>>>exit()

2.29.1 is the latest version at the time of writing.

This is how you can get the latest version of Boto for your system.

git clone git://github.com/boto/boto.git
cd boto
python setup.py install

Create boto.conf

Create a ~/.boto file with these contents:

[Credentials]
aws_access_key_id = YOURACCESSKEY
aws_secret_access_key = YOURSECRETKEY

More info on Boto: http://boto.readthedocs.org/en/latest/getting_started.html

MapReduce with Mr Job

sudo apt-get install python-yaml
mkdir ~/Apps && cd ~/Apps
git clone https://github.com/Yelp/mrjob
sudo python setup.py install

Create mrjob.conf

sudo emacs ~/.mrjob.conf

Perform First MapReduce

python mr_word_freq_count.py -r emr s3://yourbucket/property/input/sample.gz --output-dir=s3://yourbucket/property/output/

The switch --output-dir makes sure that we don't output to the local CLI, but to your S3.

You should see something like this:

We can then further reduce the output or fetch the file directly with:

s3cmd ls s3://yourbucket/property/output/ s3cmd get s3://yourbucket/property/output/part-00000

This should sever as an excellent example to get you implementing a solution right away with python.

Read more here: https://pythonhosted.org/mrjob/guides/emr-quickstart.html

TBC....