This tutorial is about setting up an environment with scripts to work via Amazon's Hadoop implmentation EMR on huge datasets. With dataset I mean extremely large datasets and a simple yet powerful grep does not cut it any more for you. What you need is Hadoop.
Setting up Hadoop the first time or scaling it can be to much of an effort, this is why we switched to Amazon Elastic Map Reduce, or EMR, Amazon's implementation of Yahoo!'s Hadoop, which itself is an implementation of Google's MapReduce paper. Amazon's EMR will take care of the Hadoop architecture and scalability; in the likely case one cluster is not enough for you.
So let me outline the architecture of the tools and services I have in mind to get our environment going.
AWS & S3
You will need your Access Key, Private Key and usually a private key file to access AWS from programatically.
sudo apt-get install cntlm
To create the hashes:
cntlm -H -c /etc/cntlm.conf
Enter them into your cntlm config. Start cntlm with
sudo service cntlm start
To verify it starts and bootup, issue a
sudo update-rc.d cntlm defaults
Fetch from FTP with wget
In our case, the logs lie on a SSL Website. I use wget to get passed the log in and fetch all logs. Wget on Ubuntu 14.04 is not working with SSL and can therefore be considered broken. Software that can't talk over encryption in light of the global spying apparatus is no good software.
sudo apt-get remove wget
sudo apt-get install gnutls-bin libgnutls-bin
mkdir -p ~/Apps/Linux
tar xvf wget-1.15.tar.gz && cd wget-1.15/
./configure --prefix=/usr --with-ssl && make && sudo make install
Now we can finally create a fetch.sh file:
Sync to S3 with s3cmd
S3cmd in Ubuntu, Fedora, RedHat and Debian is broken for files that have UTF-8 characters in them. You need to download from this repo to get it to work.
git clone https://github.com/s3tools/s3cmd/
sudo python setup.py install
sudo apt-get install python-dateutil
sudo cp s3cmd /usr/bin/
sudo cp -r S3 /usr/bin/
s3cmd. If it works finally push the files to your S3 bucket:
s3cmd sync --skip-existing logs/property/*.gz s3://YourBucket/property/input/
Skip Existing will make sure to not upload already present files twice.
Verify Number of Files
s3cmd ls s3://yourbucket/property/input/ | wc -l
Compare this number to the wordcount of where fetch.sh downloaded, e.g. the relative directory
Check which Boto you have installed with:
2.29.1 is the latest version at the time of writing.
This is how you can get the latest version of Boto for your system.
git clone git://github.com/boto/boto.git
python setup.py install
Create a ~/.boto file with these contents:
aws_access_key_id = YOURACCESSKEY
aws_secret_access_key = YOURSECRETKEY
More info on Boto: http://boto.readthedocs.org/en/latest/getting_started.html
MapReduce with Mr Job
sudo apt-get install python-yaml
mkdir ~/Apps && cd ~/Apps
git clone https://github.com/Yelp/mrjob
sudo python setup.py install
sudo emacs ~/.mrjob.conf
Perform First MapReduce
python mr_word_freq_count.py -r emr s3://yourbucket/property/input/sample.gz --output-dir=s3://yourbucket/property/output/
The switch --output-dir makes sure that we don't output to the local CLI, but to your S3.
You should see something like this:
We can then further reduce the output or fetch the file directly with:
s3cmd ls s3://yourbucket/property/output/
s3cmd get s3://yourbucket/property/output/part-00000
This should seve as an excellent example to get you implementing a solution right away with python.
Continue reading here: Elastic MapReduce with mrJob