Christopher M. Judd

Report 3 Downloads 388 Views
Tutorial Christopher M. Judd

Christopher M. Judd CTO and Partner at leader Columbus

Developer User Group (CIDUG)

Marc Peabody @marcpeabody

Introduction

http://hadoop.apache.org/

Scale up

Scale up

Scale up

Scale-up

Scale-out

Hadoop Approach

• scale-out

• share nothing

• expect failure

• smart software, dumb hardware

• move processing, not data

• build applications, not infrastructure

What is Hadoop good for? Don't use Hadoop - your data isn't that big 10 gb - Add memory and use Pandas 100 gb > 1 TB - Buy big hard drive and use Postgres > 5 TB - life sucks consider Hadoop

http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

Hadoop is an evolving project

Hadoop is an evolving project

old api

org.apache.hadoop.mapred

new api

org.apache.hadoop.mapreduce

Hadoop is an evolving project

MapReduce 1 Classic MapReduce

MapReduce 2 YARN

Setup

Hadoop Tutorial user/fun4all /opt/data

Configure SSH

add a second Host-only Adaptor

Configure SSH

Configure SSH

sudo /Library/StartupItems/VirtualB

sudo /Library/StartupItems/Virtual

SSH’ing

$ ssh [email protected] The authenticity of host '192.168.56.101 (192.168.56.101)' can't be established. RSA key fingerprint is 40:60:72:ce:48:03:ac:c7:4c:23:9a:4f:1e:a5:ae:b9. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '192.168.56.101' (RSA) to the list of known hosts. [email protected]'s password: Welcome to Ubuntu 12.04.3 LTS (GNU/Linux 3.8.0-29-generic x86_64)

! * Documentation:

https://help.ubuntu.com/

! Last login: Tue Jan 7 08:49:28 2014 from user-virtualbox.local user@user-VirtualBox:~$

Hadoop

http://www.cloudera.com/

http://hortonworks.com/

Add Hadoop User and Group

$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser $ sudo adduser hduser sudo !

$ su hduser $ cd ~

Install Hadoop $ $ $ $

sudo mkdir -p /opt/hadoop sudo tar vxzf /opt/data/hadoop-2.2.0.tar.gz -C /opt/hadoop sudo chown -R hduser:hadoop /opt/hadoop/hadoop-2.2.0 vim .bashrc # other stuff # java variables export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

!

# hadoop variables export HADOOP_HOME=/opt/hadoop/hadoop-2.2.0 export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME

$ source .bashrc $ hadoop version

Run Hadoop Job $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 4 1000

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar

aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.

aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.

bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.

dbcount: An example job that count the pageview counts from a database.

distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.

grep: A map/reduce program that counts the matches of a regex in the input.

join: A job that effects a join over sorted, equally partitioned datasets

multifilewc: A job that counts words from several files.

pentomino: A map/reduce tile laying program to find solutions to pentomino problems.

pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.

randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.

randomwriter: A map/reduce program that writes 10GB of random data per node.

secondarysort: An example defining a secondary sort to the reduce.

sort: A map/reduce program that sorts the data written by the random writer.

sudoku: A sudoku solver.

teragen: Generate data for the terasort

terasort: Run the terasort

teravalidate: Checking results of terasort

wordcount: A map/reduce program that counts the words in the input files.

wordmean: A map/reduce program that counts the average length of the words in the input files.

wordmedian: A map/reduce program that counts the median length of the words in the input files.

wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi Usage: org.apache.hadoop.examples.QuasiMonteCarlo

Lab 1

1. Create Hadoop user and group

2. Install Hadoop

3. Run example Hadoop job such as pi

• Local Standalone mode

• Pseudo-distributed mode

• Fully distributed mode

HDFS

POSIX! portable operating system interface

Reading Data

Writing Data

Configure passwordless login

$ $ $ $

ssh-keygen -t rsa -P '' cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys ssh localhost exit

Configure HDFS $ $ $ $ $ $

sudo mkdir -p /opt/hdfs/namenode sudo mkdir -p /opt/hdfs/datanode sudo chmod -R 777 /opt/hdfs sudo chown -R hduser:hadoop /opt/hdfs cd /opt/hadoop/hadoop-2.2.0 sudo vim etc/hadoop/hdfs-site.xml



!

<property> dfs.replication 1 <property> dfs.namenode.name.dir file:/opt/hdfs/namenode <property> dfs.datanode.data.dir file:/opt/hdfs/datanode

Format HDFS

$ hdfs namenode -format

Configure JAVA_HOME

$ vim etc/hadoop/hadoop-env.sh # other stuff

!

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

!

# more stuff

Configure Core

$ sudo vim etc/hadoop/core-site.xml <property> fs.default.name hdfs://localhost:9000

Start HDFS

$ start-dfs.sh

$ jps 6433 DataNode 6844 Jps 6206 NameNode 6714 SecondaryNameNode

Use HDFS commands $ $ $ $ $ $

• • • • • • • • • • •

hdfs hdfs hdfs hdfs hdfs hdfs

dfs dfs dfs dfs dfs dfs

-ls / -mkdir /books -ls / -ls /books -copyFromLocal /opt/data/moby_dick.txt /books -cat /books/moby_dick.txt

appendToFile

cat

chgrp

chmod

chown

copyFromLocal

copyToLocal

count

cp

du

get

• • • • • • • • • • •

ls

lsr

mkdir

moveFromLocal

moveToLocal

mv

put

rm

rmr

stat

tail

• • •

test

text

touchz

http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/FileSystemShell.html

http://192.168.56.101:50070/dfshealth.jsp

Lab 2 1. Configure passwordless login

2. Configure HDFS

3. Format HDFS

4. Configure JAVA_HOME

5. Configure Core

6. Start HDFS

7. Experiment HDFS commands





(ls, mkdir, copyFromLocal, cat)

HADOOP Pseudo-Distributed

Configure YARN $ sudo vim etc/hadoop/yarn-site.xml

!

!

<property> yarn.nodemanager.aux-services mapreduce_shuffle <property> yarn.nodemanager.aux-services.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler



Configure

Map Reduce

$ sudo mv etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml $ sudo vim etc/hadoop/mapred-site.xml

<property> mapreduce.framework.name yarn

Start YARN $ start-yarn.sh

$ jps 6433 DataNode 8355 Jps 8318 NodeManager 6206 NameNode 6714 SecondaryNameNode 8090 ResourceManager

Run Hadoop Job

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 4 1000

http://192.168.56.101:8042/node

Lab 3

1. Configure YARN

2. Configure Map Reduce

3. Start YARN

4. Run pi job

Combine Hadoop & HDFS

Run Hadoop Job $ $ $ $

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /books out hdfs dfs -ls out hdfs dfs -cat out/_SUCCESS hdfs dfs -cat out/part-r-00000 young-armed young; 2 younger 2 youngest 1 youngish 1 your 251 your@login yours 5 yours? 1 yourselbs yourself 14 yourself, yourself," yourself.' yourself; yourself? yourselves yourselves! yourselves, yourselves," yourselves; youth 5 youth, 2 youth. 1 youth; 1 youthful 1

1

1 1 5 1 1 4 1 1 3 1 1 1

Lab 4

1. Run wordcount job

2. Review output

3. Run wordcount job again with same parameters

Writing Map Reduce Jobs

{K1,V1}

we write

{K1,V1}

optionally write

we write

{K2, List} {K3,V3}

MOBY DICK; OR THE WHALE

!

By Herman Melville

! !

CHAPTER 1. Loomings.

! !

Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off--then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.

!

There now is your insular city of the Manhattoes, belted round by wharves as Indian isles by coral reefs--commerce surrounds it with her surf. Right and left, the streets take you waterward. Its extreme downtown is the battery, where that noble mole is washed by waves, and cooled by breezes, which a few hours previous were out of sight of land. Look at the crowds of water-gazers there.

K V 1 2 3 4 5

Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation.

{K1,V1}

{K2, List}

{K3,V3}

Mapper package com.manifestcorp.hadoop.wc;!

!

import java.io.IOException;!

!

import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.io.Text;! import org.apache.hadoop.mapreduce.Mapper;!

!

K1

V1 K2

V2

public class WordCountMapper extends Mapper { !

!

! ! ! ! ! ! ! ! ! ! ! ! ! ! }

private static final String SPACE = " ";! ! private static final IntWritable ONE = new IntWritable(1);! private Text word = new Text();! ! public void map(Object key, Text value, Context context) ! throws IOException, InterruptedException {! ! String[] words = value.toString().split(SPACE);! ! ! ! for (String str: words) {! ! ! word.set(str);! ! ! context.write(word, ONE);! ! }! ! ! }!

K1

K2 V2

V1

K V 1 2 3 4 5

Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation.

K

map

{K1,V1}

Call me Ishmael. Some years ago--never mind how of long little of or of

V 1 1 1 1 1 1 1 1 1 1 1 1 1 1

K

sort

ago--never Call how Ishmael. me little long mind of of of or Some years

V

K

V

1 1 1 1 1 1 1 1 1 1 1 1 1 1

ago--never Call how Ishmael. me little long mind of or Some years

1 1 1 1 1 1 1 1 1,1,1 1 1 1

group

{K2, List}

{K3,V3}

Reducer package com.manifestcorp.hadoop.wc;!

!

import java.io.IOException;!

!

import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.io.Text;! import org.apache.hadoop.mapreduce.Reducer;!

!

K2

V2

K3

V3

public class WordCountReducer extends Reducer {! ! ! ! public void reduce(Text key, Iterable values, Context context) ! ! ! ! ! ! ! ! ! ! ! ! ! throws IOException, InterruptedException {! ! ! int total = 0;! ! ! ! ! ! for (IntWritable value : values) {! ! ! ! total++;! ! ! }! ! ! ! ! ! context.write(key, new IntWritable(total));! ! }! }

K2

K3

V2

V3

K V 1 2 3 4 5

Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation.

K

map

{K1,V1}

Call me Ishmael. Some years ago--never mind how of long little of or of

V 1 1 1 1 1 1 1 1 1 1 1 1 1 1

K

sort

ago--never Call how Ishmael. me little long mind of of of or Some years

V

K

V

1 1 1 1 1 1 1 1 1 1 1 1 1 1

ago--never Call how Ishmael. me little long mind of or Some years

1 1 1 1 1 1 1 1 1,1,1 1 1 1

group

{K2, List}

K

reduce

ago--never Call how Ishmael. me little long mind of or Some years

V 1 1 1 1 1 1 1 1 3 1 1 1

{K3,V3}

Driver package com.manifestcorp.hadoop.wc;!

!

import import import import import import

!

org.apache.hadoop.fs.Path;! org.apache.hadoop.io.IntWritable;! org.apache.hadoop.io.Text;! org.apache.hadoop.mapreduce.Job;! org.apache.hadoop.mapreduce.lib.input.FileInputFormat;! org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;!

public class MyWordCount {!

!

!

public static void main(String[] args) throws Exception {!

! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !

Job job = new Job();! job.setJobName("my word count");! job.setJarByClass(MyWordCount.class);! ! FileInputFormat.addInputPath(job, new Path(args[0]));! FileOutputFormat.setOutputPath(job, new Path(args[1]));! ! job.setMapperClass(WordCountMapper.class);! job.setReducerClass(WordCountReducer.class);!

! !

! !

job.setOutputKeyClass(Text.class);! job.setOutputValueClass(IntWritable.class);!

! ! }

! System.exit(job.waitForCompletion(true) ? 0 : 1);! }!

!

! !

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0 com.manifestcorp.hadoop <artifactId>hadoop-mywordcount 0.0.1-SNAPSHOT <packaging>jar <properties> 2.2.0 <artifactId>maven-compiler-plugin 2.3.2 <source>1.6 1.6 <dependencies> <dependency> org.apache.hadoop <artifactId>hadoop-client ${hadoop.version} <scope>provided

Run Hadoop Job

$ hadoop jar target/hadoop-mywordcount-0.0.1-SNAPSHOT.jar com.manifestcorp.hadoop.wc.MyWordCount /books out

Lab 5 1. Unzip /opt/data/hadoop-mywordcount-start.zip

2. Write Mapper class

3. Write Reducer class

4. Write Driver class

5. Build (mvn clean package)

6. Run mywordcount job

7. Review output

Hadoop in the Cloud

http://aws.amazon.com/elasticmapreduce/

Making it more Real

http://aws.amazon.com/architecture/

http://aws.amazon.com/architecture/

Resources

Christopher M. Judd CTO and Partner email: [email protected] web: www.juddsolutions.com blog: juddsolutions.blogspot.com twitter: javajudd