Skip to content

Commit 6bebb9d

Browse files
committed
Update for AWS EMR 4.x. Thanks to Kevin Kleinfelter, Bruce Weir, Ashley Engelund!
1 parent ab64536 commit 6bebb9d

File tree

3 files changed

+341
-281
lines changed

3 files changed

+341
-281
lines changed

assignment4/README.txt

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,27 +12,30 @@ myudfs.jar from S3, through the line:
1212

1313
register s3n://uw-cse-344-oregon.aws.amazon.com/myudfs.jar
1414

15-
16-
OPTION 2: do-it-yourself; run this on your local machine:
15+
OPTION 2: Do-it-yourself; run this on your local machine:
1716

1817
cd pigtest
19-
ant -- this should create the file myudfs.jar
18+
ant
19+
20+
This should create the file myudfs.jar.
2021

2122
Next, modify example.pig to:
2223

2324
register ./myudfs.jar
2425

2526
Next, after you start the AWS cluster, copy myudfs.jar to the AWS
26-
Master Node (see hw6-awsusage.html).
27+
Master Node (see awsinstructions.md).
2728

2829
================================================================
2930

30-
STEP2
31-
32-
Start an AWS Cluster (see hw6-awsusage.html), start pig interactively,
33-
and cut and paste the content of example.pig. I prefer to do this line by line
31+
STEP 2
3432

33+
Start an AWS Cluster (see awsinstructions.md), start pig interactively,
34+
and cut and paste the content of example.pig. I prefer to do this line by
35+
line.
3536

36-
Note: The program may appear to hang with a 0% completion time... go check the job tracker. Scroll down. You should see a MapReduce job running with some non-zero progress.
37+
Note: The program may appear to hang with a 0% completion time.
38+
Go check the Hadoop monitor. You should see a MapReduce job running with
39+
some non-zero progress.
3740

3841
Also note that the script will generate more than one MapReduce job.

assignment4/assignment4.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,34 @@
1+
## **Note**
2+
13
### **We cannot reimburse you for any charges**
24

35
### **Terminating an AWS cluster**
46

5-
When you are done running Pig scripts, make sure to **ALSO** terminate your job flow. This is a step that you need to do **in addition to ** stopping pig and Hadoop (if necessary).
7+
When you are done running Pig scripts, make sure to **ALSO** terminate your cluster. This is a step that you need to do **in addition to ** stopping pig and Hadoop (if necessary).
68

7-
1. 1.Go to the [Management Console.](https://console.aws.amazon.com/elasticmapreduce/home)
8-
2. 2.Select the job in the list.
9-
3. 3.Click the Terminate button (you may also need to turn off Termination protection).
10-
4. 4.Wait for a while (may take minutes) and recheck until the job state becomes TERMINATED.
9+
1. Go to the [Management Console.](https://console.aws.amazon.com/elasticmapreduce/home)
10+
2. Select the cluster in the list.
11+
3. Click the Terminate button (you may also need to turn off Termination protection).
12+
4. Wait for a while (may take minutes) and recheck until the cluster state becomes TERMINATED.
1113

12-
### **If you fail to terminate your job and only close the browser, or log off AWS, your AWS will continue to run, and AWS will continue to charge your credit card: for hours, days, and weeks. Make sure you don't leave the console until you have confirmation that the job is terminated.**
14+
**If you fail to terminate your job and only close the browser, or log off AWS, your AWS will continue to run, and AWS will continue to charge your credit card: for hours, days, and weeks. Make sure you don't leave the console until you have confirmation that the job is terminated.**
1315

14-
## **Notes**
16+
The quiz should cost no more than 10-20 dollars if you only use medium aws instances.
1517

16-
This assignment will be very difficult from Windows; the instructions assume you have access to a Linux command line.
18+
## **Problem 0: Setup your Pig Cluster**
1719

18-
The quiz should cost no more than 5-10 dollars if you only use small aws instances
20+
1. Follow [these instructions](https://github.com/uwescience/datasci_course_materials/blob/master/assignment4/awsinstructions.md) to setup the cluster. NOTE: It will take you a good **60 minutes** to go through all these instructions without even trying to run example.pig at the end. But they are worth it. You are learning how to use the Amazon cloud, which is by far the most popular cloud platform today. At the end, the instructions will refer to example.pig. This is the name of the sample program that we will run in the next step.
21+
2. You will find example.pig in the course materials repo at:
1922

20-
## **Problem 0: Setup your Pig Cluster**
23+
https://github.com/uwescience/datasci_course_materials/blob/master/assignment4/
2124

22-
1. Follow [these instructions](https://github.com/uwescience/datasci_course_materials/blob/master/assignment4/awsinstructions.md) to setup the cluster. NOTE: It will take you a good **60 minutes** to go through all these instructions without even trying to run example.pig at the end. But they are worth it. You are learning how to use the Amazon cloud, which is by far the most popular cloud platform today. At the end, the instructions will refer to _example.pig_. This is the name of the sample program that we will run in the next step.
23-
2. You will find example.pig in the [course materials repo](https://github.com/uwescience/datasci_course_materials). example.pig is a Pig Latin script that loads and parses the billion triple dataset that we will use in this assignment into triples: (subject, predicate, object). Then it groups the triples by their object attribute and sorts them in descending order based on the count of tuple in each group.
24-
3. Follow the README.txt: it provides more information on how to run the sample program called example.pig.
25+
example.pig is a Pig Latin script that loads and parses the billion triple dataset that we will use in this assignment into triples: (subject, predicate, object). Then it groups the triples by their object attribute and sorts them in descending order based on the count of tuple in each group.
26+
3. Follow awsinstructions.md: it provides more information on how to run the sample program called example.pig.
2527
4. There is nothing to turn in for Problem 0
2628

2729
## **Useful Links**
2830

29-
[Pig Latin reference](http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html)
31+
[Pig Latin reference](http://pig.apache.org/docs/r0.15.0/piglatin_ref2.html)
3032

3133
[Counting rows in an alias](http://stackoverflow.com/questions/9900761/pig-how-to-count-a-number-of-rows-in-alias)
3234

@@ -81,7 +83,7 @@ Modify example.pig to use the file uw-cse-344-oregon.aws.amazon.com/btc-2010-chu
8183
- After the command objects = ...
8284
- After the command count\_by\_object = ...
8385

84-
**Hint 1** : [Use the job tracker](https://class.coursera.org/datasci-001/wiki/view?page=awssetup) to see the number of map and reduce tasks for your MapReduce jobs.
86+
**Hint 1** : Use the Hadoop monitor to see the number of map and reduce tasks for your MapReduce jobs.
8587

8688
**Hint 2:** To see the schema for intermediate results, you can use Pig's interactive command line client grunt, which you can launch by running Pig without specifying an input script on the command line. When using grunt, a command that you may want to know about is [describe](http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#DESCRIBE) . To see a list of other commands, type help.
8789

0 commit comments

Comments
 (0)