Update for AWS EMR 4.x. Thanks to Kevin Kleinfelter, Bruce Weir, Ashley Engelund!

ptressel · ptressel · commit 6bebb9dbab3c · 2016-04-01T01:17:49.000-07:00
diff --git a/assignment4/README.txt b/assignment4/README.txt
@@ -12,27 +12,30 @@ myudfs.jar from S3, through the line:
 
 register s3n://uw-cse-344-oregon.aws.amazon.com/myudfs.jar
 
-
-OPTION 2:  do-it-yourself; run this on your local machine:
+OPTION 2: Do-it-yourself; run this on your local machine:
 
 cd pigtest
-ant     --  this should create the file myudfs.jar
+ant
+
+This should create the file myudfs.jar.
 
 Next, modify example.pig to:
 
 register ./myudfs.jar
 
 Next, after you start the AWS cluster, copy myudfs.jar to the AWS
-Master Node (see hw6-awsusage.html).
+Master Node (see awsinstructions.md).
 
 ================================================================
 
-STEP2
-
-Start an AWS Cluster (see hw6-awsusage.html), start pig interactively,
-and cut and paste the content of example.pig.  I prefer to do this line by line
+STEP 2
 
+Start an AWS Cluster (see awsinstructions.md), start pig interactively,
+and cut and paste the content of example.pig.  I prefer to do this line by
+line.
 
-Note: The program may appear to hang with a 0% completion time... go check the job tracker. Scroll down. You should see a MapReduce job running with some non-zero progress. 
+Note: The program may appear to hang with a 0% completion time.
+Go check the Hadoop monitor. You should see a MapReduce job running with
+some non-zero progress. 
 
 Also note that the script will generate more than one MapReduce job.
diff --git a/assignment4/assignment4.md b/assignment4/assignment4.md
@@ -1,32 +1,34 @@
+## **Note**
+
 ### **We cannot reimburse you for any charges**
 
 ### **Terminating an AWS cluster**
 
-When you are done running Pig scripts, make sure to  **ALSO**  terminate your job flow. This is a step that you need to do  **in addition to ** stopping pig and Hadoop (if necessary).
+When you are done running Pig scripts, make sure to  **ALSO**  terminate your cluster. This is a step that you need to do  **in addition to ** stopping pig and Hadoop (if necessary).
 
-1. 1.Go to the  [Management Console.](https://console.aws.amazon.com/elasticmapreduce/home)
-2. 2.Select the job in the list.
-3. 3.Click the Terminate button (you may also need to turn off Termination protection).
-4. 4.Wait for a while (may take minutes) and recheck until the job state becomes TERMINATED.
+1. Go to the  [Management Console.](https://console.aws.amazon.com/elasticmapreduce/home)
+2. Select the cluster in the list.
+3. Click the Terminate button (you may also need to turn off Termination protection).
+4. Wait for a while (may take minutes) and recheck until the cluster state becomes TERMINATED.
 
-### **If you fail to terminate your job and only close the browser, or log off AWS, your AWS will continue to run, and AWS will continue to charge your credit card: for hours, days, and weeks. Make sure you don't leave the console until you have confirmation that the job is terminated.**
+**If you fail to terminate your job and only close the browser, or log off AWS, your AWS will continue to run, and AWS will continue to charge your credit card: for hours, days, and weeks. Make sure you don't leave the console until you have confirmation that the job is terminated.**
 
-## **Notes**
+The quiz should cost no more than 10-20 dollars if you only use medium aws instances.
 
-This assignment will be very difficult from Windows; the instructions assume you have access to a Linux command line.
+## **Problem 0: Setup your Pig Cluster**
 
-The quiz should cost no more than 5-10 dollars if you only use small aws instances
+1. Follow  [these instructions](https://github.com/uwescience/datasci_course_materials/blob/master/assignment4/awsinstructions.md) to setup the cluster. NOTE: It will take you a good  **60 minutes**  to go through all these instructions without even trying to run example.pig at the end. But they are worth it. You are learning how to use the Amazon cloud, which is by far the most popular cloud platform today. At the end, the instructions will refer to example.pig. This is the name of the sample program that we will run in the next step.
+2. You will find example.pig in the course materials repo at:
 
-## **Problem 0: Setup your Pig Cluster**
+    https://github.com/uwescience/datasci_course_materials/blob/master/assignment4/
 
-1. Follow  [these instructions](https://github.com/uwescience/datasci_course_materials/blob/master/assignment4/awsinstructions.md) to setup the cluster. NOTE: It will take you a good  **60 minutes**  to go through all these instructions without even trying to run example.pig at the end. But they are worth it. You are learning how to use the Amazon cloud, which is by far the most popular cloud platform today. At the end, the instructions will refer to _example.pig_. This is the name of the sample program that we will run in the next step.
-2. You will find example.pig in the  [course materials repo](https://github.com/uwescience/datasci_course_materials). example.pig is a Pig Latin script that loads and parses the billion triple dataset that we will use in this assignment into triples: (subject, predicate, object). Then it groups the triples by their object attribute and sorts them in descending order based on the count of tuple in each group.
-3. Follow the README.txt: it provides more information on how to run the sample program called example.pig.
+    example.pig is a Pig Latin script that loads and parses the billion triple dataset that we will use in this assignment into triples: (subject, predicate, object). Then it groups the triples by their object attribute and sorts them in descending order based on the count of tuple in each group.
+3. Follow awsinstructions.md: it provides more information on how to run the sample program called example.pig.
 4. There is nothing to turn in for Problem 0
 
 ## **Useful Links**
 
-[Pig Latin reference](http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html)
+[Pig Latin reference](http://pig.apache.org/docs/r0.15.0/piglatin_ref2.html)
 
 [Counting rows in an alias](http://stackoverflow.com/questions/9900761/pig-how-to-count-a-number-of-rows-in-alias)
 
@@ -81,7 +83,7 @@ Modify example.pig to use the file uw-cse-344-oregon.aws.amazon.com/btc-2010-chu
 - After the command objects = ...
 - After the command count\_by\_object = ...
 
-**Hint 1** :  [Use the job tracker](https://class.coursera.org/datasci-001/wiki/view?page=awssetup) to see the number of map and reduce tasks for your MapReduce jobs.
+**Hint 1** : Use the Hadoop monitor to see the number of map and reduce tasks for your MapReduce jobs.
 
 **Hint 2:**  To see the schema for intermediate results, you can use Pig's interactive command line client grunt, which you can launch by running Pig without specifying an input script on the command line. When using grunt, a command that you may want to know about is  [describe](http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#DESCRIBE) . To see a list of other commands, type help.
 
diff --git a/assignment4/awsinstructions.md b/assignment4/awsinstructions.md