Skip to content

Update instructions for EMR 4, add instructions to prevent observed issues. #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: directory_rename
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 10 additions & 15 deletions 7_graph_aws/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,29 @@ Instructions on how to run example.pig.

== STEP 1

Importing the myudfs.jar file in pig. You need this because
example.pig uses the function RDFSplit3(...) which is defined in myudfs.jar:
Importing the myudfs.jar file in pig. You need this because example.pig uses the function RDFSplit3(...) which is defined in myudfs.jar:

OPTION 1: Do nothing. example.pig is already configured to read
myudfs.jar from S3, through the line:
OPTION 1: Do nothing. example.pig is already configured to read myudfs.jar from S3, through the line:

register s3n://uw-cse-344-oregon.aws.amazon.com/myudfs.jar


OPTION 2: do-it-yourself; run this on your local machine:
OPTION 2: Do-it-yourself; run this on your local machine:

cd pigtest
ant -- this should create the file myudfs.jar
ant

This should create the file myudfs.jar

Next, modify example.pig to:

register ./myudfs.jar

Next, after you start the AWS cluster, copy myudfs.jar to the AWS
Master Node (see hw6-awsusage.html).


== STEP2
Next, after you start the AWS cluster, copy myudfs.jar to the AWS Master Node (see hw6-awsusage.html).

Start an AWS Cluster (see hw6-awsusage.html), start pig interactively,
and cut and paste the content of example.pig. I prefer to do this line by line
== STEP 2

Start an AWS Cluster (see awsinstructions.md), start pig interactively, and cut and paste the content of example.pig. I prefer to do this line by line

Note: The program may appear to hang with a 0% completion time... go check the job tracker. Scroll down. You should see a MapReduce job running with some non-zero progress.
Note: The program may appear to hang with a 0% completion time... Check the Hadoop monitor or cluster details page. You should see a MapReduce job running with some non-zero progress.

Also note that the script will generate more than one MapReduce job.
32 changes: 17 additions & 15 deletions 7_graph_aws/assignment_instructions.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,34 @@
## **Note**

### **We cannot reimburse you for any charges**

### **Terminating an AWS cluster**

When you are done running Pig scripts, make sure to **ALSO** terminate your job flow. This is a step that you need to do **in addition to ** stopping pig and Hadoop (if necessary).
When you are done running Pig scripts, make sure to **ALSO** terminate your cluster. This is a step that you need to do **in addition to ** stopping pig and Hadoop (if necessary).

1. 1.Go to the [Management Console.](https://console.aws.amazon.com/elasticmapreduce/home)
2. 2.Select the job in the list.
3. 3.Click the Terminate button (you may also need to turn off Termination protection).
4. 4.Wait for a while (may take minutes) and recheck until the job state becomes TERMINATED.
1. Go to the [Management Console.](https://console.aws.amazon.com/elasticmapreduce/home)
2. Select the cluster in the list.
3. Click the Terminate button (you may also need to turn off Termination protection).
4. Wait for a while (may take minutes) and recheck until the cluster state becomes TERMINATED.

### **If you fail to terminate your job and only close the browser, or log off AWS, your AWS will continue to run, and AWS will continue to charge your credit card: for hours, days, and weeks. Make sure you don't leave the console until you have confirmation that the job is terminated.**
**If you fail to terminate your job and only close the browser, or log off AWS, your AWS will continue to run, and AWS will continue to charge your credit card: for hours, days, and weeks. Make sure you don't leave the console until you have confirmation that the job is terminated.**

## **Notes**
The quiz should cost no more than 10-20 dollars if you only use medium aws instances.

This assignment will be very difficult from Windows; the instructions assume you have access to a Linux command line.
## **Problem 0: Setup your Pig Cluster**

The quiz should cost no more than 5-10 dollars if you only use small aws instances
1. Follow [these instructions](https://github.com/uwescience/datasci_course_materials/blob/master/assignment4/awsinstructions.md) to setup the cluster. NOTE: It will take you a good **60 minutes** to go through all these instructions without even trying to run example.pig at the end. But they are worth it. You are learning how to use the Amazon cloud, which is by far the most popular cloud platform today. At the end, the instructions will refer to example.pig. This is the name of the sample program that we will run in the next step.
2. You will find example.pig in the course materials repo at:

## **Problem 0: Setup your Pig Cluster**
https://github.com/uwescience/datasci_course_materials/blob/master/assignment4/

1. Follow [these instructions](https://github.com/uwescience/datasci_course_materials/blob/master/assignment4/awsinstructions.md) to setup the cluster. NOTE: It will take you a good **60 minutes** to go through all these instructions without even trying to run example.pig at the end. But they are worth it. You are learning how to use the Amazon cloud, which is by far the most popular cloud platform today. At the end, the instructions will refer to _example.pig_. This is the name of the sample program that we will run in the next step.
2. You will find example.pig in the [course materials repo](https://github.com/uwescience/datasci_course_materials). example.pig is a Pig Latin script that loads and parses the billion triple dataset that we will use in this assignment into triples: (subject, predicate, object). Then it groups the triples by their object attribute and sorts them in descending order based on the count of tuple in each group.
3. Follow the README.txt: it provides more information on how to run the sample program called example.pig.
example.pig is a Pig Latin script that loads and parses the billion triple dataset that we will use in this assignment into triples: (subject, predicate, object). Then it groups the triples by their object attribute and sorts them in descending order based on the count of tuple in each group.
3. Follow awsinstructions.md: it provides more information on how to run the sample program called example.pig.
4. There is nothing to turn in for Problem 0

## **Useful Links**

[Pig Latin reference](http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html)
[Pig Latin reference](http://pig.apache.org/docs/r0.15.0/piglatin_ref2.html)

[Counting rows in an alias](http://stackoverflow.com/questions/9900761/pig-how-to-count-a-number-of-rows-in-alias)

Expand Down Expand Up @@ -81,7 +83,7 @@ Modify example.pig to use the file uw-cse-344-oregon.aws.amazon.com/btc-2010-chu
- After the command objects = ...
- After the command count\_by\_object = ...

**Hint 1** : [Use the job tracker](https://class.coursera.org/datasci-001/wiki/view?page=awssetup) to see the number of map and reduce tasks for your MapReduce jobs.
**Hint 1** : Use the Hadoop monitor to see the number of map and reduce tasks for your MapReduce jobs.

**Hint 2:** To see the schema for intermediate results, you can use Pig's interactive command line client grunt, which you can launch by running Pig without specifying an input script on the command line. When using grunt, a command that you may want to know about is [describe](http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#DESCRIBE) . To see a list of other commands, type help.

Expand Down
Loading