From 6bebb9dbab3c0d9ed7c4c980c4ce114edabc443a Mon Sep 17 00:00:00 2001 From: Pat Tressel Date: Fri, 5 Feb 2016 12:37:51 -0800 Subject: [PATCH 1/2] Update for AWS EMR 4.x. Thanks to Kevin Kleinfelter, Bruce Weir, Ashley Engelund! --- assignment4/README.txt | 21 +- assignment4/assignment4.md | 32 +- assignment4/awsinstructions.md | 569 ++++++++++++++++++--------------- 3 files changed, 341 insertions(+), 281 deletions(-) diff --git a/assignment4/README.txt b/assignment4/README.txt index 1ba02bc0..2d1d77cf 100644 --- a/assignment4/README.txt +++ b/assignment4/README.txt @@ -12,27 +12,30 @@ myudfs.jar from S3, through the line: register s3n://uw-cse-344-oregon.aws.amazon.com/myudfs.jar - -OPTION 2: do-it-yourself; run this on your local machine: +OPTION 2: Do-it-yourself; run this on your local machine: cd pigtest -ant -- this should create the file myudfs.jar +ant + +This should create the file myudfs.jar. Next, modify example.pig to: register ./myudfs.jar Next, after you start the AWS cluster, copy myudfs.jar to the AWS -Master Node (see hw6-awsusage.html). +Master Node (see awsinstructions.md). ================================================================ -STEP2 - -Start an AWS Cluster (see hw6-awsusage.html), start pig interactively, -and cut and paste the content of example.pig. I prefer to do this line by line +STEP 2 +Start an AWS Cluster (see awsinstructions.md), start pig interactively, +and cut and paste the content of example.pig. I prefer to do this line by +line. -Note: The program may appear to hang with a 0% completion time... go check the job tracker. Scroll down. You should see a MapReduce job running with some non-zero progress. +Note: The program may appear to hang with a 0% completion time. +Go check the Hadoop monitor. You should see a MapReduce job running with +some non-zero progress. Also note that the script will generate more than one MapReduce job. diff --git a/assignment4/assignment4.md b/assignment4/assignment4.md index f9bec523..89100a3b 100644 --- a/assignment4/assignment4.md +++ b/assignment4/assignment4.md @@ -1,32 +1,34 @@ +## **Note** + ### **We cannot reimburse you for any charges** ### **Terminating an AWS cluster** -When you are done running Pig scripts, make sure to **ALSO** terminate your job flow. This is a step that you need to do **in addition to ** stopping pig and Hadoop (if necessary). +When you are done running Pig scripts, make sure to **ALSO** terminate your cluster. This is a step that you need to do **in addition to ** stopping pig and Hadoop (if necessary). -1. 1.Go to the [Management Console.](https://console.aws.amazon.com/elasticmapreduce/home) -2. 2.Select the job in the list. -3. 3.Click the Terminate button (you may also need to turn off Termination protection). -4. 4.Wait for a while (may take minutes) and recheck until the job state becomes TERMINATED. +1. Go to the [Management Console.](https://console.aws.amazon.com/elasticmapreduce/home) +2. Select the cluster in the list. +3. Click the Terminate button (you may also need to turn off Termination protection). +4. Wait for a while (may take minutes) and recheck until the cluster state becomes TERMINATED. -### **If you fail to terminate your job and only close the browser, or log off AWS, your AWS will continue to run, and AWS will continue to charge your credit card: for hours, days, and weeks. Make sure you don't leave the console until you have confirmation that the job is terminated.** +**If you fail to terminate your job and only close the browser, or log off AWS, your AWS will continue to run, and AWS will continue to charge your credit card: for hours, days, and weeks. Make sure you don't leave the console until you have confirmation that the job is terminated.** -## **Notes** +The quiz should cost no more than 10-20 dollars if you only use medium aws instances. -This assignment will be very difficult from Windows; the instructions assume you have access to a Linux command line. +## **Problem 0: Setup your Pig Cluster** -The quiz should cost no more than 5-10 dollars if you only use small aws instances +1. Follow [these instructions](https://github.com/uwescience/datasci_course_materials/blob/master/assignment4/awsinstructions.md) to setup the cluster. NOTE: It will take you a good **60 minutes** to go through all these instructions without even trying to run example.pig at the end. But they are worth it. You are learning how to use the Amazon cloud, which is by far the most popular cloud platform today. At the end, the instructions will refer to example.pig. This is the name of the sample program that we will run in the next step. +2. You will find example.pig in the course materials repo at: -## **Problem 0: Setup your Pig Cluster** + https://github.com/uwescience/datasci_course_materials/blob/master/assignment4/ -1. Follow [these instructions](https://github.com/uwescience/datasci_course_materials/blob/master/assignment4/awsinstructions.md) to setup the cluster. NOTE: It will take you a good **60 minutes** to go through all these instructions without even trying to run example.pig at the end. But they are worth it. You are learning how to use the Amazon cloud, which is by far the most popular cloud platform today. At the end, the instructions will refer to _example.pig_. This is the name of the sample program that we will run in the next step. -2. You will find example.pig in the [course materials repo](https://github.com/uwescience/datasci_course_materials). example.pig is a Pig Latin script that loads and parses the billion triple dataset that we will use in this assignment into triples: (subject, predicate, object). Then it groups the triples by their object attribute and sorts them in descending order based on the count of tuple in each group. -3. Follow the README.txt: it provides more information on how to run the sample program called example.pig. + example.pig is a Pig Latin script that loads and parses the billion triple dataset that we will use in this assignment into triples: (subject, predicate, object). Then it groups the triples by their object attribute and sorts them in descending order based on the count of tuple in each group. +3. Follow awsinstructions.md: it provides more information on how to run the sample program called example.pig. 4. There is nothing to turn in for Problem 0 ## **Useful Links** -[Pig Latin reference](http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html) +[Pig Latin reference](http://pig.apache.org/docs/r0.15.0/piglatin_ref2.html) [Counting rows in an alias](http://stackoverflow.com/questions/9900761/pig-how-to-count-a-number-of-rows-in-alias) @@ -81,7 +83,7 @@ Modify example.pig to use the file uw-cse-344-oregon.aws.amazon.com/btc-2010-chu - After the command objects = ... - After the command count\_by\_object = ... -**Hint 1** : [Use the job tracker](https://class.coursera.org/datasci-001/wiki/view?page=awssetup) to see the number of map and reduce tasks for your MapReduce jobs. +**Hint 1** : Use the Hadoop monitor to see the number of map and reduce tasks for your MapReduce jobs. **Hint 2:** To see the schema for intermediate results, you can use Pig's interactive command line client grunt, which you can launch by running Pig without specifying an input script on the command line. When using grunt, a command that you may want to know about is [describe](http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#DESCRIBE) . To see a list of other commands, type help. diff --git a/assignment4/awsinstructions.md b/assignment4/awsinstructions.md index 3df3fccd..8d3b9507 100644 --- a/assignment4/awsinstructions.md +++ b/assignment4/awsinstructions.md @@ -1,235 +1,300 @@ ## Setting up your AWS account Amazon will ask you for your credit card information during the -setup process. You will be charged for using their services. You should not have to spend more than 5-10 dollars. +setup process. You will be charged for using their services. +You should not have to spend more than 10-20 dollars US. -1. Go to [http://aws.amazon.com/](http://aws.amazon.com/ "Link: http://aws.amazon.com/") -and sign -up: - - 1. You may sign in using your existing Amazon account or you can create a -new account by selecting "I am a new user." - 2. Enter your contact information and confirm your acceptance of the AWS +Go to [http://aws.amazon.com/](http://aws.amazon.com/") and sign in or sign up. +You may sign in using your existing Amazon account, or, if you need an account: +1. Select "I am a new user". +2. Enter your contact information and confirm your acceptance of the AWS Customer Agreement. - 3. Once you have created an Amazon Web Services Account, you may need to +3. Once you have created an Amazon Web Services Account, you may need to accept a telephone call to verify your identity. Some students have used -[Google Voice](https://www.google.com/voice "Link: https://www.google.com/voice")successfully if you don't have or don't want to give a -mobile number. You need Access Identifiers to make valid web service requests. -2. Go to [http://aws.amazon.com/](http://aws.amazon.com/ "Link: http://aws.amazon.com/") -and sign -in. You need to double-check that your account is signed up for three of -their services: Simple Storage Service (S3), Elastic Compute Cloud (EC2), -and Amazon Elastic MapReduce by clicking [here](https://aws-portal.amazon.com/gp/aws/manageYourAccount "Link: https://aws-portal.amazon.com/gp/aws/manageYourAccount") -- you should see "Services You're Signed Up For" under "Manage -Your Account". +[Google Voice](https://www.google.com/voice) successfully if you don't have or don't want to give a +mobile number. +4. Return to [http://aws.amazon.com/](http://aws.amazon.com/) and sign in. + +You'll be using three AWS services: Simple Storage Service (S3), +Elastic Compute Cloud (EC2), and Elastic MapReduce (EMR). ## Setting up an EC2 key pair -Note: Some students were having problem running job flows because of no -active key found, go to [AWS security credentials page](https://portal.aws.amazon.com/gp/aws/securityCredentials "Link: https://portal.aws.amazon.com/gp/aws/securityCredentials") and -make sure that you see a key under the access key, if not just click Create -a new Access Key. +(Note AWS uses [several types of keys and credentials](http://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html), +for different purposes, including two types of keys: Access keys are used to +make remote API calls to AWS, or to use its command line tool from your own +machine. We won't need access keys for this quiz.) To connect to an Amazon EC2 node, such as the master nodes for the Hadoop clusters you will be creating, you need an SSH key pair. To create and install one, do the following: -1. After setting up your account, follow -[Amazon's instructions](http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/generating-a-keypair.html) to create a key pair. Follow the instructions +1. To create a key pair, follow the instructions at +[Amazon's instructions](http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/generating-a-keypair.html), in section "Having AWS create the key pair for you," subsection "AWS Management Console." (Don't do this in Internet Explorer, or you might not be able -to download the .pem private key file.) -2. Download and save the .pem private key file to disk. We will reference -the .pem file as `` -in -the following instructions. -3. Make sure only you can access the .pem file. If you do not change the -permissions, you will get an error message later: +to download the .pem key file.) +2. Download and save the .pem key file to disk. In commands below, we will +refer to the .pem file as ``. (Replace this entire +string including the brackets with the location of your .pem file in the +following instructions.) +3. The local key setup differs for Linux / MacOS and Windows. + * For Linux / MacOS: Make sure only you can access the .pem file. + (If you do not change the permissions, you will get an error message later.) - $ chmod 600 + chmod 600 -4. Note: This step will NOT work on Windows 7 with cygwin. Windows 7 does -not allow file permissions to be changed through this mechanism, and they -must be changed for ssh to work. So if you must use Windows, you should -use [PuTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/) as -your ssh client. In this case, you will further have to transform this -key file into PuTTY format. For more information go to [http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html "Link: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html") and -look under "Private Key Format." + * For Windows, you can use + [PuTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/) as your ssh + client. PuTTy uses a different key file format from .pem, so you'll need + to convert the key file into PuTTY format. Install PuTTy, then go to + [http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html) + and follow the instructions in the section "Converting Your Private Key + Using PuTTYgen". ## Starting an AWS Cluster and running Pig Interactively To run a Pig job on AWS, you need to start up an AWS cluster using the -[Web Management Console](https://console.aws.amazon.com/elasticmapreduce/home "Link: https://console.aws.amazon.com/elasticmapreduce/home") and connect to the Hadoop master node. Follow -the steps below. You may also find [Amazon's interactive Pig tutorial](http://aws.amazon.com/articles/2729 "Link: http://aws.amazon.com/articles/2729") useful, but note that -the screenshots are slightly out of date.To set up and connect to a -pig cluster, perform the following steps: - - 1. Go to [http://console.aws.amazon.com/elasticmapreduce/home](http://console.aws.amazon.com/elasticmapreduce/home "Link: http://console.aws.amazon.com/elasticmapreduce/home") and -sign in. - 2. Click the "Amazon Elastic MapReduce" tab. - 3. Click the "Create New Job Flow" button. - 4. In the "Job Flow Name" field type a name such as "Pig Interactive Job -Flow". - 5. Select "Pig Program" from the drop down box, and then click "Continue". -Also select: "Run your own application". - 6. Select the "Start an Interactive Pig Session" radio button and click "Continue". - 7. On the next page, select only **1 small core instance**. In the -last question of the quiz you will need to set your cluster to have 20 small nodes, rather than the 1 node. - 8. On the next page, make sure that the EC2 Key Pair that is selected is -the one you created above - 9. On the last page, you will be asked if you want to configure _Bootstrap Actions_. -You do, because the default configuration can sometimes run into memory -problems. Select "Configure your Bootstrap Actions." Then, under "Action -Type," select "Memory Intensive Configuration." - 10. When you are done setting up your workflow and you come back to your management -console, you may need to refresh the page to see your workflow. It may -take a few minutes for your job flow to launch. If your cluster fails or -takes an extraordinarily long time, Amazon may be near capacity. Try again -later. - - - 11. Now you need to obtain the Master Public DNS Name. You get this by clicking -(highlighting) your job flow, which creates a frame at the bottom of your -window. Scroll down in that frame and you will find the Master Public DNS -at the bottom. We call this Master Public DNS name . - - 12. Now you are ready to connect to your cluster and run Pig jobs. From a -terminal, use the following command: - - -`$ ssh -o "ServerAliveInterval 10" -i hadoop@ -` - - - 13. Once you connect successfully, just type +[Web Management Console](https://console.aws.amazon.com/elasticmapreduce/home) +and connect to the Hadoop master node. + +Since you'll be charged for use of the cluster once it is created, you should +prepare the Pig code for at least the first part of the assignment before +starting your cluster. You can terminate the cluster in between parts of the +assignment, and set up a new one when you're ready for the next part. + +### Creating the cluster + +These instructions show use of the "quick setup". There is also an "advanced" +setup option that may be needed if your job runs out of memory and you need to +set memory options. + +1. Go to +[http://console.aws.amazon.com/elasticmapreduce/home](http://console.aws.amazon.com/elasticmapreduce/home) +and sign in. +2. Click "Create Cluster". +3. Under General Configuration: + * In the "Cluster Name" field, you can enter a name to identify the purpose + of the cluster. + * For Launch mode, Cluster should be selected (this is the default). +4. Under Software Configuration: + * Select "Core Hadoop". +5. Under Hardware Configuration: + * Select the instance type. For most parts of this quiz, c1.medium will be + fine. For the last quiz question, a larger instance size like m2.xlarge + or m3.xlarge may be appropriate. + * For number of instances, select 1 for now. For the last quiz question, + you can select up to 20. +6. Security and access: + * Select the name of the key pair you created earlier. +7. When you're ready, click Create cluster. +8. This will open the Cluster Details page. You can see the requested instances +being acquired and provisioned toward the right side of the form. The state of +the cluster overall is shown near the top of the page. +9. Now you need to obtain the Master Public DNS Name. After the cluster has +started this will be shown near the top of the Cluster Details page. In the +following instructions, we call this Master Public DNS name ``. + +Now you are ready to connect to your cluster and run Pig jobs. + +### Connecting to the master node from Linux or MacOS + +From a terminal, use the following command (replace `` +and `` with your values): + + ssh -o "ServerAliveInterval 10" -i hadoop@ + +### Connecting to the master node from Windows + +1. Start Pageant (the PuTTy key manager). +2. Find the Pageant icon in your System Tray, right-click, select Add Key. +3. Select the .ppk file you created earlier, Open, enter your pass-phrase. +4. Start PuTTy. +5. In the Host Name field, enter `hadoop@` + (substituting the Master Public DNS name for your cluster). +6. In the Port field, enter 22. +7. For Connection type, select SSH. +8. Click Open. + +### Starting the pig shell + +Once you connect successfully, just type - **$ pig** + pig - 14. Now you should have a Pig prompt: - +Now you should have a Pig prompt: - **grunt>** + grunt> - In this quiz we will use pig only interactively. (The alternative is to have pig read the program from a file.) This is the interactive mode where you type in pig queries. Here you will cut and paste `example.pig`. You are now ready to return to the quiz. - - + Other useful information: - * For the first job you run, Hadoop will create the output directory for +* Hadoop will create the output directory for you automatically. But Hadoop refuses to overwrite existing results. So you will need to move your prior results to a different directory before re-running your script, specify a different output directory in the script, or delete the prior results altogether. - -To see how to perform these tasks and more, see ["Managing the results of your Pig queries"](#managingresults "Link: #managingresults") below. - * To exit pig, type `quit` at the `grunt>` promt. To -terminate the ssh session, type `exit` at the unix prompt: after -that you must terminate the AWS cluster (see next). - * To kill a pig job type CTRL/C while pig is running.This kills pig only: +* To exit pig, type `quit` at the `grunt>` prompt. To +terminate the ssh session, type `exit` at the Linux shell prompt. After +that you must terminate the AWS cluster +(see ["Terminating an AWS cluster"](#terminating-an-aws-cluster)). +* To kill a pig job type CTRL/C while pig is running.This kills pig only: after that you need to kill the hadoop job. We show you how to do this below. - +* After a hadoop job completes, you will need to exit from pig and re-start +it before you can run another job. +* To see how to perform these tasks and more, see +["Managing the results of your Pig queries"](#managing-the-results-of-your-pig-queries) +below. ## Monitoring Hadoop jobs -### Easy Way: SSH Tunneling +The master node provides web pages for monitoring Hadoop jobs and HDFS use, +but they are only accessible locally on the master. You can view them using +a text browser while connected to the master via ssh, but to view them from +your local browser, you'll either need to set up an SSH tunnel or SOCKS proxy. +An SSH tunnel is likely the easiest option. -By far the easiest way to do this from linux or a mac is to use ssh tunneling. - 1. Run this command +The web pages are provided on two local ports on the master: +* Hadoop monitor: 8088 +* HDFS monitor: 50070 - ssh -L 9100:localhost:9100 -L 9101:localhost:9101 -i ~/.ssh/ hadoop@ +### Easy Way: SSH Tunneling - 2. Open your browser to [http://localhost:9100](http://localhost:9100 "Link: http://localhost:9100") +Note you'll need two free local ports to use for the local end of the +SSH tunnel. Ports 8081 and 8082 are often free and not blocked by the firewall. +These are used in the examples. If these aren't available, instructions are +shown for checking which ports are in use. + +#### SSH Tunneling from Linux or MacOS + +1. Check that ports 8081 and 8082 are open, or find two free local ports: + * On Linux, open a terminal and type `netstat -antu` to see which ports are + in use. + * On MacOS, follow the instructions + [here](https://support.apple.com/kb/PH18539). +2. Run this command (substitute your free local ports if they are not 8081 +and 8082). + ssh -L 8081:localhost:8088 -L 8082:localhost:50070 -i hadoop@ +3. Open your browser to: + * Hadoop monitor: [http://localhost:8081](http://localhost:8081) + * HDFS monitor: [http://localhost:8082](http://localhost:8082) From there, you can monitor your jobs' progress using the UI. -### Hard Way 1: Lynx -There are two other ways to do this: using [lynx](http://lynx.isc.org/) or using your own browser with a SOCKS proxy. +#### SSH Tunneling from Windows using PuTTy + +(This assumes you have started Pageant and added your keys, as shown in +"Connecting to the master node from Windows".) + +1. Check that ports 8081 and 8082 are open, or find two free local ports: + * Open a command prompt window. + * Enter: `netstat -an` + * The ports shown in the Local Address column are in use -- if 8081 or 8082 + are in use pick something that is not in use. +1. Start PuTTy. +1. Go to Connection -> SSH -> Tunnels +1. In the "Add new forwarded port" section, fill in: + * Source port: 8081 (or the free port you found) + * Destination: `:8088` (recall this is the Master Public DNS + for your cluster) + * Check that Local and Auto are selected. + * Click Add. +1. Now go to Session. +1. In the Host Name field, enter `hadoop@` +1. In the Port field, enter 22. +1. For Connection type, select SSH. +1. Click Open. +1. Accept the host key (if you haven't already done so). +1. You can minimize the window. +1. Repeat steps above starting with "Start PuTTy", except change the source and +destination ports: + * Source port: 8082 + * Destination: `:50070` +1. Open these URLs in your browser: + * Hadoop monitor: [http://localhost:8081](http://localhost:8081) + * HDFS monitor: [http://localhost:8082](http://localhost:8082) + +### Text browser: Lynx + +[Lynx](http://lynx.isc.org/) is a text browser -- it shows only the text from +web pages -- in a terminal. This option is very easy. Open a separate `ssh` +connection to the AWS master node and type: + + lynx http://localhost:8088/ + +Navigate as follows: +* up/down arrows = move through the links (current link is highlighted) +* enter = follows a link +* left arrow = return to previous page - Using LYNX. Very easy, you don't need to download anything. Open a separate `ssh` connection -to the AWS master node and type: - - -`% lynx http://localhost:9100/ ` - - -Lynx is a text browser. Navigate as follows: `up/down arrows `= -move through the links (current link is highlighted); `enter` = -follows a link; `left arrow` = return to previous page. - - Examine the webpage carefully, while your pig program is running. You should find information about the map tasks, the reduce tasks, you should be able to drill down into each map task (for example to monitor its progress); you should be able to look at the log files of the map tasks (if there are runtime errors, you will see them only in these log files). -### Hard Way 2: Proxy +### SOCKS Proxy -Using SOCKS proxy, and your own browser. This requires more work, but the nicer interface makes it worth the extra work over using Lynx +Using SOCKS proxy, and your own browser. (This shows using local port 8888. +See the SSH Tunneling sections for how to check for unused local ports, +and substitute a different port if 8888 is in use.) - 1. Set up your browser to use a proxy when connecting to the master node. _Note: If the instructions fail for one browser, try the other browser_. +1. Set up your browser to use a proxy when connecting to the master node. + (Note: Instructions are for Firefox and Chrome. If the instructions fail for + one browser, try the other browser. In particular, it seems like people are having problems with Chrome but -Firefox, especially following Amazon's instructions, works well. - * Firefox: - 1. Install the [FoxyProxy extension](https://addons.mozilla.org/en-US/firefox/addon/2464/) for Firefox.li\> +Firefox, especially following Amazon's instructions, works well.) + * Firefox: + 1. Install the [FoxyProxy extension](https://addons.mozilla.org/en-US/firefox/addon/2464/) for Firefox. 2. Copy the `foxyproxy.xml` configuration file from the course materials repo into your [Firefox profile folder](http://support.mozilla.com/kb/profiles). 3. If the previous step doesn't work for you, try deleting the `foxyproxy.xml` you -copied into your profile, and using [Amazon's instructions](http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingtheHadoopUserInterface.html#AccessingtheHadoopUserInterfacetoMonitorJobStatus2) to set up FoxyProxy manually. +copied into your profile, and using +[Amazon's instructions](http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingtheHadoopUserInterface.html#AccessingtheHadoopUserInterfacetoMonitorJobStatus2) to set up FoxyProxy manually. If you use Amazon's instructions, be careful to use port 8888 instead of the port in the instructions. - * Chrome: - 1. Option 1: FoxyProxy is [now available for Chrome](http://getfoxyproxy.org/downloads.html) as + * Chrome: + * Option 1: FoxyProxy is [now available for Chrome](http://getfoxyproxy.org/downloads.html) as well. - 2. Option 2: You can try [proxy switch!](https://chrome.google.com/webstore/detail/caehdcpeofiiigpdhbabniblemipncjj "Link: https://chrome.google.com/webstore/detail/caehdcpeofiiigpdhbabniblemipncjj") - 3. Click the _Tools_ icon (upper right corner; don't confuse it with -the Developer's Tools !), Go to _Tools, _go to _Extensions_. + * Option 2: You can try [proxy switch!](https://chrome.google.com/webstore/detail/caehdcpeofiiigpdhbabniblemipncjj) + 1. Click the _Tools_ icon (upper right corner; don't confuse it with +the Developer's Tools !), Go to _Tools_, go to _Extensions_. Here you will see the ProxySwitch!: click on _Options_. - 4. Create a new Proxy Profile: Manual Configuration, Profile name = Amazon -Elastic MapReduce (any name you want), SOCKS Host = localhost, Port = 8888 -(you can choose any port you want; another favorite is 8157), + 2. Create a new Proxy Profile: Manual Configuration, Profile name = Amazon +Elastic MapReduce (any name you want), SOCKS Host = localhost, Port = 8888, SOCKS v5\. If you don't see "SOCKS", de-select the option to "Use the same proxy server for all protocols". - 5. Create two new switch rules (give them any names, say AWS1 and AWS2). -Rule 1: pattern=\*.amazonaws.com:\*/\*, Rule 2: pattern=\*.ec2.internal:\*/\*. -For both, Type=wildcard, Proxy profile=\[the profile you created at the + 3. Create two new switch rules (give them any names, say AWS1 and AWS2). + * Rule 1: pattern=\*.amazonaws.com:\*/\* + * Rule 2: pattern=\*.ec2.internal:\*/\* + + For both, Type=wildcard, Proxy profile=\[the profile you created at the previous step\]. - 2. Open a new local terminal window and create the SSH SOCKS tunnel to the + +2. Open a new local terminal window and create the SSH SOCKS tunnel to the master node using the following: - $ ssh -o "ServerAliveInterval 10"** **-i -ND 8888 hadoop@ + ssh -o "ServerAliveInterval 10" -i -ND 8888 hadoop@ -(The `-N` option + (The `-N` option tells `ssh` not to start a shell, and the `-D 8888` option tells `ssh` to start the proxy and have it listen on port 8888.) - - -The resulting SSH window will appear to hang, without any output; this + + The resulting SSH window will appear to hang, without any output; this is normal as SSH has not started a shell on the master node, but just created the tunnel over which proxied traffic will run. - - -Keep this window running in the background (minimize it) until you are + + Keep this window running in the background (minimize it) until you are finished with the proxy, then close the window to shut the proxy down. - 3. Open your browser, and type one of the following URLs: - * For the job tracker: `http://:9100/` - * For HDFS management: `http://:9101/` - -> The job tracker enables you to see what MapReduce jobs are executing in -> your cluster and the details on the number of maps and reduces that are -> running or already completed. -> -> Note that, at this point in the instructions, you will not see any MapReduce -> jobs running but you should see that your cluster has the capacity to run -> a couple of maps and reducers on your one instance. -> -> The HDFS manager gives you more low-level details about your cluster and -> all the log files for your jobs. +3. Open these URLs in your browser: + * Hadoop monitor: `http://:8088/` + * HDFS monitor: `http://:50070/` ## Killing a Hadoop Job @@ -240,50 +305,46 @@ long time to run. If you decide that you need to interrupt a job before it completes, here is the way to do it: If you want to kill pig, you first type CTRL/C, which kills pig only. -Next, kill the hadoop job, as follows. From the job tracker interface find +Next, kill the hadoop job, as follows. From the Hadoop monitor find the hadoop `job_id`, then type: -> `% hadoop job -kill job_id` -> +`hadoop job -kill job_id` -You do not need to kill any jobs at this point. - -However, you can now exit pig (just type "quit") and exit your ssh session. -You can also kill the SSH SOCKS tunnel to the master node. +Note this is not the normal way to exit from pig. If your MapReduce completes +successfully, all the hadoop jobs will exit. ## Terminating an AWS cluster When you are done running Pig scripts, make sure to **ALSO** terminate -your job flow. This is a step that you need to do **in addition to **stopping +your cluster. This is a step that you need to do **in addition to **stopping pig and Hadoop (if necessary) above. This step shuts down your AWS cluster: - 1. Go to the [Management Console.](https://console.aws.amazon.com/elasticmapreduce/home) - 2. Select the job in the list. - 3. Click the Terminate button (it should be right below "Your Elastic MapReduce -Job Flows"). - 4. Wait for a while (may take minutes) and recheck until the job state becomes -TERMINATED. +1. Go to the [EMR Management Console](https://console.aws.amazon.com/elasticmapreduce/home) +2. Select the job in the list. +3. Click the Terminate button. (If termination protection is on, you will be +prompted to turn it off before you can terminate the cluster.) +4. Wait for a while (may take minutes) and recheck until the job state becomes +TERMINATED. You may need to refresh the cluster details page. **Pay attention to this step**. If you fail to terminate -your job and only close the browser, or log off AWS, your AWS will continue +your job and only close the browser, or log off AWS, your cluster will continue to run, and AWS will continue to charge you: for hours, days, weeks, and -when your credit is exhausted, it chages your creditcard. Make sure you -don't leave the console until you have confirmation that the job is terminated. - -You can now shut down your cluster. +it charges this to your credit card automatically. Make sure you +don't leave the console until you have confirmation that the cluster is terminated. -## +Since this step releases the master node, any connections to it will end. +So you can close any ssh connections. ## Checking your Balance Please check your balance regularly!!! - 1. Go to the [Management Console.](https://console.aws.amazon.com/elasticmapreduce/home) - 2. Click on your name in the top right corner and select "Account Activity". - 3. Now click on "detail" to see any charges < $1\. +1. Go to the [EMR Management Console](https://console.aws.amazon.com/elasticmapreduce/home) +2. Click on your name in the top right corner and select "Account Activity". +3. Now click on "detail" to see any charges < $1\. -To avoid unnecessary charges, terminate your job flows when you are not -using them. +To avoid unnecessary charges, terminate your cluster when you are not +using it. **USEFUL**: AWS customers can now use **billing alerts** to help monitor the charges on their AWS bill. You can get started today by @@ -294,70 +355,45 @@ reach the threshold. ## Managing the results of your Pig queries -For the next step, you need to restart a new cluster as follows. Hopefully, -it should now go very quickly: - * Start a new cluster with one instance. - * Start a new interactive Pig session (through grunt) - * Start a new SSH SOCKS tunnel to the master node (if you are using your -own browser) - -We will now get into more details about running Pig scripts. +Your pig program stores the results in several files in a directory -- there +will be one file for each reducer. You +have two options: +* (1) Store these files in the Hadoop File System (HDFS). +If you use HDFS, the files will be discarded when your cluster is shut down. +* (2) Store these files in S3\. +If you use S3, the files will persist (and you'll be charged for storage) +until you delete them. -Your pig program stores the results in several files in a directory. You -have two options: (1) store these files in the Hadoop File System, or (2) -store these files in S3\. In both cases you need to copy them to your local -machine. - -### 1\. Storing Files in the Hadoop File System +### (1) Storing Files in the Hadoop File System This is done through the following pig command (used in `example.pig`): - store count_by_object_ordered into '/user/hadoop/example-results' using PigStorage(); + store count_by_object_ordered into '/user/hadoop/example-results' using PigStorage(); -Before you run the pig query, you need to (A) create the /user/hadoop -directory. After you run the query you need to (B) copy this directory -to the local directory of the AWS master node, then (C) copy this directory -from the AWS master node to your local machine. +Pig will not run any commands until it has to, so it will not start the +MapReduce job until you run the `store` command. -#### 1.A. Create the "/user/hadoop Directory" in the Hadoop Filesystem +After the MapReduce job completes, you will need to copy this directory +to the local directory of the AWS master node, then, if you want, copy this +from the AWS master node to your local machine. -You will need to do this for each new job flow that you create. +#### Check that the /user/hadoop directory is present -To create a `/user/hadoop` directory on the AWS cluster's HDFS -file system run this from the AWS master node: - - % hadoop dfs -mkdir /user/hadoop - +Hadoop reducers write their output to HDFS. In the new release of EMR, an +HDFS directory called `/user/hadoop` should be created automatically. +Check that the directory exists by listing it with this command: -Check that the directory was created by listing it with this command: + hadoop fs -ls /user/hadoop - % hadoop dfs -ls /user/hadoop +If there is an error, create a `/user/hadoop` directory with this command: - -You may see some output from either command, but you should not see any -errors. + hadoop fs -mkdir /user/hadoop You can also do this directly from grunt with the following command. - grunt> fs -mkdir /user/hadoop - -Now you are ready to run your first sample program. Take a look at the -starter code that we provided in the course materials repo. Copy and paste -the content of `example.pig.` - -**Note**: The program may appear to hang with a 0% completion -time... go check the job tracker. Scroll down. You should see a MapReduce -job running with some non-zero progress. - -**Note 2**: Once the first MapReduce job gets to 100%... -if your grunt terminal still appears to be suspended... go back to the -job tracker and make sure that **the reduce phase is also 100% complete**. -It can take some time for the reducers to start making any progress. - -**Note 3**: The example generates more than 1 MapReduce job... -so be patient. + fs -mkdir /user/hadoop -#### 1.B. Copying files from the Hadoop Filesystem +#### Copying files from the Hadoop Filesystem The result of a pig script is stored in the hadoop directory specified by the `store` command. That is, for `example.pig`, @@ -367,7 +403,7 @@ as specified in the script. HDFS is separate from the master node's file system, so before you can copy this to your local machine, you must copy the directory from HDFS to the master node's Linux file system: - % hadoop dfs -copyToLocal /user/hadoop/example-results example-results + hadoop fs -copyToLocal /user/hadoop/example-results example-results This will create a directory `example-results` with `part-*` files in it, which you can copy to your local machine with `scp`. You @@ -376,60 +412,79 @@ file, perhaps sorting the results if you like. An easier option may be to use - % hadoop fs -getmerge /user/hadoop/example-results example-results + hadoop fs -getmerge /user/hadoop/example-results example-results This command takes a source directory and a destination file as input and concatenates files in src into the destination local file. - -Use `hadoop dfs -help` or see the [`hadoop dfs` guide](http://hadoop.apache.org/docs/stable/file_system_shell.html) -to -learn how to manipulate HDFS. (Note that `hadoop fs` is the same -as `hadoop dfs`.) - +Use `hadoop fs -help` or see the +[`hadoop fs` guide](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html) +to learn how to manipulate HDFS. -#### 1.C. Copying files to or from the AWS master node - * To copy one file from the master node back to your computer, run this +#### Copying files to or from the AWS master node, using Linux or MacOS + +To copy one file from the master node back to your computer, run this command _on the local computer:_ - - - $ scp -o "ServerAliveInterval 10" -i hadoop@: . - + scp -o "ServerAliveInterval 10" -i hadoop@: . where `` can be absolute or relative to the AWS master node's home folder. The file should be copied onto your current directory ('.') on your local computer. - - * Better: copy an entire directory, recursively. Suppose your files are +Better: copy an entire directory, recursively. Suppose your files are in the directory `example-results`. They type the following _on your loal computer_: - $ scp -o "ServerAliveInterval 10" -i -r hadoop@:example-results . + scp -o "ServerAliveInterval 10" -i -r hadoop@:example-results . - * As an alternative, you may run the scp command on the AWS master node, +As an alternative, you may run the scp command on the AWS master node, and connect to your local machine. For that, you need to know your local machine's domain name, or IP address, and your local machine needs to accept ssh connections. -### 2\. Storing Files in S3 +#### Copying files to or from the master node, using Windows + +The simplest method is to use an application designed for this, such as +[WinSCP](http://winscp.net/). This works with PuTTy's Pageant, so it can use +your AWS EC2 ssh keys after you start Pageant. + +### (2) Storing Files in S3 To use this approach, go to your AWS Management Console, click on Create -Bucket, and create a new bucket (=directory). Give it a name that may be -a public name. Do not use any special chatacters, including underscore. +Bucket, and create a new bucket (= directory). Give it a name that may be +a public name. Do not use any special characters, including underscore. Let's say you call it` superman`. Click on Actions, Properties, Permissions. Make sure you have all the permissions. Modify the store command of `example.pig` to: - store count_by_object_ordered into 's3n://superman/example-results'; + store count_by_object_ordered into 's3n://superman/example-results'; -Run your pig program. When it terminates, then in your S3 console you -should see the new directory `example-results`. Click on individual +After your pig program completes, you should see, in your +[S3 console](https://console.aws.amazon.com/s3/home), +the new directory `example-results`. Click on individual files to download. The number of files depends on the number of reduce tasks, and may vary from one to a few dozens. The only disadvantage of using S3 is that you have to click on each file separately to download. -Note that S3 is permanent storage, and you are charged for it. You can -safely store all your query answers for several weeks without exceeding -your credit; at some point in the future remember to delete them. +Note that S3 is permanent storage, and you are charged for it. + +## Run `example.pig` + +Now you are ready to run your first sample program. Take a look at the +starter code that we provided in the course materials repo. Copy and paste +the content of `example.pig.` + +Note: +* The program may appear to hang with a 0% completion +time. Go check the Hadoop monitor. You should see a task +running with some non-zero progress. +* Once the first task gets to 100%, +if your grunt terminal still appears to be suspended, go back to the +Hadoop monitor and make sure that *the reduce phase is also 100% complete*. +It can take some time for the reducers to start making any progress. +(There is also a progress bar on the cluster details web page.) +* The example generates more than 1 MapReduce job so be patient. + +As described earlier, monitor your job as it runs. +When it's done, copy your results and *terminate your cluster*. \ No newline at end of file From c337e27f3659e70f6cc58f9b39e3e14317e77816 Mon Sep 17 00:00:00 2001 From: Pat Tressel Date: Sat, 6 Feb 2016 12:59:38 -0800 Subject: [PATCH 2/2] Memory tuning, turn off logging, etc. --- assignment4/awsinstructions.md | 82 +++++++++++++++++++++++++++++----- 1 file changed, 72 insertions(+), 10 deletions(-) diff --git a/assignment4/awsinstructions.md b/assignment4/awsinstructions.md index 8d3b9507..16d78f6e 100644 --- a/assignment4/awsinstructions.md +++ b/assignment4/awsinstructions.md @@ -72,28 +72,39 @@ set memory options. 1. Go to [http://console.aws.amazon.com/elasticmapreduce/home](http://console.aws.amazon.com/elasticmapreduce/home) and sign in. -2. Click "Create Cluster". -3. Under General Configuration: +1. On the top menu bar, at the right, select the US West (Oregon) region -- this +is where the dataset is, so reads will go faster if the cluster is located in +the same datacenter. +1. Click "Create Cluster". +1. Under General Configuration: * In the "Cluster Name" field, you can enter a name to identify the purpose of the cluster. + * Un-check Logging, unless you are certain that you want it. + (This will write a log to S3, which may exceed your S3 "put" quota, and + in any case, you will be charged for S3 usage. Log messages are also + written to the terminal connected to the master node.) * For Launch mode, Cluster should be selected (this is the default). -4. Under Software Configuration: +1. Under Software Configuration: * Select "Core Hadoop". -5. Under Hardware Configuration: +1. Under Hardware Configuration: * Select the instance type. For most parts of this quiz, c1.medium will be fine. For the last quiz question, a larger instance size like m2.xlarge or m3.xlarge may be appropriate. * For number of instances, select 1 for now. For the last quiz question, you can select up to 20. -6. Security and access: +1. Security and access: * Select the name of the key pair you created earlier. -7. When you're ready, click Create cluster. -8. This will open the Cluster Details page. You can see the requested instances +1. When you're ready, click Create cluster. +1. This will open the Cluster Details page. You can see the requested instances being acquired and provisioned toward the right side of the form. The state of the cluster overall is shown near the top of the page. -9. Now you need to obtain the Master Public DNS Name. After the cluster has +1. Now you need to obtain the Master Public DNS Name. After the cluster has started this will be shown near the top of the Cluster Details page. In the following instructions, we call this Master Public DNS name ``. +1. Wait until the master node, at least, has finished booting before +connecting, and wait until all nodes have finished booting before running your +pig program. On the cluster details page, under "Network and Hardware", you can +watch the progress of the master and other nodes being set up. Now you are ready to connect to your cluster and run Pig jobs. @@ -469,7 +480,58 @@ using S3 is that you have to click on each file separately to download. Note that S3 is permanent storage, and you are charged for it. -## Run `example.pig` +## Addressing memory problems + +If you encounter out-of-memory errors, such as a "Java heap space" error, +you may need to adjust memory settings, choose machines with more memory, +or use more machines. + +You can control, for instance, how many tasks are allowed to run +simultaneously on each machine, how much memory is given to each task, and, +within that, to the Java Virtual Machine (JVM). The tasks cannot use all the +physical memory on the machine -- there must still be room for other required +processes. + +General memory tuning advice can be found here (how to specify the +parameters is at the very end of the page): +* http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/MemoryTuning.html + +Tuning parameters: +* http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-hadoop-task-config.html + +Physical memory for EC2 machine types: +* http://aws.amazon.com/ec2/instance-types/ +* http://aws.amazon.com/ec2/previous-generation/ + +More specific advice (though dated), showing calculations for parameters: +* http://stackoverflow.com/questions/28742328/how-to-set-the-number-of-parallel-reducers-on-emr +* http://stackoverflow.com/questions/33869593/aws-emr-there-is-insufficient-memory-for-the-java-runtime/33966000 + +Pig tuning advice: +* https://pig.apache.org/docs/r0.15.0/perf.html + +To set memory parameters, on the create cluster form, select "advanced options". +You can choose equivalent settings as in the quick form, except on the software +configuration, form, put your memory settings in the "Edit software settings" +box. Here is an example of memory settings appropriate for a machine with 15GiB +of memory: + +``` +[ + { + "Classification": "mapred-site", + "Properties": { + "mapreduce.map.java.opts": "-Xmx2048m", + "mapreduce.reduce.java.opts": "-Xmx2048m", + "mapreduce.job.reuse.jvm.num.tasks": "1", + "mapreduce.map.memory.mb": "2560", + "mapreduce.reduce.memory.mb": "2560" + } + } +] +``` + +## Run example.pig Now you are ready to run your first sample program. Take a look at the starter code that we provided in the course materials repo. Copy and paste @@ -487,4 +549,4 @@ It can take some time for the reducers to start making any progress. * The example generates more than 1 MapReduce job so be patient. As described earlier, monitor your job as it runs. -When it's done, copy your results and *terminate your cluster*. \ No newline at end of file +When it's done, copy your results and _**terminate your cluster**_. \ No newline at end of file