From 431024373e967bb18973bc4413be8a08bf1d9291 Mon Sep 17 00:00:00 2001 From: Vittoria Rezzonico Date: Tue, 3 May 2016 09:35:25 +0200 Subject: [PATCH 1/2] AWS instructions updated --- assignment4/awsinstructions.md | 166 ++++++++++----------------------- 1 file changed, 48 insertions(+), 118 deletions(-) diff --git a/assignment4/awsinstructions.md b/assignment4/awsinstructions.md index 3df3fccd..6ecb9dd2 100644 --- a/assignment4/awsinstructions.md +++ b/assignment4/awsinstructions.md @@ -24,8 +24,7 @@ Your Account". ## Setting up an EC2 key pair -Note: Some students were having problem running job flows because of no -active key found, go to [AWS security credentials page](https://portal.aws.amazon.com/gp/aws/securityCredentials "Link: https://portal.aws.amazon.com/gp/aws/securityCredentials") and +Go to [AWS security credentials page](https://portal.aws.amazon.com/gp/aws/securityCredentials "Link: https://portal.aws.amazon.com/gp/aws/securityCredentials") and make sure that you see a key under the access key, if not just click Create a new Access Key. @@ -59,52 +58,20 @@ look under "Private Key Format." To run a Pig job on AWS, you need to start up an AWS cluster using the [Web Management Console](https://console.aws.amazon.com/elasticmapreduce/home "Link: https://console.aws.amazon.com/elasticmapreduce/home") and connect to the Hadoop master node. Follow -the steps below. You may also find [Amazon's interactive Pig tutorial](http://aws.amazon.com/articles/2729 "Link: http://aws.amazon.com/articles/2729") useful, but note that -the screenshots are slightly out of date.To set up and connect to a -pig cluster, perform the following steps: +the steps below. +To set up and connect to a pig cluster, perform the following steps: 1. Go to [http://console.aws.amazon.com/elasticmapreduce/home](http://console.aws.amazon.com/elasticmapreduce/home "Link: http://console.aws.amazon.com/elasticmapreduce/home") and sign in. 2. Click the "Amazon Elastic MapReduce" tab. - 3. Click the "Create New Job Flow" button. - 4. In the "Job Flow Name" field type a name such as "Pig Interactive Job -Flow". - 5. Select "Pig Program" from the drop down box, and then click "Continue". -Also select: "Run your own application". - 6. Select the "Start an Interactive Pig Session" radio button and click "Continue". - 7. On the next page, select only **1 small core instance**. In the -last question of the quiz you will need to set your cluster to have 20 small nodes, rather than the 1 node. - 8. On the next page, make sure that the EC2 Key Pair that is selected is -the one you created above - 9. On the last page, you will be asked if you want to configure _Bootstrap Actions_. -You do, because the default configuration can sometimes run into memory -problems. Select "Configure your Bootstrap Actions." Then, under "Action -Type," select "Memory Intensive Configuration." - 10. When you are done setting up your workflow and you come back to your management -console, you may need to refresh the page to see your workflow. It may -take a few minutes for your job flow to launch. If your cluster fails or -takes an extraordinarily long time, Amazon may be near capacity. Try again -later. - - - 11. Now you need to obtain the Master Public DNS Name. You get this by clicking -(highlighting) your job flow, which creates a frame at the bottom of your -window. Scroll down in that frame and you will find the Master Public DNS -at the bottom. We call this Master Public DNS name . - - 12. Now you are ready to connect to your cluster and run Pig jobs. From a -terminal, use the following command: - - -`$ ssh -o "ServerAliveInterval 10" -i hadoop@ -` - - - 13. Once you connect successfully, just type + 3. Click on "Create cluster" in the EMR menu, and from there stay in the quick options and select in Applications "Core Hadoop" which should have everything you need. You can disable logging. Previous version c1.medium or m1.medium are acceptable for this exercise. Make sure you select your SSH keypair before clicking "Create cluster" otherwise you won't be able to SSH to your cluster. + 4. Go to https://console.aws.amazon.com/ec2/v2/, Networks and security, security groups and add to the group containing the master an inbound rule allowing ssh from anywhere. + 5. In the cluster details, next to Master public DNS, click on SSH to have the command to run from the command line to connect to the cluster + 6. Once you connect successfully, just type **$ pig** - 14. Now you should have a Pig prompt: + 7. Now you should have a Pig prompt: **grunt>** @@ -136,89 +103,52 @@ below. ## Monitoring Hadoop jobs -### Easy Way: SSH Tunneling +These instructions are available on the AWS control center, from the cluster details page (link next to *Connections*, under *Enable Web Connection*) -By far the easiest way to do this from linux or a mac is to use ssh tunneling. - 1. Run this command +In summary: - ssh -L 9100:localhost:9100 -L 9101:localhost:9101 -i ~/.ssh/ hadoop@ +### Open an SSH Tunnel to the Amazon EMR Master Node - 2. Open your browser to [http://localhost:9100](http://localhost:9100 "Link: http://localhost:9100") - -From there, you can monitor your jobs' progress using the UI. - -### Hard Way 1: Lynx -There are two other ways to do this: using [lynx](http://lynx.isc.org/) or using your own browser with a SOCKS proxy. - - Using LYNX. Very easy, you don't need to download anything. Open a separate `ssh` connection -to the AWS master node and type: - - -`% lynx http://localhost:9100/ ` - - -Lynx is a text browser. Navigate as follows: `up/down arrows `= -move through the links (current link is highlighted); `enter` = -follows a link; `left arrow` = return to previous page. - - -Examine the webpage carefully, while your pig program is running. You -should find information about the map tasks, the reduce tasks, you should -be able to drill down into each map task (for example to monitor its progress); -you should be able to look at the log files of the map tasks (if there -are runtime errors, you will see them only in these log files). - -### Hard Way 2: Proxy - -Using SOCKS proxy, and your own browser. This requires more work, but the nicer interface makes it worth the extra work over using Lynx - - 1. Set up your browser to use a proxy when connecting to the master node. _Note: If the instructions fail for one browser, try the other browser_. -In particular, it seems like people are having problems with Chrome but -Firefox, especially following Amazon's instructions, works well. - * Firefox: - 1. Install the [FoxyProxy extension](https://addons.mozilla.org/en-US/firefox/addon/2464/) for Firefox.li\> - 2. Copy the `foxyproxy.xml` configuration file from the course materials -repo into your [Firefox profile folder](http://support.mozilla.com/kb/profiles). - 3. If the previous step doesn't work for you, try deleting the `foxyproxy.xml` you -copied into your profile, and using [Amazon's instructions](http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingtheHadoopUserInterface.html#AccessingtheHadoopUserInterfacetoMonitorJobStatus2) to set up FoxyProxy manually. -If you use Amazon's instructions, be careful to use port 8888 instead of -the port in the instructions. - * Chrome: - 1. Option 1: FoxyProxy is [now available for Chrome](http://getfoxyproxy.org/downloads.html) as -well. - 2. Option 2: You can try [proxy switch!](https://chrome.google.com/webstore/detail/caehdcpeofiiigpdhbabniblemipncjj "Link: https://chrome.google.com/webstore/detail/caehdcpeofiiigpdhbabniblemipncjj") - 3. Click the _Tools_ icon (upper right corner; don't confuse it with -the Developer's Tools !), Go to _Tools, _go to _Extensions_. -Here you will see the ProxySwitch!: click on _Options_. - 4. Create a new Proxy Profile: Manual Configuration, Profile name = Amazon -Elastic MapReduce (any name you want), SOCKS Host = localhost, Port = 8888 -(you can choose any port you want; another favorite is 8157), -SOCKS v5\. If you don't see "SOCKS", de-select the option to "Use the same -proxy server for all protocols". - 5. Create two new switch rules (give them any names, say AWS1 and AWS2). -Rule 1: pattern=\*.amazonaws.com:\*/\*, Rule 2: pattern=\*.ec2.internal:\*/\*. -For both, Type=wildcard, Proxy profile=\[the profile you created at the -previous step\]. - 2. Open a new local terminal window and create the SSH SOCKS tunnel to the -master node using the following: - - $ ssh -o "ServerAliveInterval 10"** **-i -ND 8888 hadoop@ +From the terminal, type + +> `ssh -i -ND 8157 hadoop@` -(The `-N` option -tells `ssh` not to start a shell, and the `-D 8888` option -tells `ssh` to start the proxy and have it listen on port 8888.) - - -The resulting SSH window will appear to hang, without any output; this +The SSH window will appear to hang, without any output; this is normal as SSH has not started a shell on the master node, but just created the tunnel over which proxied traffic will run. - - -Keep this window running in the background (minimize it) until you are -finished with the proxy, then close the window to shut the proxy down. - 3. Open your browser, and type one of the following URLs: - * For the job tracker: `http://:9100/` - * For HDFS management: `http://:9101/` + +### Install and configure FoxyProxy + +1. Download and isntall FoxyProxy +2. Using a text editor create a file named foxyproxy-settings.xml containing the following: + +``` + + + + + + + + + + + + + + + + + + +``` + +3. Click the FoxyProxy icon in the toolbar and select Options. Click File > Import Settings. Browse to foxyproxy-settings.xml, select it, and click Open. To confirm that you wish to overwrite current settings with those stored in the new file, click Yes. +4. Beside the FoxyProxy Standard add-on, click Options. In the FoxyProxy Standard dialog, for Select Mode, choose Use proxies based on their predefined patterns and priorities. + +You can now connect to your instance from your browser by typing its address in the FireFox/Chrome URL bar. The list of the ports are available ["Here"](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html). + + > The job tracker enables you to see what MapReduce jobs are executing in > your cluster and the details on the number of maps and reduces that are From 0ac361ccd041f7d23f77b2cb70bb1a9ea3294dba Mon Sep 17 00:00:00 2001 From: Vittoria Rezzonico Date: Tue, 3 May 2016 09:36:53 +0200 Subject: [PATCH 2/2] AWS instructions updated --- assignment4/awsinstructions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/assignment4/awsinstructions.md b/assignment4/awsinstructions.md index 6ecb9dd2..c62b4592 100644 --- a/assignment4/awsinstructions.md +++ b/assignment4/awsinstructions.md @@ -146,7 +146,7 @@ the tunnel over which proxied traffic will run. 3. Click the FoxyProxy icon in the toolbar and select Options. Click File > Import Settings. Browse to foxyproxy-settings.xml, select it, and click Open. To confirm that you wish to overwrite current settings with those stored in the new file, click Yes. 4. Beside the FoxyProxy Standard add-on, click Options. In the FoxyProxy Standard dialog, for Select Mode, choose Use proxies based on their predefined patterns and priorities. -You can now connect to your instance from your browser by typing its address in the FireFox/Chrome URL bar. The list of the ports are available ["Here"](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html). +You can now connect to your instance from your browser by typing its address in the FireFox/Chrome URL bar. The list of the ports are available [here](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html).