Skip to content

Commit dbbcb53

Browse files
committed
Memory tuning, turn off logging, etc.
1 parent b018076 commit dbbcb53

File tree

1 file changed

+72
-10
lines changed

1 file changed

+72
-10
lines changed

assignment4/awsinstructions.md

Lines changed: 72 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -72,28 +72,39 @@ set memory options.
7272
1. Go to
7373
[http://console.aws.amazon.com/elasticmapreduce/home](http://console.aws.amazon.com/elasticmapreduce/home)
7474
and sign in.
75-
2. Click "Create Cluster".
76-
3. Under General Configuration:
75+
1. On the top menu bar, at the right, select the US West (Oregon) region -- this
76+
is where the dataset is, so reads will go faster if the cluster is located in
77+
the same datacenter.
78+
1. Click "Create Cluster".
79+
1. Under General Configuration:
7780
* In the "Cluster Name" field, you can enter a name to identify the purpose
7881
of the cluster.
82+
* Un-check Logging, unless you are certain that you want it.
83+
(This will write a log to S3, which may exceed your S3 "put" quota, and
84+
in any case, you will be charged for S3 usage. Log messages are also
85+
written to the terminal connected to the master node.)
7986
* For Launch mode, Cluster should be selected (this is the default).
80-
4. Under Software Configuration:
87+
1. Under Software Configuration:
8188
* Select "Core Hadoop".
82-
5. Under Hardware Configuration:
89+
1. Under Hardware Configuration:
8390
* Select the instance type. For most parts of this quiz, c1.medium will be
8491
fine. For the last quiz question, a larger instance size like m2.xlarge
8592
or m3.xlarge may be appropriate.
8693
* For number of instances, select 1 for now. For the last quiz question,
8794
you can select up to 20.
88-
6. Security and access:
95+
1. Security and access:
8996
* Select the name of the key pair you created earlier.
90-
7. When you're ready, click Create cluster.
91-
8. This will open the Cluster Details page. You can see the requested instances
97+
1. When you're ready, click Create cluster.
98+
1. This will open the Cluster Details page. You can see the requested instances
9299
being acquired and provisioned toward the right side of the form. The state of
93100
the cluster overall is shown near the top of the page.
94-
9. Now you need to obtain the Master Public DNS Name. After the cluster has
101+
1. Now you need to obtain the Master Public DNS Name. After the cluster has
95102
started this will be shown near the top of the Cluster Details page. In the
96103
following instructions, we call this Master Public DNS name `<master DNS>`.
104+
1. Wait until the master node, at least, has finished booting before
105+
connecting, and wait until all nodes have finished booting before running your
106+
pig program. On the cluster details page, under "Network and Hardware", you can
107+
watch the progress of the master and other nodes being set up.
97108

98109
Now you are ready to connect to your cluster and run Pig jobs.
99110

@@ -469,7 +480,58 @@ using S3 is that you have to click on each file separately to download.
469480

470481
Note that S3 is permanent storage, and you are charged for it.
471482

472-
## Run `example.pig`
483+
## Addressing memory problems
484+
485+
If you encounter out-of-memory errors, such as a "Java heap space" error,
486+
you may need to adjust memory settings, choose machines with more memory,
487+
or use more machines.
488+
489+
You can control, for instance, how many tasks are allowed to run
490+
simultaneously on each machine, how much memory is given to each task, and,
491+
within that, to the Java Virtual Machine (JVM). The tasks cannot use all the
492+
physical memory on the machine -- there must still be room for other required
493+
processes.
494+
495+
General memory tuning advice can be found here (how to specify the
496+
parameters is at the very end of the page):
497+
* http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/MemoryTuning.html
498+
499+
Tuning parameters:
500+
* http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-hadoop-task-config.html
501+
502+
Physical memory for EC2 machine types:
503+
* http://aws.amazon.com/ec2/instance-types/
504+
* http://aws.amazon.com/ec2/previous-generation/
505+
506+
More specific advice (though dated), showing calculations for parameters:
507+
* http://stackoverflow.com/questions/28742328/how-to-set-the-number-of-parallel-reducers-on-emr
508+
* http://stackoverflow.com/questions/33869593/aws-emr-there-is-insufficient-memory-for-the-java-runtime/33966000
509+
510+
Pig tuning advice:
511+
* https://pig.apache.org/docs/r0.15.0/perf.html
512+
513+
To set memory parameters, on the create cluster form, select "advanced options".
514+
You can choose equivalent settings as in the quick form, except on the software
515+
configuration, form, put your memory settings in the "Edit software settings"
516+
box. Here is an example of memory settings appropriate for a machine with 15GiB
517+
of memory:
518+
519+
```
520+
[
521+
{
522+
"Classification": "mapred-site",
523+
"Properties": {
524+
"mapreduce.map.java.opts": "-Xmx2048m",
525+
"mapreduce.reduce.java.opts": "-Xmx2048m",
526+
"mapreduce.job.reuse.jvm.num.tasks": "1",
527+
"mapreduce.map.memory.mb": "2560",
528+
"mapreduce.reduce.memory.mb": "2560"
529+
}
530+
}
531+
]
532+
```
533+
534+
## Run example.pig
473535

474536
Now you are ready to run your first sample program. Take a look at the
475537
starter code that we provided in the course materials repo. Copy and paste
@@ -487,4 +549,4 @@ It can take some time for the reducers to start making any progress.
487549
* The example generates more than 1 MapReduce job so be patient.
488550

489551
As described earlier, monitor your job as it runs.
490-
When it's done, copy your results and *terminate your cluster*.
552+
When it's done, copy your results and _**terminate your cluster**_.

0 commit comments

Comments
 (0)