-
Notifications
You must be signed in to change notification settings - Fork 12
Benchmarking on EC2
- Connect to the master:
*/ec2/uw-ec2.py -n <num-slaves> connect
- Initialize Hadoop, HDFS, and all systems:
~/benchmark/init-all.sh
If something goes wrong, just run init-all.sh
again. It cleans things up before each run.
Note: If you are not using the default m1.xlarge instances, ensure you modify ~/benchmark/common/get-configs.sh
to have the correct settings for your instance type. Then run (or re-run) init-all.sh
.
-
The master image includes a small amazon0505 dataset. The larger datasets can be downloaded into
~/datasets/
(see Datasets). -
Load the datasets to HDFS:
~/benchmark/datasets/load-files.sh
~/benchmark/datasets/load-splits.sh
(See also: Datasets.)
Note 1: We strongly recommend running this in a screen so that it continues even if the SSH connection is killed. Start or resume a screen using screen -R
and detach from it using Ctrl+a d
. If your screen is attached to another SSH session, detach it using screen -d
.
Note 2: EBS volumes suffer from a first read/write penalty, so it may be worthwhile to dump the larger datasets before loading them to HDFS: e.g., cat ~/datasets/uk0705-adj.txt > /dev/null
. If sudo iotop
reports reads of >50MB/s, then there's no need to do this.
Note 3: This is intended for 4, 8, 16, 32, 64, or 128 slave/worker machines. For other cases, you will need to specify a size
argument.
If the source code for any of the systems is changed on the master, recompile using:
~/benchmark/<system>/recompile-<system>.sh
This recompiles the system and copies the necessary binaries to all slave/worker machines. For example, ~/benchmark/giraph/recompile-giraph.sh
recompiles Giraph.
Note 1: Use at least m1.medium/m3.medium. t1.micro and m1.small instances do not have enough memory to compile the systems.
Note 2: Giraph's recompile script only recompiles giraph-examples. If you change anything in ~/giraph-1.0.0/giraph-core/
, modify recompile-giraph.sh
to have -pl giraph-core, giraph-examples
so that it is also recompiled.
To run all experiments, use:
~/benchmark/bench-all.sh
Definitely do this in a screen!
NOTE: This only works for 4, 8, 16, 32, 64, or 128 slave/worker machines!
To run specific systems, use ~/benchmark/<system>/benchall.sh
. For specific algorithms, ~/benchmark/<system>/<alg>.sh
. See here for details.
Grab the public IP (or public DNS) address for the master machine of interest. This can be found:
- on the master via
curl http://instance-data/latest/meta-data/public-ipv4
- by SSHing to the master from your local machine (
*/ec2/uw-ec2.py -n <num-slaves> connect
) and disconnecting - or by filtering for the master by name in the EC2 console (e.g., cw0)
The Hadoop web interface (for Giraph) is at http://<public-ip>:50030
.
The GPS web interface is at http://<public-ip>:8888/monitoring.html
.
To quickly look at the time, memory, and network stats for completed runs, use:
~/benchmark/parsers/batch-parser.py <system> <time-log-files>
For example, ./batch-parser.py 0 ~/benchmark/giraph/logs/wcc*time.txt
parses logs for all completed runs on Giraph with WCC. Use --help
to see the options.
For the naming convention of the log files, see Log Files.
System failures can and will happen. Here's a list of the quirks and what to do. Don't panic!
All benchmarking scripts use ~/benchmark/common/bench-init.sh
before running a system and ~/benchmark/common/bench-finish.sh
after the run completes. These start the sar
and free
instances on the master and worker machines.
If a run on any system is cancelled manually (e.g., Ctrl-c
), be sure to clean things up by running:
~/benchmark/common/cleanup-bench.sh
Alternatively, if you want to pull the workers' logs for that failed run, cd
to the correct system directory and use:
~/benchmark/bench-finish.sh <log-prefix>
The cd
part is important---bench-finish.sh
is fragile! For example:
cd ~/benchmark/giraph/
../common/bench-finish.sh pagerank_orkut-adj.txt_16_20140101-123050
Finally, any cancelled run's log files will not be cleaned up. To find these bad log files, use
~/benchmark/parser/log-checker.sh <log-dir>/*_time.txt
or
~/benchmark/parser/log-checker.sh <log-dir>/*_0_mem.txt
For example, ./log-checker.sh ~/benchmark/giraph/logs/*_time.txt
.
If a run in Giraph is cancelled, run ~/benchmark/giraph/kill-java-job.sh
to clean up lingering Java instances on the workers (and the master).
Note that this does not fail the pending job in the Hadoop web monitoring page. To clear it, use hadoop job -kill <job-id>
(2 or 3 times). Alternatively, restart Hadooop altogether with ~/benchmark/hadoop/restart-hadoop.sh 1
(the 1
tells the script to wait until HDFS is back up).
In general, failures can be diagnosed by checking the Hadoop web interface at http://<public-ip>:50030
. Otherwise, look at the worker machines' Hadoop logs, located in their ~/hadoop-1.0.4/logs/userlogs/
.
In the console, Giraph outputs "INFO mapred.JobClient: map x% reduce 0%". Here are some signs of failure:
- x% never reaches 100% or 99%: usually because the input dataset doesn't exist on HDFS
- x% reaches 100% or 99% but starts decreasing: some machines have failed
If a run in GPS is cancelled, run ~/benchmark/gps/stop-nodes.sh
to clean up GPS tasks. Make sure to sleep 60
(on 4-32 machines) or sleep 80
(on 64-128 machines) before performing another run.
GPS has a bad habit of hanging indefinitely on a failed job---the console will not show any warning signs. Be vigilant and monitor http://<public-ip>:8888/workers.html
every now and then. One common failure is connections not being established with workers: this is indicated by "Latest Status Receive Time" and "Connection Establishment Time" being "null".
Lastly, *_time.txt
log files are created only after a run for GPS, so search for *_0_mem.txt
with log-checker.sh
when cleaning up bad log files.
Nothing needed! These systems clean up after themselves.
To grab the experimental results (i.e., log files):
- Create a tarball of the contents in
~/benchmark/<system>/logs/
for each system. For example,~/benchmark/giraph/logs/giraph-32-results.tar.gz
.
Note: Compression is highly recommended: tar -zcf giraph-32-results.tar.gz *.txt
, or tar -cf - *.txt | pigz > giraph-32-results.tar.gz
for multicore compression.
- On a local machine, run:
*/ec2/uw-ec2.py -n <num-slaves> get-logs
This will download ~/benchmark/<system>/logs/tarball.tar.gz
to */results/<system>/<num-slaves>/tarball.tar.gz
. Simply extract with tar -xf tarball.tar.gz
and you're done!