Skip to content

Benchmarking on EC2

xvz edited this page Nov 8, 2014 · 65 revisions

Contents

Running Benchmarks

Initialization

  1. Connect to the master:
*/ec2/uw-ec2.py -n <num-slaves> connect
  1. Initialize Hadoop, HDFS, and all systems:
    ~/benchmark/init-all.sh
    

If something goes wrong, just run init-all.sh again. It cleans things up before each run.

Note: If you are not using the default m1.xlarge instances, ensure you modify ~/benchmark/common/get-configs.sh to have the correct settings for your instance type. Then run (or re-run) init-all.sh.

  1. The master image includes a small amazon0505 dataset. The larger datasets can be downloaded into ~/datasets/ (see Datasets).

  2. Load the datasets to HDFS:

~/benchmark/datasets/load-files.sh
~/benchmark/datasets/load-splits.sh

(See also: Datasets.)

Note 1: We strongly recommend running this in a screen so that it continues even if the SSH connection is killed. Start or resume a screen using screen -R and detach from it using Ctrl+a d. If your screen is attached to another SSH session, detach it using screen -d.

Note 2: EBS volumes suffer from a first read/write penalty, so it may be worthwhile to dump the larger datasets before loading them to HDFS: e.g., cat ~/datasets/uk0705-adj.txt > /dev/null. If sudo iotop reports reads of >50MB/s, then there's no need to do this.

Note 3: This is intended for 4, 8, 16, 32, 64, or 128 slave/worker machines. For other cases, you will need to specify a size argument.

Recompiling Systems (Optional)

If the source code for any of the systems is changed on the master, recompile using:

~/benchmark/<system>/recompile-<system>.sh

This recompiles the system and copies the necessary binaries to all slave/worker machines. For example, ~/benchmark/giraph/recompile-giraph.sh recompiles Giraph.

Note 1: Use at least m1.medium/m3.medium. t1.micro and m1.small instances do not have enough memory to compile the systems.

Note 2: Giraph's recompile script only recompiles giraph-examples. If you change anything in ~/giraph-1.0.0/giraph-core/, modify recompile-giraph.sh to have -pl giraph-core, giraph-examples so that it is also recompiled.

Running Experiments

To run all experiments, use:

~/benchmark/bench-all.sh

Definitely do this in a screen!

NOTE: This only works for 4, 8, 16, 32, 64, or 128 slave/worker machines!

To run specific systems, use ~/benchmark/<system>/benchall.sh. For specific algorithms, ~/benchmark/<system>/<alg>.sh. See here for details.

Monitoring Experiments

Monitoring Giraph and GPS

Grab the public IP (or public DNS) address for the master machine of interest. This can be found:

  • on the master via curl http://instance-data/latest/meta-data/public-ipv4
  • by SSHing to the master from your local machine (*/ec2/uw-ec2.py -n <num-slaves> connect) and disconnecting
  • or by filtering for the master by name in the EC2 console (e.g., cw0)

The Hadoop web interface (for Giraph) is at http://<public-ip>:50030.
The GPS web interface is at http://<public-ip>:8888/monitoring.html.

Results for Completed Runs

To quickly look at the time, memory, and network stats for completed runs, use:

~/benchmark/parsers/batch-parser.py <system> <time-log-files>

For example, ./batch-parser.py 0 ~/benchmark/giraph/logs/wcc*time.txt parses logs for all completed runs on Giraph with WCC. Use --help to see the options.

For the naming convention of the log files, see Log Files.

Dealing with Failures

System failures can and will happen. Here's a list of the quirks and what to do. Don't panic!

General Issues

Cleaning up sar/free

All benchmarking scripts use ~/benchmark/common/bench-init.sh before running a system and ~/benchmark/common/bench-finish.sh after the run completes. These start the sar and free instances on the master and worker machines.

If a run on any system is cancelled manually (e.g., Ctrl-c), be sure to clean things up by running:

~/benchmark/common/cleanup-bench.sh

Alternatively, if you want to pull the workers' logs for that failed run, cd to the correct system directory and use:

~/benchmark/bench-finish.sh <log-prefix>

The cd part is important---bench-finish.sh is fragile! For example:

cd ~/benchmark/giraph/
../common/bench-finish.sh pagerank_orkut-adj.txt_16_20140101-123050

Cleaning up Logs

Finally, any cancelled run's log files will not be cleaned up. To find these bad log files, use

~/benchmark/parser/log-checker.sh <log-dir>/*_time.txt

or

~/benchmark/parser/log-checker.sh <log-dir>/*_0_mem.txt

For example, ./log-checker.sh ~/benchmark/giraph/logs/*_time.txt.

System-specific Issues

Giraph

If a run in Giraph is cancelled, run ~/benchmark/giraph/kill-java-job.sh to clean up lingering Java instances on the workers (and the master).

Note that this does not fail the pending job in the Hadoop web monitoring page. To clear it, use hadoop job -kill <job-id> (2 or 3 times). Alternatively, restart Hadooop altogether with ~/benchmark/hadoop/restart-hadoop.sh 1 (the 1 tells the script to wait until HDFS is back up).

In general, failures can be diagnosed by checking the Hadoop web interface at http://<public-ip>:50030. Otherwise, look at the worker machines' Hadoop logs, located in their ~/hadoop-1.0.4/logs/userlogs/.

In the console, Giraph outputs "INFO mapred.JobClient: map x% reduce 0%". Here are some signs of failure:

  • x% never reaches 100% or 99%: usually because the input dataset doesn't exist on HDFS
  • x% reaches 100% or 99% but starts decreasing: some machines have failed

GPS

If a run in GPS is cancelled, run ~/benchmark/gps/stop-nodes.sh to clean up GPS tasks. Make sure to sleep 60 (on 4-32 machines) or sleep 80 (on 64-128 machines) before performing another run.

GPS has a bad habit of hanging indefinitely on a failed job---the console will not show any warning signs. Be vigilant and monitor http://<public-ip>:8888/workers.html every now and then. One common failure is connections not being established with workers: this is indicated by "Latest Status Receive Time" and "Connection Establishment Time" being "null".

Lastly, *_time.txt log files are created only after a run for GPS, so search for *_0_mem.txt with log-checker.sh when cleaning up bad log files.

Mizan and GraphLab

Nothing needed! These systems clean up after themselves.

Grabbing Results

To grab the experimental results (i.e., log files):

  1. Create a tarball of the contents in ~/benchmark/<system>/logs/ for each system. For example, ~/benchmark/giraph/logs/giraph-32-results.tar.gz.

Note: Compression is highly recommended: tar -zcf giraph-32-results.tar.gz *.txt, or tar -cf - *.txt | pigz > giraph-32-results.tar.gz for multicore compression.

  1. On a local machine, run:
*/ec2/uw-ec2.py -n <num-slaves> get-logs

This will download ~/benchmark/<system>/logs/tarball.tar.gz to */results/<system>/<num-slaves>/tarball.tar.gz. Simply extract with tar -xf tarball.tar.gz and you're done!

Clone this wiki locally