Skip to content

Dataset Scripts

xvz edited this page Nov 27, 2014 · 27 revisions

Content

Aside from load-files.sh, which loads specific graphs from */datasets/ to HDFS, and load-splits.sh, which splits graphs into parts before loading them to HDFS, all the files in */benchmark/datasets/ convert datasets between different formats.

Dataset Formats

The SNAP format is

src1 dst1
src1 dst2
src1 dst3
...
src2 dst1
src2 dst2
...

while the adj or adjacency format is

src1 dst1 dst2 dst3 ...
src2 dst1 dst2 ...
...

where src and dst are both vertex IDs. In both cases, each edge is a directed edge (i.e., an arc).

If the graph is weighted, the SNAP format becomes

src1 dst1 weight1
src1 dst2 weight2
...

while the adj format becomes

src1 dst1 weight1 dst2 weight2 ...
...

Mizan uses SNAP, while Giraph, GPS, and GraphLab all use adj.

The datasets in ~/datasets/ on the master machine are named as graph.txt for the SNAP version, graph-adj.txt for the adj version. Furthermore, directed graphs are named graph*, while undirected weighted graphs (used in DMST) are named graph-mst*. For example:

Filename Format
orkut.txt directed graph in SNAP format
orkut-adj.txt directed graph in adj format
orkut-mst.txt undirected, weighted graph in SNAP format
orkut-mst-adj.txt undirected, weighted graph in adj format

Format Converters

To use the format converters...

  1. Compile them:
cd */benchmark/datasets/
make
  1. Use the *.sh scripts while in the */datasets/ folder. For example to get orkut-adj.txt:
cd */datasets/
../benchmark/datasets/convert-adj.sh orkut.txt 0

In more detail, mst-convert.cpp converts a sorted SNAP graph to an unsorted, undirected SNAP graph, with unique random edge weights. snap-convert.cpp converts from SNAP to adj, and snap-revert.cpp converts from adj to SNAP.

The bash script convert-mst.sh is a wrapper for mst-convert.cpp that sorts the input and output. This uses Unix sort, so large graphs should be converted on machines with SSDs (or sufficient RAM to keep things in-memory). The script outputs *-mst.txt.

Similarly, convert-adj.sh is a wrapper for snap-convert.cpp, which outputs *-adj.txt.

Input Splits

GraphLab requires input files to be split into parts to utilize parallel input loading (otherwise loading becomes serial). split-inputs.sh splits a given file (at newlines) into a number of contiguous parts. The number of parts should be at least the number of worker machines.

load-splits.sh uses split-inputs.sh to split graphs and then loads them to HDFS.

Datasets

We obtained our datasets from SNAP and LAW. For EC2, place datasets into ~/datasets/ on the master. For local testing, place them in */datasets/.

Our Datasets

The exact datasets we used can be obtained here. They come as compressed .xz files, which can be extracted with unxz graph.txt.xz.

All graphs are in the adjacency format. The SNAP format (used by Mizan) can be obtained using */benchmark/datasets/snap-revert (Format Converters).

Other Datasets

If you obtain your own datasets, you'll want to do the following:

  1. For SNAP datasets, prune the first few lines and (optionally) \rs:
cat snap-graph.txt | tail -n +5 | tr -d '\r' > snap-cleaned.txt

This will give a graph in the SNAP format. Convert it to adj, mst, or mst-adj as needed with the format converters.

  1. For LAW datasets, you'll need to download WebGraph (or here and here) and decompress law-graph.graph using:
cd <webgraph-dir>
java -cp *:./ it.unimi.dsi.webgraph.ArcListASCIIGraph law-graph output-graph.txt

The output will be ouput-graph.txt in the SNAP format. Again, convert as needed.

  1. On HDFS, all datasets should go into ./input. I.e.,
hadoop dfs -put <graph-file> input/

For GraphLab, make sure to split the graph file into multiple parts. A directory containing the parts should then go into ./input as well:

hadoop dfs -put <parts-directory>/ input/
Clone this wiki locally