Dataset Scripts

Content

Dataset Formats
Format Converters
Input Splits
Datasets

Aside from load-files.sh, which loads specific graphs from */datasets/ to HDFS, and load-splits.sh, which splits graphs into parts before loading them to HDFS, all the files in */benchmark/datasets/ convert datasets between different formats.

Dataset Formats

The SNAP format is

src1 dst1
src1 dst2
src1 dst3
...
src2 dst1
src2 dst2
...

while the adj or adjacency format is

src1 dst1 dst2 dst3 ...
src2 dst1 dst2 ...
...

where src and dst are both vertex IDs. In both cases, each edge is a directed edge (i.e., an arc).

If the graph is weighted, the SNAP format becomes

src1 dst1 weight1
src1 dst2 weight2
...

while the adj format becomes

src1 dst1 weight1 dst2 weight2 ...
...

Mizan uses SNAP, while Giraph, GPS, and GraphLab all use adj.

The datasets in ~/datasets/ on the master machine are named as graph.txt for the SNAP version, graph-adj.txt for the adj version. Furthermore, directed graphs are named graph*, while undirected weighted graphs (used in DMST) are named graph-mst*. For example:

Filename	Format
orkut.txt	directed graph in SNAP format
orkut-adj.txt	directed graph in adj format
orkut-mst.txt	undirected, weighted graph in SNAP format
orkut-mst-adj.txt	undirected, weighted graph in adj format

Format Converters

To use the format converters...

Compile them:

cd */benchmark/datasets/
make

Use the *.sh scripts while in the */datasets/ folder. For example to get orkut-adj.txt:

cd */datasets/
../benchmark/datasets/convert-adj.sh orkut.txt 0

In more detail, mst-convert.cpp converts a sorted SNAP graph to an unsorted, undirected SNAP graph, with unique random edge weights. snap-convert.cpp converts from SNAP to adj, and snap-revert.cpp converts from adj to SNAP.

The bash script convert-mst.sh is a wrapper for mst-convert.cpp that sorts the input and output. This uses Unix sort, so large graphs should be converted on machines with SSDs (or sufficient RAM to keep things in-memory). The script outputs *-mst.txt.

Similarly, convert-adj.sh is a wrapper for snap-convert.cpp, which outputs *-adj.txt.

Input Splits

GraphLab requires input files to be split into parts to utilize parallel input loading (otherwise loading becomes serial). split-inputs.sh splits a given file (at newlines) into a number of contiguous parts. The number of parts should be at least the number of worker machines.

load-splits.sh uses split-inputs.sh to split graphs and then loads them to HDFS.

Datasets

We obtained our datasets from SNAP and LAW. For EC2, place datasets into ~/datasets/ on the master. For local testing, place them in */datasets/.

Our Datasets

The exact datasets we used can be obtained here. They come as compressed .xz files, which can be extracted with unxz graph.txt.xz.

All graphs are in the adjacency format. The SNAP format (used by Mizan) can be obtained using */benchmark/datasets/snap-revert (Format Converters).

Other Datasets

If you obtain your own datasets, you'll want to do the following:

For SNAP datasets, prune the first few lines and (optionally) \rs:

cat snap-graph.txt | tail -n +5 | tr -d '\r' > snap-cleaned.txt

This will give a graph in the SNAP format. Convert it to adj, mst, or mst-adj as needed with the format converters.

For LAW datasets, you'll need to download WebGraph (or here and here) and decompress law-graph.graph using:

cd <webgraph-dir>
java -cp *:./ it.unimi.dsi.webgraph.ArcListASCIIGraph law-graph output-graph.txt

The output will be ouput-graph.txt in the SNAP format. Again, convert as needed.

On HDFS, all datasets should go into ./input. I.e.,

hadoop dfs -put <graph-file> input/

For GraphLab, make sure to split the graph file into multiple parts. A directory containing the parts should then go into ./input as well:

hadoop dfs -put <parts-directory>/ input/

Our Results

Data and Paper

Running Experiments

Repo Structure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset Scripts

Content

Dataset Formats

Format Converters

Input Splits

Datasets

Our Datasets

Other Datasets

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally