Skip to content

Dataset Scripts

Young edited this page May 8, 2014 · 27 revisions

Content

Aside from load-files.sh, which loads specific graphs from */datasets/ to HDFS, all the files in */benchmark/datasets/ convert datasets between different formats.

Dataset Formats

The SNAP format is

src1 dst1
src1 dst2
src1 dst3
...
src2 dst1
src2 dst2
...

while the adj or adjacency format is

src1 dst1 dst2 dst3 ...
src2 dst1 dst2 ...
...

where src and dst are both vertex IDs. In both cases, each edge is a directed edge (i.e., an arc).

If the graph is weighted, the SNAP format becomes

src1 dst1 weight1
src1 dst2 weight2
...

while the adj format becomes

src1 dst1 weight1 dst2 weight2 ...
...

Mizan uses SNAP, while Giraph, GPS, and GraphLab all use adj.

The datasets in ~/datasets/ on the master machine are named as graph.txt for the SNAP version, graph-adj.txt for the adj version. Furthermore, directed graphs are named graph*, while undirected weighted graphs (used in DMST) are named graph-mst*. For example:

Graph Format
orkut.txt directed graph in SNAP format
orkut-adj.txt directed graph in adj format
orkut-mst.txt undirected, weighted graph in SNAP format
orkut-mst-adj.txt undirected, weighted graph in adj format

Format Converters

To use the format converters...

  1. Compile them:
cd */benchmark/datasets/
make
  1. Use the *.sh scripts while in the */datasets/ folder. For example to get orkut-adj.txt:
cd */datasets/
../benchmark/datasets/convert-adj.sh orkut.txt 0

In more detail, mst-convert.cpp converts a sorted SNAP graph to an unsorted, undirected SNAP graph, with unique random edge weights. snap-convert.cpp converts from SNAP to adj, and snap-revert.cpp converts from adj to SNAP.

The bash script convert-mst.sh is a wrapper for mst-convert.cpp that sorts the input and output. This uses Unix sort, so large graphs should be converted on machines with SSDs (or sufficient RAM to keep things in-memory). The script outputs *-mst.txt.

Similarly, convert-adj.sh is a wrapper for snap-convert.cpp, which outputs *-adj.txt.

Datasets

We obtained our datasets from SNAP and LAW. On the master EC2 image, the datasets are in ~/datasets/. For local testing, place them in */datasets/.

If you obtain your own datasets, you'll want to do the following:

  1. For SNAP datasets, prune the first few lines and (optionally) \rs:
cat snap-graph.txt | tail -n +5 | tr -d '\r' > snap-cleaned.txt

This will give a graph in the SNAP format. Convert it to adj, mst, or mst-adj as needed with the format converters.

  1. For LAW datasets, you'll need to download WebGraph (or here and here)and decompress law-graph.graph using:
cd <webgraph-dir>
java -cp *:./ it.unimi.dsi.webgraph.ArcListASCIIGraph law-graph output-graph.txt

The output will be ouput-graph.txt in the SNAP format. Again, convert as needed.

  1. On HDFS, all datasets should go into ./input. I.e.,
hadoop dfs -put <graph-file> input/
Clone this wiki locally