Dataset Scripts

Content

Dataset Formats
Format Converters
Datasets

Aside from load-files.sh, which loads specific graphs from */datasets/ to HDFS, all the files in */benchmark/datasets/ convert datasets between different formats.

Dataset Formats

The SNAP format is

src1 dst1
src1 dst2
src1 dst3
...
src2 dst1
src2 dst2
...

while the adj or adjacency format is

src1 dst1 dst2 dst3 ...
src2 dst1 dst2 ...
...

where src and dst are both vertex IDs. In both cases, each edge is a directed edge (i.e., an arc).

If the graph is weighted, the SNAP format becomes

src1 dst1 weight1
src1 dst2 weight2
...

while the adj format becomes

src1 dst1 weight1 dst2 weight2 ...
...

Mizan uses SNAP, while Giraph, GPS, and GraphLab all use adj.

The datasets in ~/datasets/ on the master machine are named as graph.txt for the SNAP version, graph-adj.txt for the adj version. Furthermore, directed graphs are named graph*, while undirected weighted graphs (used in DMST) are named graph-mst*. For example:

Graph	Format
orkut.txt	directed graph in SNAP format
orkut-adj.txt	directed graph in adj format
orkut-mst.txt	undirected, weighted graph in SNAP format
orkut-mst-adj.txt	undirected, weighted graph in adj format

Format Converters

To use the format converters...

Compile them:

cd */benchmark/datasets/
make

Use the *.sh scripts while in the */datasets/ folder. For example to get orkut-adj.txt:

cd */datasets/
../benchmark/datasets/convert-adj.sh orkut.txt 0

In more detail, mst-convert.cpp converts a sorted SNAP graph to an unsorted, undirected SNAP graph, with unique random edge weights. snap-convert.cpp converts from SNAP to adj, and snap-revert.cpp converts from adj to SNAP.

The bash script convert-mst.sh is a wrapper for mst-convert.cpp that sorts the input and output. This uses Unix sort, so large graphs should be converted on machines with SSDs (or sufficient RAM to keep things in-memory). The script outputs *-mst.txt.

Similarly, convert-adj.sh is a wrapper for snap-convert.cpp, which outputs *-adj.txt.

Datasets

We obtained our datasets from SNAP and LAW. On the master EC2 image, the datasets are in ~/datasets/. For local testing, place them in */datasets/.

If you obtain your own datasets, you'll want to do the following:

For SNAP datasets, prune the first few lines and (optionally) \rs:

cat snap-graph.txt | tail -n +5 | tr -d '\r' > snap-cleaned.txt

This will give a graph in the SNAP format. Convert it to adj, mst, or mst-adj as needed with the format converters.

For LAW datasets, you'll need to download WebGraph (or here and here)and decompress law-graph.graph using:

cd <webgraph-dir>
java -cp *:./ it.unimi.dsi.webgraph.ArcListASCIIGraph law-graph output-graph.txt

The output will be ouput-graph.txt in the SNAP format. Again, convert as needed.

On HDFS, all datasets should go into ./input. I.e.,

hadoop dfs -put <graph-file> input/

Our Results

Data and Paper

Running Experiments

Repo Structure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset Scripts

Content

Dataset Formats

Format Converters

Datasets

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally