-
Notifications
You must be signed in to change notification settings - Fork 12
Dataset Scripts
Aside from load-files.sh
, which loads specific graphs from */datasets/
to HDFS, and load-splits.sh
, which splits graphs into parts before loading them to HDFS, all the files in */benchmark/datasets/
convert datasets between different formats.
The SNAP format is
src1 dst1
src1 dst2
src1 dst3
...
src2 dst1
src2 dst2
...
while the adj or adjacency format is
src1 dst1 dst2 dst3 ...
src2 dst1 dst2 ...
...
where src and dst are both vertex IDs. In both cases, each edge is a directed edge (i.e., an arc).
If the graph is weighted, the SNAP format becomes
src1 dst1 weight1
src1 dst2 weight2
...
while the adj format becomes
src1 dst1 weight1 dst2 weight2 ...
...
Mizan uses SNAP, while Giraph, GPS, and GraphLab all use adj.
The datasets in ~/datasets/
on the master machine are named as graph.txt
for the SNAP version, graph-adj.txt
for the adj version. Furthermore, directed graphs are named graph*
, while undirected weighted graphs (used in DMST) are named graph-mst*
. For example:
Filename | Format |
---|---|
orkut.txt | directed graph in SNAP format |
orkut-adj.txt | directed graph in adj format |
orkut-mst.txt | undirected, weighted graph in SNAP format |
orkut-mst-adj.txt | undirected, weighted graph in adj format |
To use the format converters...
- Compile them:
cd */benchmark/datasets/
make
- Use the
*.sh
scripts while in the*/datasets/
folder. For example to getorkut-adj.txt
:
cd */datasets/
../benchmark/datasets/convert-adj.sh orkut.txt 0
In more detail, mst-convert.cpp
converts a sorted SNAP graph to an unsorted, undirected SNAP graph, with unique random edge weights. snap-convert.cpp
converts from SNAP to adj, and snap-revert.cpp
converts from adj to SNAP.
The bash script convert-mst.sh
is a wrapper for mst-convert.cpp
that sorts the input and output. This uses Unix sort
, so large graphs should be converted on machines with SSDs (or sufficient RAM to keep things in-memory). The script outputs *-mst.txt
.
Similarly, convert-adj.sh
is a wrapper for snap-convert.cpp
, which outputs *-adj.txt
.
GraphLab requires input files to be split into parts to utilize parallel input loading (otherwise loading becomes serial). split-inputs.sh
splits a given file (at newlines) into a number of contiguous parts. The number of parts should be at least the number of worker machines.
load-splits.sh
uses split-inputs.sh
to split graphs and then loads them to HDFS.
We obtained our datasets from SNAP and LAW. For EC2, place datasets into ~/datasets/
on the master. For local testing, place them in */datasets/
.
The exact datasets we used can be obtained here. They come as compressed .xz files, which can be extracted with unxz graph.txt.xz
.
All graphs are in the adjacency format. The SNAP format (used by Mizan) can be obtained using */benchmark/datasets/snap-revert
(Format Converters).
If you obtain your own datasets, you'll want to do the following:
- For SNAP datasets, prune the first few lines and (optionally)
\r
s:
cat snap-graph.txt | tail -n +5 | tr -d '\r' > snap-cleaned.txt
This will give a graph in the SNAP format. Convert it to adj, mst, or mst-adj as needed with the format converters.
- For LAW datasets, you'll need to download WebGraph (or here and here) and decompress
law-graph.graph
using:
cd <webgraph-dir>
java -cp *:./ it.unimi.dsi.webgraph.ArcListASCIIGraph law-graph output-graph.txt
The output will be ouput-graph.txt
in the SNAP format. Again, convert as needed.
- On HDFS, all datasets should go into
./input
. I.e.,
hadoop dfs -put <graph-file> input/
For GraphLab, make sure to split the graph file into multiple parts. A directory containing the parts should then go into ./input
as well:
hadoop dfs -put <parts-directory>/ input/