CLAM: CHAODA

This readme contains the information and specific commands you would need to set up a python virtual environment and reproduce our results for CHAODA and the comparisons against other algorithms.

Setting up Python and a Virtual Environment:

We developed and tested our implementation on a machine with:

OS: Manjaro Linux x86_64
Kernel: 5.13.13-1-MANJARO
Python: 3.9.6

If you have a Mac, you made a terrible life decision some time ago and will have to adapt some commands for yourself.

For the paper, we benchmarked using a machine with:

CPU: Intel Xeon E5-2690 v4 (28 cores) @ 3.500G
Memory: 512GiB
OS: CentOS Linux 7 (Core) x86_64
Kernel: 3.10.0-1127.13.1.el7.x86_64
Python: 3.6.8

First, make sure to install the required packages to create and manage Python virtual-environments.

Arch:

$ sudo pacman -S python-pip

Ubuntu:

$ sudo apt install python3-dev python3-pip python3-venv

Next, create a new virtual environment.

Arch:

$ python -m venv venv

Ubuntu:

$ python3 -m venv venv

Update pip and install the required Python packages.

$ source venv/bin/activate
$ pip install --upgrade pip setuptools wheel

Now pip install the following packages for:

CHAODA benchmarks:
- pyclam
- numpy
- scipy
- scikit-learn
- h5py
- pandas
competing algorithms:
- pyod
- tensorflow
- keras
(my) sanity:
- tqdm
SDSS-APOGEE2 benchmarks:
- astropy
generating some neat plots:
- matplotlib
- umap-learn

We include a requirements.txt file with the exact version numbers we used for each package. If you're using Linux and have an old enough kernel, these versions will not be available. In this case, you're on your own. If you have a Mac or a Windows computer, re-read the previous sentence.

Running CHAODA:

We provide an easy and helpful CLI through main.py. For help, run:

$ python main.py --help

We assume write access for the local folder of this repository.

Downloading ODDS datasets:

You will need these datasets to reproduce the benchmarks for CHAODA and competitors.

$ python main.py --mode download-datasets

Training the Meta-ML Models:

We provide the meta-ml models we trained and used for the reported benchmarks. These are in chaoda/meta_models.py. If you're curious as to how these are generated (and how they strictly adhere to PEP8), feel free to dig around in chaoda/train_meta_ml.py.

If you want to retrain the meta-ml models for yourself, run:

$ python main.py --mode train-meta-ml --meta-ml-epochs 10

5 epochs are likely enough. We went with 10 by default.

Running this command will produce a new file chaoda/custom_meta_models.py which will override the meta-ml models we provide. If you wish to revert to the models we provide, just delete this file.

Benchmarking CHAODA:

You can reproduce CHAODA benchmarks with:

$ python main.py --mode bench-chaoda

This will create two files:

results/scores.csv, and
results/times.csv

If you want to use our heuristic for what we call CHAODA_FAST, simply add the --fast flag:

$ python main.py --mode bench-chaoda --fast

If you want to see detailed statistics on each individual algorithm in the ensemble, add the --report-individual-methods flag:

$ python main.py --mode bench-chaoda --report-individual-methods

This will create a giant file called results/individual_scores.csv.

Benchmarking Competitors:

We benchmark many competing algorithms from the PyOD suite. You can reproduce these with:

$ python main.py --mode bench-pyod

We default to allowing each of these algorithms up to 10 hours to finish running. You can override the default by passing in the number of seconds with the --pyod-time-limit argument:

$ python main.py --mode bench-pyod --pyod-time-limit 600

This will add benchmarks to results/scores.csv and results/times.csv.

Scoring the APOGEE Data:

This part is a bit more involved than the rest.

First, you need to download the APOGEE2 data from the SDSS archives. Go to this link and follow their instructions. You will need approximately 1.5TB at the time of this writing (likely more in the future). You may also need to read up on their data model if you run into problems here.

Edit APO_TELESCOPE and APOGEE_PATH in sdss/preparse.py as needed to point the script to the apogee2 data.

Run the preparse script to extract the APOGEE spectra that we used:

$ python main.py --mode preparse-apogee

For us, this extracted 528,319 spectra. There are probably more in this dataset, but you will need to come up with some regex magic to extract them.

The preparse script will need another approx 20GB of space to store a numpy array. CHAODA does not load the whole thing into RAM. From here on, you should be able to reproduce APOGEE scores on a machine with 12GB of RAM.

To score the APOGEE dataset, run:

$ python main.py --mode score-apogee

This will produce a json file in the results directory with the amount of time taken and the anomaly score for each spectrum we extracted. These scores correspond with the fits files named in the filenames.csv in the data directory. If you're an expert on the APOGEE data, have fun!

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
chaoda		chaoda
comparisons		comparisons
sdss		sdss
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
explorations.py		explorations.py
main.py		main.py
requirements.txt		requirements.txt
supplement.pdf		supplement.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CLAM: CHAODA

Setting up Python and a Virtual Environment:

Running CHAODA:

Downloading ODDS datasets:

Training the Meta-ML Models:

Benchmarking CHAODA:

Benchmarking Competitors:

Scoring the APOGEE Data:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

URI-ABD/chaoda

Folders and files

Latest commit

History

Repository files navigation

CLAM: CHAODA

Setting up Python and a Virtual Environment:

Running CHAODA:

Downloading ODDS datasets:

Training the Meta-ML Models:

Benchmarking CHAODA:

Benchmarking Competitors:

Scoring the APOGEE Data:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages