This project is derived from an assignement I did during my bootcamp at Yotta Academy. It aims to classify the morphologies of distant galaxies using deep neural networks.

It is based on the Kaggle Galaxy Zoo Challenge.

Originaly posed as a regression problem in the Kaggle challenge, with formulate it here as a multiclass classification problem since this is eventually the goal behind the project. Additionaly, this has the added benefit to simplify things a bit.

To better understand the task to be learned by the model, give it a go yourself: try it here.

Project & Results

Checkout my experiments and the project's report on Weights & Biases.

Documentation

A few related papers on the topic are available here:

Installation

Step 1

Ensure your gpu driver & cuda are properly setup for pytorch to use it (the name of your device should appear):

nvidia-smi

Step 2

If you don't have it already — I highly recommend it! — install poetry:

make setup-poetry

Step 3

Setup the environment with python 3.10, e.g. using miniconda (easier IMO):

git clone git@github.com:aliberts/galaxy-zoo.git
cd galaxy-zoo
conda create --yes --name gzoo python=3.10
conda activate gzoo
poetry install

or pyenv:

git clone git@github.com:aliberts/galaxy-zoo.git
cd galaxy-zoo
pyenv install 3.10:latest
pyenv local 3.10:latest
poetry install

Step 4

Download the dataset:

make dataset

This will download and extract the archives into dataset/. You'll need to login with Kaggle's API first and place your kaggle.json api key inside ~/.kaggle by default.
You can also do it manually by downloading it here. In that case, don't forget to update the location of the directory you put it in with the dataset.dir config option.

Optional

Make your commands shorter with this alias:

alias py='poetry run python'

If you intend to contribute in this repo, install the pre-commit hooks with:

pre-commit install

You're good to go!

Training

Create the training labels for classification

poetry run python -m gzoo.app.make_labels

This will produce the classification_labels.csv file inside dataset/, which is needed for training. These class labels are produced from the original regression labels in training_solutions_rev1.csv.

Partition data for training

poetry run python -m gzoo.app.split_data

This will split the dataset into the training / validation / testing partitions and write those partitions in a clf_labels_split.csv file. The ratios used for the partitionning are set in the dataset.test_split_ratio and dataset.val_split_ratio config options. .

Run the classification pipeline

poetry run python -m gzoo.app.train

script option:

--config_path: specify the .yaml config file to read options from. Every run config option should be listed in this file (the default file for this is config/train.yaml) and every option in that file can be overloaded on the fly at the command line.

For instance, if you are fine with the values in the yaml config file but you just want to change the epochs number, you can either change it in the config file or you can directly run:

poetry run python -m gzoo.app.train --compute.epochs=50

This will use all config values from config/train.yaml except the number of epochs which will be set to 50.

main run options:

--compute.seed: seed for deterministic training. (default: None)
--compute.epochs: total number of epochs (default: 90)
--compute.batch-size: batch size (default: 128)
--compute.workers: number of data-loading threads (default: 8)
--model.arch: model architecture to be used (default: resnet18)
--model.pretrained: use pre-trained model (default: False)
--optimizer.lr: optimizer learning rate (default: 3.e-4 with Adam)
--optimizer.momentum: optimizer momentum (for SGD only, default: 0.9)
--optimizer.weight-decay: optimizer weights regularization (L2, default 1.e-4)

Prediction

poetry run python -m gzoo.app.predict

Config works the same way as for training, default config is at config/predict.yaml.

A 1-image example is provided which you can run with:

poetry run python -m gzoo.app.predict --dataset.dir=example/

Config

If you make changes in gzoo.infra.config, you should also update the related .yaml config files in config/ with:

make config

galaxy-zoo

Commits

:arrow_up: pre-commit autoupdate

:arrow_up: poetry update

:technologist: Simplified config update

:recycle: Refactored GalaxyTrainSet for dataset lineage

:art: split_data cleanup

:building_construction: W&B artifacts lineage

README