GitXplorerGitXplorer
a

galaxy-zoo

public
2 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
75315cbab30bd45bea5a6a7887eb1b0bc1b33054

:truck: Change artifact import

aaliberts committed 2 years ago
Unverified
d72ee4b150c4ee251a43c9b2a601cf0508f0fadb

:art: Dataset fetch on W&B

aaliberts committed 2 years ago
Unverified
281bf115508c31300631c5f848b8993ea80b7404

:wrench: Ran config_update

aaliberts committed 2 years ago
Unverified
84bd936ed7be2dd5a5da668fdb11e3323d922f1c

:wrench: Moved dataset version pre-processing to config

aaliberts committed 2 years ago
Unverified
c0605429ea03808000e7ea4239ff74ef1d2688c0

:bug: Labels not written to data split fixed

aaliberts committed 2 years ago
Unverified
42dc621915ccab8ee8f13750cc162eab1ba36265

:coffin: Remove old data split

aaliberts committed 2 years ago

README

The README file for this repository.

Python Version License Code Style Weights & Biases

This project is derived from an assignement I did during my bootcamp at Yotta Academy. It aims to classify the morphologies of distant galaxies using deep neural networks.

It is based on the Kaggle Galaxy Zoo Challenge.

Originaly posed as a regression problem in the Kaggle challenge, with formulate it here as a multiclass classification problem since this is eventually the goal behind the project. Additionaly, this has the added benefit to simplify things a bit.

To better understand the task to be learned by the model, give it a go yourself: try it here.

Project & Results

Checkout my experiments and the project's report on Weights & Biases.

Documentation

A few related papers on the topic are available here:

Installation

Step 1

Ensure your gpu driver & cuda are properly setup for pytorch to use it (the name of your device should appear):

nvidia-smi

Step 2

If you don't have it already — I highly recommend it! — install poetry:

make setup-poetry

Step 3

Setup the environment with python 3.10, e.g. using miniconda (easier IMO):

git clone git@github.com:aliberts/galaxy-zoo.git
cd galaxy-zoo
conda create --yes --name gzoo python=3.10
conda activate gzoo
poetry install

or pyenv:

git clone git@github.com:aliberts/galaxy-zoo.git
cd galaxy-zoo
pyenv install 3.10:latest
pyenv local 3.10:latest
poetry install

Step 4

Download the dataset:

make dataset

This will download and extract the archives into dataset/. You'll need to login with Kaggle's API first and place your kaggle.json api key inside ~/.kaggle by default.
You can also do it manually by downloading it here. In that case, don't forget to update the location of the directory you put it in with the dataset.dir config option.

Optional

Make your commands shorter with this alias:

alias py='poetry run python'

If you intend to contribute in this repo, install the pre-commit hooks with:

pre-commit install

You're good to go!

Training

Create the training labels for classification

poetry run python -m gzoo.app.make_labels

This will produce the classification_labels.csv file inside dataset/, which is needed for training. These class labels are produced from the original regression labels in training_solutions_rev1.csv.

Partition data for training

poetry run python -m gzoo.app.split_data

This will split the dataset into the training / validation / testing partitions and write those partitions in a clf_labels_split.csv file. The ratios used for the partitionning are set in the dataset.test_split_ratio and dataset.val_split_ratio config options. .

Run the classification pipeline

poetry run python -m gzoo.app.train

script option:

  • --config_path: specify the .yaml config file to read options from. Every run config option should be listed in this file (the default file for this is config/train.yaml) and every option in that file can be overloaded on the fly at the command line.

For instance, if you are fine with the values in the yaml config file but you just want to change the epochs number, you can either change it in the config file or you can directly run:

poetry run python -m gzoo.app.train --compute.epochs=50

This will use all config values from config/train.yaml except the number of epochs which will be set to 50.

main run options:

  • --compute.seed: seed for deterministic training. (default: None)
  • --compute.epochs: total number of epochs (default: 90)
  • --compute.batch-size: batch size (default: 128)
  • --compute.workers: number of data-loading threads (default: 8)
  • --model.arch: model architecture to be used (default: resnet18)
  • --model.pretrained: use pre-trained model (default: False)
  • --optimizer.lr: optimizer learning rate (default: 3.e-4 with Adam)
  • --optimizer.momentum: optimizer momentum (for SGD only, default: 0.9)
  • --optimizer.weight-decay: optimizer weights regularization (L2, default 1.e-4)

Prediction

poetry run python -m gzoo.app.predict

Config works the same way as for training, default config is at config/predict.yaml.

A 1-image example is provided which you can run with:

poetry run python -m gzoo.app.predict --dataset.dir=example/

Config

If you make changes in gzoo.infra.config, you should also update the related .yaml config files in config/ with:

make config