GitXplorerGitXplorer
a

galaxy-zoo

public
2 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
8138c6360a80a02e5489c58674a6b569c70106f9

:arrow_up: pre-commit autoupdate

aaliberts committed 2 years ago
Unverified
517cb4b4535012915cf628a0c9bd41c08e49daf4

:arrow_up: poetry update

aaliberts committed 2 years ago
Unverified
b54f335f1f688101b244a6b0f7b70c9888bfc799

:technologist: Simplified config update

aaliberts committed 2 years ago
Unverified
d2b7048d09d39b83142a96851050a36722f3900b

:recycle: Refactored GalaxyTrainSet for dataset lineage

aaliberts committed 2 years ago
Unverified
ffd288952ccea38a8c01e375bf72dca4ed7ed0b9

:art: split_data cleanup

aaliberts committed 2 years ago
Unverified
cd3886f8fd6b1b70c0feca762b88a38118a45b01

:building_construction: W&B artifacts lineage

aaliberts committed 2 years ago

README

The README file for this repository.

Python Version License Code Style Weights & Biases

This project is derived from an assignement I did during my bootcamp at Yotta Academy. It aims to classify the morphologies of distant galaxies using deep neural networks.

It is based on the Kaggle Galaxy Zoo Challenge.

Originaly posed as a regression problem in the Kaggle challenge, with formulate it here as a multiclass classification problem since this is eventually the goal behind the project. Additionaly, this has the added benefit to simplify things a bit.

To better understand the task to be learned by the model, give it a go yourself: try it here.

Project & Results

Checkout my experiments and the project's report on Weights & Biases.

Documentation

A few related papers on the topic are available here:

Installation

Step 1

Ensure your gpu driver & cuda are properly setup for pytorch to use it (the name of your device should appear):

nvidia-smi

Step 2

If you don't have it already — I highly recommend it! — install poetry:

make setup-poetry

Step 3

Setup the environment with python 3.10, e.g. using miniconda (easier IMO):

git clone git@github.com:aliberts/galaxy-zoo.git
cd galaxy-zoo
conda create --yes --name gzoo python=3.10
conda activate gzoo
poetry install

or pyenv:

git clone git@github.com:aliberts/galaxy-zoo.git
cd galaxy-zoo
pyenv install 3.10:latest
pyenv local 3.10:latest
poetry install

Step 4

Download the dataset:

make dataset

This will download and extract the archives into dataset/. You'll need to login with Kaggle's API first and place your kaggle.json api key inside ~/.kaggle by default.
You can also do it manually by downloading it here. In that case, don't forget to update the location of the directory you put it in with the dataset.dir config option.

Optional

Make your commands shorter with this alias:

alias py='poetry run python'

If you intend to contribute in this repo, install the pre-commit hooks with:

pre-commit install

You're good to go!

Training

Create the training labels for classification

poetry run python -m gzoo.app.make_labels

This will produce the classification_labels.csv file inside dataset/, which is needed for training. These class labels are produced from the original regression labels in training_solutions_rev1.csv.

Partition data for training

poetry run python -m gzoo.app.split_data

This will split the dataset into the training / validation / testing partitions and write those partitions in a clf_labels_split.csv file. The ratios used for the partitionning are set in the dataset.test_split_ratio and dataset.val_split_ratio config options. .

Run the classification pipeline

poetry run python -m gzoo.app.train

script option:

  • --config_path: specify the .yaml config file to read options from. Every run config option should be listed in this file (the default file for this is config/train.yaml) and every option in that file can be overloaded on the fly at the command line.

For instance, if you are fine with the values in the yaml config file but you just want to change the epochs number, you can either change it in the config file or you can directly run:

poetry run python -m gzoo.app.train --compute.epochs=50

This will use all config values from config/train.yaml except the number of epochs which will be set to 50.

main run options:

  • --compute.seed: seed for deterministic training. (default: None)
  • --compute.epochs: total number of epochs (default: 90)
  • --compute.batch-size: batch size (default: 128)
  • --compute.workers: number of data-loading threads (default: 8)
  • --model.arch: model architecture to be used (default: resnet18)
  • --model.pretrained: use pre-trained model (default: False)
  • --optimizer.lr: optimizer learning rate (default: 3.e-4 with Adam)
  • --optimizer.momentum: optimizer momentum (for SGD only, default: 0.9)
  • --optimizer.weight-decay: optimizer weights regularization (L2, default 1.e-4)

Prediction

poetry run python -m gzoo.app.predict

Config works the same way as for training, default config is at config/predict.yaml.

A 1-image example is provided which you can run with:

poetry run python -m gzoo.app.predict --dataset.dir=example/

Config

If you make changes in gzoo.infra.config, you should also update the related .yaml config files in config/ with:

make config