A minimal object detection repository.
While reading papers and browsing repos to refresh my computer vision knowledge, i noticed that most object detection repos are complicating and cluttered with code - which makes it difficult to understand how object detection works end to end.
This repo should provide a simple and clear understanding on how to tackle the object detection problem. It's like a minimal template for object detection problems.
The aim was to make it easy to use, understand and customize for your own problems or datasets.
Repo is mostly based on the FCOS architecture.
All training was done from scratch, without pretrained models or additional data.
git clone git@github.com:filipbasara0/simple-object-detection.git
- create virtual environment:
virtualenv -p python3.8 env
- activate virtual environment:
source env/bin/activate
- install requirements:
pip install -r requirements.txt
python train.py --resolution=480 --dataset="pascal_voc_2012" --output_dir="trained_models/model.pth" --train_batch_size=8 --eval_batch_size=8 --num_epochs=81 --learning_rate=1e-3 --save_model_epochs=1 --num_classes=19 --adam_weight_decay=5e-2
from inference.load import load_model, load_image
from datasets import reverse_transform_classes
from utils import draw_bboxes
# load a model
predictor = load_model("path/to/model.pth", num_classes=19)
# load an image
image = load_image("path/to/img.jpg", image_size=480)
# obtain results
preds = predictor(image)
bboxes = preds["predicted_boxes"]
scores = preds["scores"]
classes = reverse_transform_classes(preds["pred_classes"], "pascal_voc_2012")
# optional - visualize predictions
image = image[0].permute(1, 2, 0).detach().cpu().numpy()
draw_bboxes(f"./path/to/visualized.jpg", image, bboxes[0], scores[0], classes[0])
To add a new dataset, create a file datasets/my_dataset.py
. In datasets/my_dataset.py
, you should create a class that contains two methods - get_transforms
for training augmentations (can be None
if you don't need them) and load_data
:
class MyDataset:
def load_data(self, dataset_path, labels):
# load the dataset and return it in the format specified below
...
def get_transforms(self):
# return transforms (just return None if you don't need any)
...
load_data
should return the dataset in the following format:
[
...,
{
"image_path": "path/to/my/image.jpg",
"target": [..., [x1,y1,x2,y2,C]]
}
]
x1, y1 and x2,y2 represent top left and bottom right corners of your target bboxes, while C represents a label encoding of your target class (1,2,...len(C))
. Element 0 is reserved for the __background__
class, which is used to filter negative samples when preparing the training labels.
Finally, in datasets/datasets.py
add a new entry to the DATASETS
dict with thet following fields
-
dataset_path
- path to your dataset metadata (image_path
andtarget
) -
class_name
- class name for you dataset -
labels
- list of labels - first element of the list should be the__background__
class (see Pascal and Carla labels indatasets/datasets.py
)
Training used extensive data augmentation - random horizontal flipping, scaling, translation, rotation, shearing and HSV. Images were resized to maintain the aspect ratio, using the letterbox
method.
Additional augmentation such as noise injection, blurring, cropping, (blocks/center) erasing, ... could result in better overall performance.
Backbone architecture is the same as ConvNext-Tiny
:
- Patch size:
4
- Layer depths:
[3, 3, 9, 3]
- Block dims:
[96, 192, 384, 768]
- Image sizes:
384
,416
and480
- Model resulted in
25M
params
It was trained for 100 epochs and obtained a mAP of 40 on a small eval dataset. Training took ~30 hours on a GTX1070Ti.
Training bigger models for longer would definitely yield better results.
Model with the same specification as above was trained for 50 epochs and obtained a mAP of 60 on a small eval dataset. Training took 3 hours on a GTX1070Ti.
Dataset collected by myself in the CARLA simulator can be found here, annotations can be found here.
Pretrained model can be found here.
Amazingly, the model can even detect IRL traffic lights (although with a lower confidence):
from inference.load import load_model, load_image
from datasets import reverse_transform_classes
from utils import draw_bboxes
# load a model (download from link above - https://drive.google.com/file/d/17mcQ-Ct6bUTS8BEpeDjaZMIFmHS2gptl/view?usp=share_link)
predictor = load_model("/path/to/fcos-carla-v01.pth", num_classes=2)
# load an image
image = load_image("path/to/img.jpg", image_size=480)
# obtain results
preds = predictor(image)
bboxes = preds["predicted_boxes"]
scores = preds["scores"]
classes = reverse_transform_classes(preds["pred_classes"], "carla_traffic_lights")
# optional - visualize predictions
image = image[0].permute(1, 2, 0).detach().cpu().numpy()
draw_bboxes(f"./path/to/visualized.jpg", image, bboxes[0], scores[0], classes[0])
- Add support for segmentation
- Add DETR
- Train on COCO (once i manage to get some better hardware)