May 2020
tl;dr: The ultimate guide to train a fast and accurate detector with limited resource.
The paper has a very nice review of object detection, including one-stage object detectors, two-stage object detectors, anchor-based ones and anchor-free ones.
Yolov4 is highly practical and focuses on training fast object detectors with only one 1080Ti or 2080Ti GPU card. Yolov4 runs twice as fast as EfficientDet.
- Bag of freebies: Improvements can be made in the training process
- Data augmentation: CutMix and Mosaic
- photometric, geometric
- DropBlock regularization: more effective than DropOut for CNN. DropOut was initially proposed for fc layers.
- Class label smoothing: it actually degrades performance
- cIOU loss function
- CmBN
- Cosine Anealing LR
- Dynamic minibatch size. This is similar to [Multigrid]
- Data augmentation: CutMix and Mosaic
- Bag of specials: which impacts the inference time slightly with a good return in performance.
- Plug-in modules such as attention modules
- SPP
- SAM
- PANet
- activation other than ReLU: Mish (better than Swish, also seen here)
- dIoU-NMS (anchor-free are NMS-free)
- Random training shapes
- Innovations
- Mosaic data aug: this is very similar to Sticher and Yolov3 Ultralytics. This is similar to increase the batch size.
- Self-adversarial training: first pass modify original image, then second pass train object detection
- Cross minibatch batch norm: improved version of CBN.
- SAM: Spatial Attention Module (channel wise mean/max pooling) from CBAM modified to point wise attention.
- PANet: concatenation instead of addition.
- Conclusions:
- ResNeXt50 is better for classification, but DarkNet53 is better for detection
- When enough trick is used, accuracy does not depend on batch size too much (4 and 8 similar)
- S: eliminating grid sensitivity. Yolov3 uses a sigmoid to regress the relative position inside the grid. The scale factor in yolov4 is 1.2. Essentially this is using the middle range of sigmoid, say stretching [0.2, 0.8] to [0, 1]. See more details in this PR to openCV. When S >> 1, then scale sigmoid approximates L2 loss.
- See this issue in yolov4 repo for background of this discussion.
- Most effective tricks:
- Mosaic data aug
- GA (genetic algorithm) to use 10% of training to finetune parameters.
- Cosine anealing
- SqueezeNet, MobileNet and ShuffleNet are more friendly for CPU but not GPU.
- SPP: in the original paper Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition TPAMI 2015 converts an image to a fixed length one-dimensional feature vector, and SPP is used for object detection. Yolov3 improves it by concatenating the output of max pooling with stride 1. This helps increases the receptive field fast.
- FLOPS vs inference time: SE on GPU usually increases inference time by 10%, although flops only increases 2%. SAM from CBAM does not increase inference time at all.
- CSPNet: A New Backbone that can Enhance Learning Capability of CNN: Cross stage partial network splits feature channels into two groups: one group passes through conv layers and the rest keeps the same.
- Why not regressing the position inside the grid [0, 1] directly with L1 loss?