Skip to content

Latest commit

 

History

History
54 lines (46 loc) · 3.7 KB

File metadata and controls

54 lines (46 loc) · 3.7 KB

May 2020

tl;dr: The ultimate guide to train a fast and accurate detector with limited resource.

Overall impression

The paper has a very nice review of object detection, including one-stage object detectors, two-stage object detectors, anchor-based ones and anchor-free ones.

Yolov4 is highly practical and focuses on training fast object detectors with only one 1080Ti or 2080Ti GPU card. Yolov4 runs twice as fast as EfficientDet.

Key ideas

  • Bag of freebies: Improvements can be made in the training process
    • Data augmentation: CutMix and Mosaic
      • photometric, geometric
    • DropBlock regularization: more effective than DropOut for CNN. DropOut was initially proposed for fc layers.
    • Class label smoothing: it actually degrades performance
    • cIOU loss function
    • CmBN
    • Cosine Anealing LR
    • Dynamic minibatch size. This is similar to [Multigrid]
  • Bag of specials: which impacts the inference time slightly with a good return in performance.
    • Plug-in modules such as attention modules
    • SPP
    • SAM
    • PANet
    • activation other than ReLU: Mish (better than Swish, also seen here)
    • dIoU-NMS (anchor-free are NMS-free)
    • Random training shapes
  • Innovations
    • Mosaic data aug: this is very similar to Sticher and Yolov3 Ultralytics. This is similar to increase the batch size.
    • Self-adversarial training: first pass modify original image, then second pass train object detection
    • Cross minibatch batch norm: improved version of CBN.
    • SAM: Spatial Attention Module (channel wise mean/max pooling) from CBAM modified to point wise attention.
    • PANet: concatenation instead of addition.
  • Conclusions:
    • ResNeXt50 is better for classification, but DarkNet53 is better for detection
    • When enough trick is used, accuracy does not depend on batch size too much (4 and 8 similar)

Technical details

  • S: eliminating grid sensitivity. Yolov3 uses a sigmoid to regress the relative position inside the grid. The scale factor in yolov4 is 1.2. Essentially this is using the middle range of sigmoid, say stretching [0.2, 0.8] to [0, 1]. See more details in this PR to openCV. When S >> 1, then scale sigmoid approximates L2 loss.
  • Most effective tricks:
    • Mosaic data aug
    • GA (genetic algorithm) to use 10% of training to finetune parameters.
    • Cosine anealing
  • SqueezeNet, MobileNet and ShuffleNet are more friendly for CPU but not GPU.
  • SPP: in the original paper Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition TPAMI 2015 converts an image to a fixed length one-dimensional feature vector, and SPP is used for object detection. Yolov3 improves it by concatenating the output of max pooling with stride 1. This helps increases the receptive field fast.
  • FLOPS vs inference time: SE on GPU usually increases inference time by 10%, although flops only increases 2%. SAM from CBAM does not increase inference time at all.
  • CSPNet: A New Backbone that can Enhance Learning Capability of CNN: Cross stage partial network splits feature channels into two groups: one group passes through conv layers and the rest keeps the same.

Notes

  • Why not regressing the position inside the grid [0, 1] directly with L1 loss?