Skip to content

Latest commit

 

History

History
38 lines (30 loc) · 2.35 KB

File metadata and controls

38 lines (30 loc) · 2.35 KB

June 2020

tl;dr: Transformer used for object detection as direct set prediction .

Overall impression

Formulate the object detection problem as direct set prediction problem. No need for engineering-heavy anchor boxes and NMS.

The attention mechanism from Transformer is similar to Non-local Networks. The attention has perfect memory and has same "distance" between any two points in the image.

Key ideas

  • Training:
    • Two step process. Bipartite matching loss + Prediction loss. (the second term is called "Hungarian loss" which is quite confusing)
    • The bipartite matching forces a 1-to-1 matching, without missing.
    • Null padding of training set to N.
  • DETR infers a fixed size set of N predictions. Predicts normalized center coordinates wrt the input image.
  • Resizing feature map HxWxd to dx(HW), as a sequence of feature dim d and length of HW.
  • Object query can be learned with SGD. It is part of the model's weight. It is better to be learned at attention.
  • Decoding output is non-autoregressive parallel decoding (feed previous output to the decoder to get next output).
  • Need extremely long training (300 epochs) to converge, vs 1x = 12 epochs for Faster RCNN.

Technical details

  • Matching loss + Hungarian loss
    • $L_{matching} = -\mathbb{1}(c_i \neq \varnothing)p(c_i) + \mathbb{1}(c_i \neq \varnothing) L(y_{bbox}, \hat{y}_{bbox})$
    • $L_{Hungarian} = -p(c_i) + \mathbb{1}(c_i \neq \varnothing) L(y_{bbox}, \hat{y}_{bbox})$
    • Note that the class mask in Hungarian loss is gone. This is needed to suppress false positive or near-duplicates.
  • Generalized IoU loss for scale invariant and more robust alternative to IoU or Dice loss.
  • Seems like object detectors usually do not use Adam nor dropout. ("We are not aware of successful applications of ...")
  • Key findings for explainability of DETR:
    • Encoder is able to separate individual objects.
    • Decoder typically attends to object extremities.
    • The visualization of bbox prediction of various output slots in DETR demonstrates that each slot attends to a particular type of objects. This is highly intriguing, similar to the prototype masks in YOLACT.
  • Uses updated AdamW for training.

Notes