DETR3D (CoRL 2021)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
6 camera images + calibration
↓
Shared 2D backbone
ResNet / VoVNet
↓
FPN
multi-scale 2D image features
↓
Learned 3D object queries
↓
Predict 3D reference points
↓
Project 3D points into camera views
↓
Sample 2D FPN features
↓
Transformer decoder updates queries
↓
Class head + 3D box head
↓
Hungarian matching during training
↓
Final 3D detections, no NMS
How it works:
- Input: 6 surrounding camera images+ camera intrinsics+ camera extrinsics
- Feature Pyramid Extraction: in the encoder stage, every image passes through ResNet / VovNet to extract features, these features then pass through an FPN (Feature Pyramid Network) to get 2D feature maps at 4 scales.
- Decoder gets feature maps at 4 scales as inputs. It then converts the 2D features into 3D features. This avoids building a dense 3D voxel/BEV feature volume. It uses sparse 3D object queries. Each query predicts or owns a 3D reference point, projects that point into the camera images, samples 2D features there, and updates the query
- Generate 3D Reference Points: leaves 100 trainable query embedding vectors as “3D object slots”. During inference, these vectors are not inputs, but rather fixed parameters of the network. After going through decoder layers with current image features, these embeddings will become 3D points which are not the same across scenes. Then, these 3D locations are projected back to the camera view using extrinsics and intrinsics.
- Project 3D reference points to 2D feature maps. Convert each 3D camera point back into each camera’s feature maps at all scales (TODO is this true?), using the camera extrinsics and intrinsics.
- Sample Features : using bilinear interpolation, the model samples image features around projected locations and updates query features via cross-attention
- Training: There are two heads:
- a bounding box head (x,y,z, bounding box height, width, legnth, velocity, and yaw).
- 1D class label
- During training, Hungarian matching teaches each query slot to specialize into object detection behavior (one-to-one assignment).
- TODO: can you explain more?
- Loss: set-to-set loss. bounding box part is L1, classification loss is focal loss. The overall loss is the same as DETR
ANd please give me a tiny DETR3D workflow chart
Detailed Explaination
1
2
3
100x256 -> self attention -> 100x256 -> cross attention:
feature map (H/4, W/4, 1) -> (H/4, W/4, 256) -> cross attention
- Each object query is like a feature vector, and represent an obstacle bounding box at one 3D location. So the vector carries (x,y,z)
-
These queries vectors become unique by learning from each other by going through self attention (after back propagation). 100 querys = 100 x 256. Position embedding added
- Cross Attention:
- the feature map gets transformed into
(H/4, W/4, 256). This is still in the image space feature. The 256-D vectorfat[x0,y0]represents a feature at the feature map point. - If you pass
fthroughKandV, you can get key and value vectorskandvforf. Then you can feed an object queryqin, to get a context vector ofqthat has image information atf. - Full attention over every pixel in every camera at every scale would be expensive. Naive cross-attention would do
object query attends to all image pixels from all cameras and scales. That will be6 camera x 4 scales * many H, W locations.- Instead, we project 3D reference locations to 2D cameras and look there
- Deformable cross-attention does not attend to all feature-map positions. For each query
q, each attention head samples a small number of image feature locations around the projected reference point (basically we select a small neighborhood offvectors, get theirkand v`), then combines those sampled values with learned attention weights.- So the final context vector is $\sum(multiheaded{q, k_i, v_i})$
- the feature map gets transformed into
-
Repeat step 3 N times
- Will predict
(x,y,z, h,w,l, yaw, confidence). Over training, this output vector will become better and better
Original Dataset: Kitti, Nuscenes (Multimodal city traffic datasets), sunrgbd (indoor 3D object detection datasets), scannet (RGBD camera objects)