1. The Core Goal of a BEV Network
A Bird’s Eye View (BEV) network aims to reconstruct, in real time using only onboard cameras, a local 3D map of comparable quality to what an offline reconstruction pipeline can produce with full compute budgets and hindsight. The key questions are: (a) what does that offline system produce, and (b) how does the online network learn to approximate it?
2. The Training Problem: Where Does Ground Truth Come From?
Camera-only networks have no LiDAR or direct depth sensor onboard (or pretend not to use one at inference time). So where does the depth/occupancy supervision come from during training?
2.1 Offline Auto-Labeling Pipeline
Companies build a heavy offline reconstruction pipeline that runs after data collection, not in real time. It may include:
- SLAM / SfM — simultaneous localization and mapping, structure from motion
- Bundle adjustment — global joint optimization of camera poses and 3D structure
- Multiview stereo (MVS) — dense depth from many overlapping views
- Temporal tracking — associating objects across frames
- Map fusion — aggregating reconstructions across many drives
The full pipeline, simplified:
1
2
3
4
5
6
7
8
9
1. Collect: synchronized multi-camera video + calibration + ego-motion (GPS/IMU)
2. Reconstruct: run offline SLAM / SfM / MVS / bundle-adjustment stack
3. Label: generate pseudo-ground-truth targets
├── 3D points / surfaces
├── object tracks and bounding boxes
├── occupancy volumes
├── lane geometry
└── free-space masks
4. Train: supervise online network to predict those targets from raw images alone
The offline system may produce depth at varying densities:
| Output | Typical source |
|---|---|
| Sparse depth | SfM / feature matching |
| Semi-dense depth | Direct methods (LSD-SLAM, DSO) |
| Dense depth | Multiview stereo, depth completion |
| Surface estimates | TSDF fusion, mesh reconstruction |
2.2 Voxel Occupancy as Training Target
Rather than supervising metric depth directly, it is often more useful to voxelize the reconstructed scene into discrete states:
- Occupied — a reconstructed surface or tracked object is present
- Free — a camera ray passed through this cell without hitting anything
- Unknown — no ray coverage
This is task-aligned for autonomous driving (collision avoidance, path planning) and avoids the difficulty of regressing a single depth value per pixel.
Key insight: The offline system produces supervisory signals of far higher quality than the real-time car could compute onboard. The online network learns to predict a compatible world representation directly from raw images, leveraging that offline investment only at training time.
2.3 Feature Consistency / Reprojection Loss (Self-Supervised Signal)
An alternative or supplementary signal requires no offline reconstruction. Given calibrated cameras or consecutive video frames:
- Predict depth or 3D feature locations from frame $t$
- Project them into another camera viewpoint or into frame $t+1$ using known/estimated camera pose
- Compare projected features or pixels against actual observations there
This photometric or feature-level reprojection loss provides a self-supervised geometric signal without any annotation. It is used in methods such as Monodepth2, SurroundDepth, and similar works.
3. Multiview Geometry: Building the Offline Reconstruction
3.1 Classical Sparse Pipeline
1
2
3
4
5
6
7
8
1. Detect and match feature points across views / time
(SIFT, SuperPoint, ORB, ...)
2. Use epipolar geometry + RANSAC to reject bad matches
(fundamental matrix F, essential matrix E)
3. Recover relative camera poses from E/F decomposition
4. Triangulate matched point pairs into 3D
5. Refine globally with bundle adjustment
→ sparse 3D point cloud
3.2 Densification and Semantic Enrichment
A sparse point cloud alone is not sufficient for autonomous driving training. The pipeline goes further:
| Technique | Output |
|---|---|
| Multiview stereo (MVS) | Dense depth / dense point cloud |
| Plane / surface fitting | Ground plane, building facades |
| Temporal fusion | Consistent HD map across many drives |
| Semantic segmentation | Per-voxel class labels |
| Object-level reconstruction | Tracked 3D bounding boxes |
Summary: Epipolar geometry and triangulation give the geometric skeleton; the full offline pipeline densifies, cleans, and semantically organizes it into usable training targets — surface estimates, lane geometry, curb profiles, occupancy volumes, and tracked objects.
4. View Transformation: From Perspective Images to BEV
4.1 The Core Challenge — Depth Ambiguity
A single image pixel $(u, v)$ does not correspond to one unique 3D point. It corresponds to an entire ray through 3D space:
\[\mathbf{p}_{3D} = \mathbf{o} + d \cdot \hat{\mathbf{r}}_{u,v}, \quad d \in [d_{\min}, d_{\max}]\]Without knowing the depth $d$, you cannot place the image feature into a unique BEV cell. This is the central difficulty of perspective-to-BEV lifting.
4.2 IPM — Inverse Perspective Mapping
Inverse Perspective Mapping (IPM) projects image pixels to BEV without depth estimation, by assuming all scene points lie on the ground plane ($Z = 0$ in world coordinates).
Given camera intrinsics $K$ and extrinsics $[R \mid t]$, the ground-plane constraint reduces the problem to a planar homography, giving a closed-form, per-pixel mapping with no learning.
- Strengths: Fast, simple, no supervision needed
- Limitation: Works only for flat ground. Fails for raised objects (vehicles, pedestrians), curbs, overpasses, or any non-planar structure
4.3 The General Feature-Lifting Approach
For a BEV cell at world position $(x, y)$:
- Assume or predict a height hypothesis $z$
- Project $(x, y, z)$ into image coordinates $(u, v)$ using $K$ and $[R \mid t]$
- Sample the image feature map at $(u, v)$ via bilinear interpolation
- Write or accumulate the sampled feature into the BEV cell
This family of operations is called unprojection, lifting, splatting, or view transformation depending on direction and implementation.
4.4 Why Lift Features, Not Raw Pixels
Warping the raw image into BEV before the backbone is the wrong order:
| Warp pixels first | Warp features first |
|---|---|
| Heavy distortion, artifacts, missing regions | Features encode semantic meaning across viewpoint changes |
| Backbone receives broken, unrealistic input | Backbone has already extracted edges, object parts, lane markings |
| Sensitive to lighting, texture changes | Backbone-learned invariance carries over |
Correct pipeline order:
1
image → backbone → feature map → geometric lifting / projection → BEV fusion
Not:
1
image → hard warp to BEV → backbone ✗
4.5 Four Methods for Handling Depth Ambiguity
| Method | Mechanism | Strengths | Weaknesses |
|---|---|---|---|
| A. Flat-ground / IPM | Assume ground plane $z = 0$; closed-form homography | Simple, fast, no learned depth | Fails for non-flat objects |
| B. Depth distribution (LSS) | Predict softmax over depth bins per pixel; lift feature along ray weighted by distribution | Handles arbitrary heights; differentiable; end-to-end | Needs depth supervision or self-sup. loss |
| C. Attention-based (BEVFormer) | BEV grid queries attend to image features via cross-attention; geometry encoded in positional embeddings | No explicit depth; learns flexible correspondences | Computationally heavier; less geometric interpretability |
| D. Occupancy prediction | Predict occupied / free / unknown per voxel directly; no explicit per-pixel depth | Task-aligned; avoids metric depth regression | Requires 3D occupancy ground truth |
Method B is the basis of Lift-Splat-Shoot (LSS) and BEVDet.
Method C is the basis of BEVFormer, DETR3D, and PETR.