[BEV] BEV Introduction: Tesla's Challenges and Architecture

Why per-camera detection falls short, and how BEV solves it

Posted by Rico's Nerd Cluster on April 5, 2026

1. Tesla’s Perception Challenges (2021)

Before BEV, Tesla’s pipeline detected objects and lanes independently in each camera view and then tried to fuse the results. This created fundamental problems:

Lane detection was unreliable.

Per-camera detect-then-fuse breaks for large objects. If a truck spans multiple cameras, how do you tell which detections belong to the same object? Recovering the full 3D shape of a large vehicle from disjointed per-camera boxes is hard.

No shared spatial context. Each camera sees its own patch of the world. Questions like “how fast is that truck moving?”, “is it double-parked?”, and “is there a pedestrian behind it?” need a shared spatial frame to answer reliably.

Lane markings are hard to preserve across views and over time.

Tesla’s solution: Move to a unified top-down (“local map”) representation — a Bird’s Eye View (BEV). BEV provides a single, ego-centric spatial grid where features from all cameras can be fused in a common coordinate frame, and temporal accumulation is straightforward.


2. The Evolution of Tesla’s Vision Stack

2017: Per-image range detection (regression) and classification. One model, one camera, one task at a time.

Later: Multi-camera, multi-task models. Several cameras feed into shared feature extractors; outputs include depth, segmentation, object detection simultaneously.

Vector Space / Feature Queue: Features are not just pooled but stored in a spatial queue across time. This supports:

  • Spatial: merging overlapping camera fields of view into a consistent grid
  • Temporal: accumulating features from past frames to handle occlusion and velocity estimation

The transformer-based architecture works as follows:

  1. Take images from multiple cameras at multiple timesteps; rectify them
  2. Feed each image through an image feature extractor (backbone)
  3. Generate keys and values per image; generate a spatial BEV query over the shared grid
  4. Cross-attention produces BEV-aligned spatial features, all referenced to the same ego frame at time $T$

Tesla is notable for using no HD map — all spatial context is built on-the-fly from vision alone. Their data closed-loop (auto-labeling triggered by edge cases, retrain, redeploy) is a major competitive advantage. The system powers NOA (Navigate on Autopilot / Navigate on Autosteer) in city driving.


3. Tesla’s Full Perception Pipeline

Target: L2+ (and eventually FSD) perception from vision only.

The pipeline has five stages:

Stage What it does
1. Feature extraction Each camera image passes through a backbone (e.g. ResNet, RegNet) to produce a rich feature map
2. View transform Image features are lifted from perspective views into a shared BEV / 3D vector space using cross-attention or geometric projection
3. Spatial fusion BEV features from all cameras are merged into a single ego-centric grid (Spatial Transformer, BEV)
4. Temporal fusion Consecutive BEV frames are aligned (using ego-motion) and fused to aggregate motion cues and reduce occlusion uncertainty
5. Multi-task heads The fused BEV representation feeds task-specific heads: occupancy grid, free-space, parking, lane geometry, object detection

Why temporal fusion after spatial fusion? The spatial transform produces a BEV frame tied to a single timestep. Temporal fusion then aligns and merges multiple such BEV frames across time using ego-motion, which is easier and more principled in BEV space than in perspective image space.


4. Training a BEV Network: Where Does Ground Truth Come From?

Camera-only networks need depth or occupancy supervision, but depth sensors are either absent at inference time or intentionally excluded. The answer is an offline auto-labeling pipeline.

4.1 Offline Reconstruction Pipeline

Companies run a heavy reconstruction stack offline (after data collection, not in real time):

1
2
3
4
5
6
7
8
9
1. Collect: synchronized multi-camera video + calibration + ego-motion (GPS/IMU)
2. Reconstruct: run offline SLAM / SfM / MVS / bundle-adjustment stack
3. Label:  generate pseudo-ground-truth targets
             ├── 3D points / surfaces
             ├── object tracks and bounding boxes
             ├── occupancy volumes
             ├── lane geometry
             └── free-space masks
4. Train:  supervise online network to predict those targets from raw images alone

Depth output density varies by method:

Output Typical source
Sparse depth SfM / feature matching
Semi-dense depth Direct methods (LSD-SLAM, DSO)
Dense depth Multiview stereo, depth completion
Surface estimates TSDF fusion, mesh reconstruction

4.2 Voxel Occupancy as Training Target

Rather than regressing metric depth per pixel, it is more useful to voxelize the scene:

  • Occupied — a reconstructed surface or tracked object is present
  • Free — a camera ray passed through without hitting anything
  • Unknown — no ray coverage

This is task-aligned for autonomous driving and avoids single-pixel depth regression difficulties.

4.3 Feature Reprojection Loss (Self-Supervised Signal)

An additional signal needs no offline reconstruction:

  1. Predict depth or lifted 3D features from frame $t$
  2. Project them into another camera or frame $t+1$ using known ego-motion
  3. Compare against actual observations there (photometric or feature-level loss)

This is the basis of methods like Monodepth2 and SurroundDepth.


5. Multi-View Geometry: The Offline Reconstruction Stack

The offline pipeline is built on classical multi-view geometry, extended with dense methods:

5.1 Sparse Pipeline

1
2
3
4
5
1. Detect and match feature points (SIFT, SuperPoint, ORB, ...)
2. Apply epipolar geometry + RANSAC to filter bad matches
3. Recover relative camera poses from fundamental / essential matrix
4. Triangulate matched point pairs into 3D
5. Refine globally with bundle adjustment → sparse point cloud

Epipolar geometry and triangulation give the geometric skeleton.

5.2 Densification and Semantic Enrichment

Technique Output
Multiview stereo (MVS) Dense depth / point cloud
Plane / surface fitting Ground plane, facades
Temporal fusion Consistent HD map across drives
Semantic segmentation Per-voxel class labels
Object-level reconstruction Tracked 3D bounding boxes

The full pipeline densifies, cleans, and semantically organizes the skeleton into the rich training targets (lanes, curbs, occupancy volumes, tracked objects) that sparse SfM alone cannot provide.


6. View Transformation: From Images to BEV

6.1 The Depth Ambiguity Problem

A single pixel $(u, v)$ maps not to one 3D point but to an entire ray:

\[\mathbf{p}_{3D} = \mathbf{o} + d \cdot \hat{\mathbf{r}}_{u,v}, \quad d \in [d_{\min}, d_{\max}]\]

Without knowing $d$, you cannot assign that image feature to a unique BEV grid cell. This is the core difficulty of perspective-to-BEV lifting.

6.2 IPM — Inverse Perspective Mapping

IPM resolves the ambiguity by assuming all scene points lie on the ground plane ($Z = 0$). The constraint turns the projection into a planar homography — closed-form, no learning required.

  • Good for: flat road surface, lane markings
  • Bad for: vehicles, pedestrians, curbs, overpasses

6.3 Why Lift Features, Not Raw Pixels

Warp pixels first Warp features first
Heavy distortion and missing regions Features already encode edges, objects, lanes
Backbone sees broken, unrealistic input Backbone invariances (lighting, viewpoint) carry over

Correct order:

1
image → backbone → feature map → geometric lifting → BEV fusion

6.4 Four Lifting Methods

Method Mechanism Papers
A. IPM / flat-ground Ground-plane homography; no depth network Classic
B. Depth distribution Predict softmax over depth bins; lift feature along ray LSS, BEVDet
C. Cross-attention BEV queries attend to image features; geometry in positional embeddings BEVFormer, DETR3D, PETR
D. Occupancy prediction Predict voxel occupancy directly; bypass explicit depth MonoScene, TPVFormer