[BEV] BEV Introduction

Understanding L2 and L2+ Perception: Why the Real Challenge Is System Design, Not Just Better Detection

When people talk about assisted driving, they often jump straight to compute power, perception models, or the latest sensor stack. But in practice, L2 and L2+ systems are defined less by any single model and more by application constraints, feature scope, and how the full perception-to-decision chain is designed.

At a high level, the needs of the application determine the system. A typical L2 system requires less than 20 TOPS of compute and focuses on relatively structured environments such as highways and bridges. L2+ expands that scope, often adding features like automated lane changes and on-ramp or off-ramp assistance. Platforms from vendors like Nvidia can range from roughly 200 to 300 TOPS, giving ~1000 TOPS across multiple cards, but raw compute alone does not solve the core autonomy problem.

What L2 Usually Means in Practice

A basic L2 driving stack is typically built around a small set of core functions:

LKA (Lane Keeping Assist) for keeping the vehicle centered in the lane
ACC (Adaptive Cruise Control) for maintaining speed and following distance
AEB (Automatic Emergency Braking) for emergency stopping when a collision risk is detected

In real deployments, these functions are rarely isolated. They depend heavily on the quality of lane perception, object detection, tracking, free-space understanding, and system-level fusion.

For L2+, the system often extends beyond simple lane centering and longitudinal control. Features like automatic lane changes or highway ramp handling introduce more edge cases, more rules, and more dependence on reliable scene understanding.

One key capability that becomes increasingly important is bird’s-eye-view (BEV) perception. A BEV representation gives the system a unified spatial understanding of lanes, vehicles, obstacles, and map priors, which is especially useful when multiple sensors need to be fused into a single planning-friendly scene representation.

The Typical L2 Perception Workflow

L2 systems fundamentally need a 360-degree view of the vehicle’s surroundings, yet current pipelines are long and multi-stage. A typical perception workflow looks like:

Input sensors: HD map (as a prior), cameras, radar, stereo cameras, ultrasonic sensors
Pre-processing: e.g., Inverse Perspective Mapping (IPM), image rectification and undistortion
Inference: detection and segmentation models
Post-processing: IPM-based transforms, range detection, NMS
Multi-sensor fusion: combining outputs into a unified scene representation

The pipeline diagrams below illustrate common architectures seen in production L2/L2+ systems:

Perception pipeline overview

Detailed pipeline stages

Around View Monitor (AVM / 全景环视系统)

One important pre-processing subsystem is the Around View Monitor (AVM). It provides a seamless top-down surround view by:

Undistorting fisheye images from multiple surround cameras
Rectifying each image to a common projection
Stitching them together into a unified bird’s-eye-view

This is especially valuable for low-speed maneuvers and close-range obstacle detection.

The Typical L2 Perception Stack

In many L2 systems, the map serves as a prior, while the live perception stack combines information from cameras, radar, and sometimes stereo cameras.

A simplified perception chain often looks like this:

Sensors → detection / post-processing → fusion → downstream planning and control

Static Elements

For static scene understanding, the system typically needs to recognize:

Lane lines
Traffic lights
Traffic signs

Most of these start in image space and then need to be lifted into 3D space or transformed into a BEV representation. That transformation is not trivial. For example, lanes may be extracted using image-based detection, inverse perspective mapping (IPM), and curve fitting. Traffic lights and signs may require temporal sequencing across frames to improve stability.

Dynamic Elements

For dynamic objects such as cars and other road users, the system often detects 2D bounding boxes in camera images first, then estimates position and distance through a combination of methods:

Model-based distance estimation
IPM-based distance estimation
Ground-plane-based reasoning
Radar-assisted localization
Post-processing and tracking

In many stacks, the front-view perception is camera-dominant, while rear dynamic object detection may be handled separately and sometimes more conservatively, depending on sensor placement and use case.

The practical outputs of such a system may include:

2D traffic light and sign detections
Lane detections
HD map priors
Radar-based BEV obstacles
Camera-based obstacle detections

These are then fused into a more stable world model.

Sensor Fusion Strategies

A central challenge in any multi-sensor system is how to combine information from cameras, LiDAR, and radar. There are two primary paradigms, along with hybrid approaches.

Early Fusion (前融合 / Feature-Level Fusion)

Early fusion aligns LiDAR and camera information as early as possible before feeding it into the model. Common approaches include:

Projecting point clouds onto the image plane and appending color or image features to each point
Lifting image features into BEV or 3D space and encoding them alongside point cloud features

The model sees both sensor modalities simultaneously at the intermediate feature stage.

Example: “Attach the camera’s appearance information onto the point cloud first, then jointly decide whether it is a car.”

Key disadvantage: models are usually too big for edge deployment

Feature Projection Methods

Feature projection methods such as MMF project 3D features onto 2D. This approach is geometrically lossy: points that are far apart in 3D can appear very close in 2D.

Geometric loss in feature projection

Proposal-Based Methods

Methods like AVOD use proposals to find aligned 2D and 3D features and combine them for detection. These can be semantically lossy when key objects are occluded.

Both feature projection and proposal-based approaches tend to require large models, making them less suitable for deployment on resource-constrained platforms.

Comparison of fusion method complexity

Late Fusion (后融合 / Decision-Level Fusion)

Late fusion runs each sensor’s detector independently and merges the results at the output level through:

Bounding box matching
Non-maximum suppression (NMS)
Confidence-weighted merging
Rule-based or track-level fusion

Example: “The camera says there’s a car here; LiDAR also says there’s a car here — then decide if they refer to the same target and merge.”

A key disadvantage of late fusion is the proliferation of hand-crafted rules, which can become a bottleneck for cross-camera objects and edge cases.

Why Mono3D Is Tempting but Limited

Monocular 3D perception is attractive because it avoids the cost and complexity of extra sensors. But in practice, Mono3D often struggles with accuracy and robustness, especially under long-range, occlusion-heavy, or unusual road conditions.

Data-driven methods for range detection are preferred in principle, but their capabilities are not yet strong enough for full reliance. Representative approaches include:

FCOS3D — a fully convolutional one-stage monocular 3D object detector with promising results, but requires large datasets and careful tuning
Mono3D — typically exhibits low accuracy and low robustness in real-world conditions
Pseudo-LiDAR — converts stereo or mono depth estimates into point clouds, then applies 3D detection; accuracy is often still insufficient to fully replace direct sensing

As a result, many production systems still rely on a mix of geometric priors, hand-engineered post-processing, and deep learning rather than betting entirely on end-to-end monocular 3D.

Today, the mainstream is still a hybrid paradigm: geometry plus deep learning, rather than pure deep learning alone.