FEB-YOLOv8

Posted by Rico's Nerd Cluster on April 28, 2026

FEB-YOLOv8 tries to make YOLOv8n small enough for underwater robots, then gives back the lost accuracy using attention and better small-object feature fusion. The three main changes are P-C2f, EMA attention, and an improved BiFPN/P2 feature pyramid. The authors report that this raises mAP while reducing parameters/GFLOPs versus YOLOv8n on DUO and URPC2020. ([PLOS][1])


1. Overall mental model

YOLOv8n already has the usual structure:

1
Input → Backbone → Neck/FPN-PAN → Detection Head

FEB-YOLOv8 changes it like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
YOLOv8n backbone C2f
        ↓
replace Bottleneck with FasterNet/PConv block
        ↓
lighter backbone, fewer redundant conv operations

then add EMA attention
        ↓
recover feature quality by focusing on useful underwater object regions

then improve feature pyramid with P2 + BiFPN-style fusion
        ↓
better small-object detection

The authors motivate this from underwater detection: objects are often small, clustered, blurred, low-contrast, and hard to separate from the background. At the same time, underwater robots have limited compute/storage, so the model must be lightweight. ([PLOS][1])


2. Improvement 1: P-C2f - make C2f cheaper

YOLOv8’s C2f is good because it improves gradient flow by splitting features, passing some through bottlenecks, and concatenating intermediate outputs. But the bottleneck blocks still use regular convolutions, which cost a lot.

FEB-YOLOv8 says:

Do not convolve every channel. Convolve only a subset of channels, keep the rest as cheap identity channels, then mix with 1×1 convolutions.

This is based on Partial Convolution / PConv from FasterNet. In this paper, PConv is used to replace the bottleneck inside C2f, forming P-C2f. ([PLOS][1])


Math

For a normal convolution with feature map size H × W, input channels C, output channels C, and kernel size k × k:

\[\text{FLOPs}_{conv} = H W k^2 C^2\]

For PConv, only (C_p) channels are convolved:

\[\text{FLOPs}_{pconv} = H W k^2 C_p^2\]

If:

\[C_p = \frac{C}{4}\]

then:

\[\text{FLOPs}_{pconv} = H W k^2 \left(\frac{C}{4}\right)^2 = \frac{1}{16} H W k^2 C^2\]

So the spatial convolution part becomes roughly 1/16 the cost of full convolution. The paper explicitly states this 1/16 FLOP reduction when (C_p = C/4). ([PLOS][1])


Pseudocode: normal C2f

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class C2f:
    def forward(self, x):
        # expand / project
        y = conv1x1(x)

        # split channels into two parts
        y0, y1 = split(y)

        outputs = [y0, y1]

        # normal YOLOv8 C2f uses Bottleneck blocks
        cur = y1
        for block in bottleneck_blocks:
            cur = block(cur)
            outputs.append(cur)

        # concatenate all partial outputs
        out = concat(outputs, dim="channel")

        # fuse channels
        return conv1x1(out)

Pseudocode: P-C2f

1
2
3
4
5
6
7
8
9
10
11
12
class PartialConv:
    def __init__(self, channels, ratio=0.25):
        self.cp = int(channels * ratio)
        self.conv3x3 = Conv3x3(self.cp, self.cp)

    def forward(self, x):
        x_conv = x[:, :self.cp, :, :]
        x_skip = x[:, self.cp:, :, :]

        x_conv = self.conv3x3(x_conv)

        return concat([x_conv, x_skip], dim="channel")
1
2
3
4
5
6
7
8
9
10
11
12
13
class FasterNetBlock:
    def forward(self, x):
        shortcut = x

        # spatial mixing on only part of channels
        y = partial_conv(x)

        # channel mixing
        y = conv1x1_expand(y)
        y = activation(y)
        y = conv1x1_project(y)

        return shortcut + y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
class PC2f:
    def forward(self, x):
        y = conv1x1(x)
        y0, y1 = split(y)

        outputs = [y0, y1]

        cur = y1
        for block in fasternet_blocks:
            cur = block(cur)
            outputs.append(cur)

        out = concat(outputs, dim="channel")
        return conv1x1(out)

The important point:

1
2
C2f:   Bottleneck = full conv on many channels
P-C2f: Bottleneck = PConv + cheap channel mixing

3. Improvement 2: EMA attention - recover accuracy after making the model lighter

PConv makes the model cheaper, but it may weaken representation quality because fewer channels receive full spatial convolution. EMA is added to restore focus. Underwater images contain noise, blur, low contrast, and small objects. EMA helps the network emphasize useful spatial regions and multi-scale information instead of treating all spatial/channel features equally. The paper says EMA uses grouped channel reshaping, horizontal/vertical pooling, 1×1 and 3×3 branches, global average pooling, Softmax, and cross-dimensional interaction. ([PLOS][1])


Math

Let the input feature be:

\[X \in \mathbb{R}^{B \times C \times H \times W}\]

EMA first groups the channels:

\[X_g \in \mathbb{R}^{(B \cdot G) \times (C/G) \times H \times W}\]

Then it computes directional pooling.

Horizontal/context along width:

\[z_h(h) = \frac{1}{W} \sum_{w=1}^{W} X_g(:, :, h, w)\]

Vertical/context along height:

\[z_w(w) = \frac{1}{H} \sum_{h=1}^{H} X_g(:, :, h, w)\]

These are concatenated and passed through a 1×1 convolution:

\[q = Conv_{1 \times 1}([z_h, z_w])\]

Then split into height and width gates:

\[a_h, a_w = split(q)\]

Apply sigmoid gates:

\[\hat{X}_1 = X_g \cdot \sigma(a_h) \cdot \sigma(a_w)\]

In parallel, EMA also computes a 3×3 branch:

\[\hat{X}_2 = Conv_{3 \times 3}(X_g)\]

Then it uses global average pooling and Softmax to produce cross-spatial attention weights:

\[s_1 = Softmax(GAP(\hat{X}_1))\] \[s_2 = Softmax(GAP(\hat{X}_2))\]

A simplified version of the final spatial attention map is:

\[A = \sigma(s_1 \cdot flatten(\hat{X}_2) + s_2 \cdot flatten(\hat{X}_1))\]

Then:

\[Y_g = X_g \cdot A\]

Finally reshape grouped features back:

\[Y \in \mathbb{R}^{B \times C \times H \times W}\]

the operational math implied by their EMA description: grouped features, horizontal/vertical pooling, 1×1/3×3 branches, GAP, Softmax, cross-space aggregation, and reweighting. ([PLOS][1])


Small mental example

Imagine a feature map where the true object is a tiny starfish in the lower center of the image.

Without EMA:

1
background texture, sand, water noise, and object features all flow forward similarly

With EMA:

1
2
3
4
horizontal pooling: "something important appears around this row"
vertical pooling:   "something important appears around this column"
3×3 branch:         "local texture looks object-like"
cross-space fusion: "boost this small spatial region"

So EMA creates a soft attention mask that says:

1
2
Pay more attention here.
Ignore noisy water/sand elsewhere.

Pseudocode: EMA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class EMA:
    def __init__(self, channels, groups):
        self.groups = groups
        self.group_channels = channels // groups

        self.conv1x1 = Conv1x1(self.group_channels, self.group_channels)
        self.conv3x3 = Conv3x3(self.group_channels, self.group_channels)

    def forward(self, x):
        B, C, H, W = x.shape
        G = self.groups

        # reshape channels into grouped batch dimension
        xg = x.reshape(B * G, C // G, H, W)

        # directional pooling
        pool_h = avg_pool_width(xg)   # shape: B*G, C/G, H, 1
        pool_w = avg_pool_height(xg)  # shape: B*G, C/G, 1, W

        # align and concatenate spatial descriptors
        pool_w = transpose_hw(pool_w)
        pooled = concat([pool_h, pool_w], dim="height")

        # 1x1 branch creates coordinate-style gates
        gates = self.conv1x1(pooled)

        gate_h, gate_w = split_height_width(gates, H, W)
        gate_w = transpose_hw(gate_w)

        x1 = xg * sigmoid(gate_h) * sigmoid(gate_w)

        # 3x3 branch captures local spatial context
        x2 = self.conv3x3(xg)

        # cross-space attention
        w1 = softmax(global_avg_pool(x1), dim="channel")
        w2 = softmax(global_avg_pool(x2), dim="channel")

        x1_flat = flatten_spatial(x1)  # B*G, C/G, H*W
        x2_flat = flatten_spatial(x2)

        attn = matmul(w1, x2_flat) + matmul(w2, x1_flat)
        attn = sigmoid(attn.reshape(B * G, 1, H, W))

        out = xg * attn

        # restore original shape
        return out.reshape(B, C, H, W)

4. Improvement 3: BiFPN-style feature pyramid + P2 - improve small-object detection

Mental model

YOLO necks usually combine features from different depths:

1
2
deep layers: strong semantics, weak spatial detail
shallow layers: weak semantics, strong spatial detail

For underwater objects, this matters a lot because many targets are small. If you only use deeper feature maps, small objects may disappear after repeated downsampling.

FEB-YOLOv8 modifies the feature pyramid by:

  1. adding P2 feature map information, which is higher-resolution and better for tiny objects;
  2. using cross-scale connections inspired by BiFPN;
  3. using weighted feature fusion instead of simple addition/concat. ([PLOS][1])

Math: weighted feature fusion

Standard FPN/PAN often does something like:

\[O = I_1 + I_2\]

or:

\[O = Conv(Concat(I_1, I_2))\]

BiFPN-style fusion learns how much to trust each input:

\[O = \frac{\sum_i w_i I_i}{\epsilon + \sum_i w_i}\]

where:

\[w_i \ge 0\]

Usually this is implemented as:

\[w_i = ReLU(\alpha_i)\]

So the model learns:

1
2
For this fusion node, should I trust high-resolution shallow features more?
Or deep semantic features more?

For small underwater objects, the answer is often:

1
Use more P2/P3 spatial detail, but still inject deeper semantics.

Pseudocode: weighted fusion

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class WeightedFusion:
    def __init__(self, n_inputs, eps=1e-4):
        self.raw_weights = Parameter(ones(n_inputs))
        self.eps = eps

    def forward(self, features):
        # make weights non-negative
        weights = relu(self.raw_weights)

        # normalize weights
        weights = weights / (sum(weights) + self.eps)

        out = 0
        for wi, fi in zip(weights, features):
            out = out + wi * fi

        return out

Pseudocode: improved P2-BiFPN neck

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class FEBNeck:
    def forward(self, P2, P3, P4, P5):
        # P2: high resolution, useful for tiny objects
        # P5: low resolution, strong semantics

        # top-down path
        P5_td = P5
        P4_td = fuse([P4, upsample(P5_td)])
        P3_td = fuse([P3, upsample(P4_td)])
        P2_td = fuse([P2, upsample(P3_td)])

        # bottom-up path
        P3_out = fuse([P3, P3_td, downsample(P2_td)])
        P4_out = fuse([P4, P4_td, downsample(P3_out)])
        P5_out = fuse([P5, downsample(P4_out)])

        return P2_td, P3_out, P4_out, P5_out

The exact graph depends on the implementation, but the principle is:

1
2
3
Add P2 → preserve tiny-object detail
Use bidirectional paths → exchange shallow/detail and deep/semantic information
Use learned weights → avoid treating every scale equally

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class FEB_YOLOv8:
    def __init__(self):
        self.stem = Conv()

        # backbone
        self.stage1 = PC2f()
        self.stage2 = PC2f()
        self.stage3 = PC2f()
        self.stage4 = PC2f()

        # attention inserted in backbone
        self.ema = EMA(channels=..., groups=...)

        # SPPF kept from YOLOv8-style design
        self.sppf = SPPF()

        # improved feature pyramid
        self.neck = P2_BiFPN_Neck()

        # YOLOv8 decoupled detection head
        self.head = DetectHead()

    def forward(self, x):
        x = self.stem(x)

        P2 = self.stage1(x)
        P3 = self.stage2(P2)
        P4 = self.stage3(P3)
        P5 = self.stage4(P4)

        P5 = self.ema(P5)
        P5 = self.sppf(P5)

        features = self.neck(P2, P3, P4, P5)

        return self.head(features)

6. Reported results

On DUO, the paper reports FEB-YOLOv8 reaches 82.9% mAP@0.5, improving over YOLOv8n by 1.2% mAP@0.5 and 1.4% mAP@0.5:0.95. It also reports 1.64M parameters and 6.2G GFLOPs. ([PLOS][1])

On URPC2020, the paper reports 83.5% mAP@0.5 and 48.9% mAP@0.5:0.95, improving over YOLOv8n by 1.3% mAP@0.5 and 1.0% mAP@0.5:0.95. ([PLOS][1])

The ablation is especially informative:

1
2
3
4
5
6
7
8
9
10
11
Replace C2f with P-C2f:
    parameters ↓
    GFLOPs ↓
    tiny accuracy drop

Add improved BiFPN:
    accuracy recovers/improves
    model remains lightweight

Add EMA:
    accuracy improves further

The paper reports the final model reduces parameters and GFLOPs versus baseline while improving mAP. ([PLOS][1])

That is the main engineering logic of FEB-YOLOv8.

[1]: https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0311173 “FEB-YOLOv8: A multi-scale lightweight detection model for underwater object detection PLOS One”