FEB-YOLOv8 tries to make YOLOv8n small enough for underwater robots, then gives back the lost accuracy using attention and better small-object feature fusion. The three main changes are P-C2f, EMA attention, and an improved BiFPN/P2 feature pyramid. The authors report that this raises mAP while reducing parameters/GFLOPs versus YOLOv8n on DUO and URPC2020. ([PLOS][1])
1. Overall mental model
YOLOv8n already has the usual structure:
1
Input → Backbone → Neck/FPN-PAN → Detection Head
FEB-YOLOv8 changes it like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
YOLOv8n backbone C2f
↓
replace Bottleneck with FasterNet/PConv block
↓
lighter backbone, fewer redundant conv operations
then add EMA attention
↓
recover feature quality by focusing on useful underwater object regions
then improve feature pyramid with P2 + BiFPN-style fusion
↓
better small-object detection
The authors motivate this from underwater detection: objects are often small, clustered, blurred, low-contrast, and hard to separate from the background. At the same time, underwater robots have limited compute/storage, so the model must be lightweight. ([PLOS][1])
2. Improvement 1: P-C2f - make C2f cheaper
YOLOv8’s C2f is good because it improves gradient flow by splitting features, passing some through bottlenecks, and concatenating intermediate outputs. But the bottleneck blocks still use regular convolutions, which cost a lot.
FEB-YOLOv8 says:
Do not convolve every channel. Convolve only a subset of channels, keep the rest as cheap identity channels, then mix with 1×1 convolutions.
This is based on Partial Convolution / PConv from FasterNet. In this paper, PConv is used to replace the bottleneck inside C2f, forming P-C2f. ([PLOS][1])
Math
For a normal convolution with feature map size H × W, input channels C, output channels C, and kernel size k × k:
For PConv, only (C_p) channels are convolved:
\[\text{FLOPs}_{pconv} = H W k^2 C_p^2\]If:
\[C_p = \frac{C}{4}\]then:
\[\text{FLOPs}_{pconv} = H W k^2 \left(\frac{C}{4}\right)^2 = \frac{1}{16} H W k^2 C^2\]So the spatial convolution part becomes roughly 1/16 the cost of full convolution. The paper explicitly states this 1/16 FLOP reduction when (C_p = C/4). ([PLOS][1])
Pseudocode: normal C2f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class C2f:
def forward(self, x):
# expand / project
y = conv1x1(x)
# split channels into two parts
y0, y1 = split(y)
outputs = [y0, y1]
# normal YOLOv8 C2f uses Bottleneck blocks
cur = y1
for block in bottleneck_blocks:
cur = block(cur)
outputs.append(cur)
# concatenate all partial outputs
out = concat(outputs, dim="channel")
# fuse channels
return conv1x1(out)
Pseudocode: P-C2f
1
2
3
4
5
6
7
8
9
10
11
12
class PartialConv:
def __init__(self, channels, ratio=0.25):
self.cp = int(channels * ratio)
self.conv3x3 = Conv3x3(self.cp, self.cp)
def forward(self, x):
x_conv = x[:, :self.cp, :, :]
x_skip = x[:, self.cp:, :, :]
x_conv = self.conv3x3(x_conv)
return concat([x_conv, x_skip], dim="channel")
1
2
3
4
5
6
7
8
9
10
11
12
13
class FasterNetBlock:
def forward(self, x):
shortcut = x
# spatial mixing on only part of channels
y = partial_conv(x)
# channel mixing
y = conv1x1_expand(y)
y = activation(y)
y = conv1x1_project(y)
return shortcut + y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
class PC2f:
def forward(self, x):
y = conv1x1(x)
y0, y1 = split(y)
outputs = [y0, y1]
cur = y1
for block in fasternet_blocks:
cur = block(cur)
outputs.append(cur)
out = concat(outputs, dim="channel")
return conv1x1(out)
The important point:
1
2
C2f: Bottleneck = full conv on many channels
P-C2f: Bottleneck = PConv + cheap channel mixing
3. Improvement 2: EMA attention - recover accuracy after making the model lighter
PConv makes the model cheaper, but it may weaken representation quality because fewer channels receive full spatial convolution. EMA is added to restore focus. Underwater images contain noise, blur, low contrast, and small objects. EMA helps the network emphasize useful spatial regions and multi-scale information instead of treating all spatial/channel features equally. The paper says EMA uses grouped channel reshaping, horizontal/vertical pooling, 1×1 and 3×3 branches, global average pooling, Softmax, and cross-dimensional interaction. ([PLOS][1])
Math
Let the input feature be:
\[X \in \mathbb{R}^{B \times C \times H \times W}\]EMA first groups the channels:
\[X_g \in \mathbb{R}^{(B \cdot G) \times (C/G) \times H \times W}\]Then it computes directional pooling.
Horizontal/context along width:
\[z_h(h) = \frac{1}{W} \sum_{w=1}^{W} X_g(:, :, h, w)\]Vertical/context along height:
\[z_w(w) = \frac{1}{H} \sum_{h=1}^{H} X_g(:, :, h, w)\]These are concatenated and passed through a 1×1 convolution:
\[q = Conv_{1 \times 1}([z_h, z_w])\]Then split into height and width gates:
\[a_h, a_w = split(q)\]Apply sigmoid gates:
\[\hat{X}_1 = X_g \cdot \sigma(a_h) \cdot \sigma(a_w)\]In parallel, EMA also computes a 3×3 branch:
\[\hat{X}_2 = Conv_{3 \times 3}(X_g)\]Then it uses global average pooling and Softmax to produce cross-spatial attention weights:
\[s_1 = Softmax(GAP(\hat{X}_1))\] \[s_2 = Softmax(GAP(\hat{X}_2))\]A simplified version of the final spatial attention map is:
\[A = \sigma(s_1 \cdot flatten(\hat{X}_2) + s_2 \cdot flatten(\hat{X}_1))\]Then:
\[Y_g = X_g \cdot A\]Finally reshape grouped features back:
\[Y \in \mathbb{R}^{B \times C \times H \times W}\]the operational math implied by their EMA description: grouped features, horizontal/vertical pooling, 1×1/3×3 branches, GAP, Softmax, cross-space aggregation, and reweighting. ([PLOS][1])
Small mental example
Imagine a feature map where the true object is a tiny starfish in the lower center of the image.
Without EMA:
1
background texture, sand, water noise, and object features all flow forward similarly
With EMA:
1
2
3
4
horizontal pooling: "something important appears around this row"
vertical pooling: "something important appears around this column"
3×3 branch: "local texture looks object-like"
cross-space fusion: "boost this small spatial region"
So EMA creates a soft attention mask that says:
1
2
Pay more attention here.
Ignore noisy water/sand elsewhere.
Pseudocode: EMA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class EMA:
def __init__(self, channels, groups):
self.groups = groups
self.group_channels = channels // groups
self.conv1x1 = Conv1x1(self.group_channels, self.group_channels)
self.conv3x3 = Conv3x3(self.group_channels, self.group_channels)
def forward(self, x):
B, C, H, W = x.shape
G = self.groups
# reshape channels into grouped batch dimension
xg = x.reshape(B * G, C // G, H, W)
# directional pooling
pool_h = avg_pool_width(xg) # shape: B*G, C/G, H, 1
pool_w = avg_pool_height(xg) # shape: B*G, C/G, 1, W
# align and concatenate spatial descriptors
pool_w = transpose_hw(pool_w)
pooled = concat([pool_h, pool_w], dim="height")
# 1x1 branch creates coordinate-style gates
gates = self.conv1x1(pooled)
gate_h, gate_w = split_height_width(gates, H, W)
gate_w = transpose_hw(gate_w)
x1 = xg * sigmoid(gate_h) * sigmoid(gate_w)
# 3x3 branch captures local spatial context
x2 = self.conv3x3(xg)
# cross-space attention
w1 = softmax(global_avg_pool(x1), dim="channel")
w2 = softmax(global_avg_pool(x2), dim="channel")
x1_flat = flatten_spatial(x1) # B*G, C/G, H*W
x2_flat = flatten_spatial(x2)
attn = matmul(w1, x2_flat) + matmul(w2, x1_flat)
attn = sigmoid(attn.reshape(B * G, 1, H, W))
out = xg * attn
# restore original shape
return out.reshape(B, C, H, W)
4. Improvement 3: BiFPN-style feature pyramid + P2 - improve small-object detection
Mental model
YOLO necks usually combine features from different depths:
1
2
deep layers: strong semantics, weak spatial detail
shallow layers: weak semantics, strong spatial detail
For underwater objects, this matters a lot because many targets are small. If you only use deeper feature maps, small objects may disappear after repeated downsampling.
FEB-YOLOv8 modifies the feature pyramid by:
- adding P2 feature map information, which is higher-resolution and better for tiny objects;
- using cross-scale connections inspired by BiFPN;
- using weighted feature fusion instead of simple addition/concat. ([PLOS][1])
Math: weighted feature fusion
Standard FPN/PAN often does something like:
\[O = I_1 + I_2\]or:
\[O = Conv(Concat(I_1, I_2))\]BiFPN-style fusion learns how much to trust each input:
\[O = \frac{\sum_i w_i I_i}{\epsilon + \sum_i w_i}\]where:
\[w_i \ge 0\]Usually this is implemented as:
\[w_i = ReLU(\alpha_i)\]So the model learns:
1
2
For this fusion node, should I trust high-resolution shallow features more?
Or deep semantic features more?
For small underwater objects, the answer is often:
1
Use more P2/P3 spatial detail, but still inject deeper semantics.
Pseudocode: weighted fusion
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class WeightedFusion:
def __init__(self, n_inputs, eps=1e-4):
self.raw_weights = Parameter(ones(n_inputs))
self.eps = eps
def forward(self, features):
# make weights non-negative
weights = relu(self.raw_weights)
# normalize weights
weights = weights / (sum(weights) + self.eps)
out = 0
for wi, fi in zip(weights, features):
out = out + wi * fi
return out
Pseudocode: improved P2-BiFPN neck
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class FEBNeck:
def forward(self, P2, P3, P4, P5):
# P2: high resolution, useful for tiny objects
# P5: low resolution, strong semantics
# top-down path
P5_td = P5
P4_td = fuse([P4, upsample(P5_td)])
P3_td = fuse([P3, upsample(P4_td)])
P2_td = fuse([P2, upsample(P3_td)])
# bottom-up path
P3_out = fuse([P3, P3_td, downsample(P2_td)])
P4_out = fuse([P4, P4_td, downsample(P3_out)])
P5_out = fuse([P5, downsample(P4_out)])
return P2_td, P3_out, P4_out, P5_out
The exact graph depends on the implementation, but the principle is:
1
2
3
Add P2 → preserve tiny-object detail
Use bidirectional paths → exchange shallow/detail and deep/semantic information
Use learned weights → avoid treating every scale equally
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class FEB_YOLOv8:
def __init__(self):
self.stem = Conv()
# backbone
self.stage1 = PC2f()
self.stage2 = PC2f()
self.stage3 = PC2f()
self.stage4 = PC2f()
# attention inserted in backbone
self.ema = EMA(channels=..., groups=...)
# SPPF kept from YOLOv8-style design
self.sppf = SPPF()
# improved feature pyramid
self.neck = P2_BiFPN_Neck()
# YOLOv8 decoupled detection head
self.head = DetectHead()
def forward(self, x):
x = self.stem(x)
P2 = self.stage1(x)
P3 = self.stage2(P2)
P4 = self.stage3(P3)
P5 = self.stage4(P4)
P5 = self.ema(P5)
P5 = self.sppf(P5)
features = self.neck(P2, P3, P4, P5)
return self.head(features)
6. Reported results
On DUO, the paper reports FEB-YOLOv8 reaches 82.9% mAP@0.5, improving over YOLOv8n by 1.2% mAP@0.5 and 1.4% mAP@0.5:0.95. It also reports 1.64M parameters and 6.2G GFLOPs. ([PLOS][1])
On URPC2020, the paper reports 83.5% mAP@0.5 and 48.9% mAP@0.5:0.95, improving over YOLOv8n by 1.3% mAP@0.5 and 1.0% mAP@0.5:0.95. ([PLOS][1])
The ablation is especially informative:
1
2
3
4
5
6
7
8
9
10
11
Replace C2f with P-C2f:
parameters ↓
GFLOPs ↓
tiny accuracy drop
Add improved BiFPN:
accuracy recovers/improves
model remains lightweight
Add EMA:
accuracy improves further
The paper reports the final model reduces parameters and GFLOPs versus baseline while improving mAP. ([PLOS][1])
That is the main engineering logic of FEB-YOLOv8.
| [1]: https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0311173 “FEB-YOLOv8: A multi-scale lightweight detection model for underwater object detection | PLOS One” |