Computer Vision - Aruco Detection

Posted by Rico's Nerd Cluster on March 15, 2026

Aruco Detection

Aruco detection has three jobs:

1
2
3
1. Detection: find square marker candidates in the image.
2. Decoding: read the binary ID inside the square.
3. Pose estimation: use the four 2D corners + known marker size to solve camera-to-marker pose.

OpenCV’s ArUco module treats markers as square binary fiducial markers; a predefined dictionary has a fixed marker bit size, such as DICT_6X6_250, meaning 250 possible markers with a 6×6 binary code area. The dictionary’s minimum Hamming distance controls how well marker IDs can be distinguished and corrected under bit errors. (OpenCV Documentation)

An ArUco marker is basically:

1
2
3
4
black outer border
binary bit pattern inside
square shape
known physical size

Example toy marker with a 4×4 inner code and a 1-cell black border:

1
2
3
4
5
6
7
8
sampled 6×6 grid

0 0 0 0 0 0
0 1 0 0 1 0
0 0 1 1 0 0
0 1 1 0 0 0
0 0 0 1 1 0
0 0 0 0 0 0

Where:

1
2
0 = black cell
1 = white cell

The detector expects the outer border to be black. Then it reads the inner code:

1
2
3
4
5
6
inner 4×4 code

1 0 0 1
0 1 1 0
1 1 0 0
0 0 1 1

The high-level OpenCV-style pipeline is:

1
2
3
4
5
6
7
8
RGB image
  -> grayscale + adaptive threshold + find candidate contours
  -> approximate contours as polygons + keep quadrilateral candidates
  -> perspective-warp candidate to square
  -> sample bit grid
  -> rotate marker to find the best rotation
  -> match inner bits to dictionary
  -> return marker ID + 4 corners

OpenCV’s ArUco detector works by detecting candidate square regions, warping them into a canonical square form, thresholding the cells, and comparing the resulting bit pattern against the marker dictionary. (OpenCV Documentation)


Step 1 - Finding square candidates

Suppose your camera image contains this marker:

1
2
3
4
5
6
image

       p0 __________ p1
         /          /
        /  marker  /
     p3/__________/p2

The detector first thresholds the image:

1
2
bright pixels -> white
dark pixels   -> black

Then it finds contours and keeps contours that look like quadrilaterals:

1
2
3
4
5
6
7
8
9
10
11
12
13
for contour in contours:
    polygon = approximate_polygon(contour)

    if len(polygon) != 4:
        reject

    if area_too_small(polygon):
        reject

    if not convex(polygon):
        reject

    keep_as_marker_candidate(polygon)

At this point, the detector does not yet know the marker ID. It only knows:

1
this looks like a black-bordered square candidate

Step 2 - Perspective warp to a canonical marker image

The candidate in the camera image is tilted:

1
2
3
4
5
6
source image

       p0 __________ p1
         /          /
        /          /
     p3/__________/p2

The detector warps it into a clean square:

1
2
3
4
5
6
canonical marker image

q0 __________ q1
 |            |
 |            |
q3|___________|q2

This is a homography warp. For each detected image corners in the image [u,v] space

1
2
3
4
p0 = (u0, v0)
p1 = (u1, v1)
p2 = (u2, v2)
p3 = (u3, v3)

We choose 4 canonical square points:

q0 = (0, 0) q1 = (N, 0) q2 = (N, N) q3 = (0, N)

We can solve for a 3×3 homography H to sample from the original image:

1
2
3
4
5
6
[ x_q ]   [ h00 h01 h02 ] [ u ]
[ y_q ] = [ h10 h11 h12 ] [ v ]
[ z_q ]   [ h20 h21 h22 ] [ 1 ]

x_p = x_q / z_q
y_p = y_q / z_q

This is basically

1
2
3
4
5
6
H = cv2.getPerspectiveTransform(
    src=np.float32([p0, p1, p2, p3]),   # detected image quad
    dst=np.float32([q0, q1, q2, q3]),   # canonical square
)

canonical = cv2.warpPerspective(gray, H, (N, N))

After this, the candidate looks front-facing, so it can be divided into cells.

Step 3 - Sampling the bit grid for Inner Code

For the toy 6×6 marker:

1
2
3
4
5
6
0 0 0 0 0 0
0 1 0 0 1 0
0 0 1 1 0 0
0 1 1 0 0 0
0 0 0 1 1 0
0 0 0 0 0 0

The detector samples each cell. A simple version is:

1
2
3
4
5
6
7
8
9
10
11
12
13
for cell_y in range(6):
    for cell_x in range(6):
        patch = canonical_marker[
            cell_y * cell_size : (cell_y + 1) * cell_size,
            cell_x * cell_size : (cell_x + 1) * cell_size,
        ]

        mean_intensity = patch.mean()

        if mean_intensity > threshold:
            bit = 1
        else:
            bit = 0

Then it checks the border:

1
2
3
4
top row    must be all 0
bottom row must be all 0
left col   must be all 0
right col  must be all 0

If the border is not black, reject the candidate.

Then remove the border and keep the inner code:

1
2
3
4
1 0 0 1
0 1 1 0
1 1 0 0
0 0 1 1

Step 4 - Rotation Correction

The camera may see the marker rotated:

1
2
3
4
5
6
original code

1 0 0 1
0 1 1 0
1 1 0 0
0 0 1 1

Rotated 90 degrees:

1
2
3
4
0 1 0 1
0 1 1 0
1 0 1 0
1 0 0 1

So the detector tries all four rotations:

1
2
3
4
5
6
7
8
9
candidates = [
    bits,
    rotate90(bits),
    rotate180(bits),
    rotate270(bits),
]

for rotated_bits in candidates:
    compare_to_dictionary(rotated_bits)

The best matching rotation gives:

1
2
3
marker ID
marker orientation
correct corner ordering

The clever thing here is depite hamming distance is not rotation invariant, because after the contour is found, the marker is already perspective-warped into a square. So it’s sufficient to try 4 rotates to find the best matching marker orientation. And therefore, Corner ordering is important for pose estimation

Identify Marker ID Using Dictionary matching with Hamming distance

The detector compares the decoded bit matrix to known dictionary entries. Suppose the dictionary has this marker ID 17:

1
2
3
4
5
6
dictionary marker 17

1 0 0 1
0 1 1 0
1 1 0 0
0 0 1 1

Now suppose one bit was misread:

1
2
3
4
5
6
observed with one bad bit

1 0 0 1
0 1 1 0
1 0 0 0   <- one bit changed
0 0 1 1

Compare to dictionary marker 17:

1
different bits = 1

If the dictionary allows correction up to that error threshold, it can still identify the marker.

That is why dictionary design matters. If two valid markers are too similar, a noisy observation could be confused. OpenCV’s docs describe the inter-marker distance as the minimum Hamming distance between dictionary markers, and that distance determines error-detection and error-correction capability. (OpenCV Documentation)


Step 5 - Pose estimation from four corners

After decoding, we know the 2D image corners:

1
2
3
4
5
6
image corners, pixels

u0, v0
u1, v1
u2, v2
u3, v3

And because the printed marker has known physical size, we know the 3D marker-frame corners. For a marker of side length L, define marker coordinates:

1
2
3
4
P0 = [-L/2,  L/2, 0]
P1 = [ L/2,  L/2, 0]
P2 = [ L/2, -L/2, 0]
P3 = [-L/2, -L/2, 0]

The marker is planar, so all points have:

1
Z = 0

The camera projection model is:

1
s [u, v, 1]^T = K [R | t] [X, Y, Z, 1]^T

where:

1
2
3
K = camera intrinsics
R = marker rotation relative to camera
t = marker translation relative to camera

Since the marker lies on a plane, Z = 0, so:

1
s [u, v, 1]^T = K [r1 r2 t] [X, Y, 1]^T

This is a homography:

1
s [u, v, 1]^T = H [X, Y, 1]^T

where:

1
H = K [r1 r2 t]

So a square marker gives four 2D-3D correspondences, enough to estimate pose. OpenCV’s ArUco docs emphasize that one benefit of binary square fiducial markers is that a single marker provides enough corner correspondences to estimate camera pose. (GitHub)

In practice, OpenCV usually calls a PnP solver:

1
2
3
4
5
6
ok, rvec, tvec = cv2.solvePnP(
    object_points,   # 3D marker corners
    image_points,    # detected 2D corners
    K,
    dist_coeffs,
)

Then:

1
R, _ = cv2.Rodrigues(rvec)

The result is usually interpreted as:

1
camera_T_marker

or:

1
marker pose in camera frame

Meaning:

1
P_camera = R * P_marker + t

9. Small numerical pose example

Say the marker side length is:

1
L = 0.10  # meters

So marker-frame corners are:

1
2
3
4
P0 = [-0.05,  0.05, 0]
P1 = [ 0.05,  0.05, 0]
P2 = [ 0.05, -0.05, 0]
P3 = [-0.05, -0.05, 0]

Suppose camera intrinsics are:

1
2
3
4
fx = 600
fy = 600
cx = 320
cy = 240

So:

1
2
3
4
5
K = [
    [600,   0, 320],
    [  0, 600, 240],
    [  0,   0,   1],
]

If the marker is front-facing at:

1
2
t = [0, 0, 0.5]  # 50 cm in front of camera
R = identity

Then projection is:

1
2
u = fx * X/Z + cx
v = fy * Y/Z + cy

For corner P0 = [-0.05, 0.05, 0], camera point is:

1
2
3
Xc = -0.05
Yc =  0.05
Zc =  0.50

So:

1
2
u = 600 * (-0.05 / 0.50) + 320 = 260
v = 600 * ( 0.05 / 0.50) + 240 = 300

Do the same for all corners:

1
2
3
4
P0 -> (260, 300)
P1 -> (380, 300)
P2 -> (380, 180)
P3 -> (260, 180)

So a 10 cm marker at 50 cm distance appears as:

1
width in image = 120 pixels

because:

1
2
3
pixel_width = fx * physical_width / depth
            = 600 * 0.10 / 0.50
            = 120 px

PnP solves the inverse problem:

1
2
3
4
5
6
7
Given:
    3D marker corners
    detected 2D image corners
    camera intrinsics

Find:
    R, t

10. Compact pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def detect_aruco_and_pose(image, K, dist_coeffs, marker_length, dictionary):
    gray = to_grayscale(image)

    # 1. Find square candidates
    binary = adaptive_threshold(gray)
    contours = find_contours(binary)

    candidates = []
    for contour in contours:
        poly = approximate_polygon(contour)

        if len(poly) == 4 and is_convex(poly) and area(poly) > min_area:
            candidates.append(order_corners(poly))

    detections = []

    # 2. Decode each candidate
    for corners in candidates:
        canonical = perspective_warp(gray, corners, output_size=(N, N))

        bits_with_border = sample_cells(canonical)

        if not black_border_is_valid(bits_with_border):
            continue

        inner_bits = remove_border(bits_with_border)

        best_id = None
        best_rotation = None
        best_distance = infinity

        for rotation in [0, 90, 180, 270]:
            rotated = rotate_bits(inner_bits, rotation)

            for marker_id, dict_bits in dictionary:
                d = hamming_distance(rotated, dict_bits)

                if d < best_distance:
                    best_distance = d
                    best_id = marker_id
                    best_rotation = rotation

        if best_distance <= allowed_error:
            corrected_corners = rotate_corner_order(corners, best_rotation)
            detections.append((best_id, corrected_corners))

    # 3. Estimate pose for each decoded marker
    poses = []
    object_points = marker_3d_corners(marker_length)

    for marker_id, image_corners in detections:
        ok, rvec, tvec = cv2.solvePnP(
            object_points,
            image_corners,
            K,
            dist_coeffs,
        )

        if ok:
            poses.append((marker_id, rvec, tvec))

    return poses