60 FPS Object Detection on Android using YOLOv8

I built a realtime object detection system that runs entirely on an Android phone. Screen capture, YOLO inference, tracking, and control output, all in under 25ms. This post covers some of my internal notes, goes through how I got here and what I learned along the way.

IMPORTANT

The testing and development for this project was done in training mode and custom lobbies with no real players affected.

The latency budgeth2

At 60 FPS you get about 16ms per frame. I aimed for 25ms total latency which leaves some room for variance. On a Snapdragon 888:

Stage	Time
Capture	~4ms
YOLO inference	~18ms
Tracking	~0.5ms
Control	~0.3ms
Output	~0.2ms
Total	~23ms

75% of that budget goes to inference. Neural networks on mobile are just slow and there’s no way around it. So everything else needs to be as cheap as possible.

Captureh2

MediaProjection gives you screen frames through ImageReader. Frames arrive as HardwareBuffer which you can pass straight to native code without copying anything.

void* data = nullptr;
AHardwareBuffer_lock(buffer, AHARDWAREBUFFER_USAGE_CPU_READ_OFTEN, -1, nullptr, &data);
// data now points to the raw pixels

I capture at half resolution, so 1200x540 on a 2400x1080 screen. The model resizes to 320x320 anyway so you’re not losing much.

Capture takes about 4ms, mostly just waiting for the next frame. Double buffering could hide some of that but didn’t feel worth the added complexity.

NCNN and Vulkanh2

NCNN is Tencent’s mobile inference framework with Vulkan support. It lets you offload to the GPU instead of frying your CPU.

ncnn::Net net;
net.opt.use_vulkan_compute = true;
net.opt.lightmode = true;
net.opt.num_threads = 4;  // fallback for CPU ops

YOLOv8 nano is already pretty small but INT8 quantization makes it even faster. Inference goes from 35ms down to 18ms and the model shrinks from 6MB to 2MB.

Preprocessing is standard YOLO:

Center crop the frame to focus on the active region
Resize to 320x320
Convert RGBA to RGB
Normalize pixel values to [0, 1]

const float normVals[3] = {1/255.f, 1/255.f, 1/255.f};
in.substract_mean_normalize(nullptr, normVals);

INT8 does cost some recall. But in this case false positives hurt more than missed detections since they trigger reactions to nothing.

Trainingh2

51 images over 100 epochs. Transfer learning does the heavy lifting here since YOLOv8n already knows what people look like from COCO. The custom images just fine tune it for the target environment.

I pulled frames from gameplay recordings, auto labeled them with a pretrained detector, then went through manually to fix mistakes. Whole thing took maybe an hours.

Metric	Value
mAP@0.5	0.94
Precision	0.98
Recall	0.88

High precision low recall was the goal. Missed detections are fine, the system just does nothing that frame. False positives are bad because then it reacts to something that isn’t there.

Training loss and metrics over 100 epochs — Training converged cleanly over 100 epochs with no overfitting.

Precision-Recall curve — The PR curve shows 0.94 mAP at 0.5 IOU threshold.

Validation batch predictions — Validation predictions showing detection across different poses and occlusion levels.

Non-Maximum Suppressionh2

YOLO spits out thousands of boxes, most overlapping because it predicts at multiple scales. NMS filters them down to just the good ones.

Overlap is measured with IOU:

\text{IOU}(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}

In code:

float iou(const BBox& a, const BBox& b) {
    float x1 = std::max(a.left(), b.left());
    float y1 = std::max(a.top(), b.top());
    float x2 = std::min(a.right(), b.right());
    float y2 = std::min(a.bottom(), b.bottom());

    float inter = std::max(0.f, x2-x1) * std::max(0.f, y2-y1);
    return inter / (a.area() + b.area() - inter);
}

The algorithm is simple:

Sort by confidence -> Keep the best box -> Kill anything that overlaps too much -> Repeat

IOU based trackingh2

Raw detections are noisy, boxes jump around a few pixels each frame and sometimes disappear entirely. If you just react to whatever YOLO gives you the output looks jittery and unstable.

I use IOU matching to track targets across frames. If a new detection overlaps enough with an existing track they’re probably the same target:

for (int t = 0; t < numTracks; t++) {
    float bestIou = iouThreshold;  // typically 0.3
    int bestDet = -1;

    for (int d = 0; d < numDetections; d++) {
        if (matched[d]) continue;
        float iou = tracks[t].bbox.iou(detections[d].bbox);
        if (iou > bestIou) {
            bestIou = iou;
            bestDet = d;
        }
    }

    if (bestDet >= 0) {
        // update track with detection
        tracks[t].update(detections[bestDet]);
        matched[bestDet] = true;
    }
}

When a track goes unmatched I predict where it should be using velocity:

P_{t+1} = P_t + v_t \cdot \Delta t

Velocity itself is smoothed with EMA so it doesn’t freak out from noisy detections:

v_{t+1} = (1 - \alpha) \cdot v_t + \alpha \cdot \frac{\Delta P}{\Delta t}

Tracks that go unmatched for more than 5 frames get dropped. That’s about 80ms at 60 FPS. Long enough to survive brief detection failures but short enough to not leave out ghost tracks.

Distance adaptive PIDh2

Turning target position into cursor movement sounds easy until you try it. A basic P controller oscillates when you get close because the error keeps flipping sign.

Different distances need different strategies:

Distance	Controller
< 30px	PID
30-150px	Proportional + EMA
> 150px	Proportional + clamp

PID is textbook:

u = K_p e + K_i \int e \, dt + K_d \frac{de}{dt}

Discretized:

u[n] = K_p \cdot e[n] + K_i \sum_{i=0}^{n} e[i] \cdot \Delta t + K_d \cdot \frac{e[n] - e[n-1]}{\Delta t}

float PID::update(float error, float dt) {
    integral += error * dt;
    float derivative = (error - lastError) / dt;
    lastError = error;
    return Kp * error + Ki * integral + Kd * derivative;
}

Tuned values are $K_p = 0.45$ , $K_i = 0$ , $K_d = 0.12$ . I disabled integral entirely because it causes windup. If the target is briefly hidden the integral builds up error. Then when it reappears you overshoot.

Derivative is what prevents oscillation near the target. It dampens response when error changes fast.

The control loop in action showing smooth tracking across distance transitions.

Zero allocation hot pathh2

Android’s garbage collector can pause for 50ms+ which wrecks any latency gains. I kept the hot path allocation free to avoid that.

Everything uses fixed arrays allocated once at startup:

template <typename T, int N>
class FixedArray {
    T data[N];
    int size = 0;
public:
    bool push(const T& v) {
        if (size >= N) return false;
        data[size++] = v;
        return true;
    }
    void removeAt(int i) {
        data[i] = data[size-1];
        size--;
    }
};

removeAt does a swap-remove. Copy the last element into the hole, decrement size. O(1) and order doesn’t matter for this use case.

I use FixedArray<Detection, 100> for detections and FixedArray<Track, 50> for tracks. In practice you would rarely see more than 5-10 detections per frame so the limits I use are really generous.

Realtime detection and overlay running at 60 FPS.

Performanceh2

This was tested on Realme GT 5G with Snapdragon 888 and 8GB RAM:

Average latency: 23ms
P99 latency: 28ms
Sustained framerate: 60 FPS
Memory usage: ~80MB

Inference is the main bottleneck but for this project 23ms was good enough.

PS: This is still a prototype and there are still lots of ways to improve this even further. Model distillation, better tracking, training with hard negatives, fancier control strategies…etc. But those are topics for some other day :)

The latency budgeth2

Captureh2

NCNN and Vulkanh2

Trainingh2

Non-Maximum Suppressionh2

IOU based trackingh2

Distance adaptive PIDh2

Zero allocation hot pathh2

Performanceh2

Further readingh2

Comments