How I built a realtime Android vision loop with YOLO + NCNN, IOU tracking, and distance-adaptive PID control all running at 60 FPS.
I built a realtime object detection system that runs entirely on an Android phone. Screen capture, YOLO inference, tracking, and control output, all in under 25ms. This post covers some of my internal notes, goes through how I got here and what I learned along the way.
The testing and development for this project was done in training mode and custom lobbies with no real players affected.
The latency budgeth2
At 60 FPS you get about 16ms per frame. I aimed for 25ms total latency which leaves some room for variance. On a Snapdragon 888:
| Stage | Time |
|---|---|
| Capture | ~4ms |
| YOLO inference | ~18ms |
| Tracking | ~0.5ms |
| Control | ~0.3ms |
| Output | ~0.2ms |
| Total | ~23ms |
75% of that budget goes to inference. Neural networks on mobile are just slow and there’s no way around it. So everything else needs to be as cheap as possible.

Captureh2
MediaProjection gives you screen frames through ImageReader. Frames arrive as HardwareBuffer which you can pass straight to native code without copying anything.
void* data = nullptr;AHardwareBuffer_lock(buffer, AHARDWAREBUFFER_USAGE_CPU_READ_OFTEN, -1, nullptr, &data);// data now points to the raw pixelsI capture at half resolution, so 1200x540 on a 2400x1080 screen. The model resizes to 320x320 anyway so you’re not losing much.
Capture takes about 4ms, mostly just waiting for the next frame. Double buffering could hide some of that but didn’t feel worth the added complexity.
NCNN and Vulkanh2
NCNN is Tencent’s mobile inference framework with Vulkan support. It lets you offload to the GPU instead of frying your CPU.
ncnn::Net net;net.opt.use_vulkan_compute = true;net.opt.lightmode = true;net.opt.num_threads = 4; // fallback for CPU opsYOLOv8 nano is already pretty small but INT8 quantization makes it even faster. Inference goes from 35ms down to 18ms and the model shrinks from 6MB to 2MB.
Preprocessing is standard YOLO:
- Center crop the frame to focus on the active region
- Resize to 320x320
- Convert RGBA to RGB
- Normalize pixel values to [0, 1]
const float normVals[3] = {1/255.f, 1/255.f, 1/255.f};in.substract_mean_normalize(nullptr, normVals);INT8 does cost some recall. But in this case false positives hurt more than missed detections since they trigger reactions to nothing.
Trainingh2
51 images over 100 epochs. Transfer learning does the heavy lifting here since YOLOv8n already knows what people look like from COCO. The custom images just fine tune it for the target environment.
I pulled frames from gameplay recordings, auto labeled them with a pretrained detector, then went through manually to fix mistakes. Whole thing took maybe an hours.
| Metric | Value |
|---|---|
| mAP@0.5 | 0.94 |
| Precision | 0.98 |
| Recall | 0.88 |
High precision low recall was the goal. Missed detections are fine, the system just does nothing that frame. False positives are bad because then it reacts to something that isn’t there.



Non-Maximum Suppressionh2
YOLO spits out thousands of boxes, most overlapping because it predicts at multiple scales. NMS filters them down to just the good ones.
Overlap is measured with IOU:
In code:
float iou(const BBox& a, const BBox& b) { float x1 = std::max(a.left(), b.left()); float y1 = std::max(a.top(), b.top()); float x2 = std::min(a.right(), b.right()); float y2 = std::min(a.bottom(), b.bottom());
float inter = std::max(0.f, x2-x1) * std::max(0.f, y2-y1); return inter / (a.area() + b.area() - inter);}The algorithm is simple:
Sort by confidence -> Keep the best box -> Kill anything that overlaps too much -> Repeat
IOU based trackingh2
Raw detections are noisy, boxes jump around a few pixels each frame and sometimes disappear entirely. If you just react to whatever YOLO gives you the output looks jittery and unstable.
I use IOU matching to track targets across frames. If a new detection overlaps enough with an existing track they’re probably the same target:
for (int t = 0; t < numTracks; t++) { float bestIou = iouThreshold; // typically 0.3 int bestDet = -1;
for (int d = 0; d < numDetections; d++) { if (matched[d]) continue; float iou = tracks[t].bbox.iou(detections[d].bbox); if (iou > bestIou) { bestIou = iou; bestDet = d; } }
if (bestDet >= 0) { // update track with detection tracks[t].update(detections[bestDet]); matched[bestDet] = true; }}When a track goes unmatched I predict where it should be using velocity:
Velocity itself is smoothed with EMA so it doesn’t freak out from noisy detections:
Tracks that go unmatched for more than 5 frames get dropped. That’s about 80ms at 60 FPS. Long enough to survive brief detection failures but short enough to not leave out ghost tracks.
Distance adaptive PIDh2
Turning target position into cursor movement sounds easy until you try it. A basic P controller oscillates when you get close because the error keeps flipping sign.
Different distances need different strategies:
| Distance | Controller |
|---|---|
| < 30px | PID |
| 30-150px | Proportional + EMA |
| > 150px | Proportional + clamp |

PID is textbook:
Discretized:
float PID::update(float error, float dt) { integral += error * dt; float derivative = (error - lastError) / dt; lastError = error; return Kp * error + Ki * integral + Kd * derivative;}Tuned values are , , . I disabled integral entirely because it causes windup. If the target is briefly hidden the integral builds up error. Then when it reappears you overshoot.
Derivative is what prevents oscillation near the target. It dampens response when error changes fast.
Zero allocation hot pathh2
Android’s garbage collector can pause for 50ms+ which wrecks any latency gains. I kept the hot path allocation free to avoid that.
Everything uses fixed arrays allocated once at startup:
template <typename T, int N>class FixedArray { T data[N]; int size = 0;public: bool push(const T& v) { if (size >= N) return false; data[size++] = v; return true; } void removeAt(int i) { data[i] = data[size-1]; size--; }};removeAt does a swap-remove. Copy the last element into the hole, decrement size. O(1) and order doesn’t matter for this use case.
I use FixedArray<Detection, 100> for detections and FixedArray<Track, 50> for tracks. In practice you would rarely see more than 5-10 detections per frame so the limits I use are really generous.
Performanceh2
This was tested on Realme GT 5G with Snapdragon 888 and 8GB RAM:
- Average latency: 23ms
- P99 latency: 28ms
- Sustained framerate: 60 FPS
- Memory usage: ~80MB
Inference is the main bottleneck but for this project 23ms was good enough.
PS: This is still a prototype and there are still lots of ways to improve this even further. Model distillation, better tracking, training with hard negatives, fancier control strategies…etc. But those are topics for some other day :)
Further readingh2
- NCNN Vulkan notes - Official docs for enabling Vulkan compute in NCNN.
- Android AHardwareBuffer reference - NDK documentation for hardware buffer access.
- YOLOv8 NCNN export - Ultralytics guide for exporting models to NCNN format.
- PID control tutorial - Wikipedia article explaining PID control theory and implementation.
- Learning non-maximum suppression - Research paper on NMS techniques and improvements.
This project was built for educational purposes only. It is not intended for use in real competitive scenarios. Please respect game terms of service and community guidelines.
Comments