Two articles describe how on-device AI performance on Android can degrade during real user sessions due to physical and system-level limits. Both attribute the common “it was fast, then it slows down” experience to thermal throttling, where the device heats up and the Android kernel responds by lowering CPU/GPU/NPU operating frequency and voltage through DVFS. They describe this as a non-linear “performance cliff” rather than a gradual slowdown.

They also emphasize that performance problems are often tied to power and heat rather than only raw model compute. Both discuss energy costs of moving data between memory and accelerator hardware, which can make models bottleneck on memory/data movement and trigger thermal limits faster.

The articles recommend mitigation through profiling and adaptive execution. One focuses on building thermal-aware application logic using Android’s PowerManager thermal status signals, then switching model precision (for example, FP16 to INT8), adjusting workload (such as frame skipping), and stopping inference in severe states. The other stresses using Android Studio’s Power Profiler to correlate energy rails, hardware utilization (CPU vs NPU vs GPU), and thermal throttling with inference latency, guiding configuration choices like quantization/pruning and correct accelerator usage.

Both mention AICore as a system-level service intended to manage shared on-device AI models and abstract hardware acceleration, improving updateability and memory efficiency.