Machine learning is no longer confined to the cloud. Advances in model compression, quantisation, and dedicated neural processing hardware have made it practical to run inference on microcontrollers with 256KB of flash. This is reshaping what embedded systems can do — and how they are designed.
From Cloud to Edge: The Shift in AI Architecture
Early IoT architectures sent raw sensor data to the cloud for processing. This model has significant drawbacks: it requires reliable connectivity, introduces latency, increases bandwidth cost, and exposes potentially sensitive data in transit. Edge inference solves all four. By running the model on the device, you get sub-millisecond response times, offline operation, and data privacy by design.

Hardware Implications
Running ML models on embedded hardware has direct implications for hardware design. Key considerations include:
- Processing: MCUs with DSP extensions (Arm Cortex-M4/M7/M33 with CMSIS-DSP) handle quantised inference efficiently. For more demanding models, look at Arm Cortex-M55 with Helium (MVE) or dedicated NPUs like the Arm Ethos-U55
- Memory: Even small models need RAM for activations. Budget 64–256KB SRAM for TinyML workloads; more complex models may require external PSRAM
- Power: Inference at full speed on an M4 @ 168MHz draws 30–50mA. Profile your model and consider duty-cycling or low-power inference modes
- Thermal: Continuous inference on high-performance MCUs can raise die temperature significantly — factor this into your thermal design
Frameworks and Toolchain
The embedded ML toolchain has matured significantly. The leading frameworks for edge inference are:
- TensorFlow Lite for Microcontrollers (TFLM) — the most widely used, with broad hardware support and a growing operator library
- Edge Impulse — end-to-end platform from data collection to deployment, excellent for rapid prototyping
- STM32Cube.AI — Arm and STMicroelectronics’ tool for optimising and deploying models onto STM32 devices
- ONNX Runtime for embedded — growing support, useful if your training pipeline outputs ONNX models