What breaks first when your IoT fleet outgrows its architecture

Ondrej Kostolnak

Development

First to break: ingest and processing are the same thing

There's a version of your IoT platform that works perfectly. Devices connect, telemetry flows, the dashboard updates, and everyone agrees the MVP was a success.

Then the fleet grows. And the architecture that handled 200 devices without a single incident starts producing problems that don't show up in staging, don't trigger clean error messages, and don't have obvious fixes. Latency drifts up. State data goes stale. Alerts fire late or not at all. The backoffice shows devices as online when they've been silent for hours.

These aren't bugs. They're the predictable consequences of an architecture that was designed for validation, not for load. And the painful part is that by the time you see them, you're already running production traffic on a system that needs structural changes.

Here's what typically breaks, in what order, and what you can do about it before it costs you.

First to break: ingest and processing are the same thing

Most MVP platforms use a single API to do everything. A device sends a message over HTTPS. The API authenticates it, validates the payload, writes to the database, updates device state, and returns a response. One endpoint, one flow.

This is the right choice early on. It's fast to build, easy to debug, and cheap to run. But it means your ingestion layer and your processing layer are the same service. When the fleet was small, that didn't matter. When the fleet grows, it means every spike in device traffic directly increases your API latency and database pressure.

A fleet of asset trackers reporting on 60-second intervals doesn't send traffic evenly. Devices on timed intervals tend to cluster. One minute, your API handles 50 requests. The next, it handles 800. There's no buffer between "message received" and "message processed." The database takes every hit in real time.

What this looks like from the business side: P95 ingest latency creeps up. Some messages start failing. Devices retry, which adds more load. The dashboard shows intermittent staleness. Your ops team can't tell if a device is genuinely offline or if the platform just hasn't processed its last message yet.

What fixes it: Decoupling ingest from processing. A queue between the two means the ingestion layer's only job is to accept and enqueue. Processing happens asynchronously, at whatever rate your workers can sustain. Burst traffic fills the queue instead of hammering the database. This is the single most impactful architectural change you can make when scaling an IoT platform.

Second to break: failure handling that depends on the device

In a synchronous architecture, if the API fails to process a message, the device has to retry. That sounds reasonable until you think about what "retry" actually means for a field device running on a cellular connection with firmware you can't easily update.

Most early firmware handles this with basic retry logic. But the platform has no visibility into what was lost. There's no dead-letter queue, no replay mechanism, no way to distinguish "message never received" from "message received but processing failed."

What this looks like from the business side: Gaps in telemetry history. Missing data points that only surface weeks later during analytics. Devices that appear to have gone offline but actually sent data that the platform dropped.

What fixes it: Queue-native retry policies with dead-letter handling. Messages that fail processing get retried with backoff. Messages that fail repeatedly land in a dead-letter queue where they can be inspected and replayed. The platform owns failure recovery, not the device.

Third to break: one protocol for everything

HTTPS works for sending telemetry. But as use cases expand beyond basic reporting, you need the platform to reach devices too. Firmware updates, configuration changes, and diagnostic reads all require the platform to push data to a specific device. With HTTPS only, all of that is device-initiated. The device has to poll for updates. The platform can't push anything.

What this looks like from the business side: Firmware rollouts that take days instead of hours. Configuration changes that only take effect on the next reporting cycle. No ability to quickly diagnose a specific device without waiting for it to check in.

What fixes it: Adding MQTT alongside HTTPS. Persistent connections, topic-based routing, and bidirectional communication. Each device subscribes to its own namespace. The platform pushes when it needs to. HTTPS stays available where it makes more sense.

Fourth to break: the "scale everything" problem

When ingest and processing are coupled, scaling means scaling everything together. If processing gets slow (more complex validation, heavier writes, more device types with different schemas), you have to add more API instances even though the bottleneck is only the write path. You're paying for ingest capacity you don't need just to get more processing throughput.

What this looks like from the business side: Infrastructure costs that scale faster than your device count. Capacity planning that's impossible because you can't isolate which part of the system is saturated.

What fixes it: Independent scaling of ingestion and processing. When these are separate services, you scale workers based on queue depth and ingest capacity based on traffic. If you add a new device type with heavier processing requirements, you add workers. The ingestion layer doesn't change. Your costs map to actual demand.

What's worth investing in before you need it

Not all of this needs to be built on day one. But a few decisions get dramatically more expensive to change later:

Design your ingest endpoint as a clean surface. Even in a synchronous architecture, keep the "receive, validate, process, store" pipeline explicit and separable. When it's time to add a queue, you want a clean cut point, not a tangle of interleaved logic.

Model device state with timestamps from the start. Every piece of state should carry the timestamp of the message that set it. This is trivial to implement early and nearly impossible to retrofit. Without it, you can't handle out-of-order messages, and you can't tell if a state update is fresh or stale.

Separate device identity from the ingest path. Per-device authentication and topic-scoped authorization aren't just security features. They're the foundation for fleet management, targeted updates, and incident response at scale. Building these after you've shipped devices with shared credentials is a recall-level problem.

Instrument before you need to. Ingest success rate, processing latency at P95 and P99, device online ratio, inactive-device rate. These metrics are cheap to collect and invaluable when something starts degrading. The teams that catch scaling problems early are the ones that were already watching.

The pattern is always the same. The architecture that worked at 200 devices starts showing cracks at 1,000 and fails structurally at 5,000. The difference between a platform that scales and one that needs a rewrite isn't more engineering time. It's knowing which decisions lock you in and making those ones carefully, even when everything else is moving fast.

Share on