FLEET MANAGEMENT
80,000 MikroTik devices in 6 countries: what we learned running a fleet
Provisioning, drift, asymmetric routing, mass rollouts. Four years of lessons from the field, condensed for ISPs, WISPs and MSPs who still configure routers "by hand".
OptiWize runs over 80,000 MikroTik devices in production, across six European and African countries, for customers ranging from regional ISPs to national carriers. This is not marketing: it is the current state of our core, measured by the internal dashboard as we write. Numbers like these force you to learn the difference between a good idea and an idea that survives scale. We collected the hardest lessons in this article, hoping they will spare a few night shifts to anyone scaling up right now.
Lesson 1: manual provisioning is not an MVP, it's technical debt
In OptiWize's first phase, every new device was configured by hand by the NOC: SSH in, paste the config, tag in Winbox. It worked. At 100 devices it was tolerable, at 1,000 it was tiring, at 5,000 it was a blocker. Not because the NOC was incapable: because each operator introduced micro-variants. An extra route, a missing firewall rule, a slightly different QoS parameter. The network was "fine" on every individual router but, at fleet level, it was unmanageable.
The shift came with provisioning templates. A template is a parameterised configuration: you define the variables (WAN IP, VLAN, bandwidth, QoS tier), the system generates a valid config, applies it, verifies the outcome. Today OptiWize runs with 55+ active templates, from residential CPE to enterprise gateway to branch firewall. The NOC no longer writes configurations: it picks a template and fills the form. Mean provisioning time down 70% compared with manual setup, and — most importantly — drift driven to zero.
Lesson 2: drift is a cancer, not a minor annoyance
"Drift" is the polite word for "the router is not the way we left it any more". Someone logged into Winbox, added a route to troubleshoot something quickly, never removed it. A week later, that router behaves differently from the others. Six months later, nobody remembers why.
OptiWize treats drift as an industrial control system, not as a nice-to-have. Every device has an expected configuration (the "declared state") and an observed one (the "read state"). The system continuously compares the two. If they diverge — even by an extra comment in a rule — we raise an exception. The operator decides: either promote the drift to declared state (because it was intentional) or revert it.
It sounds obvious written like this. In practice, when you support tens of thousands of devices, every micro-variant is a mine that can blow up six months later during an upgrade. We built drift detection because we spent entire weeks chasing "weird" bugs that were simply routers no longer aligned to the baseline.
Lesson 3: asymmetric routing is the most frequent bug, and the worst diagnosed
When we migrated OptiWize to Kubernetes — with ovpn-server pods handling the management VPN towards devices — we hit a classic of the trade: asymmetric routing. Traffic from the K8s cluster reached MikroTik devices with the node's source IP (e.g. 192.168.122.51). The MikroTik did not know how to route the reply back: the return path went through the default route, not the VPN. Cascading TCP timeouts, sessions dropping in 30 seconds.
Diagnosis took days. The fix was trivial once we understood the problem: apply MASQUERADE on the OpenVPN tun interfaces inside the cluster (`iptables -t nat -A POSTROUTING -o tun10x -j MASQUERADE`). From that moment, every request to a MikroTik appears to come from the VPN server tun IP, which the device knows how to route back. Asymmetry solved.
Lesson 4: load balancing is done with explicit rules, not with luck
Still in K8s phase, we had to distribute OpenVPN traffic across three instances behind a single port (1188 → 1181, 1182, 1183). The first implementation used iptables DNAT rules without explicitly specifying the destination port. Result: rules intercepted ALL TCP traffic on the node, including SSH, kubelet, API. Cluster broken within minutes.
The correct rule is `iptables -t nat -A PREROUTING -p tcp --dport 1188 -m statistic --mode nth --every 3 --packet 0 -j DNAT --to-destination :1181`. The `--dport` is mandatory. Without it, any TCP packet enters the matchset. It's the kind of mistake you don't see in lab and that shreds your production. We mention it because we have seen the same oversight in three different customer setups.
Lesson 5: mass rollouts don't happen at night, they happen with canary
For years the informal rule was "big rollouts happen at 03:00, when traffic is lower". It is a first-generation NOC rule, from the time when rollback required a manual intervention. Not any more. A mass rollout on a MikroTik fleet is done with canary releases: you apply the change to a subset (5%-10%), watch the KPIs (packet loss, latency, ticket volume), and proceed only if the numbers hold.
OptiWize supports canary releases in the template engine: every policy rollout can be targeted by group, region or customer tag. The system applies progressively, watches Zabbix metrics in real time, and if it detects degradation beyond a threshold it stops automatically. Mass rollouts moved from night operations to daytime supervised processes. The NOC difference: sleeping at night.
Lesson 6: treat every device as if you could never touch it again
The most common mistake in fleet management is to think of an individual router as an accessible asset. It is today. It won't be tomorrow. A CPE in a customer's home, a PE in a remote POP, a firewall on the third floor of a branch warehouse: every device is a logistic cost to go back and put hands on. The golden rule we learned is: every operation you perform on a device must be idempotent, rollback-friendly and observable remotely. Always. No exceptions.
OptiWize implements this rule through three mechanisms: backup channel polling (if the device loses the primary VPN, there is a backup channel via cellular or secondary L2TP), configuration watchdog (if you push a config that locks you out, the device automatically rolls back after 60 seconds if you don't confirm), and safe-mode scheduling (scheduled operations cancel themselves if there is no active management connection).
Three things to do if you are mid-stream
- Measure drift. If you don't know how many of your devices are aligned to the baseline you think, start there. Even a quarterly manual check is better than nothing. When the number turns out ugly — and it will — you have the basis to justify investing in a template engine.
- Identify the three most-repeated configurations. For every customer we talk to, 80% of the provisioning work is repeated on three or four patterns. Turn them into templates, even if managed manually via a Python script. The productivity jump is immediate.
- Stop doing night rollouts. Not because they are wrong, but because they keep you in a world where rollback is manual. Invest two weeks of work into canary releases with automatic monitoring. It changes your operational life.
If you want to see it live
OptiWize runs today on 80,000+ devices, 55+ active templates, continuous drift detection, canary releases enabled. If you operate a fleet above 1,000 routers and recognised yourself in any of the points above, the fastest way to find out if we're a fit is a half-day audit: we take one of your real networks, compare it against our benchmarks, give you a concrete number on provisioning time, drift and MTTR. No commitments, no pitch.
Keep reading
AI ENTERPRISE
AI Receptionist for healthcare: 3,440 calls in 4 months, an enterprise AI lesson
YoDa Health is our vertical AI Receptionist for private healthcare. Four months of operation, real numbers, what works and what doesn't when AI meets real users — not lab users.
Read the article →METHOD
Why we build the signal first, then the product
More than 50% of startups fail. Not from bad ideas, but because they build them before validating them. Vibe Lab is our way of doing the opposite.
Read the article →Want the next ones in your inbox?
About one article every two weeks. No spam, no fluff.