A small change that helped us maintain old clusters
In my current job, I manage several clusters that are already quite old. Some of them have been running for years and host important workloads.
Because the clusters are old, doing a big upgrade is not always easy. Most of the time we only do small maintenance tasks, like restarting services or doing rolling upgrades.
One important component in these clusters is Open vSwitch (OVS). It is used as the dataplane for our networking.
At first everything looked normal, until I noticed something strange during maintenance.
The Problem: Packet Loss When Restarting OVS
Whenever we restarted or upgraded OVS, the network always showed the same behavior:
- traffic stopped for a few seconds
- packet loss happened
- after a short time the network recovered
This felt strange because OVS separates the control plane and the dataplane.
| Component | Location |
|---|---|
| ovs-vswitchd | userspace |
| datapath | kernel |
In theory, restarting the control plane should not interrupt forwarding. Since ovs-vswitchd runs in userspace while the datapath runs in the kernel, restarting the daemon should not affect packet forwarding. However, what we observed in practice was different. When ovs-vswitchd restarted, traffic stopped for a short time. During that time the flows in the datapath were rebuilt, and only after that the network returned to normal. This indicated that something was happening to the datapath during the restart process.
What Actually Happens
After investigating, I found that this behavior exists in older versions of OVS.
In versions like OVS 2.13 and earlier, when ovs-vswitchd exits normally, it deletes all datapath flows.
The process looks like this:

This means every restart creates a cold datapath.
In small environments this might not be a big problem. But in large clusters it can cause:
- thousands of flows to be recreated
- CPU spikes
- many upcalls to userspace
- packet loss for several seconds
This explains the behavior we saw in production.
A Change in Newer OVS Versions
Later I found that the OVS community already fixed this issue.
Starting from OVS 2.14, there is an important change:
ofproto: Do not delete datapath flows on exit by default
With this change, ovs-vswitchd does not delete datapath flows when exiting.
The new behavior is:
| Event | Datapath flows |
|---|---|
| ovs-vswitchd exit | flows stay |
| ovs-vswitchd restart | reconnect to datapath |
| ovs-appctl exit --cleanup | flows deleted |
Now the restart process becomes:

So forwarding can continue even when the control plane restarts.
The Challenge With Old Clusters
The clusters I manage are still running OVS < 2.13.X and this change only appears in OVS 2.14 and later.
Upgrading directly to a newer version is not always possible in older environments. So the solution was to backport the patch.
Check if Your OVS Version Already Has the Fix
Before applying the patch, it is a good idea to check if the commit is already included in the OVS version you are using.
The commit we are talking about is:
79eadafeb1b47a3871cb792aa972f6e4d89d1a0b
ofproto: Do not delete datapath flows on exit by default
If you have the OVS source code locally, you can check it using Git.
Check which versions contain the commit
Run:
git tag --contains 79eadafeb1b47a3871cb792aa972f6e4d89d1a0b
Example output:
v2.14.0
v2.14.1
v2.14.2
...
This means the commit first appears in OVS 2.14.
If your version is 2.13.x, the patch is not included yet.
Backporting the Patch
First, download the OVS source code.
git clone https://github.com/openvswitch/ovs.git
cd ovs
git checkout v2.13.8
Then apply the patch using git cherry-pick.
git cherry-pick 79eadafeb1b47a3871cb792aa972f6e4d89d1a0b
This commit introduces the change that keeps datapath flows when ovs-vswitchd exits.
Sometimes there will be a small conflict in the NEWS file, but this is only a changelog and can be resolved easily.
Small Adjustment for Older Code
In OVS 2.13 the function signature is slightly different.
The new patch introduces this function:
udpif_stop_threads(struct udpif *udpif, bool delete_flows)
But the older code still calls it like this:
udpif_stop_threads(udpif);
So we need to update it to:
udpif_stop_threads(udpif,false);
Setting false means we stop the threads but keep the datapath flows.
Building OVS
After applying the patch, build OVS normally.
./boot.sh
./configure
make-j$(nproc)
The main binaries will be:
vswitchd/ovs-vswitchd
ovsdb/ovsdb-server
utilities/ovs-vsctl
utilities/ovs-appctl
Testing the Behavior
First generate some traffic so the datapath has flows.
ovs-dpctl dump-flows
Then stop ovs-vswitchd.
ovs-appctl -t ovs-vswitchdexit
In the old version you will see:
flows: 0
The datapath becomes empty.
After applying the patch:
ovs-appctl -t ovs-vswitchd exit
ovs-dpctl dump-flows
The flows remain in the datapath.
This means forwarding can continue even when the control plane stops.
Impact in Real Clusters
This small change made a big difference in maintenance.
Restarting OVS now becomes:
- much faster
- almost no packet loss
- safer for rolling upgrades
Especially in clusters with many flows.
Lessons Learned
Old clusters sometimes contain behaviors that are easy to miss.
In this case the problem was not:
- network configuration
- kernel issues
- or infrastructure problems
It was simply how older OVS versions handled shutdown.
Sometimes solving a big operational issue only requires finding the right small patch.
Conclusion
If you are running OVS in production with older versions, you might experience the same issue.
There are two possible solutions:
- upgrade to OVS 2.14 or newer
- or backport the patch to older versions
