Restarting Open vSwitch Without Packet Loss

A small change that helped us maintain old clusters

In my current job, I manage several clusters that are already quite old. Some of them have been running for years and host important workloads.

Because the clusters are old, doing a big upgrade is not always easy. Most of the time we only do small maintenance tasks, like restarting services or doing rolling upgrades.

One important component in these clusters is Open vSwitch (OVS). It is used as the dataplane for our networking.

At first everything looked normal, until I noticed something strange during maintenance.

The Problem: Packet Loss When Restarting OVS

Whenever we restarted or upgraded OVS, the network always showed the same behavior:

traffic stopped for a few seconds
packet loss happened
after a short time the network recovered

This felt strange because OVS separates the control plane and the dataplane.

Component	Location
ovs-vswitchd	userspace
datapath	kernel

In theory, restarting the control plane should not interrupt forwarding. Since ovs-vswitchd runs in userspace while the datapath runs in the kernel, restarting the daemon should not affect packet forwarding. However, what we observed in practice was different. When ovs-vswitchd restarted, traffic stopped for a short time. During that time the flows in the datapath were rebuilt, and only after that the network returned to normal. This indicated that something was happening to the datapath during the restart process.

What Actually Happens

After investigating, I found that this behavior exists in older versions of OVS.

In versions like OVS 2.13 and earlier, when ovs-vswitchd exits normally, it deletes all datapath flows.

The process looks like this:

This means every restart creates a cold datapath.

In small environments this might not be a big problem. But in large clusters it can cause:

thousands of flows to be recreated
CPU spikes
many upcalls to userspace
packet loss for several seconds

This explains the behavior we saw in production.

A Change in Newer OVS Versions

Later I found that the OVS community already fixed this issue.

Starting from OVS 2.14, there is an important change:

ofproto: Do not delete datapath flows on exit by default

With this change, ovs-vswitchd does not delete datapath flows when exiting.

The new behavior is:

Event	Datapath flows
ovs-vswitchd exit	flows stay
ovs-vswitchd restart	reconnect to datapath
ovs-appctl exit --cleanup	flows deleted

Now the restart process becomes:

So forwarding can continue even when the control plane restarts.

The Challenge With Old Clusters

The clusters I manage are still running OVS < 2.13.X and this change only appears in OVS 2.14 and later.

Upgrading directly to a newer version is not always possible in older environments. So the solution was to backport the patch.

Check if Your OVS Version Already Has the Fix

Before applying the patch, it is a good idea to check if the commit is already included in the OVS version you are using.

The commit we are talking about is:

79eadafeb1b47a3871cb792aa972f6e4d89d1a0b
ofproto: Do not delete datapath flows on exit by default

If you have the OVS source code locally, you can check it using Git.

Check which versions contain the commit

Run:

git tag --contains 79eadafeb1b47a3871cb792aa972f6e4d89d1a0b

Example output:

v2.14.0
v2.14.1
v2.14.2
...

This means the commit first appears in OVS 2.14.

If your version is 2.13.x, the patch is not included yet.

Backporting the Patch

First, download the OVS source code.

git clone https://github.com/openvswitch/ovs.git
cd ovs
git checkout v2.13.8

Then apply the patch using git cherry-pick.

git cherry-pick 79eadafeb1b47a3871cb792aa972f6e4d89d1a0b

This commit introduces the change that keeps datapath flows when ovs-vswitchd exits.

Sometimes there will be a small conflict in the NEWS file, but this is only a changelog and can be resolved easily.

Small Adjustment for Older Code

In OVS 2.13 the function signature is slightly different.

The new patch introduces this function:

udpif_stop_threads(struct udpif *udpif, bool delete_flows)

But the older code still calls it like this:

udpif_stop_threads(udpif);

So we need to update it to:

udpif_stop_threads(udpif,false);

Setting false means we stop the threads but keep the datapath flows.

Building OVS

After applying the patch, build OVS normally.

./boot.sh
./configure
make-j$(nproc)

The main binaries will be:

vswitchd/ovs-vswitchd
ovsdb/ovsdb-server
utilities/ovs-vsctl
utilities/ovs-appctl

Testing the Behavior

First generate some traffic so the datapath has flows.

ovs-dpctl dump-flows

Then stop ovs-vswitchd.

ovs-appctl -t ovs-vswitchd exit

In the old version you will see:

flows: 0

The datapath becomes empty.

After applying the patch:

ovs-appctl -t ovs-vswitchd exit
ovs-dpctl dump-flows

The flows remain in the datapath.

This means forwarding can continue even when the control plane stops.

Impact in Real Clusters

This small change made a big difference in maintenance.

Restarting OVS now becomes:

much faster
almost no packet loss
safer for rolling upgrades

Especially in clusters with many flows.

Lessons Learned

Old clusters sometimes contain behaviors that are easy to miss.

In this case the problem was not:

network configuration
kernel issues
or infrastructure problems

It was simply how older OVS versions handled shutdown.

Sometimes solving a big operational issue only requires finding the right small patch.

Conclusion

If you are running OVS in production with older versions, you might experience the same issue.

There are two possible solutions:

upgrade to OVS 2.14 or newer
or backport the patch to older versions