As I mentioned in the introduction, route
flap damping was considered one of the BGP scaling techniques.
It was introduced in the mid 90s to solve a particular problem that the major network operators
were facing at the time.
And indeed through the mid and late 90s it was a strongly recommended practice,
as it provided significant savings on the
early router CPUs.
But today it is now strongly discouraged as it causes far
greater network instability
than it actually cures.
Indeed route flap damping
should only be implemented following
recommendations that are mentioned at the end
of this piece.
But before we do that let's have a look at the theory, so that we understand how route flap damping works.
A route flap is basically the
going up and down of a path,
or a change in a BGP attribute.
So a BGP withdraw
followed by an update is a flap,
in other words the prefix is withdrawn from the
network
and then it's reannounced to the network and that is considered a single flap.
The physical going up and down of an eBGP neighbor session is not considered a flap,
it's the withdraw of a prefix followed by the reannouncment of the prefix.
And one thing that many operators were not aware of in the earlier years of route flap damping,
was that the change in a BGP attribute was
also contributed to route flap.
Now these route flaps ripple through the entire internet.
So it's not just a local feature, a withdraw of a prefix is seen globally unless policy gets in the way.
And the update is seen globally as well.
And this wastes CPU, suddenly operators far away from where the prefixes are flapped
see CPU being used for reasons that they cannot control and are really not interesting to them either.
And so what route flap damping was
introduced for was to try and reduce the scope of route flat propagation.
If prefixes were coming and going all the time, obviously there's not much connectivity that works,
so why use up router CPU running the path selection
process and so forth when there's no real connectivity possible.
So the requirements for route flap damping were that we would have fast convergence for normal route changes.
You want a brand new prefix to be introduced and be
immediately visible.
You want a prefix to be withdrawn for good and be withdrawn for good, not to hang around.
And really this means we're going to use history of
the prefix to predict its future behavior.
The idea being to suppress
oscillating routes and advertise stable ones.
The Internet RFC describing this is 2439,
which describes the full implementation of route flap damping.
How it works is quite straightforward, in Cisco IOS,
each flap attracts a penalty
and this penalty value is 1000.
A change in attribute attracts a penalty of 500.
So if any of the BGP attributes we've mentioned already
changes, we increment
the penalty by 500.
The penalty is exponentially decayed, where a half-life time determines the decay rate.
If the penalty value goes above a
suppress-limit, a predefined limit,
then we don't advertise the route to BGP
peers.
This is where the prefix has been considered to flap so much that it's no
longer worth advertising to a peer.
Once the prefix is withdrawn, the penalty
decays
using the half-life time that's been configured.
And once this penalty
value has gone below a reuse-limit,
the prefix is readvertised to the BGP peers.
And the penalty value gets reset to zero
when it is half
of the reuse-limit.
The diagram shows this more straightforwardly.
You can see
in the diagram the vertical axis is the penalty value,
the horizontal axis is a
nominal time value,
the prefix flaps every few minutes say,
attracts a penalty 1000, the penalty value decays exponentially,
the penalty at some point
crosses over a predefined suppress-limit, 2000 in the in the diagram.
And after that the prefix decays, it still keeps flapping, it still keeps flapping,
decay, flap, decay, flap, until obviously somebody fixes the problem
or notices there was an issue and takes care of it.
After that, the decay carries on, the
prefix is no longer flapping,
and the decay continues until it reaches the
reuse-limit,
which is 750 in this diagram.
And that is many, many minutes later.
The time depends on the vendor defaults or the configured values.
So in this example we have flapped three times and then we disappear for a substantial amount of time.
And certainly the early implementations were, two flaps gone for 30 minutes, three flaps gone for 60 minutes,
very, very typically in most
networks.
So network operators implementing flap damping, it took two or three flaps depending on vendor implementation
before the destination
that was flapping would disappear from the internet
for a good half hour to an
hour long.
Certainly for users who are I suppose
unsurprised by this, unaccustomed to it,
it was an interesting troubleshooting
experience when dealing with route flap damping implemented by the major operators.
But this was very, very necessary in that period because of some really unstable prefixes being announced to
the internet.
Now route flap damping only applies to inbound announcements from eBGP peers.
And it only applies to the path that was flapping.
So that meant alternative paths were still usable.
And again by vendor implementation, the
half-life time, the reuse-limit, the suppress-limit,
and the maximum suppress time could all be controlled at the command-line interface.
Now, if we
have a look at the history of flap damping,
it was introduced in the
internet by around about 1995.
And we very, very quickly learned in the
operations industry, that the vendor defaults were somewhat too severe.
In their defense the vendors really didn't know what default values to use,
and the defaults that were introduced seemed reasonable at the time.
In Europe, the RIPE Routing Working Group studied the impact of route flap damping in some of
the networks in Europe,
with contributions from some of the North
American network operators as well.
And this resulted in a document series ripe-178, followed by ripe-210, and ripe-229.
Which introduced recommendations for route flap damping.
And these recommendations did different levels of damping for /22's, /23's, and /24's.
And then excluded the root name servers from flap damping as well.
Unfortunately we found over the years, that many network operators simply switched on the vendor default values without thinking.
The manual said to do
this, so it was done, whether it was good for the network or not.
And as I said earlier route flap damping was introduced to solve a specific problem in the mid 90s
that had by and large disappeared by the early 2000s.
Researchers in the early 2000s had started to look at routing convergence,
and several works, some examples included in the slides here,
discovered that route
flap damping was actually causing more problems
with the convergence of the
internet routing system than they were actually solving.
One problem is this,
when a path flaps the BGP speakers pick the next best path,
and they announce
that next best path to all peers, and the flap counter is incremented.
Because the next best path means AS path has changed.
Change in an attribute attracts
a penalty of 500.
Many operators weren't aware of the change of attribute actually contributing to the flap counter.
And because of that these peers
then see a change in best path, their flap counter is incremented, and after a few
hops
peers see multiple changes simply caused by a single flap,
and the prefix a
few hops away ends up being suppressed.
The second problem is that different BGP
implementations have different transit times for prefixes,
this is the minimum
advertisement interval.
Some hold onto the prefix for some time before
advertising it.
Others advertise the prefix immediately it's been received and after the best path selection process has been run.
And so if you have
networks using different BGP implementations, that peer with each other,
you get this race to the finish line causing the appearance of flapping
when in fact there's been no flap at all.
There's just been a brand new prefix
announcement or even just a simple path change,
and that causes the prefix to be
suppressed as well.
So really what we found, in recent years,
is that
misconfigured route flap damping will seriously impact access
both to the
local network and to the whole internet.
There's more background about this in
the RIPE Routing Working Group document ripe-378,
and indeed more recent work has
been carried out by the RIPE Routing Working Group
which can be read about in ripe-580 document.
And IETF has produced the RFC,
RFC7196,
which now produces recommendations for properly configured route flap damping.
So if you feel that route flap damping is necessary for your network,
I strongly encourage you to look at RFC7196,
and ripe-580 before you
proceed by simply turning on the vendor defaults.
© Produced by Philip Smith and the Network Startup Resource Center, through the University of Oregon.
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
This is a human-readable summary of (and not a substitute for) the license. Disclaimer. You are free to: Share — copy and redistribute the material in any medium or format Adapt — remix, transform, and build upon the material The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms: Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. NonCommercial — You may not use the material for commercial purposes. No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.