As I mentioned in the introduction, route flap damping was considered one of the BGP scaling techniques. It was introduced in the mid 90s to solve a particular problem that the major network operators were facing at the time. And indeed through the mid and late 90s it was a strongly recommended practice, as it provided significant savings on the early router CPUs. But today it is now strongly discouraged as it causes far greater network instability than it actually cures. Indeed route flap damping should only be implemented following recommendations that are mentioned at the end of this piece. But before we do that let's have a look at the theory, so that we understand how route flap damping works. A route flap is basically the going up and down of a path, or a change in a BGP attribute. So a BGP withdraw followed by an update is a flap, in other words the prefix is withdrawn from the network and then it's reannounced to the network and that is considered a single flap. The physical going up and down of an eBGP neighbor session is not considered a flap, it's the withdraw of a prefix followed by the reannouncment of the prefix. And one thing that many operators were not aware of in the earlier years of route flap damping, was that the change in a BGP attribute was also contributed to route flap. Now these route flaps ripple through the entire internet. So it's not just a local feature, a withdraw of a prefix is seen globally unless policy gets in the way. And the update is seen globally as well. And this wastes CPU, suddenly operators far away from where the prefixes are flapped see CPU being used for reasons that they cannot control and are really not interesting to them either. And so what route flap damping was introduced for was to try and reduce the scope of route flat propagation. If prefixes were coming and going all the time, obviously there's not much connectivity that works, so why use up router CPU running the path selection process and so forth when there's no real connectivity possible. So the requirements for route flap damping were that we would have fast convergence for normal route changes. You want a brand new prefix to be introduced and be immediately visible. You want a prefix to be withdrawn for good and be withdrawn for good, not to hang around. And really this means we're going to use history of the prefix to predict its future behavior. The idea being to suppress oscillating routes and advertise stable ones. The Internet RFC describing this is 2439, which describes the full implementation of route flap damping. How it works is quite straightforward, in Cisco IOS, each flap attracts a penalty and this penalty value is 1000. A change in attribute attracts a penalty of 500. So if any of the BGP attributes we've mentioned already changes, we increment the penalty by 500. The penalty is exponentially decayed, where a half-life time determines the decay rate. If the penalty value goes above a suppress-limit, a predefined limit, then we don't advertise the route to BGP peers. This is where the prefix has been considered to flap so much that it's no longer worth advertising to a peer. Once the prefix is withdrawn, the penalty decays using the half-life time that's been configured. And once this penalty value has gone below a reuse-limit, the prefix is readvertised to the BGP peers. And the penalty value gets reset to zero when it is half of the reuse-limit. The diagram shows this more straightforwardly. You can see in the diagram the vertical axis is the penalty value, the horizontal axis is a nominal time value, the prefix flaps every few minutes say, attracts a penalty 1000, the penalty value decays exponentially, the penalty at some point crosses over a predefined suppress-limit, 2000 in the in the diagram. And after that the prefix decays, it still keeps flapping, it still keeps flapping, decay, flap, decay, flap, until obviously somebody fixes the problem or notices there was an issue and takes care of it. After that, the decay carries on, the prefix is no longer flapping, and the decay continues until it reaches the reuse-limit, which is 750 in this diagram. And that is many, many minutes later. The time depends on the vendor defaults or the configured values. So in this example we have flapped three times and then we disappear for a substantial amount of time. And certainly the early implementations were, two flaps gone for 30 minutes, three flaps gone for 60 minutes, very, very typically in most networks. So network operators implementing flap damping, it took two or three flaps depending on vendor implementation before the destination that was flapping would disappear from the internet for a good half hour to an hour long. Certainly for users who are I suppose unsurprised by this, unaccustomed to it, it was an interesting troubleshooting experience when dealing with route flap damping implemented by the major operators. But this was very, very necessary in that period because of some really unstable prefixes being announced to the internet. Now route flap damping only applies to inbound announcements from eBGP peers. And it only applies to the path that was flapping. So that meant alternative paths were still usable. And again by vendor implementation, the half-life time, the reuse-limit, the suppress-limit, and the maximum suppress time could all be controlled at the command-line interface. Now, if we have a look at the history of flap damping, it was introduced in the internet by around about 1995. And we very, very quickly learned in the operations industry, that the vendor defaults were somewhat too severe. In their defense the vendors really didn't know what default values to use, and the defaults that were introduced seemed reasonable at the time. In Europe, the RIPE Routing Working Group studied the impact of route flap damping in some of the networks in Europe, with contributions from some of the North American network operators as well. And this resulted in a document series ripe-178, followed by ripe-210, and ripe-229. Which introduced recommendations for route flap damping. And these recommendations did different levels of damping for /22's, /23's, and /24's. And then excluded the root name servers from flap damping as well. Unfortunately we found over the years, that many network operators simply switched on the vendor default values without thinking. The manual said to do this, so it was done, whether it was good for the network or not. And as I said earlier route flap damping was introduced to solve a specific problem in the mid 90s that had by and large disappeared by the early 2000s. Researchers in the early 2000s had started to look at routing convergence, and several works, some examples included in the slides here, discovered that route flap damping was actually causing more problems with the convergence of the internet routing system than they were actually solving. One problem is this, when a path flaps the BGP speakers pick the next best path, and they announce that next best path to all peers, and the flap counter is incremented. Because the next best path means AS path has changed. Change in an attribute attracts a penalty of 500. Many operators weren't aware of the change of attribute actually contributing to the flap counter. And because of that these peers then see a change in best path, their flap counter is incremented, and after a few hops peers see multiple changes simply caused by a single flap, and the prefix a few hops away ends up being suppressed. The second problem is that different BGP implementations have different transit times for prefixes, this is the minimum advertisement interval. Some hold onto the prefix for some time before advertising it. Others advertise the prefix immediately it's been received and after the best path selection process has been run. And so if you have networks using different BGP implementations, that peer with each other, you get this race to the finish line causing the appearance of flapping when in fact there's been no flap at all. There's just been a brand new prefix announcement or even just a simple path change, and that causes the prefix to be suppressed as well. So really what we found, in recent years, is that misconfigured route flap damping will seriously impact access both to the local network and to the whole internet. There's more background about this in the RIPE Routing Working Group document ripe-378, and indeed more recent work has been carried out by the RIPE Routing Working Group which can be read about in ripe-580 document. And IETF has produced the RFC, RFC7196, which now produces recommendations for properly configured route flap damping. So if you feel that route flap damping is necessary for your network, I strongly encourage you to look at RFC7196, and ripe-580 before you proceed by simply turning on the vendor defaults.
© Produced by Philip Smith and the Network Startup Resource Center, through the University of Oregon.
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
This is a human-readable summary of (and not a substitute for) the license. Disclaimer. You are free to: Share — copy and redistribute the material in any medium or format Adapt — remix, transform, and build upon the material The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms: Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. NonCommercial — You may not use the material for commercial purposes. No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.