Frame Relay EEK versus Traffic Shaping

V

Vincent C Jones 20 years ago

Frame relay traffic shaping and QoS appears to be incompatible with frame relay end-to-end keepalives. Need traffic shaping to provide production traffic with guaranteed bandwidth in the presence of other traffic. Need end-to-end keepalives (EEK) to reliably detect loss of frame connectivity at the link level (for policy routing).

The configuration which follows works fine as long as there is not too much production traffic. No problems (except for surfers) when background traffic exceeds available bandwidth. But major problems when production traffic exceeds allocated bandwidth. This forces the PVC down due to loss of end-to-end keepalives. It appears that end-to-end keepalives have lower priority than reserved bandwidth, unlike routing keepalives, which have their own high priority queue.

Side note: the background traffic is policed because if left to contend for available bandwidth, the frame relay traffic shaping kicks in before the background traffic is rate limited, boosting the queuing delays for production traffic to unacceptable (approx five seconds one way) levels. As configured, production traffic delay still suffers a major, but not fatal, hit, rising from 60 ms at no load to around 150 ms. Using "priority" rather than "bandwidth" does not improve the delay hit, implying it is queuing occuring after the QoS queuing has already been applied.

Another hint: With both production and background traffic sources running at overload, production traffic is not being limited to the allocation, behaving more like priority queuing rather than QoS queuing.

Serial0/0 is a full T1 at this end, a 256K fractional T1 at the destination end.

Anyone have any ideas what is going on or how to fix it? Ideal would be getting QoS to work the way it does on a leased line, but just getting frame EEK to queue at a higher priority would be acceptable.

Configuration excerpts:

Cisco Internetwork Operating System Software IOS (tm) C1700 Software (C1700-SY7-M), Version 12.3(10b), RELEASE SOFTWARE (fc3) cisco 1760 (MPC860P) processor (revision 0x500) with 56320K/9216K bytes of memory. Processor board ID FOC08091V2T (923892838), with hardware revision 0000 MPC860P processor: part number 5, mask 2

! class-map match-all APPL-priority match access-group name PRIORITY-TRAFFIC ! policy-map FRAME256policy class APPL-priority bandwidth 100 class class-default fair-queue police cir 120000 conform-action transmit exceed-action drop violate-action drop ! interface Serial0/0 description DLCI 100 no ip address encapsulation frame-relay IETF logging event dlci-status-change serial restart-delay 0 no fair-queue frame-relay traffic-shaping ! interface Serial0/0.150 point-to-point description TestLink bandwidth 240 ip address 206.208.93.37 255.255.255.252 delay 1000 frame-relay class FrameShape256 frame-relay interface-dlci 150 ! ip access-list extended PRIORITY-TRAFFIC permit icmp host 192.168.100.131 any ! map-class frame-relay FrameShape256 frame-relay end-to-end keepalive mode bidirectional frame-relay cir 256000 frame-relay mincir 256000 frame-relay traffic-rate 240000 240000 service-policy output FRAME256policy !

Thanks in advance...

Vote

I

Igor Mamuzic 20 years ago

If we assume that FR end-to-end keepalives falls into default class and there is no any QoS marking for it, then maybe your background traffic consists of huge number of smaller packets causing keepalives to compete with background... I assume that you use route-map with tracking option using FR IP SLA monitoring to detect link faults... If so, why you have to rely on FR mechanism? Is it possible for you to use icmpecho probes or TCP based SLA monitoring? On this way you can tag those control packets to ensure highest priority for them...

B.R. Igor

Vote

V

Vincent C Jones 20 years ago

Thanks for the response. Tracked down the EEK drops to IOS getting confused by the many test changes and not executing frame relay traffic shaping despite the configuration... "copy run start" followed by "reload" and the frame relay traffic shaping started working correctly and EEK no longer had problems, even under extreme overload, at least with CIR/bandwidth of 256Kbps and traffic shaped to 240Kbps.

However, I still have a problem with queueing delays in the reserved bandwidth traffic when the background traffic goes into overload. E.g. applying 500Kbps of background traffic brings the delays and packet losses for background traffic up to the 10's of seconds and 60-80% range, but also hits the foreground (using

130Kbps of 179Kbp reserved) with 2 second delays and 25% packet loss, which is unacceptable. I can cure it by applying policing to the background traffic, but then the background traffic is policed even when foreground traffic is not present. Foreground delay and loss are acceptable (under 200 ms and negligable) as long as foreground traffic is around 90 Kbps or less, regardless of background traffic.

Vote

M

Merv 20 years ago

Would you need to implement frame-relay fragmentation on the 256K PVC ?

Vote

A

anybody43 20 years ago

I have seen this once before. Needed TAC to fix it but they thought it quite common and suggested it right away when presented with whatever anomaly it was.

Warning, I am speculating here (as usual?) It also seems rather likely that you will have considered these possibilities.

Can you reduce the queue sizes? Is RED an option? (W or no W)

Kind of reading between the lines I guess that you are using synthetic test traffic. Possibly in practise (with TCP anyway) the delays will be lower since the transmitters will back off. Unless of course the number of 'sessions' is large.

Good luck.

Vote

I

Igor Mamuzic 20 years ago

Vincent,

Maybe you should consider implementing LLQ, since there is normal that your foreground traffic experience some slight delay and loss (ok, I admit here that 25% isn't "some loss":) it's a loss that can impact net performance seriously) since router dequeues background traffic too, but several times less often comparing with the foreground. If you want to ensure absolute priority then you need to implement LLQ. In that case router will not forward packets from background traffic queue until foreground traffic queue is empty. Of course, this may lead to background traffic starvation, so it's important to police foreground traffic preventing it to totally occupies all the available bw. LLQ ensures just right that+when foreground exceeds it's policed limit - LLQ will ensure that it gets appropriate WFQ priority and it's Cisco's recommendation for VoIP and similar delay&loss sensitive traffic. If LLQ isn't appropriate for your traffic (as it isn't in my case, too), then you have to get down on your knees and prey your management for extra bw, I'm afraid:)

B.R. Igor

Vote

V

Vincent C Jones 20 years ago

Maybe I'm missing something, but in my original post, I mentioned that I get the same behavior using either the "priority" or the "bandwidth" keyword. I did not try fragmentation, but I did not see where that would help when the delays were tens of packets in duration and packet loss rates for priority packets were over 20% even though the priority allocation was not even close to being consumed.

And yes, for my testing, both foreground and background traffic were artificially generated, and well behaved background traffic would be throttled by the dropped packets. But the requirement was for the foreground application to work well even in the presence of hostile background traffic (remember Nacchi/Welchia ??).

Vote

M

Merv 20 years ago

IOS bug CSCsa65035 ?

Vote

D

dennis 20 years ago

Its almost comical that you claim to need EEKs to "ensure reliablitty" but they are causing you to lose traffic by falsing determing the link to be down. Another one of cisco's great "enhancements". How the world survived for so many years with EEKs is a mystery to me. EEKs are a waste of bandwidth. They don't ensure anything. If the link is down, the FR LMI will notifiy the router as per the FR spec. You dont need extra crap to do it. Its just cisco taking advantage of people who don't know how FR works.

I'm not sure what the mumbo-jumbo in your FR settings mean (police cir, etc), but surely you aren't trying to force or limit your traffic to always be below the CIR, are you? You don't need to throttle below your CIR unless you get congestion notifications from the network. CIR doesn't mean thats all you can get (it actually doesnt MEAN anything really), its supposed to mean that anything below that will not be discarded by the network unless absolutely necessary. I've had links with 256K cir that could send at full t1 all day long without ever losing a packet.

Are you trying to tell me that ciscos, with all of their "Features" can't properly react to network congestion notifications?

I'd certainly suggest trying to get rid of your "exceed-action drop" directive, because all you're doing is discarding traffic that likely doesn't need to be dropped. Its completely ridiculous for you to have your router purposely drop traffic when the entire point of shaping is to avoid drops in the first place.

Dennis

formatting link

Vote

V

Vincent C Jones 20 years ago

Hence my assumption that the behavior observed is due to a defect. My posting was an attempt to solicit others' opinions as to whether the defect was in my configuration or in the IOS. As I posted in a followup, a reboot of the router cured the situation, strongly indicating that it was due to an IOS defect.

Primarily by using routing protocols over the links which detect unreported failures through the use of network layer keepalive exchanges. Unfortunately, there are other of "cisco's great "enhancements"" that require the failure of a path to be detected at the link layer, which routing protocols do not provide.

If properly implemented and configured, they should ensure that any frame relay network error which blocks a PVC but is not reported by the LMI will cause the router to declare the subinterface defined for that DLCI to be down.

Nice concept if it were true. Have you read the FR spec? Can you refer us to where it states that the LMI must detect and report all failures which render a DLCI as not useable? I believe you will find that the LMI is only REQUIRED to be locally meaningful and that any reflection of end-to-end status is OPTIONAL. Regardless, what it says in the FR spec (and technically, you would also need to specify which one) is moot. There are real world frame relay service providers who do not always signal PVC connectivity problems in the LMI exchanges between the service provider and the service user.

Au contraire, mon ami. It is just cisco responding to the needs of practitioners whose networks must work reliably in the real world.

Yes, I am, and for good reason. At the other side of the frame relay network, the physical link servicing this PVC is a fractional T1 with a physical data rate which matches the CIR.

Frequently, but not always true, as alredy explained. You need to understand the specifics of an implementation before attempting to apply blanket generalities. In this application, by the time BECNs were received, priority packets would be queueing up at the other side of the network awaiting delivery out the fractional T1 and the delays would engender unacceptable application performance.

Yes. This was actually a common problem when frame relay first came out twenty years ago. Prototypes would perform beautifully, and sometimes even the production rollout would be flawless, but eventually, the overall traffic levels supported by the service provider would grow to reach their design levels and CIR would start to be enforced and the applications which depended on "free bandwidth" would fall flat on their faces.

As stated above, BECN and FECN are irrelevant in this application and were not tested.

There is traffic which pays for the network and there is traffic which is along for a free ride. I am inclined to agree with your definition of the "entire point of shaping", but remember that I was doing more than shaping. I was also providing Quality of Service to ensure that the paying traffic got the service it required in order to be willing to continue paying the bills. The traffic policing applied to freeloader traffic was an attempt to get around faults uncovered in the traffic shaping which allowed the freeloader traffic to slow down the paying traffic long before the paying traffic reached the level it was paying for. I included it to reinforce the fact that the problem was in the traffic shaping, not an inability of the router to cope with gross overload in the freeloader traffic.

P.S. I normally ignore flames, but after reading the white paper "Bandwidth Management for ISPs and Universities" on your web site, I thought I would give you the benefit of the doubt. Just be careful you don't fall into the same trap you warn your customers against: "The worst case is if you've used some other bandwidth management product and you think you know everything. Unfortunately, what you know is defined by terms that likely only apply to the product you have been using." Been there, done that, been burnt...its a philosophy that applies to far more than just bandwidth management.

Vote

D

dennis 20 years ago

I guess you didn't find the frame relay FAQ, once a top-rated page and now a bit out of date but entertaining nonetheless:

formatting link

I wrote what was once the defacto standard frame relay implementation for linux/BSD (and in fact our bandwidth management product was originally written to use on frame relay networks to minimuze drops), and thusly, I've read the specs many, many times.

I may have forgotten that the assumption that FR switches are implemented according to spec is a dubious one much of the time, but LMI reporting is SUPPOSED to be the end-to-end status. However I generally would equate a FR keepalive to a PPP keepalive. They're much more likely to generate a false negative then they are to save the day. In most cases, knowing the DLCI isn't working (but the switch isn't reporting it down) isn't very useful information, since you probably know it isn't working just as quickly from someone complaining about it. Your routing protocols (if you need to switch something on a down) have keepalives of their own, so the extra few seconds you may save once in a blue moon don't seem worth it.

I think when you use "extras" you have to weigh the chances that they are going to "save the day" against the damage they may cause by failing themselves. Just because something seems like a good idea doesn't mean it is (spanning tree comes to mind). In this case, are the EEKs going to save you tremendous time and/or money more often then they're going to cause a perfectly good frame relay connection to be thought to be down?. Another prime example is vlan trunking. You make an utter mess of your network in order to secure against some unlikely event. So you suffer 100% of the time in order to avoid a .1% chance of a problem. You make your network unmanageable by the avg technician, and you limit your choice in equipment that you can use on your network. And in the end the vlan trunking itself causes problems much more often then the events that you're trying to avoid. Its crazy.

DB

Vote

D

dennis 20 years ago

Also of note is that EEKs will not detect "all failures which render a DLCI unusable". If they did, there would be a better case.

Vote

V

Vincent C Jones 20 years ago

Actually, I was looking for a reference to an accepted standard such as one from the Frame Relay Forum or ITU-T. You might want to go back and read your FAQ yourself, as even there the only comment re: LMI meaning is the statement: "DLCIs are marked "ACTIVE" if there is a valid connection set up for that particular DLCI. If you do not get an ACTIVE response from the switch, then the frame relay network provider probably does not have the connection set up properly." Being configured correctly with successful handshaking over the local loop does not equate to end-to-end connectivity, at least not in the real world of commercial frame relay networks.

Excuse me, but I can't go to Sprint or Verizon and say "Dennis claims your LMI should be more meaningful." I need a citation to a specific page & paragraph number in a binding standard. Which neither your linux/BSD implementation nor your frame relay FAQ are. And no, I am not about to waste an hour or two going over the standards just to prove you wrong, because I don't care because I have to deal with the LMI as provided by my clients' suppliers.

As I have said multiple times, I would love to have a citation to an accepted standard that specifies this requirement for LMI.

Good analogy.

Hmmm. I have not had this problem with PPP. I have had problems with brain dead implementations of PPP LQM, but that is a different story.

My client's pay a lot of money to minimize the probability that their users will ever notice, yet alone complain about, a down link. There is no way to achieve even two or three nines of availability if you have to wait for users to complain before taking action. It can be scary how many people have to discover the hard way that redundancy without proper attention to design is a good way to spend money with no improvement to availability.

As I have stated previously, some Cisco features are not integrated with the routing protocols and will only work if a link goes down at the link layer. Not my choice.

Agree. You are preaching to the choir on this one.

Spanning tree is OK for what it was designed to achieve. How it is used is another story. When you try to make something idiot proof, along comes a bigger idiot.

No, hence the original posting in an attempt to determine why EEKs were failing in my test configuration.

Agree that it is scary how many networks are designed by people who have no idea of how networking protocols work, and consequently no idea of their limitations.

I left my original response intact just in case you want to take the time to read it more carefully, because you appear to be reacting to key words rather than the intent of my posting. Instead of assuming I'm a wet behind the ears noobie, go back and re-read my postings assuming competence on my part.

Vote

V

Vincent C Jones 20 years ago

Very true, nor will OSPF or EIGRP hello exchanges. But all will detect common failures which are not reported by LMI on major provider networks and should be able to detect all instances of total loss of communications capability. Detecting excessive BER is much trickier. PPP LQM is designed to, but Cisco's implementation is not useful. ISIS allows configuring large hello packets, but the cost of the IOS upgrades required to get support make it difficult to justify. The money is better spent on a good network management system which tracks and reports rising BER on a link.

Vote

Frame Relay EEK versus Traffic Shaping

Join the Discussion

Didn't find your answer?