HSRP problems

Hi,

we currently have an internet access with 2 ISP composed of two routers 2821(Version 12.4(11)T) running BGP ( we have an ASN) .

We are running two HRSP groups on the pair of routers.

Once in a while we lose internet access for a couple of minutes and the problem seems to be related to HSRP.

**First MRTG shows that router 1 generates a lot of traffic ( 80 Mb/s) for 5 minutes.

**Router 1 traps shows that HSRP is toggleing from active to standby for group 2 only

2007-06-02 18:34:50 Local7.No-tice x.x.56.3 261: Jun 2 2007 18:34:49.249 EDT: %HSRP-5-STATECHANGE: GigabitEthernet0/1 Grp 2 state Standby -> Active 2007-06-02 18:34:53 Local7.Notice x.x.56.3 262: Jun 2 2007 18:34:52.301 EDT: %HSRP-5-STATECHANGE: GigabitEthernet0/1 Grp 2 state Active -> Speak 2007-06-02 18:35:03 Local7.Notice x.x.56.3 263: Jun 2 2007 18:35:02.301 EDT: %HSRP-5-STATECHANGE: GigabitEthernet0/1 Grp 2 state Speak -> Standby 2007-06-02 18:35:13 Local7.Notice x.x.56.3 264: Jun 2 2007 18:35:12.301 EDT: %HSRP-5-STATECHANGE: GigabitEthernet0/1 Grp 2 state Standby -> Active 2007-06-02 18:35:22 Local7.Notice x.x.56.3 265: Jun 2 2007 18:35:21.273 EDT: %HSRP-5-STATECHANGE: GigabitEthernet0/1 Grp 2 state Active -> Speak 2007-06-02 18:35:32 Local7.Notice x.x.56.3 266: Jun 2 2007 18:35:31.278 EDT: %HSRP-5-STATECHANGE: GigabitEthernet0/1 Grp 2 state Speak -> Standby 2007-06-02 18:35:35 Local7.Notice x.x.56.3 267: Jun 2 2007 18:35:34.262 EDT: %HSRP-5-STATECHANGE: GigabitEthernet0/1 Grp 2 state Standby -> Active 2007-06-02 18:36:02 Local7.Notice x.x.56.3 268: Jun 2 2007 18:36:01.386 EDT: %HSRP-5-STATECHANGE: GigabitEthernet0/1 Grp 2 state Active -> Speak 2007-06-02 18:36:12 Local7.Notice x.x.56.3 269: Jun 2 2007 18:36:11.398 EDT: %HSRP-5-STATECHANGE: GigabitEthernet0/1 Grp 2 state Speak -> Standby

***Router 1 - There is no state changes on Group 1 but 514 on group 2

GigabitEthernet0/1 - Group 1 State is Active 1 state change, last state change 16w5d Virtual IP address is x.x.56.1 Active virtual MAC address is 0000.0c07.ac01 Local virtual MAC address is 0000.0c07.ac01 (v1 default) Hello time 3 sec, hold time 10 sec Next hello sent in 2.700 secs Preemption enabled Active router is local Standby router is x.x.56.4, priority 100 (expires in 8.476 sec) Priority 140 (configured 140) Track interface GigabitEthernet0/0 state Up decrement 50 IP redundancy name is "hsrp-Gi0/1-1" (default) GigabitEthernet0/1 - Group 2 State is Standby 514 state changes, last state change 1d21h Virtual IP address is x.x.56.2 Active virtual MAC address is 0000.0c07.ac02 Local virtual MAC address is 0000.0c07.ac02 (v1 default) Hello time 3 sec, hold time 10 sec Next hello sent in 1.460 secs Preemption enabled Active router is x.x.4, priority 140 (expires in 7.644 sec) Standby router is local Priority 100 (default 100) IP redundancy name is "hsrp-Gi0/1-2" (default)

***Router 2 - No state changes on either group

GigabitEthernet0/1 - Group 1 State is Standby 4 state changes, last state change 14w3d Virtual IP address is x.x.56.1 Active virtual MAC address is 0000.0c07.ac01 Local virtual MAC address is 0000.0c07.ac01 (v1 default) Hello time 3 sec, hold time 10 sec Next hello sent in 2.592 secs Preemption enabled Active router is x.x.56.3, priority 140 (expires in 9.800 sec) Standby router is local Priority 100 (default 100) IP redundancy name is "hsrp-Gi0/1-1" (default) GigabitEthernet0/1 - Group 2 State is Active 2 state changes, last state change 16w5d Virtual IP address is x.x.56.2 Active virtual MAC address is 0000.0c07.ac02 Local virtual MAC address is 0000.0c07.ac02 (v1 default) Hello time 3 sec, hold time 10 sec Next hello sent in 1.764 secs Preemption enabled Active router is local Standby router is x.x.56.3, priority 100 (expires in 9.420 sec) Priority 140 (configured 140) Track interface Serial1/0.500 state Up decrement 50 IP redundancy name is "hsrp-Gi0/1-2" (default)

*** Here is the Router 1 config standby 1 ip x.x.56.1 standby 1 priority 140 standby 1 preempt standby 1 track GigabitEthernet0/0 50 (WAN ISP1) standby 2 ip x.x.56.2 standby 2 preempt

***And Router 2 config standby 1 ip x.x.56.1 standby 1 preempt standby 2 ip x.x.56.2 standby 2 priority 140 standby 2 preempt standby 2 track Serial1/0.500 50 (WAN ISP2)

Does anyone have an idea why Router 1 HSRP is going crazy ? It seems to lose contact with Router 2 but only for group 2 , group 1 still works.

And it happens like once every two weeks. always at different times , days or weekends...

Any hints ? Thanks

Reply to
mcaissie
Loading thread data ...

What does router 2 show at this time? Why do you have two hsrp addresses for the same network? Should you not just have one and use the track command as a way to monitor if the line protocol is up and whether he should be the 'man' or not. Even if you don't have the track command, he should still lose his routing relationship (unless of course you are using statics) and it will still just hop over to the other core to route out.

Reply to
Trendkill

As you can see in the "sh standby" of router 2 , there have been no state changes at the time they occured on router 1 . Last state change on router 2 is

16week back. So when router 1 goes crazy we have the same IP on both routers.

We have two HSRP group , to manually load balance outgoing traffic between both ISP. We have a couple of lans on the inside separated by PIX firewalls .Half of them have the .1 as the default gateway and others have .2 . This way we make a better usage of the available bandwidth.

Reply to
mcaissie

Why do you not use a routing protocol instead and let your routers do their job rather than splitting this back into each network via hsrp? Not saying that what you are doing is 'wrong' in anyway, but just saying that it is a bit overcomplicated considering you could easily turn up a routing protocol and dual home your wan routers and let it do it automagically. Your clients would all go to .1, and while you may need to turn up a in-between 'routing network' for core to internet, you wouldn't have to separate gateways and all kinds of manual stuff.

To your problem, I have no idea how it could only be going crazy for router 1 when its the same vlan, and the same ip address for both groups. If it can't talk to its peer, then it should loose the hsrp pairing for both groups since it relies on the same trunk or whatever you have going across. Granted I have never run two standby IPs for the same network, so I suppose there could easily be problems that I am unaware of.

I have seen 4 routers (.2, .3, .4, and .5) in the same network, and two have one HSRP address and the other two pair for the other, but I have never seen the same IP allocated to two different standby groups......I mean you have to think of this logically. If I am .3 and my peer is .4, and our HSRP addresses are .1 and .2, how do i know which frame is for which peer group? I see source and destination and which group is it heartbeating? Since i have never done this, it quite possibly could be completely legitimate..but I think there are more scalable solutions out there for load balancing internet/WAN.

Reply to
Trendkill

Using two HSRP groups to load balance traffic is not unusual

formatting link
Ideally, if we had routers instead of PIX we could put two default route and we would have the same config in all routers. But PIX doesn't support multiple default routes , that's why we do it manually.

Incoming traffic is load balanced through BGP.

In normal mode , R1 is .1 and R2 is .2 . If R1 goes down then R2 takes both addresses and all the traffic. If R2 goes down R1 takes both addresses and all the traffic . So there is no conflict here. And it works fine when we have to do maintenance on one router.

But for some reason , every couple of days , sometimes it can works fine for a whole month, R1 starts acting strange by going in active to stanby to active etc,...

Reply to
mcaissie

unusual

formatting link

Appreciate the link, very interesting. My experience is in very large enterprises, so sometimes I fail to use some of the smaller or more flexible solutions when we have the traffic requirements/budget to use full sets of hardware. Anything in your switch logs about trunks dropping or such? I mean, if you are losing layer 3, we should see if there is anything about losing layer 2 as well, although then I would think you would lose both groups and not just one. I would also think you would have an issue on your 2nd router, or at least show state changes. How about IOS bugs or the like? Same code on both? Have you done a bug search?

Reply to
Trendkill

unusual

formatting link

Does anyone have an idea why Router 1 HSRP is going crazy ?

As I understand it:

Router 1 is active for Group 1, standby for group 2 Router 2 is standby for Group 1, active for Group 1

Both routers are set to pre-empt & therefore rely on the HSRP hello packets with the other router's priority for each group to make the decision as to whether they should go active or not. Could the following explain the situation you are seeing? -

The hello packets are only being received one-way (Router 2 -> Router

1) but being sent both ways, so Router 1 thinks Router 2 is down & Group2 transitions to active automatically. Group 1 is already active so won't change state. Router 2 is still receiving hellos so doesn't change. Could you be having problems on the link between these devices causing (sustained) packet loss?

Are you able to debug standby to try & find the cause?

Cheers,

Al

Reply to
Al

unusual

formatting link
>

Trendkill

thanks for your hints,

we are looking in all possible directions . But it's so intermittent that it's hard to troubleshoot. But at least now that we have logs pointing on something. Our conclusion for the day is that the HSRP configuration is ok , so we gona first investigate on L1 . I guess that hello packets must get dropped somewhere. So we'll first change the inside cable on R1 , change it's switch port , validate speed-duplex , and see in the nexts days if it occurs again.

thanks

Reply to
mcaissie

unusual

formatting link
>

Al ,

just saw your post after sending my last one.

It's exactly what we are now thinking . So as i say we gona do some L1 changes . And i am trying to get a laptop to put a permanent sniffer to get all the HRSP communications

thanks

Reply to
mcaissie

I do not see any issue with the HSRP configuration.

Q1. any particular reason for using IOS 12.4(11)T ???

Q2. what Ethernet switch is being used to connect the two routers - make & model & software version.

Can you post the output of show interface show interface aqccount show ip traffic

Reply to
Merv

Merv, see below

no other reason than to have the latest version at the moment of the installation. But i effectively saw on the IOS planner that there is a software issu. So we are planning to move on with 12.4.13b

A pair of 2950 IOS (tm) C2950 Software (C2950-I6Q4L2-M), Version 12.1(22)EA4, RELEASE SOFTWARE (fc1)

planning to upgrade to 12.1(22)EA10

This the R1 switch interface FastEthernet0/1 is up, line protocol is up (connected) Hardware is Fast Ethernet, address is 0014.6954.8f41 (bia 0014.6954.8f41) MTU 1500 bytes, BW 100000 Kbit, DLY 100 usec, reliability 255/255, txload 1/255, rxload 4/255 Encapsulation ARPA, loopback not set Keepalive set (10 sec) Full-duplex, 100Mb/s, media type is 100BaseTX input flow-control is unsupported output flow-control is unsupported ARP type: ARPA, ARP Timeout 04:00:00 Last input 00:00:44, output 00:00:00, output hang never Last clearing of "show interface" counters never Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0 Queueing strategy: fifo Output queue: 0/40 (size/max) 5 minute input rate 1724000 bits/sec, 281 packets/sec 5 minute output rate 783000 bits/sec, 298 packets/sec 1506325281 packets input, 2875625539 bytes, 0 no buffer Received 18373744 broadcasts (0 multicast) 0 runts, 0 giants, 0 throttles 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 watchdog, 17427606 multicast, 4115830 pause input 0 input packets with dribble condition detected 1714959084 packets output, 4235514096 bytes, 0 underruns 0 output errors, 0 collisions, 2 interface resets 0 babbles, 0 late collision, 0 deferred 0 lost carrier, 0 no carrier, 0 PAUSE output 0 output buffer failures, 0 output buffers swapped out

This is for the last 18 hres. I reseted the counters yesterday. But there was some input errors and a couple of hundreds of CRC , accumulated in the past months.

GigabitEthernet0/1 is up, line protocol is up Hardware is MV96340 Ethernet, address is 001a.6df2.9661 (bia

001a.6df2.9661) Internet address is x.x.56.3/25 MTU 1500 bytes, BW 100000 Kbit, DLY 100 usec, reliability 255/255, txload 3/255, rxload 1/255 Encapsulation ARPA, loopback not set Keepalive set (10 sec) Full-duplex, 100Mb/s, media type is T output flow-control is XON, input flow-control is XON ARP type: ARPA, ARP Timeout 04:00:00 Last input 00:00:00, output 00:00:00, output hang never Last clearing of "show interface" counters 17:41:11 Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0 Queueing strategy: fifo Output queue: 0/40 (size/max) 5 minute input rate 234000 bits/sec, 153 packets/sec 5 minute output rate 1420000 bits/sec, 229 packets/sec 4597539 packets input, 1104191945 bytes, 0 no buffer Received 107766 broadcasts, 0 runts, 0 giants, 0 throttles 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 watchdog, 0 multicast, 0 pause input 0 input packets with dribble condition detected 5836985 packets output, 576658579 bytes, 0 underruns 0 output errors, 0 collisions, 0 interface resets 0 babbles, 0 late collision, 0 deferred 0 lost carrier, 0 no carrier, 0 pause output 0 output buffer failures, 0 output buffers swapped out

RE3A01# sh interface account GigabitEthernet0/0 Protocol Pkts In Chars In Pkts Out Chars Out Other 2137 102576 6413 384780 IP 5857236 707979083 4652826 1158323678 DEC MOP 0 0 108 8316 ARP 2306 138360 4 240 CDP 1069 415841 1069 393392 GigabitEthernet0/1 Protocol Pkts In Chars In Pkts Out Chars Out Other 0 0 6413 384780 IP 4704965 1176902406 5973163 679047591 DEC MOP 0 0 108 8316 ARP 52620 3157200 3935 236100 CDP 0 0 1069 403013

RE3A01#sh ip traffic IP statistics: Rcvd: 3702050182 total, 12053658 local destination 0 format errors, 0 checksum errors, 284229 bad hop count 0 unknown protocol, 0 not a gateway 0 security failures, 0 bad options, 0 with options Opts: 0 end, 0 nop, 0 basic security, 0 loose source route 0 timestamp, 0 extended security, 0 record route 0 stream ID, 0 strict source route, 0 alert, 0 cipso, 0 ump 0 other Frags: 0 reassembled, 0 timeouts, 0 couldn't reassemble 1 fragmented, 2 fragments, 0 couldn't fragment Bcast: 165699 received, 0 sent Mcast: 8733748 received, 9134333 sent Sent: 12599495 generated, 1739484146 forwarded Drop: 950969 encapsulation failed, 0 unresolved, 0 no adjacency 747 no route, 0 unicast RPF, 0 forced drop 0 options denied Drop: 0 packets with source IP address zero Drop: 0 packets with internal loop back IP address 0 physical broadcast

ICMP statistics: Rcvd: 4 format errors, 16 checksum errors, 0 redirects, 1812 unreachable 1350685 echo, 17 echo reply, 0 mask requests, 0 mask replies, 0 quench 0 parameter, 0 timestamp, 0 timestamp replies, 0 info request, 0 other 0 irdp solicitations, 0 irdp advertisements Sent: 0 redirects, 84939 unreachable, 20 echo, 1350685 echo reply 0 mask requests, 0 mask replies, 0 quench, 0 timestamp, 0 timestamp replies 0 info reply, 284229 time exceeded, 0 parameter problem 0 irdp solicitations, 0 irdp advertisements

TCP statistics: Rcvd: 788413 total, 102 checksum errors, 1372 no port Sent: 776976 total

BGP statistics: Rcvd: 344283 total, 14 opens, 0 notifications, 29 updates 344240 keepalives, 0 route-refresh, 0 unrecognized Sent: 344340 total, 16 opens, 14 notifications, 27 updates 344283 keepalives, 0 route-refresh

IP-EIGRP statistics: Rcvd: 0 total Sent: 0 total

PIMv2 statistics: Sent/Received Total: 0/0, 0 checksum errors, 0 format errors Registers: 0/0 (0 non-rp, 0 non-sm-group), Register Stops: 0/0, Hellos:

0/0 Join/Prunes: 0/0, Asserts: 0/0, grafts: 0/0 Bootstraps: 0/0, Candidate_RP_Advertisements: 0/0 Queue drops: 0 State-Refresh: 0/0

IGMP statistics: Sent/Received Total: 0/0, Format errors: 0/0, Checksum errors: 0/0 Host Queries: 0/0, Host Reports: 0/0, Host Leaves: 0/0 DVMRP: 0/0, PIM: 0/0 Queue drops: 0

UDP statistics: Rcvd: 9912241 total, 1 checksum errors, 215307 no port Sent: 10102679 total, 0 forwarded broadcasts

OSPF statistics: Rcvd: 0 total, 0 checksum errors 0 hello, 0 database desc, 0 link state req 0 link state updates, 0 link state acks

Sent: 0 total 0 hello, 0 database desc, 0 link state req 0 link state updates, 0 link state acks

ARP statistics: Rcvd: 8185832 requests, 56838 replies, 0 reverse, 0 other Sent: 650822 requests, 68827 replies (65277 proxy), 0 reverse Drop due to input queue full: 0

Reply to
mcaissie

If I were you I would downgrade to 12.3

I would also try increasigng the HSRP holdtime ( at leat for troulbeshooting) to 20 or 30 seconds

I would setup up a monitoring port on your switch to capture all of the HSRP hello messages so that you can determine if it is a problem with router 2 not sending hellos or with router 1 not receiving or not process hellos. I would not change the phyiscal topology until that was acoomplished, Connect

Reply to
Merv

Conect a PC running Etherreal to the monitoring port . capture all multicast traffic to begin with. Determine if there is any unnecessary or unwand multicast traffic. if not filter for HSRP UDP traffic (UDP port 1985)

Setup a ACL that matches HSRP traffic i(ngress and egress)and permits it allong witha permits for all IP traffic i.e two ACL lines (ACEs).You can then see how many HSRP packets are being sent and received by using show access-list command and looking at the match counts

Reply to
Merv

I would suggest you disable CDP on interfaces facing your ISP on R1 and R2.

If you want to see a little more instantioustraffic rates, thenconfigure "load-interval 30" on all of your router and switch ports

Reply to
Merv

Thanks Merv for your comments, i'll give it try.

We also created 2 other bogus HSRP groups , to see if we get the same behavior with those two groups.

Regarding physical connections, we will first just invert the cables from R1 to R2. We will see if the problem follows the cable/switch or stay on R1.

We will have a sniffer for HSRP traffic, and will add MRTG on the WAN interfaces of the routers and other spots in our network. We are still unsure if the HSRP behavior is the cause or the consequence of the high traffic we get on the inside interface of the router.

We will see in the next days/weeks how it goes. I'll post a new thread with the results

thanks again

Reply to
mcaissie

Cabling-Design.com Forums website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.