Class B subnet 172.16.0.0/16 with about 500 hosts.
Cisco3640 core router as dgw of the network, address 172.16.1.116
Cisco 1712 vpn gateway, address 172.16.1.108
Cisco Pix506 vpn gateway, address 172.16.1.107
Other networking gears not related (hope) to this issue
Eigrp protocol running on all the devices except the pix.
Recently the customer is experiencing problems reaching the subnets at the
other end of the tunnels terminated by the pix, here are some tests I've
done and some details I've collected:
If I ping the address of the internal if of the pix (172.16.1.107) from the
3640 (remember, this is the dgw of the subnet), all seems ok.
If I do a trace from 3640 to the same address I surprisingly see the packets
going to the 1712 (172.16.1.108) and back to the 3640 (due to the eigrp
table), so forth until the ttl expires.
Routing tables are all ok
Arp tables are all ok
No cef or netflow running
No policy routing
No proxy arp enabled on any device.
Cpu usage of all the devices is as usual
1712 is injecting routes to the internal lan in a fairly controlled fashion
due to distribute lists.
Same devices are running quite from a while, no important changes made to
the images or configs lately.
Strange enough, if I change the ip of the pix with an address near to the
old one the problem stands, if I change the address, and only that, with one
quite far from the old one (now is 172.16.1.5) the problem suddenly
disappears, however none of these addresses belong to any other host or
device (tried to ping after the change).
Anyone has experienced something similar in the past?
It sounds like the 1712 is advertising a route to 172.16.1.107 to the
3640 for some reason, and this is overriding the connected route for the
network. But this doesn't explain why the first ping works and the
traceroute fails when they're going to the same address. What does
"show ip route 172.16.1.107" say?
3640#sh ip ro 172.16.204.107
Routing entry for 172.16.0.0/16
Known via "connected", distance 0, metric 0 (connected, via interface)
Redistributing via eigrp 100
Routing Descriptor Blocks:
* directly connected, via Ethernet0/0
Route metric is 0, traffic share count is 1
As you can see routing table is ok.
It seems like a strange packet duplication by the 3640, for some misterious
reason 3640 sent packets destined to pix also to 1712, this may explain why
ping worked and trace behaved like that.
In fact, if I did ping or trace a host behind the pix (not only from 3640
but from any local or remote host in the network) I saw the packet counters
of the associated crypto map incrementing by hundreds of times, not by 5 as
expected, this sound to me that packets looped from 3640 to 1712 until tll
expiration and were also sent to pix each time.
Maybe some sniffing should have revealed something more, but now customer's
happy after the change of the ip and I'm not so fool to switch back for the
only purpose of forensic.....but curiosity is very strong :-)
It was the first thing looked for, but unfortunately arp tables of all the
devices involved in the story, expecially the 3640, are referencing the
correct mac addresses and proxy arp is disbled on all devices.
More...if it was a proxy arp issue I should see only one hop when I do a
trace to the pix (from the wrong address, however), this doesn't explain the
loop neither the apparent duplication of the packets destined to the pix.
I really appreciate your help, but I feel this story will remain a mistery
for a long time, unless I decide to go to the customer site a sunday
morning, switch back the address of the pix to the original one and do some
other testing and sniffing.