Hi all. I have what seems to me to be a very strange problem, and any thoughts on fixing it would be appreciated.
Briefly, we have data acquistion system consisting of a number of DSPs with 100 MBit ethernet that send out data to linux pcs. Each DSP sends a 1 KB UDP packets about every 1.4 ms. The DSPs and the computers are on a private internal network (10.1.2.x) with an unmanaged 10/100/1000 switch (the linux boxes have gigabit ethernet).
When we begin data acquisition with, for example, two DSPs communicating with one computer, everything runs perfectly for 10-15 minutes, and then, for no apparent reason, the switch seems to become a hub. While initially other computers on the same network did not see any UDP traffic on that network, they start seeing all of the packets, even though those packets are not addressed to them.
The switch shows this change: where initially only the link lights for the DSPs and the single target computer flash, once this change occurs, all of the previously lit (but not active) link LEDs start to flash. The result is that the DSPs become overloaded, and we start losing data. The same thing happens with only one DSP sending data, but in all cases most of the data is still received by the target system.
A quick stop and then restart of data acquisition appears to reset the system, such that it takes another 10 - 15 minutes for the same problem to reoccur.
I've tried three different unmanaged switches, all of which show exactly the same behavior, and this does not occur when I send packets between computers, so it must be something in the DSP ethernet. The only thing we can think of is that there is some error that appears in the packet header, but how that can cause the broadcast of all packets while still leaving the data in each packed intact is beyond us. Does anyone have any ideas?
As a demonstration of the effect I've appended the output of ifconfig and a tcpdump from one of the computers that was on the network but not involved in data acquisition. Here the computer with IP address
10.1.2.2 was receiving data from dsp1 (10.1.2.11) and dsp2 (10.1.2.12).Output of ifconfig eth2: eth2 Link encap:Ethernet HWaddr 00:02:A5:E7:22:EF inet addr:10.1.2.1 Bcast:10.1.2.255 Mask:255.255.255.0 inet6 addr: fe80::202:a5ff:fee7:22ef/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:17889469 errors:0 dropped:0 overruns:0 frame:0 TX packets:12653 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:809749514 (772.2 MiB) TX bytes:1074460 (1.0 MiB) Interrupt:5
Output of tcpdump -vvv -i eth2
tcpdump: listening on eth2, link-type EN10MB (Ethernet), capture size
96 bytes 12:31:19.342875 IP (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto: UDP (17), length: 962) ds p2.4003 > 10.1.2.2.4003: UDP, length 934 12:31:19.343360 IP (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto: UDP (17), length: 962) ds p1.4003 > 10.1.2.2.4003: UDP, length 934 12:31:19.343631 IP (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto: UDP (17), length: 962) ds p2.4003 > 10.1.2.2.4003: UDP, length 934 12:31:19.344197 IP (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto: UDP (17), length: 962) dsand so on. These packets appear at exactly the same time that the switch starts broadcasting everything.
Thanks much,
Loren