Anyone used Iperf or Netperf w/GigE?

In article , ohaya wrote: :I may have been unclear by what I meant by a "manual copy" test. What :they are suggesting that I do is create a 36GB file on one server, then:

:- manually time a file copy from that server to the other server, and :- manually time a file copy from that server to itself, and :- subtract the times and divide the result by 36GB.

That test is dubious.

- The time to copy a file is dependant on the OS and drive maximum write rate, and the write rates are not necessarily going to be the same between the two servers [unless they are the same hardware through and through.]

- A copy of a file from a server to itself can potentially be substantially decreased by DMA. Depends how smart the copy program is. There is the advantage of knowing that one is going to be starting the read and write on a nice boundarys, so one could potentially have the copy program keep the data in system space or maybe even in hardware space.

- When the file is being copied locally, if it is being copied to the same drive, then the reads and writes are going to be in contention whereas when copying a file to a remote server, the reads and writes happen in parallel. The larger the memory buffer that the system can [with hardware cooperation] allocate to a single disk I/O, the fewer the times the drive has to move its head... if, that is, the file is allocated into contiguous blocks and is being written into contiguous blocks, though this need would be mitigated if the drive controller supports scatter-gather or CTQ.

- When the file is being copied locally, if it is being copied to the same controller, then there can be bus contention that would prevent the reads from operating in parallel with the writes. But again system buffering and drive controller cache and CTQ can mitigate this: some SCSI drives do permit incoming writes to be buffered while they are seeking and reading for a previous read request.

- The first copy is going to require that the OS find the directory entry and locate the file on disk and start reading. But at the time of the second copy, the directory and block information might be cached by the OS, reducing the copy time. Also, if the file fits entirely within available memory, then the OS may still have the file in it's I/O buffers and might skip the read. (Okay, that last is unlikely to happen with a 30 Gb file on the average system, but it is not out of the question for High Performance Computing systems.)

- In either copy scenario, one has to know what it means for the last write() to have returned: does it mean that the data is flushed to disk, or does it mean that the last buffer of data has been sent to the filesystem cache for later dispatch when convenient? Especially when you are doing the copy to the remote system, are you measuring the time until the last TCP packet hits the remote NIC and the ACK for it gets back, or are you measuring the time until the OS gets around to scheduling a flush? The difference could be substantial if you have large I/O buffers on the receiving side! Is the copy daemon using synchronous I/O or asynch I/O ?

- A test that that would more closely simulate the source server's copy out to network, would be to time a copy to the null device instead of to a file on the server. But to measure the network timing you still need to know how the destination server handles flushing the last buffer when a close() is issued. Ah, but you also have to know how the TCP stack and copy daemons work together.

When the copy-out daemon detects the end of the source file, it will close the connection and the I/O library will translate that into needing to send a FIN packet. But will that FIN packet get sent in the header of the last buffer, or will it be a separate packet? And when the remote system receives the FIN, does the TCP layer FIN ACK immediately, or does it wait until the copy-in daemon closes the input connection? If it waits, then does the copy-in daemon close the input connection as soon as it detects EOF, or does it wait until the write() on the final buffer returns? When the copy-out daemon close()'s the connection, does the OS note that and return immediately, possibly dealing with the TCP details on a different CPU or in hardware, or does the OS wait for the TCP ACK gets received before it returns to the program? Are POSIX.1 calls being used by the copy daemons, and if so what does POSIX.1 say is the proper behaviour considering that until the ACK of the last output packet arives, the write associated with the implicit flush() might fail: if the last packet gets dropped [and all TCP retries are exhausted] then the return from close() is perhaps different than if the last packet makes it. Of maybe not and one has to explicitly flush() if one wants to distinguish the cases. Unfortunately I don't have my copy of POSIX.1 with me to check.

I bet the company didn't think of these problems when they asked you to do the test. Or if they did, then they are probably assuming that the boundary conditions will not make a significant contribution to the final bandwidth calculation when the boundary conditions are amoratized over 30 Gb. But there are just too many possibilities that could throw the calculation off significantly, especially the drive head contention and the accounting of the time to flush the final write buffer when one has large I/O buffers.

Reply to
Walter Roberson
Loading thread data ...

A good point. Worth trying a set of attenuators made just for this purpose.

-- Robert

Reply to
Robert Redelmeier

Rick,

I spent a few more hours testing this weekend, including various different sizes for the "RWIN". Increasing it up to 64KB or so made no noticeable difference.

I also tried enabling the "TCP Scaling" (Tcp1323Opts) which should allow the RWIN to be set to greater than 64KB, and then tried various sizes for RWIN. Again, no difference.

I began running Windows Performance Monitor, monitoring "Total Bytes/sec" on the sending machine, and what I was seeing was that:

- There was very low CPU utilization throughout the test (

Reply to
ohaya

Hi,

After much testing, I was finally able to get some reasonable results from all 3 network test tools that I had been working with, Iperf, Netperf, and PCATTCP.

What I had to do was to include command line parameters for the following:

MSS: 100000 TcpWindowSize: 64K Buffer Size: 24K

For example, for Iperf sending end, I used:

iperf -c 10.1.1.24 -M 100000 -w 64K -l 24K -t 30

and for Netperf, I used:

netperf-2.1pl1 -H 10.1.1.24 -l 30 -- -s 24000,24000 -m 100000 -M 100000

With these command line parameters, I am now getting results in the 900+ Mbits/sec range, both via the GigE switch and via a cross-over cable.

I'm posting this in case anyone needs this info, and to close off this thread. I'll be posting another msg to start a thread re. "What now?", i.e., what are the implications of these test results.

Thanks for all those who replied to this thread!!

Yours, Jim Lum

Reply to
ohaya

Those are truely odd. FWIW, and I suspect the same is true for iperf, what you are calling the MSS is really the size of the buffer being presented to the transport at one time. TCP then breaks that up into MSS-sized segments.

Windows TCP has a bit of a disconnect between SO_SNDBUF/SO_RCVBUF (what netperf sets with -s and -S on either side) and the TCP window doesn't it.

rick jones

Reply to
Rick Jones

Rick,

What did you mean in your last sentence when you said "and the TCP window doesn't it"?

Re. you comments, I'm a bit confused (not by your comments, but just in general).

I was able to get the higher speed results with Iperf first. I found these parameters at:

formatting link

Then, I proceeded to try to duplicate these results with netperf and PCATTCP, i.e., I did the best that I could to try to use the equivalent parameters to the ones that I used with Iperf. Granted, now that I go back and review the parameters, some of the "translations" were somewhat unclear.

According to the Iperf docs, the "-M" parameter is:

"Attempt to set the TCP maximum segment size (MSS) via the TCP_MAXSEG option. The MSS is usually the MTU - 40 bytes for the TCP/IP header. For ethernet, the MSS is 1460 bytes (1500 byte MTU). This option is not implemented on many OSes."

The "-l" parameter is:

"The length of buffers to read or write. Iperf works by writing an array of len bytes a number of times. Default is 8 KB for TCP, 1470 bytes for UDP. Note for UDP, this is the datagram size and needs to be lowered when using IPv6 addressing to 1450 or less to avoid fragmentation. See also the -n and -t options."

The "-w" parameter is: "Sets the socket buffer sizes to the specified value. For TCP, this sets the TCP window size. For UDP it is just the buffer which datagrams are received in, and so limits the largest receivable datagram size."

It sounds like what you describe in your post as "really the size of the buffer presented to the transport at one time" corresponds to the Iperf "-l" parameter, rather than the "-M", which the Iperf docs say are for "attempting" to set the MSS using TCP_MAXSEG?

Jim

Reply to
ohaya

:> Windows TCP has a bit of a disconnect between SO_SNDBUF/SO_RCVBUF :> (what netperf sets with -s and -S on either side) and the TCP window :> doesn't it.

:What did you mean in your last sentence when you said "and the TCP :window doesn't it"?

That confused me a moment too, but I then re-parsed it and understood.

"A has a bit of a disconnect between B and C, does it not?"

In other words,

"I think you will agree that in system A, element B and element C are not related as strongly as you would normally think they would be."

Reply to
Walter Roberson

Rick,

I haven't had a chance to try adjust "Description: The number of receive bytes that AFD buffers on a connection before imposing flow control. For some applications, a larger value here gives slightly better performance at the expense of increased resource utilization. Applications can modify this value on a per-socket basis with the SO_RCVBUF socket option."

There's also a "DefaultSendWindow" just below that.

It looks like, from this description, that the SO_RCVBUF is equivalent to the DefaultReceiveWindow and the SO_SNDBUF is equivalent to the DefaultSendWindow?

Jim

Reply to
ohaya

On a different note, but related to Iperf....

I found a Linux distribution called Knoppix-STD (security tool distribution). The CD is bootable and it contains the Linux version of Iperf. The host does not even need a hard drive. The CD can breathe life into old workstations and servers by transforming them into traffic generators.

formatting link

-mike

Rick J>

Reply to
Michael Roberts

Yeah, what he said :) Basically, I am accustomed to having setsockopt() calls for SO_RCVBUF, when made before the call to connect() or listen(), controlling the size of the offered TCP window. My understanding is that is not _particularly_ the case under Windows.

rick jones

Reply to
Rick Jones

formatting link
"Description: The number of receive bytes that AFD buffers on a

It might - _if_ the "flow control" being mentioned is between TCP endpoints (at least in the receive case) however, if it is an intra-stack flow control it would be different. One possible experiement is to take the code, hack it a bit to open a socket, connected it to chargen on some other system and see in tcpdump just how many bytes flow before the zero window advertisements start to flow.

rick jones

Reply to
Rick Jones

Cabling-Design.com Forums website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.