In article , ohaya wrote: :I may have been unclear by what I meant by a "manual copy" test. What :they are suggesting that I do is create a 36GB file on one server, then:
:- manually time a file copy from that server to the other server, and :- manually time a file copy from that server to itself, and :- subtract the times and divide the result by 36GB.
That test is dubious.
- The time to copy a file is dependant on the OS and drive maximum write rate, and the write rates are not necessarily going to be the same between the two servers [unless they are the same hardware through and through.]
- A copy of a file from a server to itself can potentially be substantially decreased by DMA. Depends how smart the copy program is. There is the advantage of knowing that one is going to be starting the read and write on a nice boundarys, so one could potentially have the copy program keep the data in system space or maybe even in hardware space.
- When the file is being copied locally, if it is being copied to the same drive, then the reads and writes are going to be in contention whereas when copying a file to a remote server, the reads and writes happen in parallel. The larger the memory buffer that the system can [with hardware cooperation] allocate to a single disk I/O, the fewer the times the drive has to move its head... if, that is, the file is allocated into contiguous blocks and is being written into contiguous blocks, though this need would be mitigated if the drive controller supports scatter-gather or CTQ.
- When the file is being copied locally, if it is being copied to the same controller, then there can be bus contention that would prevent the reads from operating in parallel with the writes. But again system buffering and drive controller cache and CTQ can mitigate this: some SCSI drives do permit incoming writes to be buffered while they are seeking and reading for a previous read request.
- The first copy is going to require that the OS find the directory entry and locate the file on disk and start reading. But at the time of the second copy, the directory and block information might be cached by the OS, reducing the copy time. Also, if the file fits entirely within available memory, then the OS may still have the file in it's I/O buffers and might skip the read. (Okay, that last is unlikely to happen with a 30 Gb file on the average system, but it is not out of the question for High Performance Computing systems.)
- In either copy scenario, one has to know what it means for the last write() to have returned: does it mean that the data is flushed to disk, or does it mean that the last buffer of data has been sent to the filesystem cache for later dispatch when convenient? Especially when you are doing the copy to the remote system, are you measuring the time until the last TCP packet hits the remote NIC and the ACK for it gets back, or are you measuring the time until the OS gets around to scheduling a flush? The difference could be substantial if you have large I/O buffers on the receiving side! Is the copy daemon using synchronous I/O or asynch I/O ?
- A test that that would more closely simulate the source server's copy out to network, would be to time a copy to the null device instead of to a file on the server. But to measure the network timing you still need to know how the destination server handles flushing the last buffer when a close() is issued. Ah, but you also have to know how the TCP stack and copy daemons work together.
When the copy-out daemon detects the end of the source file, it will close the connection and the I/O library will translate that into needing to send a FIN packet. But will that FIN packet get sent in the header of the last buffer, or will it be a separate packet? And when the remote system receives the FIN, does the TCP layer FIN ACK immediately, or does it wait until the copy-in daemon closes the input connection? If it waits, then does the copy-in daemon close the input connection as soon as it detects EOF, or does it wait until the write() on the final buffer returns? When the copy-out daemon close()'s the connection, does the OS note that and return immediately, possibly dealing with the TCP details on a different CPU or in hardware, or does the OS wait for the TCP ACK gets received before it returns to the program? Are POSIX.1 calls being used by the copy daemons, and if so what does POSIX.1 say is the proper behaviour considering that until the ACK of the last output packet arives, the write associated with the implicit flush() might fail: if the last packet gets dropped [and all TCP retries are exhausted] then the return from close() is perhaps different than if the last packet makes it. Of maybe not and one has to explicitly flush() if one wants to distinguish the cases. Unfortunately I don't have my copy of POSIX.1 with me to check.
I bet the company didn't think of these problems when they asked you to do the test. Or if they did, then they are probably assuming that the boundary conditions will not make a significant contribution to the final bandwidth calculation when the boundary conditions are amoratized over 30 Gb. But there are just too many possibilities that could throw the calculation off significantly, especially the drive head contention and the accounting of the time to flush the final write buffer when one has large I/O buffers.