On Thu, Jan 22, 2009 at 1:43 PM, Mike Miller <mbmiller at taxa.epi.umn.edu>wrote: > > If every file has to be moved, the comparing would be wasted time, but if > files are large and most do not have to be moved, the comparison may > massively save time, especially if the network is slow. > > > It happens that I started to write the info below a couple of months ago > to share with this list and did not finish it, but I'm finishing it now. > My problem was to copy many files from one machine to another, but none of > the files existed on the target machine. I really just wanted to make a > gzipped tar file (.tgz) and send it to another machine. I didn't have > much free disk space on the source machine so I had to do a little work to > figure out the tricks. Read on: > > > I want to move files from one GNU/Linux box to another. The disks are > nearly full on box with the files currently on it, so I can't write to > .tgz on the source machine and send the .tgz file. The data are about > 13GB uncompressed and about 3.7GB in .tgz format. This is how I get the > latter number: > > tar zpcf - directory | wc -c > > That sends the zipped tar to stdout where the bytes are counted by wc. I > have about 210,000 files and directories. > > There are some good suggestions here on how to proceed: > > http://happygiraffe.net/copy-net > > I wanted to have the .tgz file on the other side instead of having tar > unpackage it automatically, so I find out I could do this on the old > machine to send files to the new machine... > > tar zpcf - directory | ssh user at target.machine "cat > backup.tgz" > > ...and it packs "directory" from the old machine into the backup.tgz file > on the new machine. Nice. > > One small problem: I didn't have a way to be sure that there were no > errors in file transmission. First some things that did not to work: > > tar zpcf - directory | md5sum > > Testing that on a small directory gave me, to my surprise, different > results every time. What was changing? I didn't get it. I could tell > that it was probably caused by gzip because... > > $ echo "x" | gzip - > test1.gz > > $ echo "x" | gzip - > test2.gz > > $ md5sum test?.gz > 358cc3d6fe5d929cacd00ae4c2912bf2 test1.gz > 601a8e99e56741d5d8bf42250efa7d26 test2.gz > > So gzip must have a random seed in it, or it is incorporating the > timestamp into the file somehow -- something is changing. Then I realized > that I just had to use this method of checking md5sums... > > On the source machine: > tar pcf - directory | md5sum > > Then do this to transfer the data: > tar zpcf - directory | ssh user at target.machine "cat > backup.tgz" > > After transferring, do this on the target machine: > gunzip -c backup.tgz | md5sum > > The two md5sums are created without making new files on either side and > they will match if there are no errors. I moved about 30GB of compressed > data this way in three large .tgz files and found no errors -- the md5sums > always matched. To me, the file comparison isn't that big of a deal, and I'd only be concerned about the time it took if it was a cronjob scheduled to run in a tight amount of time (say every 10 minutes for a 3GB FS). If it's to populate a new system, it wouldn't bother me. I would say if it's that much of a concern on the initial load, then you haven't given yourself enough time to do the work. Remeber the 6 P's... While I admire the thought you put into your process above. IMO, it's not efficient enough for my tastes. Also, too many chances for errors. Here's how I would have done it: