On Thu, 14 Oct 2010, Scott Raun wrote: > On Thu, Oct 14, 2010 at 05:01:29PM -0500, Florin Iucha wrote: >> On Thu, Oct 14, 2010 at 04:35:10PM -0500, Mike Miller wrote: >>> An example would be that I have an mbox file (email messages) of 300 MB >>> and containing 50,000 messages and I want to break it into 10 sections of >>> at least 30 MB each (the tenth section would have to be a little smaller >>> because there wouldn't be enough file left). >>> >>> I can do stuff like this to divide the file "mbox" into individual email >>> messages, one per file... >>> >>> csplit -ksz mbox '/^From /' {*} >> >> I don't have an answer to your general question, but in this particular >> instance csplit would not necessarily do what you want, as there might >> be a paragraph starting with 'From' at the beginning of the line (which >> vim e-mail syntax highlighting merrily bolds and colors) that would >> result in a message split in two. Use 'formail' for this kind of >> processing. > > When I've edited my mbox files with Emacs, anything that would match > ^From that wasn't actually an e-mail delimiter was actually turned into > ^>From. My understanding is that this is part of some spec somewhere. It is, but I don't really do what I said I do. I wrote the regexp as "^From " because I didn't want to type out the long one. This is the one I would really use if I didn't want to screw up: ^From \S+\s+\S+\s+\S+\s+\d+\s+\d+:\d+:\d+\s+\d+ The thing is, that would probably slow things down and the other method is very unlikely to split a message unless I'm dividing a file into many pieces. I could check that it worked. I think the "^>From " substitution is not always used, but formail is supposed to use it: http://linuxcommand.org/man_pages/formail1.html I'm not sure of how it decides unless it uses a regexp like the one I show above. Mike