[tclug-list] csplit

Thu Oct 14 18:12:46 CDT 2010

On Thu, 14 Oct 2010, Scott Raun wrote:

> On Thu, Oct 14, 2010 at 05:01:29PM -0500, Florin Iucha wrote:
>> On Thu, Oct 14, 2010 at 04:35:10PM -0500, Mike Miller wrote:
>>> An example would be that I have an mbox file (email messages) of 300 MB
>>> and containing 50,000 messages and I want to break it into 10 sections of
>>> at least 30 MB each (the tenth section would have to be a little smaller
>>> because there wouldn't be enough file left).
>>>
>>> I can do stuff like this to divide the file "mbox" into individual email
>>> messages, one per file...
>>>
>>> csplit -ksz mbox '/^From /' {*}
>>
>> I don't have an answer to your general question, but in this particular 
>> instance csplit would not necessarily do what you want, as there might 
>> be a paragraph starting with 'From' at the beginning of the line (which 
>> vim e-mail syntax highlighting merrily bolds and colors) that would 
>> result in a message split in two.  Use 'formail' for this kind of 
>> processing.
>
> When I've edited my mbox files with Emacs, anything that would match 
> ^From that wasn't actually an e-mail delimiter was actually turned into 
> ^>From. My understanding is that this is part of some spec somewhere.

It is, but I don't really do what I said I do.  I wrote the regexp as 
"^From " because I didn't want to type out the long one.  This is the one 
I would really use if I didn't want to screw up:

^From \S+\s+\S+\s+\S+\s+\d+\s+\d+:\d+:\d+\s+\d+

The thing is, that would probably slow things down and the other method is 
very unlikely to split a message unless I'm dividing a file into many 
pieces.  I could check that it worked.

I think the "^>From " substitution is not always used, but formail is 
supposed to use it:

http://linuxcommand.org/man_pages/formail1.html

I'm not sure of how it decides unless it uses a regexp like the one I show 
above.

Mike