[tclug-list] line cut: another missing coreutil

Sun Jun 2 20:10:01 CDT 2013

On Sun, 2 Jun 2013, Mike Miller wrote:

> Pretty close, but it's sending out an extra pair of newlines for every 
> space in the format string.  It does seem to be super fast, though, 
> probably faster than the perl scripts I'm testing, which is great.

I was able to fix that newline problem.  All I had to do was delete one 
line:

26d25
<  print "\n"

Another issue is with the way awk and sed handle the data stream -- if 
they only need to work with the first few lines, they still process the 
entire file.  It would be great if it were possible to tell it to awk to 
stop with the last requested line.  Here's an example where I send 10 
million lines to the script.  This is how much time it takes just to make 
all those lines:

$ time -p seq 10000000 >/dev/null
real 5.11
user 5.11
sys 0.01

Here's how long it takes to process those 10 million lines when only the 
first 55 lines are needed:

$ time -p seq 10000000 | print_ranges.awk - "1-5 55-55 27-27" >/dev/null
real 27.01
user 20.36
sys 1.25

But here's how long it takes when I add head -55 to the pipe to drop the 
unused lines before piping to the awk script:

$ time -p seq 10000000 | head -55 | print_ranges.awk - "1-5 55-55 27-27" >/dev/null
real 0.05
user 0.01
sys 0.00

My friend's perl script doesn't reorder the lines and it is much slower, 
but my friend is working on making it stop after the last processed line, 
and if that succeeds it will be much faster.

$ time -p seq 10000000 | ./cutrows_King_1999.pl 1-5,55,27
1
2
3
4
5
27
55
real 56.59
user 59.56
sys 0.07

I probably should try to learn enough perl and awk to understand these 
scripts more completely.

With what I know now, I think I could write a wrapper that reads in a 
string like -5,55,27 and puts out "1-5 55-55 27-27" but also uses "head 
-55" to reduce the work load on awk.  One thing that I don't know how to 
do is to make it so that something like 92- gets interpreted like this...

92-$(wc -l file)

...but without having to run wc-l, which could take forever on a very big 
file, possibly doubling the processing time.

I also have to test these things for how well they deal with stuff like 
1-10,12-10000000.  That is, just dropping one line out of a 10 million 
line file.

Mike