[Novalug] Fwd: perl help | filename spaces

Jon LaBadie novalugml at jgcomp.com
Thu Mar 18 12:36:51 EDT 2010


On Thu, Mar 18, 2010 at 06:39:20AM -0400, Michael Henry wrote:
> On 03/17/2010 10:42 PM, Jon LaBadie wrote:
> > On Wed, Mar 17, 2010 at 09:18:18PM -0400, Ken Kauffman wrote:
> >> find . \( -name cache -prune \) -o  -name "*.htm*" -print0 | xargs -0 perl
> >> -i.save -pe "s/news.html/news\//g;"
> >
> > You can get the same effect without the -print0, pipe, and extra command.
> >
> > find . \( -name cache -prune \) -o  -name "*.htm*" \
> >      -exec perl -i.save -pe "s/news.html/news\//g;" {} +
> 
> In the past, the ``-exec`` option had only one syntax, ``-exec
> {} ;``.  The ``-exec`` option was deprecated in cases where the
> associated command could accept multiple filenames because it
> spawned a new invocation of the command for every filename.  The
> ``xargs``-based solution optimized for this case by aggregating
> the maximum number of filenames together per invocation of the
> command, minimizing the number of spawned processes.  But that
> downside was removed with the addition of the "+" delimiter for
> ``-exec`` (I'm not sure how long ago). 

Quite a long time ago.

> When using ``-exec {} +`` as Jon has suggested, ``find`` will
> aggregate up filenames and minimize the number of spawned
> processes.  In most ways, it's equivalent to the ``xargs``-based
> solution, but there are slight performance differences.
> 
> I'd done some benchmarking in the past, and found an
> at-first counter-intuitive measurement that says spawning the
> extra ``xargs`` process actually saves time overall.  I believe
> this is due to the multiprocessing going on between ``find``
> (which can continue to find files in the background) and
> ``xargs`` (which spawns the desired command on batches of
> already-found filenames).
> 
> Here's a quick benchmark that's repeatable on my box:
> 
>   cd ~/projects
>   $ time find -type f -exec grep bigteststring {} +
> 
>   real    0m0.210s
>   user    0m0.073s
>   sys     0m0.137s
> 
>   $ time find -type f -print0 | xargs -0 grep bigteststring
> 
>   real    0m0.172s
>   user    0m0.080s
>   sys     0m0.140s
> 
> It's not a huge difference, but it's repeatable.  It's certainly
> not enough to change what you write interactively at the prompt,
> but in a script it may be worth considering.  Either way, I find
> it an interesting result.

Timings of such a short duration are nearly meaningless.
Order of the test could also have an effect as the first
command line may pull things into cache that the second
will not have to.  As an example, in my large home dir,
I did a simple find . -type f > /dev/null 2>&1.  The
first run took 16.5 seconds, the second 6.1 and subsequent
runs took only 1.3 - 1.4 seconds.

Your suggestion that multi-processing can have an affect
is valid.  Note your cpu usage (user+sys) was 0.22 seconds
in the xargs version, higher than clock time of .17 sec
and higher even than clock time of the no-xargs version.
But find could also multi-process and continue "finding"
after "exec'ing" a grep.

I repeated your test on a 2 core laptop from a directory
that gave about a full minute of clock time.  First I
saturated the caches with the inodes and directory blocks
with a simple find . -type f then I ran your tests two
times each.  Order was no-xargs, xargs, xargs, no-xargs.
Average timings were:

  clock (real)			67.34		67.30
  cpu (user + sys)		 7.33		 7.41

The timings are similar enough that I didn't bother
putting column headings on them.

jl
-- 
Jon H. LaBadie                  jon at jgcomp.com
 JG Computing
 12027 Creekbend Drive		(703) 787-0884
 Reston, VA  20194		(703) 787-0922 (fax)



More information about the Novalug mailing list