[Novalug] Fwd: perl help | filename spaces
Jon LaBadie
novalugml at jgcomp.com
Fri Mar 19 11:32:50 EDT 2010
On Fri, Mar 19, 2010 at 04:52:56AM -0400, Michael Henry wrote:
> On 03/18/2010 12:36 PM, Jon LaBadie wrote:
> > On Thu, Mar 18, 2010 at 06:39:20AM -0400, Michael Henry wrote:
> >> But that
> >> downside was removed with the addition of the "+" delimiter for
> >> ``-exec`` (I'm not sure how long ago).
> >
> > Quite a long time ago.
>
> After a quick look, I think it was adopted into POSIX around
> 2004, and GNU find picked it up around 2005. From the NEWS file
> in GNU findutils::
Agreed. I'm colored by having used it since the late 80's or early
90's when it was adopted in SVR4 as you note. Quite a few systems
implemented the feature but did not document it. I was unaware
of this until I checked with my UNIX history guru, Sven Mascheck.
His page on the '+' delimiter is:
http://www.in-ulm.de/~mascheck/various/find/#xargs
> >> Here's a quick benchmark that's repeatable on my box:
> >
> > Timings of such a short duration are nearly meaningless.
>
> As I mentioned, the test results are fully repeatable.
> Benchmarking can take a lot of time, and I wasn't trying to give
> numbers that demonstrate expected average time ratios.
When I saw your results my (likely biased) view was different
than yours. IIRC you suggested that with the + delimiter
find being single threaded had to work in a linear fashion
while the pipelined version with xargs naturally allowed
for multi-processing efficiencies.
Perhaps you are correct, and there are reasons why find must
wait for the exec'ed process to finish before continuing to
execute it's find activity, but if I were writing find, I'd
strive very hard to fork off that process and continue finding.
So I looked for alternative explanations of longer clock (real)
times you noted in your test. My hypothesis is the kernel's
cpu (core) allocation algorithm on a multi-cpu system like yours.
Note, all your timings of the xargs version are less "efficient"
than the + versions as they show higher cpu usage (user + sys)
than the corresponding + versions. It is clear that xargs and
find were often running on different cpus as their real time
is less than their cpu usage. In contrast, the + versions
basically have identical cpu usage and clock time indicating
a saturation of a single cpu.
I wonder if the cpu allocation algorithm initially allocates
separate cpu's to processes started at essentially the same
time (as in a pipeline) but for a single fork uses the current
cpu as some metadata is already there and does not need to
be transfered to the other cpu cache. For similar reasons,
a process may be partially constrained to the same cpu so
as to avoid the inefficiency of moving/duplicating the
process metadata.
Whether my analysis is reasonable or not, I agree with you
that whatever activity is exec'ed or xarg'ed is likely to
dominate the timings.
BTW I tried your "chmod" example on a mono-core system and
saw no repeatable differences between the + version and
the xargs version. Not a surprise.
I'll still recommend the -exec + versions. I think it is
sufficiently wide spread, probably more than -print0 | xargs -0
if you consider UNIX systems also. Xargs also has some
problems dealing with multi-byte character sets and commands
where the filenames are not the last items on the command line.
--
Jon H. LaBadie jon at jgcomp.com
JG Computing
12027 Creekbend Drive (703) 787-0884
Reston, VA 20194 (703) 787-0922 (fax)
More information about the Novalug
mailing list