Re: &rest and &optional in compiled function arglists

Sunday, 6 December 1998

        ...
>>>> "Hrvoje" == Hrvoje Niksic
<hniksic(a)srce.hr&gt; writes: 
Hrvoje> Now that we have various optimizations, it looks a bit funny that
Hrvoje> compiled functions parse their arglist on every funcall.  Shouldn't it 
Hrvoje> be possible to have an ->optimized_arglist element in the compiled
Hrvoje> function, filled out by optimize_compiled_function() (or by something
Hrvoje> else)?

Hrvoje> The element could be a structure, like this:

Hrvoje> struct optimized_args {
Hrvoje>   int argcount, optcount, restcount;
Hrvoje>   Lisp_Object *argvector;
Hrvoje> }

Hrvoje> This would allow for the removal of LIST_LOOP_3 at
Hrvoje> funcall_compiled_function().  At a later date, the code could be
Hrvoje> easily expanded to cleanly process keyword arguments.

Hrvoje> Have you perhaps already tried that line of optimization and concluded 
Hrvoje> it was not worth it?  Or have you not thought of that?

I've actually done a fair amount of thinking about this.

Particularly painful from the point of view of performance is the fact
that almost no functions have &optional or &rest parameters.

Most functions have zero, one, or two ordinary parameters.  Perhaps we 
could have special cases for those.

For best optimization, you would store the arglist directly inside the
opaque object with the instructions for better locality of reference.
Same with the constant vector.  This will also speed up garbage
collection.  bytecode.c will get a little uglier, however.

It won't buy as much as you'd like, though. 

Here is the head of gprof output for (byte-compile-file "simple.el") on Sun:

   %  cumulative    self              self    total          
 time   seconds   seconds    calls  ms/call  ms/call name    
 26.8     265.02   265.02 24293397     0.01     0.02  execute_optimized_program	<cycle
3> [5]
  9.0     353.63    88.61                            _mcount (6644)
  5.7     409.88    56.25 107146042     0.00     0.00  bufpos_to_bytind_func [21]
  5.4     463.45    53.57 32144818     0.00     0.00  mark_object	<cycle 2> [23]
  4.6     508.51    45.06 172449465     0.00     0.00  set_buffer_point [25]
  4.2     550.16    41.65                            oldarc [26]
  3.9     588.82    38.66     2042    18.93    19.33  decode_coding_iso2022 [30]
  3.8     626.17    37.35 44511947     0.00     0.00  Ffuncall	<cycle 3> [7]
  2.7     652.96    26.79 24285878     0.00     0.00  funcall_compiled_function	<cycle
3> [39]
  2.6     678.27    25.31 83869017     0.00     0.00  readchar [22]
  2.2     699.82    21.55 62705330     0.00     0.00  Fmemq [43]
  1.6     715.51    15.69  5822540     0.00     0.02  buffer_insert_string_1 [9]
  1.6     731.12    15.61   320510     0.05     0.24  skip_chars [15]
  1.4     745.24    14.12  7610693     0.00     0.00  Fassq [54]
  1.1     756.49    11.25                            next [56]
  1.0     766.47     9.98     4985     2.00    14.57  garbage_collect_1	<cycle 3>
[16]

_mcount and oldarc are measurement overhead.  

Here is gprof on Linux:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 46.79    409.06   409.06 24295909     0.02     0.02  execute_optimized_program
  9.25    489.94    80.88 22500420     0.00     0.00  mark_object
  3.90    524.05    34.11 24288192     0.00     0.00  funcall_compiled_function
  3.49    554.54    30.49 44373981     0.00     0.00  Ffuncall
  2.94    580.28    25.74 62702820     0.00     0.00  Fmemq
  2.58    602.86    22.58      928    24.33    24.33  sweep_conses
  2.55    625.15    22.29 95129330     0.00     0.00  set_buffer_point
  2.37    645.86    20.71 83653640     0.00     0.00  readchar
  1.41    658.16    12.30  5822327     0.00     0.01  buffer_insert_string_1
  1.21    668.72    10.56  7614031     0.00     0.00  Fassq
  1.17    678.96    10.24 26409642     0.00     0.00  map_extents_bytind
  0.92    686.97     8.01 29062131     0.00     0.00  find_symbol_value_1
  0.91    694.94     7.97      928     8.59     8.60  compact_string_chars
  0.86    702.43     7.49  3187928     0.00     0.01  read_atom_0
  0.76    709.04     6.61 24361635     0.00     0.00  symbol_value_in_buffer
  0.74    715.53     6.49  5824902     0.00     0.00  signal_after_change
  0.70    721.65     6.12  5313527     0.00     0.01  read1
  0.66    727.44     5.79  9362082     0.00     0.00  advance_plist_pointers

It seems that even if you make funcall_compiled_function infinitely
fast, this benchmark will only be 5-10% faster, at best.

Hmmmmmmmmmmmmmmmmmmmmmmmmm.

On second thought, the obvious way to exercise funcall is using
mapc.

Here's a gprof output for (mapc (lambda (x) (incf z x)) long-list):

   %  cumulative    self              self    total          
 time   seconds   seconds    calls  ms/call  ms/call name    
 35.3     226.20   226.20 300015011     0.00     0.00  execute_optimized_program	<cycle
3> [6]
 23.3     375.60   149.40 300013990     0.00     0.00  funcall_compiled_function	<cycle
3> [7]
 19.4     500.16   124.56 300364373     0.00     0.00  Ffuncall	<cycle 3> [8]
 12.2     578.34    78.18                            _mcount (6644)
  4.6     607.68    29.34                            oldarc [14]
  1.7     618.80    11.12   300288     0.04     0.04  mapcar1	<cycle 3> [19]
  1.4     627.96     9.16   320473     0.03     0.03  Flength [21]
  1.0     634.22     6.26                            done [24]
  0.4     636.56     2.34  3692039     0.00     0.00  mark_object	<cycle 2> [29]
  0.1     637.00     0.44  2425273     0.00     0.00  readchar [38]
  0.0     637.32     0.32  1565358     0.00     0.00  put_char_table	<cycle 3> [42]
  0.0     637.59     0.27      882     0.31     0.35  decode_coding_no_conversion [43]
  0.0     637.85     0.26    46702     0.01     0.01  mark_hash_table	<cycle 2>
[45]
  0.0     638.08     0.23     3414     0.07     0.95  garbage_collect_1	<cycle 3>
[26]
  0.0     638.30     0.22  1891385     0.00     0.00  cmst_mapfun	<cycle 3> [47]
  0.0     638.52     0.22      183     1.20     1.20  compact_string_chars [46]
  0.0     638.71     0.19   382545     0.00     0.00  map_over_other_charset	<cycle
3> [50]
  0.0     638.89     0.18   131826     0.00     0.01  read1	<cycle 3> [36]

Hmmmmmmmmmmmmmmmmmmmmmmmmmmm.

Better yet, if you want to measure pure Ffuncall, do (mapc #'identity long-list):

Sun:

   %  cumulative    self              self    total          
 time   seconds   seconds    calls  ms/call  ms/call name    
 60.2     325.03   325.03 600664277     0.00     0.00  Ffuncall	<cycle 3> [5]
 17.3     418.21    93.18                            _mcount (6644)
  7.1     456.57    38.36                            oldarc [10]
  5.2     484.42    27.85 600000160     0.00     0.00  Fidentity [13]
  3.8     504.99    20.57   600284     0.03     0.03  mapcar1	<cycle 3> [14]
  3.3     522.66    17.67   620458     0.03     0.03  Flength [16]
  1.5     530.71     8.05                            done [21]
  0.5     533.21     2.50  3692036     0.00     0.00  mark_object	<cycle 2> [27]
  0.3     534.95     1.74    14960     0.12     0.13  execute_optimized_program	<cycle
3> [29]
  0.1     535.37     0.42  1565358     0.00     0.00  put_char_table	<cycle 3> [39]
  0.1     535.74     0.37  2425246     0.00     0.00  readchar [37]
  0.1     536.09     0.35      882     0.40     0.42  decode_coding_no_conversion [42]
  0.0     536.35     0.26     3414     0.08     1.00  garbage_collect_1	<cycle 3>
[25]
  0.0     536.57     0.22    46702     0.00     0.00  mark_hash_table	<cycle 2>
[45]
  0.0     536.75     0.18   382545     0.00     0.00  map_over_other_charset	<cycle
3> [50]

Linux:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 70.43     55.82    55.82 120162804     0.00     0.00  Ffuncall
  9.55     63.39     7.57   120169     0.06     0.06  mapcar1
  6.12     68.24     4.85 120000057     0.00     0.00  Fidentity
  5.41     72.53     4.29   137347     0.03     0.03  Flength
  3.33     75.17     2.64  2315065     0.00     0.00  mark_object
  0.44     75.52     0.35      148     2.36     2.36  compact_string_chars
  0.40     75.84     0.32  2209869     0.00     0.00  readchar
  0.35     76.12     0.28     9975     0.03     0.04  execute_optimized_program
  0.34     76.39     0.27      148     1.82     1.82  sweep_symbols
  0.32     76.64     0.25    32503     0.01     0.01  mark_hash_table
  0.23     76.82     0.18   122237     0.00     0.00  hash_string
  0.21     76.99     0.17    87471     0.00     0.00  oblookup
  0.18     77.13     0.14      148     0.95     0.95  sweep_conses
  0.15     77.25     0.12      148     0.81     0.81  sweep_strings
  0.14     77.36     0.11   111560     0.00     0.01  read1
  0.13     77.46     0.10    67448     0.00     0.00  read_atom_0
  0.10     77.54     0.08    13203     0.01     0.01  re_match_2_internal

Of course, mapc and friends can be sped up by putting Ffuncall-style
smarts into mapcar1().  But I don't see any way offhand of how to do
this elegantly.

Martin

P.S. Caveat benchmarker: gprof numbers are notoriously unreliable.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998