>>>> "Hrvoje" == Hrvoje Niksic
<hniksic(a)srce.hr> writes:
Hrvoje> Now that we have various optimizations, it looks a bit funny that
Hrvoje> compiled functions parse their arglist on every funcall. Shouldn't it
Hrvoje> be possible to have an ->optimized_arglist element in the compiled
Hrvoje> function, filled out by optimize_compiled_function() (or by something
Hrvoje> else)?
Hrvoje> The element could be a structure, like this:
Hrvoje> struct optimized_args {
Hrvoje> int argcount, optcount, restcount;
Hrvoje> Lisp_Object *argvector;
Hrvoje> }
Hrvoje> This would allow for the removal of LIST_LOOP_3 at
Hrvoje> funcall_compiled_function(). At a later date, the code could be
Hrvoje> easily expanded to cleanly process keyword arguments.
Hrvoje> Have you perhaps already tried that line of optimization and concluded
Hrvoje> it was not worth it? Or have you not thought of that?
I've actually done a fair amount of thinking about this.
Particularly painful from the point of view of performance is the fact
that almost no functions have &optional or &rest parameters.
Most functions have zero, one, or two ordinary parameters. Perhaps we
could have special cases for those.
For best optimization, you would store the arglist directly inside the
opaque object with the instructions for better locality of reference.
Same with the constant vector. This will also speed up garbage
collection. bytecode.c will get a little uglier, however.
It won't buy as much as you'd like, though.
Here is the head of gprof output for (byte-compile-file "simple.el") on Sun:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
26.8 265.02 265.02 24293397 0.01 0.02 execute_optimized_program <cycle
3> [5]
9.0 353.63 88.61 _mcount (6644)
5.7 409.88 56.25 107146042 0.00 0.00 bufpos_to_bytind_func [21]
5.4 463.45 53.57 32144818 0.00 0.00 mark_object <cycle 2> [23]
4.6 508.51 45.06 172449465 0.00 0.00 set_buffer_point [25]
4.2 550.16 41.65 oldarc [26]
3.9 588.82 38.66 2042 18.93 19.33 decode_coding_iso2022 [30]
3.8 626.17 37.35 44511947 0.00 0.00 Ffuncall <cycle 3> [7]
2.7 652.96 26.79 24285878 0.00 0.00 funcall_compiled_function <cycle
3> [39]
2.6 678.27 25.31 83869017 0.00 0.00 readchar [22]
2.2 699.82 21.55 62705330 0.00 0.00 Fmemq [43]
1.6 715.51 15.69 5822540 0.00 0.02 buffer_insert_string_1 [9]
1.6 731.12 15.61 320510 0.05 0.24 skip_chars [15]
1.4 745.24 14.12 7610693 0.00 0.00 Fassq [54]
1.1 756.49 11.25 next [56]
1.0 766.47 9.98 4985 2.00 14.57 garbage_collect_1 <cycle 3>
[16]
_mcount and oldarc are measurement overhead.
Here is gprof on Linux:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
46.79 409.06 409.06 24295909 0.02 0.02 execute_optimized_program
9.25 489.94 80.88 22500420 0.00 0.00 mark_object
3.90 524.05 34.11 24288192 0.00 0.00 funcall_compiled_function
3.49 554.54 30.49 44373981 0.00 0.00 Ffuncall
2.94 580.28 25.74 62702820 0.00 0.00 Fmemq
2.58 602.86 22.58 928 24.33 24.33 sweep_conses
2.55 625.15 22.29 95129330 0.00 0.00 set_buffer_point
2.37 645.86 20.71 83653640 0.00 0.00 readchar
1.41 658.16 12.30 5822327 0.00 0.01 buffer_insert_string_1
1.21 668.72 10.56 7614031 0.00 0.00 Fassq
1.17 678.96 10.24 26409642 0.00 0.00 map_extents_bytind
0.92 686.97 8.01 29062131 0.00 0.00 find_symbol_value_1
0.91 694.94 7.97 928 8.59 8.60 compact_string_chars
0.86 702.43 7.49 3187928 0.00 0.01 read_atom_0
0.76 709.04 6.61 24361635 0.00 0.00 symbol_value_in_buffer
0.74 715.53 6.49 5824902 0.00 0.00 signal_after_change
0.70 721.65 6.12 5313527 0.00 0.01 read1
0.66 727.44 5.79 9362082 0.00 0.00 advance_plist_pointers
It seems that even if you make funcall_compiled_function infinitely
fast, this benchmark will only be 5-10% faster, at best.
Hmmmmmmmmmmmmmmmmmmmmmmmmm.
On second thought, the obvious way to exercise funcall is using
mapc.
Here's a gprof output for (mapc (lambda (x) (incf z x)) long-list):
% cumulative self self total
time seconds seconds calls ms/call ms/call name
35.3 226.20 226.20 300015011 0.00 0.00 execute_optimized_program <cycle
3> [6]
23.3 375.60 149.40 300013990 0.00 0.00 funcall_compiled_function <cycle
3> [7]
19.4 500.16 124.56 300364373 0.00 0.00 Ffuncall <cycle 3> [8]
12.2 578.34 78.18 _mcount (6644)
4.6 607.68 29.34 oldarc [14]
1.7 618.80 11.12 300288 0.04 0.04 mapcar1 <cycle 3> [19]
1.4 627.96 9.16 320473 0.03 0.03 Flength [21]
1.0 634.22 6.26 done [24]
0.4 636.56 2.34 3692039 0.00 0.00 mark_object <cycle 2> [29]
0.1 637.00 0.44 2425273 0.00 0.00 readchar [38]
0.0 637.32 0.32 1565358 0.00 0.00 put_char_table <cycle 3> [42]
0.0 637.59 0.27 882 0.31 0.35 decode_coding_no_conversion [43]
0.0 637.85 0.26 46702 0.01 0.01 mark_hash_table <cycle 2>
[45]
0.0 638.08 0.23 3414 0.07 0.95 garbage_collect_1 <cycle 3>
[26]
0.0 638.30 0.22 1891385 0.00 0.00 cmst_mapfun <cycle 3> [47]
0.0 638.52 0.22 183 1.20 1.20 compact_string_chars [46]
0.0 638.71 0.19 382545 0.00 0.00 map_over_other_charset <cycle
3> [50]
0.0 638.89 0.18 131826 0.00 0.01 read1 <cycle 3> [36]
Hmmmmmmmmmmmmmmmmmmmmmmmmmmm.
Better yet, if you want to measure pure Ffuncall, do (mapc #'identity long-list):
Sun:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
60.2 325.03 325.03 600664277 0.00 0.00 Ffuncall <cycle 3> [5]
17.3 418.21 93.18 _mcount (6644)
7.1 456.57 38.36 oldarc [10]
5.2 484.42 27.85 600000160 0.00 0.00 Fidentity [13]
3.8 504.99 20.57 600284 0.03 0.03 mapcar1 <cycle 3> [14]
3.3 522.66 17.67 620458 0.03 0.03 Flength [16]
1.5 530.71 8.05 done [21]
0.5 533.21 2.50 3692036 0.00 0.00 mark_object <cycle 2> [27]
0.3 534.95 1.74 14960 0.12 0.13 execute_optimized_program <cycle
3> [29]
0.1 535.37 0.42 1565358 0.00 0.00 put_char_table <cycle 3> [39]
0.1 535.74 0.37 2425246 0.00 0.00 readchar [37]
0.1 536.09 0.35 882 0.40 0.42 decode_coding_no_conversion [42]
0.0 536.35 0.26 3414 0.08 1.00 garbage_collect_1 <cycle 3>
[25]
0.0 536.57 0.22 46702 0.00 0.00 mark_hash_table <cycle 2>
[45]
0.0 536.75 0.18 382545 0.00 0.00 map_over_other_charset <cycle
3> [50]
Linux:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
70.43 55.82 55.82 120162804 0.00 0.00 Ffuncall
9.55 63.39 7.57 120169 0.06 0.06 mapcar1
6.12 68.24 4.85 120000057 0.00 0.00 Fidentity
5.41 72.53 4.29 137347 0.03 0.03 Flength
3.33 75.17 2.64 2315065 0.00 0.00 mark_object
0.44 75.52 0.35 148 2.36 2.36 compact_string_chars
0.40 75.84 0.32 2209869 0.00 0.00 readchar
0.35 76.12 0.28 9975 0.03 0.04 execute_optimized_program
0.34 76.39 0.27 148 1.82 1.82 sweep_symbols
0.32 76.64 0.25 32503 0.01 0.01 mark_hash_table
0.23 76.82 0.18 122237 0.00 0.00 hash_string
0.21 76.99 0.17 87471 0.00 0.00 oblookup
0.18 77.13 0.14 148 0.95 0.95 sweep_conses
0.15 77.25 0.12 148 0.81 0.81 sweep_strings
0.14 77.36 0.11 111560 0.00 0.01 read1
0.13 77.46 0.10 67448 0.00 0.00 read_atom_0
0.10 77.54 0.08 13203 0.01 0.01 re_match_2_internal
Of course, mapc and friends can be sped up by putting Ffuncall-style
smarts into mapcar1(). But I don't see any way offhand of how to do
this elegantly.
Martin
P.S. Caveat benchmarker: gprof numbers are notoriously unreliable.