Thank you so much for doing this! We need WAY more profiling done of XEmacs.
Why don't you post exactly the steps you took in order to get the profiling output, as
well as the relevant part of the output itself? I think people are scared off of
profiling because XEmacs is big and does nonstandard things with its executable, and they
don't want to figure out how to do it.
I really really think you should implement your comment about memcmp(). Do something like
this:
1. Move the non-compared fields to the end (Bufpos and Endpos, I think) and create an
extra substructure that includes everything else. This means you have to change all the
places in the code that access the fields, but there will be no slowdown as a result of
this.
2. Give the union a name, and make sure it's bzero()d whenever it's changed. (This should
cause negligible slowdown, right? The structure is compared *much* more often than
changed, I presume.)
3. Replace compare_runes() with a memcmp() of the substructure and a call to
WINDOW_FACE_CACHEL_DIRTY().
ben
Jan Vroonhof wrote:
> ...and for free too! If I count correctly the change to struct rune
> does not change its size at all.
>
> A while ago I ran XEmacs under C level profiling and this is the result
>
> time xemacs.old -vanilla -eval '(progn (loop repeat 10 do (hanoi 6)) (kill-emacs))'
> 93.300u 2.080s 1:41.36 94.1% 0+0k 0+0io 1387pf+0w
> time xemacs.new -vanilla -eval '(progn (loop repeat 10 do (hanoi 6)) (kill-emacs))'
> 82.100u 2.030s 1:27.44 96.2% 0+0k 0+0io 1488pf+0w
> time xemacs.new -vanilla -eval '(progn (loop repeat 10 do (hanoi 6)) (kill-emacs))'
> 80.910u 2.280s 1:28.60 93.8% 0+0k 0+0io 1386pf+0w
> time xemacs.old -vanilla -eval '(progn (loop repeat 10 do (hanoi 6)) (kill-emacs))'
> 94.700u 2.030s 1:41.31 95.4% 0+0k 0+0io 1387pf+0w
>
> 1999-11-16 Jan Vroonhof <vroonhof(a)math.ethz.ch>
>
> * src/redisplay-output.c (compare_runes): Add comments about
> results from profiling.
>
> * src/redisplay.h (struct rune): Do not use bitfields for members.
> (struct rune): Add various comments about further optimizations.
>
> Index: src/redisplay-output.c
> ===================================================================
> RCS file: /usr/CVSroot/XEmacs/xemacs/src/redisplay-output.c,v
> retrieving revision 1.11.2.19
> diff -u -u -r1.11.2.19 redisplay-output.c
> --- redisplay-output.c 1999/10/15 10:11:50 1.11.2.19
> +++ redisplay-output.c 1999/11/16 21:05:23
> @@ -190,6 +190,27 @@
> /* Do not compare the values of bufpos and endpos. They do not
> affect the display characteristics. */
>
> + /* Note: (hanoi 6) spends 95% of its time in redisplay, and about
> + 30% here. Not using bitfields for rune.type alone gives a redisplay
> + speed up of 10%.
> +
> + #### In profile arcs run of a normal Gnus session this function
> + is run 6.76 million times, only to return 1 in 6.73 million of
> + those.
> +
> + In addition a quick look GCC sparc assembly shows that GCC is not
> + doing a good job here.
> + 1. The function is not inlined (too complicated?)
> + 2. It seems to be reloading the crb and drb variables all the
> + time.
> + 3. It doesn't seem to notice that the second half of these if's
> + are really a switch statement.
> +
> + So I (JV) conjecture
> +
> + #### It would really be worth it to arrange for this function to
> + be (almost) a single call to memcmp. */
> +
> if ((crb->findex != drb->findex) ||
> (WINDOW_FACE_CACHEL_DIRTY (w, drb->findex)))
> return 0;
> Index: src/redisplay.h
> ===================================================================
> RCS file: /usr/CVSroot/XEmacs/xemacs/src/redisplay.h,v
> retrieving revision 1.7.2.13
> diff -u -u -r1.7.2.13 redisplay.h
> --- redisplay.h 1999/10/26 05:48:53 1.7.2.13
> +++ redisplay.h 1999/11/16 21:05:46
> @@ -91,6 +91,13 @@
> but control characters have two -- a ^ and a letter -- and other
> non-printing characters (those displayed in octal) have four. */
>
> +/* WARNING! In compare_runes (one of the most heavily used functions)
> + two runes are compared. So please be careful with changes to this
> + structure. See comments in compare_runes.
> +
> + #### This should really be made smaller.
> +*/
> +
> typedef struct rune rune;
> struct rune
> {
> @@ -105,10 +112,6 @@
> each of the face properties in this
> particular window. */
>
> - short xpos; /* horizontal starting position in pixels */
> - short width; /* pixel width of rune */
> -
> -
> Bufpos bufpos; /* buffer position this rune is displaying;
> for the modeline, the value here is a
> Charcount, but who's looking? */
> @@ -116,11 +119,26 @@
> /* #### Chuck, what does it mean for a rune
> to cover a range of pos? I don't get
> this. */
> - unsigned int cursor_type :3; /* is this rune covered by the cursor? */
> - unsigned int type :3; /* type of rune object */
> + /* #### This isn't used as an rvalue anywhere!
> + remove! */
> +
> +
> + short xpos; /* horizontal starting position in pixels */
> + short width; /* pixel width of rune */
> +
> +
> + unsigned char cursor_type; /* is this rune covered by the cursor? */
> + unsigned char type; /* type of rune object */
> + /* We used to do bitfields here, but if I
> + (JV) count correctly that doesn't matter
> + for the size of the structure. All the bit
> + fiddling _does_ slow down redisplay by
> + about 10%. So don't do that */
>
> union /* Information specific to the type of rune */
> {
> + /* #### GLyps are are. Is it really necessary to waste 8 bytes on every
> + rune for that?! */
> /* DGLYPH */
> struct
> {