Re: Another solution to the hashtable problem?

Monday, 2 April 2001

        Executive summary: I withdraw previous recent patches for elhash.c,
and propose that all hash table mapping functions operate on copies of
the hash table, partly in response to Jan's and Hrvoje's analysis.

...
>>>> "Jan" == jan  <jan(a)xemacs.org&gt;
writes: 
Jan> [Please forgive me that I went overboard on the abstractions. I
Jan> am a Mathematician after all and spend the last two weeks rewriting
Jan> a PCI ethernet driver...]

Jan> Guys, I came to this discussion late, so let me please review what I
Jan> think I understand of the situation. <envy> I am discussing this now
because
Jan> I want to have Martin's views on things before he goes of to roam
Jan> around Japan's nature. </envy> Direct responses to me allowed if you
Jan> so prefer. I'll summarize.

Thanks, Jan.

Jan> First define a few terms (I hope we can algree on these), limited to
Jan> the purpose of this mail:

Jan> The Rule: In using maphash, FUNCTION may not modify HASH-TABLE, with
Jan> the one exception that FUNCTION may remhash or puthash the entry
Jan> currently being processed by FUNCTION.

Jan> Correct Code: Code that abides by The rule when it uses maphash. [It
Jan> is not clear what should be done for map-keymap]

Jan> Incorrect Code: Code that violates The Rule.

Jan> A hash table implementation can be:

Jan> Safe: XEmacs should not crash or infloop when running either Correct
Jan> or Incorect code.

Jan> Correct: When running Correct Code, maphash should map over all elements
Jan> that are present in the hashtable at the start of the function exactly
Jan> once. When running Incorrect code elements maybe missed out. 

Jan> Robust: Incorrect code should be terminated by an error.

Jan> Agreed?

Jan> Is the current code "Safe"?

Jan> If I understand thing correctly the problem is:

Jan> The current code is demonstrably not Safe when the FUNCTION uses
Jan> puthash, i.e. it crashes. Resizing the hasharray may cause the entries
Jan> array to be replaced by a new one and the old one (still in use by
Jan> maphash) to be freed -> segfault.

Jan> The remhash case is Safe because it only shuffles things around in the
Jan> array.

Jan> Is the current code "Correct"?

Jan> Ben has voices his doubts about the remhash detection in maphash. As
Jan> stated I think it Safe, however I have my doubts about Correctness.

Jan> The test uses the fact that linear probing pushes any colliding
Jan> elements forwards (i.e. from H(item) to H(item)+1, H(item+2) etc) in
Jan> the array. Therefore the rehashing done my remhash can only cause
Jan> elements to move backwards. Since we only removed only one element
Jan> this can only be done by occupying the now empty spot, which, for
Jan> Correct Code, is the one maphash is currently at. This what the
Jan> test checks for and in that case redoes that spot. This ensure
Jan> that each element is always visited once.

Jan> However I think there is a corner case where this could lead to elements
Jan> being visited TWICE:

Jan> Define H(xn) = n and a table currently populated as

Jan> 0   1   2   3   4   5
Jan> e3  .   a2  b3  c3 d3
Jan>             *

Jan> Suppose the maphash is position 3 (* in the picture). It has already
Jan> visited e3 and a2. Now FUNCTION deletes b3: I am not mistaken the
Jan> table then looks like

Jan> 0   1   2   3   4   5
Jan> ..  .   a2  c3  d3 e3
Jan>             *

Jan> This will cause "e3" to be visited multiple times. 

Interesting.

Jan> Is the current code "Robust"?

Jan> No

Jan> Solutions currently Proposed:

Jan> 1. Take a copy of the entries array before mapping over it. Obviously
Jan> Safe and Correct. Disadvantages: possibly slow & produces more
Jan> garbage.

Obviously Safe and Correct.  But we want to avoid Consing, because of
our current weak GC implementation.  This can be done as well.

Basically, multiple threads of control accessing the same data
structure is asking for trouble, especially when the data structure is
dynamically modified by some operations.  Jan, you may well have found
a bug causing an entry to be processed twice if maphash calls remhash.

Another bug: If the hash table is weak, the mapping function might
call gc, which will cause remhash to be called, causing entries in the
hash table to slide around, thereby causing some entries in the hash
table to *never* be processed.  Probably no one currently maps over
weak hash tables, but this should also be robust.

All of these problems are solved by making a copy of all entries
before mapping.  Below is an implementation for maphash, that I
believe to be Safe and Correct.

If we agree that this is the correct approach, I will submit a patch
to make similar changes for elisp_maphash and elisp_map_remhash.

/************************************************************************/
/*			    Mapping Functions				*/
/************************************************************************/
static Lisp_Object
free_hentries_unwind (Lisp_Object unwind_obj)
{
  void *ptr = (void *) get_opaque_ptr (unwind_obj);
  xfree (ptr);
  free_opaque_ptr (unwind_obj);
  return Qnil;
}

DEFUN ("maphash", Fmaphash, 2, 2, 0, /*
Map FUNCTION over entries in HASH-TABLE, calling it with two args,
each key and value in HASH-TABLE.

FUNCTION must not modify HASH-TABLE, with the one exception that FUNCTION
may remhash or puthash the entry currently being processed by FUNCTION.
*/
       (function, hash_table))
{
  const Lisp_Hash_Table * const ht = xhash_table (hash_table);
  hentry * const hentries = xnew_array (hentry, ht->count);

  {
    const hentry *e1, *sentinel;
    hentry *e2;
    for (e1 = ht->hentries, sentinel = e1 + ht->size, e2 = hentries;
	 e1 < sentinel;
	 e1++)
      if (!HENTRY_CLEAR_P (e1))
	*(e2++) = *e1;
    assert (e2 == hentries + ht->count);
  }

  {
    Lisp_Object args[3];
    const hentry *e, *sentinel;
    int count = specpdl_depth ();
    struct gcpro gcpro1;

    record_unwind_protect (free_hentries_unwind, make_opaque_ptr ((void *)hentries));
    GCPRO1 (hentries[0].key);
    gcpro1.nvars = 2 * ht->count;

    args[0] = function;

    for (e = hentries, sentinel = e + ht->count; e < sentinel; e++)
      {
	args[1] = e->key;
	args[2] = e->value;
	Ffuncall (countof (args), args);
      }

    unbind_to (count, Qnil);
    UNGCPRO;
  }

  return Qnil;
}

Jan> 2. Make the implementation Robust. This then makes it easy to be Safe.
Jan> However Martin[1] thinks that doing this is nontrivial and not worth
Jan> it and frankly from what I have seen about it [reference counts,
Jan> dealing properly with recusion and exceptions] I tend to agree that
Jan> it is not worth doing unless absolutely necessary...
Jan> It also not clear that strict Robustness can be achieved for weak tables.

Agreed.

Jan> The Third Way:

Jan> However I think there is another solution, consisting of two parts:

Jan> a. [Safe] The whole problem is a traditional C resource tracking one.
Jan> If puthash (by way of resize_hash_table) did not free the array until
Jan> maphash wasn't done there was no crash. Thus we have to determine when
Jan> it is safe to do so. This leads to so much the same problems as with
Jan> that "hash being mapped" flag needed for Robustness. However there is
Jan> an easy way out: Have the entries array be garbage collected! Maphash
Jan> would then GCPRO it. I think this only leads to negligible overhead
Jan> and keeps the current simple structure (In fact it does almost all the things
Jan> the Robustness test would need to do but reuses the existing GCPRO
Jan> mechanism).

Jan> b. [Correct] The problem with duplicated items in the remhash case is
Jan> caused by the fact that wraparound can still cause a colliding item
Jan> i to come before position H(i). However we now that all such items
Jan> must come before the first empty position in the array.

Jan> Therefore I propose to keep the test in remhash as it is now, but
Jan> start the loop at the first free position (and stop there again).

Tricky.

Weak hash tables are still a problem.

Jan> I think this solution works sane for Incorrect code as well (in fact it
Jan> might be possible to drop "The Rule" altogether, but I haven't
checked
Jan> that), therefore removing the need for complex error detection code.

Jan> Any comments? I think implementing this should be fairly easy and clean,
Jan> possibly using OPAQUE data. In fact the only reason I haven't actually
Jan> tried and do it yet is that I am unsure about implementation details:
Jan> Is OPAQUE the right thing nor do we need a new (private) lrecord? I
Jan> also remember vague something about bad interactions between opaque
Jan> data and pdumping.

My implementation in this message is my first use of opaque_ptr, which
is the thing to use here and avoid consing.

Jan> If you all agree, I'll write the patch.

Disagree, and hope to persuade all of you that the above is the way to
go.

Jan> Jan

Jan> P.S. I don't think the code in elhash.c is difficult to understand,
Jan> quite the opposite. Linear probing was confirmed to be what I thought
Jan> it was by a simple google search. First hit

Jan> -----------

Jan> Definition: A hash table in which a collision is resolved by putting
Jan> the item in the next empty place in the array following the occupied
Jan> place. With an even moderate load factor, primary clustering tends to
Jan> slow retrieval.

Jan> [Martin's code does have a comment why he chose linear probing despite
Jan> this]

Jan> See also double hashing, quadratic probing. 

Jan> Note: Deletion may be hard because finding collisions again relies on not creating
empty spots. One solution is to mark an entry as deleted so it can be reused for
insertion, but the search list is still intact. 

Jan> [Which is what I guess is what Martin calls "tombstones", presumably
Jan> also a standard term]

Jan> -----------

Jan> I think using standard terms for well established fields like hashing
Jan> is a good thing.

Jan> I think a few more subtle things deserve commenting (because they are
Jan> important and non-localized).

Agreed.  Here's a comment to add to the top of elhash.c.

/* This file implements the hash table lisp object type.

   The hash table technique used is "linear probing".  Collisions are
   resolved by putting the item in the next empty place in the array
   following the occupied place.  Finding a hash entry performs a
   linear search in the cluster starting at the hash value.

   On deletions from the hash table, the entries immediately following
   the deleted entry are re-entered in the hash table.  We do not have
   a special way to mark deleted entries (known as "tombstones").

   The traditional literature on hash table implementation
   (e.g. Knuth) suggests that too much "primary clustering" occurs
   with linear probing.  However, this literature was written when
   locality of reference was not a factor.  In a world where a random
   access to memory is up to 20 times as expensive as access to the
   nearest address (and getting worse), linear probing makes sense.

   But the representation doesn't actually matter that much with the
   current elisp engine.  Funcall is sufficiently slow that the choice
   of hash table implementation is noise.  */

Jan> 1. The remhash case detection makes use of the special properties of linear
Jan> probing.
Jan> 2. Throughout the code the fact there is always at least one empty
Jan> bucket is used as a builtin sentinel to avoid unnecessary tests.
Jan> Example: Prove that find_hentry always terminates in case the item is not
Jan> in the table.

Jan> Footnotes: 
Jan> [1]  I hope Martin doesn't mind me putting words in his mouth trying
Jan> to "summarize" like this.

Not at all.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: Another solution to the hashtable problem?