I'm not sure why you would expect memcmp to be faster. When
I spent time optimizing Feval and related thing to make the
interpreter faster, the biggest gains came from inlining
things that were previously function call. Function call
overhead can be a real killer. You have to do register
save/restore plus pushing and popping args off the stack.
And you still have to compare the same bits that the old
inlined code did. If there's a machine instruction that lets you
do memcmp in an instruction or two, you will save some time, but
given chip speeds today, most of the time is probably spent in
the memory fetches. You can't get around the fetches.