Re: large el files and performance: replace words from lists

Sunday, 10 November 2013

Hello --

 Ar an seachtú lá de mí na Samhain, scríobh Uwe Brauer: 

...
 I would like to have a lisp pkg which would replace certain words in
a text.[1]

 For this I need two things, 

     - a function which does the replacement and 

     - a list containing words with and without niqqud.

 Concerning the first, I have already a function, which is loosely
 based on iso-accentuate, or a function
 (TeX-to-char) which was provided by Aidan
 some years ago, which replaces latex symbols by its UTF8
 equivalents. I am not sure which code is more efficient, I'll
 will to post the central part of the code later.

 However what bothers me more is the second part. I obtained the
 hebrew bible in UTF8 format and could then generate the desired
 list. However it seems to me that this list would be huge, at
 least 2000 to 3000 words if not more.

 What is a reasonable size limit for such a list???

 Is 2000 words to big? or must I divide the list in several parts
 (and files) and write corresponding functions? 
In your position I’d:

-- Take the list of words without niqqud, feed them to regexp-opt, and
surround the result like so: (concat "\\Sw\\(" REGEXP-OPT-RESULT
"\\)\\Sw").
This will construct a regexp that looks for the Hebrew words, specifically
as Hebrew words, not as part of other words. It also creates a group

The interesting question about this approach is how big this regexp will be
and how slow searching with it will be. It will be cached, on the upside.
It’s something you’d need to experiment with; maybe creating several regexps
with regexp-opt is the way to go.

-- Save the map from the words without niqqud to the words with niqqud in a
Berkeley or DBM database; see #'open-database, #'put-database, and the code
that uses them in descr-text.el in XEmacs 21.5. Be careful about the CODESYS
argument to #'open-database.

-- Then, the function to do the replacement opens the database you’ve
created, and its inner loop looks like the following:

(while (re-search-forward REGEXP-OPT-RESULT-WITH-\S nil t)
  (replace-match (get-database (match-string 1) DATABASE-HANDLE)
		 t t nil 1))

Now, I haven’t paid that much attention to Hebrew, but in Perso-Arabic
script, which often has similar problems, the difficulty with something like
this is that many words have multiple possible vowels, giving different
meanings. E.g. depending on context کشتی can be /keʃti/, ship, or /koʃti/,
wrestling, and the corresponding versions with harakat (the equivalent of
niqqud) would be کِشتی and کُشتی; there’s no way to decide on which without
involving a human who can read the language. So your automatic approach
would be confusing in these contexts.

...
 Footnotes:
 [1]  to be precise, to substitute hebrew words by hebrew words with
      vowels, so called niqqud 
-- 
‘Liston operated so fast that he once accidentally amputated an assistant’s
fingers along with a patient’s leg, […] The patient and the assistant both
died of sepsis, and a spectator reportedly died of shock, resulting in the
only known procedure with a 300% mortality.’ (Atul Gawande, NEJM, 2012)

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://lists.xemacs.org/mailman/listinfo/xemacs-beta

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: large el files and performance: replace words from lists