How to find offsets in Unicode Text fast

Geoff Canyon gcanyon at gmail.com
Mon Nov 12 18:20:37 EST 2018


On Mon, Nov 12, 2018 at 11:36 AM Ben Rubinstein via use-livecode <
use-livecode at lists.runrev.com> wrote:

>
> I'm really confused that case-insensitive should work at all for UTF-16 or
> UTF-32; at this point as far as I understand it, LC has no idea that how
> to
> correctly interpret the value of the variable as text.
>
> Mr Very Picky would also suggest that to be really correct, the code in
> this
> case should also check that the offset found was on a four-byte boundary
> (tPos
> mod 4 = 1); it's probably a purely theoretical consideration, but I think
> that
> the four-byte sequence (representing the character you're searching for)
> could
> in theory be incorrectly matched across two other characters.
>

I also thought of the four-byte boundary consideration. The code below is
available at: https://github.com/gcanyon/alloffsets

For example, previous UTF-32 versions will fail on characters like 𐀁,
which converts to 00010001 and therefore finding 𐀁 in 𐀁𐀁𐀁 would return
1,1,2,2,3. I don't know how many other possible issues there are, but given
the current UTF-32 character set there are a few, but likely not many. The
failure searching for "Reykjavík" in "Reykjavík er höfuðborg" is weirder
and worse, obviously.

I was puzzled at first by the case-sensitive functionality in
UTF-32-encoded strings, but I realized that standard case-insensitive
searches are presumably just implemented as a set of exceptions at a low
level. For example, the the engine isn't looking at "a" and "A"  and
saying, "those are the same." Instead, it's looking at raw ACII and
mapping 97 to 65 if case-insensitive is requested. The same must be true of
UTF-32: the engine isn't looking at "Ѡ" and "ѡ", it's mapping 00000460
to 00000461. I agree that it seems a little odd that LC knows the string of
binary data is "text", but maybe there's some trick to that?

Anyway, here's the boundary-respecting, but still-flawed version of
UTF-32-based allOffsets, with a documented bad example in a comment:

function allOffsetsUTF32 pFind,pString,pCaseSensitive,pNoOverlaps
   -- returns a comma-delimited list of the offsets of pFind in pString
   -- note, this seems to fail on some searches, for example:
   -- searching for "Reykjavík" in "Reykjavík er höfuðborg"
   -- It is uncertain why.
   -- See thread here:
http://lists.runrev.com/pipermail/use-livecode/2018-November/251357.html
   local tNewPos, tPos, tResult, tSkip
   put textEncode(pFind,"UTF-32") into pFind
   put textEncode(pString,"UTF-32") into pString
   if pNoOverlaps then put length(pFind) - 1 into tSkip

   set the caseSensitive to pCaseSensitive is true
   put 0 into tPos
   repeat forever
      put offset(pFind, pString, tPos) into tNewPos
      if tNewPos = 0 then exit repeat
      add tNewPos to tPos
      if tPos mod 4 = 1 then put (tPos div 4 + 1),"" after tResult
      if pNoOverlaps then add tSkip to tPos
   end repeat
   if tResult is empty then return 0
   else return char 1 to -2 of tResult
end allOffsetsUTF32



More information about the use-livecode mailing list