How to find offsets in Unicode Text fast
Geoff Canyon
gcanyon at gmail.com
Mon Nov 12 18:20:37 EST 2018
On Mon, Nov 12, 2018 at 11:36 AM Ben Rubinstein via use-livecode <
use-livecode at lists.runrev.com> wrote:
>
> I'm really confused that case-insensitive should work at all for UTF-16 or
> UTF-32; at this point as far as I understand it, LC has no idea that how
> to
> correctly interpret the value of the variable as text.
>
> Mr Very Picky would also suggest that to be really correct, the code in
> this
> case should also check that the offset found was on a four-byte boundary
> (tPos
> mod 4 = 1); it's probably a purely theoretical consideration, but I think
> that
> the four-byte sequence (representing the character you're searching for)
> could
> in theory be incorrectly matched across two other characters.
>
I also thought of the four-byte boundary consideration. The code below is
available at: https://github.com/gcanyon/alloffsets
For example, previous UTF-32 versions will fail on characters like 𐀁,
which converts to 00010001 and therefore finding 𐀁 in 𐀁𐀁𐀁 would return
1,1,2,2,3. I don't know how many other possible issues there are, but given
the current UTF-32 character set there are a few, but likely not many. The
failure searching for "Reykjavík" in "Reykjavík er höfuðborg" is weirder
and worse, obviously.
I was puzzled at first by the case-sensitive functionality in
UTF-32-encoded strings, but I realized that standard case-insensitive
searches are presumably just implemented as a set of exceptions at a low
level. For example, the the engine isn't looking at "a" and "A" and
saying, "those are the same." Instead, it's looking at raw ACII and
mapping 97 to 65 if case-insensitive is requested. The same must be true of
UTF-32: the engine isn't looking at "Ѡ" and "ѡ", it's mapping 00000460
to 00000461. I agree that it seems a little odd that LC knows the string of
binary data is "text", but maybe there's some trick to that?
Anyway, here's the boundary-respecting, but still-flawed version of
UTF-32-based allOffsets, with a documented bad example in a comment:
function allOffsetsUTF32 pFind,pString,pCaseSensitive,pNoOverlaps
-- returns a comma-delimited list of the offsets of pFind in pString
-- note, this seems to fail on some searches, for example:
-- searching for "Reykjavík" in "Reykjavík er höfuðborg"
-- It is uncertain why.
-- See thread here:
http://lists.runrev.com/pipermail/use-livecode/2018-November/251357.html
local tNewPos, tPos, tResult, tSkip
put textEncode(pFind,"UTF-32") into pFind
put textEncode(pString,"UTF-32") into pString
if pNoOverlaps then put length(pFind) - 1 into tSkip
set the caseSensitive to pCaseSensitive is true
put 0 into tPos
repeat forever
put offset(pFind, pString, tPos) into tNewPos
if tNewPos = 0 then exit repeat
add tNewPos to tPos
if tPos mod 4 = 1 then put (tPos div 4 + 1),"" after tResult
if pNoOverlaps then add tSkip to tPos
end repeat
if tResult is empty then return 0
else return char 1 to -2 of tResult
end allOffsetsUTF32
More information about the use-livecode
mailing list