Text encoding.

Alex Tweedly alex at tweedly.net
Thu Sep 2 14:23:13 EDT 2021


On 02/09/2021 18:34, Mark Waddingham via use-livecode wrote:
> On 2021-09-02 12:12, Alex Tweedly via use-livecode wrote:
>> Sorry to drag us off the interesting topic of licensing :-) into some
>> Livecode question.
>>
>> I know little or nothing about Unicode, text encodings, etc. - so my
>> question is indeed naive.
>>
>> I have a text file (War & Peace from Project Gutenberg), about 3.4Mb.
>> The Mac describes it simply as "Plain text".
>
> Do you have a link to the file handy?

https://www.gutenberg.org/ebooks/2600 and then chose "Plain Text UTF-8"

or directly to https://www.gutenberg.org/files/2600/2600-0.txt

(and then I saved that page to file).

>
>>
>> When I read that into a variable, and then do
>>     replace tChar by SPACE in tWholeText
>> it takes between 1000 and 4000 millisecs - versus the 8-10 msecs I had
>> expected from other samples.
>>
>> If I put in
>>     put textEncode(tWHoleText, "UTF8") into tWholeText
>> before the replace then it does indeed tae 8-10 msecs.
>
> What exact code are you using in both cases? (including reading in the 
> file, char you are replacing etc.)

    put URL ("file:" & specialFolderPath("home") & "/warpeace.txt") into 
tText

and then

>    put quote&"!?,.:;[]{}()£$¢%^&≤≥÷<>=+-…“‘”¡™#∞§¶*•ªº\/" into tList
>    -- put textencode(pStr, "UTF8") into pStr
>    repeat for each char tChar in tList
>       --      put the millisecs into t1
>       replace tChar with space in pStr
>       --      put the millisecs - t1 && tChar &CR after msg
>       wait 0 millisecs with messages
>       --      if the shiftkey is down then exit repeat
>    end repeat
Obviously, comment those lines in or out as needed.
(NB yes the times I gave are for *each* char replace, not for the whole 
loop)

> The character itself is the 'undefined/illegal codepoint' which has a 
> different sequence of bytes for each of the main 
> (UTF-8/16LE,BE/32LE,BE) encodings. If you do `hexdump -c | less` on 
> the file, then if it is UTF-8 there will be three bytes before the T, 
> or 4 if it is UTF-16.
>
Three characters, confirming the identification in the original webpage.

Thanks,

Alex.





More information about the use-livecode mailing list