Text encoding.

Mark Waddingham mark at livecode.com
Thu Sep 2 13:34:22 EDT 2021


On 2021-09-02 12:12, Alex Tweedly via use-livecode wrote:
> Sorry to drag us off the interesting topic of licensing :-) into some
> Livecode question.
> 
> I know little or nothing about Unicode, text encodings, etc. - so my
> question is indeed naive.
> 
> I have a text file (War & Peace from Project Gutenberg), about 3.4Mb.
> The Mac describes it simply as "Plain text".

Do you have a link to the file handy?

> 
> When I read that into a variable, and then do
>     replace tChar by SPACE in tWholeText
> it takes between 1000 and 4000 millisecs - versus the 8-10 msecs I had
> expected from other samples.
> 
> If I put in
>     put textEncode(tWHoleText, "UTF8") into tWholeText
> before the replace then it does indeed tae 8-10 msecs.

What exact code are you using in both cases? (including reading in the 
file, char you are replacing etc.)

> Additional info - I just discovered that according to 'more' command
> line, the file start with :
> 
> <U+FEFF>The Project ....

That suggests the file is unicode encoded - it is a 'byte order mark'.

The character itself is the 'undefined/illegal codepoint' which has a 
different sequence of bytes for each of the main (UTF-8/16LE,BE/32LE,BE) 
encodings. If you do `hexdump -c | less` on the file, then if it is 
UTF-8 there will be three bytes before the T, or 4 if it is UTF-16.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list