Text encoding.

David V Glasgow dvglasgow at gmail.com
Fri Sep 3 06:07:56 EDT 2021


Following this with interest, but also a little confusion.  I completely fell into the trap of assuming you encode outgoing and decode incoming.

Alex states that put textEncode(tWHoleText, "UTF8") into tWholeText speeds replace up, but David B says LC internal format is UTF16.  Doesn’t the 8 vs 16 difference matter?  Or matters less than other encodings?

Cheers

David Glasgow


> On 2 Sep 2021, at 1:01 pm, David Bovill via use-livecode <use-livecode at lists.runrev.com> wrote:
> 
> Thanks for the question Alex, I’m wrestling with the same issues - but so far got no responses from encoding gurus here :)
> 
> This is my understanding:
> 
> 1) Yes its recommended to textEncode text that comes from outside into Livecode’s internal native format (which is utf16).  Livecode handles everything internally “transparently” from then on - which I guess means all usual language and control operations expect this utf16 internal format. My guess is this is why a few things have got slower as compared with early versions of Livecode.
> 2) Without doing textEncode the engine tries to guess the encoding (duck-typing?) and does this in a platform specific way? Again exactly what is going on there is a bit opaque to me, but the take-home message is that this is slower and less robust. So yes -losing nothing (assuming the original file is utf8, and yes its the best alternative.
> 
> I thing the hard thing to find out is exactly what type of encoding some files are - would be great if there was a duck-typing service where we could paste text or upload files and it would say - hey this looks like utf8 - but that’s asking too much
> 
> 📆    Schedule a call with me
> On 2 Sep 2021, 12:12 +0100, Alex Tweedly via use-livecode <use-livecode at lists.runrev.com>, wrote:
>> Sorry to drag us off the interesting topic of licensing :-) into some
>> Livecode question.
>> 
>> I know little or nothing about Unicode, text encodings, etc. - so my
>> question is indeed naive.
>> 
>> I have a text file (War & Peace from Project Gutenberg), about 3.4Mb.
>> The Mac describes it simply as "Plain text".
>> 
>> When I read that into a variable, and then do
>>     replace tChar by SPACE in tWholeText
>> it takes between 1000 and 4000 millisecs - versus the 8-10 msecs I had
>> expected from other samples.
>> 
>> If I put in
>>     put textEncode(tWHoleText, "UTF8") into tWholeText
>> before the replace then it does indeed tae 8-10 msecs.
>> 
>> Q1. What (if anything) am I losing by doing that ?
>> 
>> Q2. Is this the best alternative ?
>> 
>> Additional info - I just discovered that according to 'more' command
>> line, the file start with :
>> 
>> <U+FEFF>The Project ....
>> 
>> if that is useful.
>> 
>> Many thanks,
>> 
>> Alex.
>> 
>> 
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode





More information about the use-livecode mailing list