UTF8 on LC server

Mark Waddingham mark at livecode.com
Fri Jun 1 03:17:41 EDT 2018


On 2018-06-01 02:14, Tim Selander via use-livecode wrote:
> Hi Kee and Alex,
> 
> The original documents I'm working with are UTF8, so that's that I've
> been using. So converting them to UTF16 is recommended? I'll try that.
> 
> Alex, desktop is version 8 something, and the server is the one
> installed on the on-rev host; can't remember what the key in $_Server
> for than info is, and Googling failed me this time...

You should be fine using 'character' on any unicode text - it uses the 
Unicode grapheme (specific name of 'character's as human's 'think' of 
'character's) breaking rules to find the boundaries.

That being said, I think codepoint (from memory) should also be okay on 
Japanese text as I don't think the Japanese/Chinese scripts have any 
multi-codepoint characters - they just use codepoints with value > 65535 
for less used ideographs (the 'supplementary plane'). [ Korean script 
can be encoded with Hangul, which *does* require the use of character as 
a single Korean Hangul ideograph can be composed of up to three 
codepoints ].

The fact it is breaking on Japanese text in the way you suggest makes me 
think you aren't textDecode()'ing your UTF-8 input files:

e.g.
    put textDecode(url ("binfile:<pathtofile>"), "utf-8") into tText

Without decoding as utf-8, the engine will thing your file is 'native' 
(single-byte encoded), so each byte of the file will be seen as a 
separate character.

Internally the engine uses either single-byte or double-byte encodings 
for strings (the latter being UTF-16) - which is not user-visible, you 
just need to make sure that incoming data is decoded correctly.

Can you share the code you are using to read in the text data and code 
which is breaking on server?

Warmest Regards,

Mark.

P.S. 'word' in LC is still any sequence of non-space characters 
separated by spaces, or any sequence of characters delimited by quotes - 
it takes no account of the script of the text, nor actual 
word-boundaries. If you want human-style word boundaries then you should 
use trueWord (which uses the standard Unicode word breaking rules).

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list