UTF8 on LC server
Mark Waddingham
mark at livecode.com
Fri Jun 1 03:17:41 EDT 2018
On 2018-06-01 02:14, Tim Selander via use-livecode wrote:
> Hi Kee and Alex,
>
> The original documents I'm working with are UTF8, so that's that I've
> been using. So converting them to UTF16 is recommended? I'll try that.
>
> Alex, desktop is version 8 something, and the server is the one
> installed on the on-rev host; can't remember what the key in $_Server
> for than info is, and Googling failed me this time...
You should be fine using 'character' on any unicode text - it uses the
Unicode grapheme (specific name of 'character's as human's 'think' of
'character's) breaking rules to find the boundaries.
That being said, I think codepoint (from memory) should also be okay on
Japanese text as I don't think the Japanese/Chinese scripts have any
multi-codepoint characters - they just use codepoints with value > 65535
for less used ideographs (the 'supplementary plane'). [ Korean script
can be encoded with Hangul, which *does* require the use of character as
a single Korean Hangul ideograph can be composed of up to three
codepoints ].
The fact it is breaking on Japanese text in the way you suggest makes me
think you aren't textDecode()'ing your UTF-8 input files:
e.g.
put textDecode(url ("binfile:<pathtofile>"), "utf-8") into tText
Without decoding as utf-8, the engine will thing your file is 'native'
(single-byte encoded), so each byte of the file will be seen as a
separate character.
Internally the engine uses either single-byte or double-byte encodings
for strings (the latter being UTF-16) - which is not user-visible, you
just need to make sure that incoming data is decoded correctly.
Can you share the code you are using to read in the text data and code
which is breaking on server?
Warmest Regards,
Mark.
P.S. 'word' in LC is still any sequence of non-space characters
separated by spaces, or any sequence of characters delimited by quotes -
it takes no account of the script of the text, nor actual
word-boundaries. If you want human-style word boundaries then you should
use trueWord (which uses the standard Unicode word breaking rules).
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
More information about the use-livecode
mailing list