UTF8 on LC server
Tim Selander
selander at tkf.att.ne.jp
Fri Jun 1 06:53:33 EDT 2018
Hi Mark,
Here is the script. The files I'm using are
bamboobabies.com/getjapanesetext.lc, and the text it is getting
is bamboobabies.com/news.txt.
In the script, there are two lines reading the text file that
I've taken turns commenting out....
If you can give me any hints, it would be greatly appreciated.
Tim Selander
<?lc put header "Content-Type: text/html; charset=UTF-8" ?>
<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;
charset=UTF8">
<title>workbench</title>
</head>
<body>
<?lc
--This line loads readable japanese text, but putting char 500 to
550 breaks beginning and ending kanji
put url "http://bamboobabies.com/news.txt" into vText
--When this line is used, none of the put text is readable
--put textDecode(url "binfile:bamboobabies.com/news.txt",
"utf-8") into vText
put line 1 of vText
put "<BR><BR><BR><BR>"
put char 500 to 550 of vText
?>
</body>
</html>
On 2018.06.01 16:17, Mark Waddingham via use-livecode wrote:
> You should be fine using 'character' on any unicode text - it
> uses the Unicode grapheme (specific name of 'character's as
> human's 'think' of 'character's) breaking rules to find the
> boundaries.
>
> That being said, I think codepoint (from memory) should also be
> okay on Japanese text as I don't think the Japanese/Chinese
> scripts have any multi-codepoint characters - they just use
> codepoints with value > 65535 for less used ideographs (the
> 'supplementary plane'). [ Korean script can be encoded with
> Hangul, which *does* require the use of character as a single
> Korean Hangul ideograph can be composed of up to three codepoints ].
>
> The fact it is breaking on Japanese text in the way you suggest
> makes me think you aren't textDecode()'ing your UTF-8 input files:
>
> e.g.
> put textDecode(url ("binfile:<pathtofile>"), "utf-8") into tText
>
> Without decoding as utf-8, the engine will thing your file is
> 'native' (single-byte encoded), so each byte of the file will be
> seen as a separate character.
>
> Internally the engine uses either single-byte or double-byte
> encodings for strings (the latter being UTF-16) - which is not
> user-visible, you just need to make sure that incoming data is
> decoded correctly.
>
> Can you share the code you are using to read in the text data and
> code which is breaking on server?
>
> Warmest Regards,
>
> Mark.
>
> P.S. 'word' in LC is still any sequence of non-space characters
> separated by spaces, or any sequence of characters delimited by
> quotes - it takes no account of the script of the text, nor
> actual word-boundaries. If you want human-style word boundaries
> then you should use trueWord (which uses the standard Unicode
> word breaking rules).
>
More information about the use-livecode
mailing list