UTF8 on LC server

Tim Selander selander at tkf.att.ne.jp
Fri Jun 1 06:53:33 EDT 2018


Hi Mark,

Here is the script. The files I'm using are 
bamboobabies.com/getjapanesetext.lc, and the text it is getting 
is bamboobabies.com/news.txt.

In the script, there are two lines reading the text file that 
I've taken turns commenting out....

If you can give me any hints, it would be greatly appreciated.

Tim Selander


<?lc put header "Content-Type: text/html; charset=UTF-8" ?>
<!DOCTYPE HTML>
<html>
     <head>
         <meta http-equiv="Content-type" content="text/html; 
charset=UTF8">
         <title>workbench</title>
     </head>
<body>

<?lc
--This line loads readable japanese text, but putting char 500 to 
550 breaks beginning and ending kanji
put url "http://bamboobabies.com/news.txt" into vText

--When this line is used, none of the put text is readable
--put textDecode(url "binfile:bamboobabies.com/news.txt", 
"utf-8") into vText

put line 1 of vText

put "<BR><BR><BR><BR>"

put char 500 to 550 of vText
  ?>
</body>
</html>




On 2018.06.01 16:17, Mark Waddingham via use-livecode wrote:

> You should be fine using 'character' on any unicode text - it
> uses the Unicode grapheme (specific name of 'character's as
> human's 'think' of 'character's) breaking rules to find the
> boundaries.
>
> That being said, I think codepoint (from memory) should also be
> okay on Japanese text as I don't think the Japanese/Chinese
> scripts have any multi-codepoint characters - they just use
> codepoints with value > 65535 for less used ideographs (the
> 'supplementary plane'). [ Korean script can be encoded with
> Hangul, which *does* require the use of character as a single
> Korean Hangul ideograph can be composed of up to three codepoints ].
>
> The fact it is breaking on Japanese text in the way you suggest
> makes me think you aren't textDecode()'ing your UTF-8 input files:
>
> e.g.
>     put textDecode(url ("binfile:<pathtofile>"), "utf-8") into tText
>
> Without decoding as utf-8, the engine will thing your file is
> 'native' (single-byte encoded), so each byte of the file will be
> seen as a separate character.
>
> Internally the engine uses either single-byte or double-byte
> encodings for strings (the latter being UTF-16) - which is not
> user-visible, you just need to make sure that incoming data is
> decoded correctly.
>
> Can you share the code you are using to read in the text data and
> code which is breaking on server?
>
> Warmest Regards,
>
> Mark.
>
> P.S. 'word' in LC is still any sequence of non-space characters
> separated by spaces, or any sequence of characters delimited by
> quotes - it takes no account of the script of the text, nor
> actual word-boundaries. If you want human-style word boundaries
> then you should use trueWord (which uses the standard Unicode
> word breaking rules).
>




More information about the Use-livecode mailing list