UTF8 on LC server
selander at tkf.att.ne.jp
Fri Jun 1 06:53:33 EDT 2018
Here is the script. The files I'm using are
bamboobabies.com/getjapanesetext.lc, and the text it is getting
In the script, there are two lines reading the text file that
I've taken turns commenting out....
If you can give me any hints, it would be greatly appreciated.
<?lc put header "Content-Type: text/html; charset=UTF-8" ?>
<meta http-equiv="Content-type" content="text/html;
--This line loads readable japanese text, but putting char 500 to
550 breaks beginning and ending kanji
put url "http://bamboobabies.com/news.txt" into vText
--When this line is used, none of the put text is readable
--put textDecode(url "binfile:bamboobabies.com/news.txt",
"utf-8") into vText
put line 1 of vText
put char 500 to 550 of vText
On 2018.06.01 16:17, Mark Waddingham via use-livecode wrote:
> You should be fine using 'character' on any unicode text - it
> uses the Unicode grapheme (specific name of 'character's as
> human's 'think' of 'character's) breaking rules to find the
> That being said, I think codepoint (from memory) should also be
> okay on Japanese text as I don't think the Japanese/Chinese
> scripts have any multi-codepoint characters - they just use
> codepoints with value > 65535 for less used ideographs (the
> 'supplementary plane'). [ Korean script can be encoded with
> Hangul, which *does* require the use of character as a single
> Korean Hangul ideograph can be composed of up to three codepoints ].
> The fact it is breaking on Japanese text in the way you suggest
> makes me think you aren't textDecode()'ing your UTF-8 input files:
> put textDecode(url ("binfile:<pathtofile>"), "utf-8") into tText
> Without decoding as utf-8, the engine will thing your file is
> 'native' (single-byte encoded), so each byte of the file will be
> seen as a separate character.
> Internally the engine uses either single-byte or double-byte
> encodings for strings (the latter being UTF-16) - which is not
> user-visible, you just need to make sure that incoming data is
> decoded correctly.
> Can you share the code you are using to read in the text data and
> code which is breaking on server?
> Warmest Regards,
> P.S. 'word' in LC is still any sequence of non-space characters
> separated by spaces, or any sequence of characters delimited by
> quotes - it takes no account of the script of the text, nor
> actual word-boundaries. If you want human-style word boundaries
> then you should use trueWord (which uses the standard Unicode
> word breaking rules).
More information about the Use-livecode