Decode UTF-8 in variable ?

Dar Scott dsc at swcp.com
Thu Jun 20 11:58:15 EDT 2013


You might be able to work with it as it is.

In UTF-8 the ASCII subset looks just like ... ASCII.  All characters represented by UTF-8 that are not ASCII are represented by one to three bytes with the high bit set, that is, the byte value of each byte in the sequence is over 127.  All ASCII characters are represented by a single byte with the high bit zero, that is, the byte value is less than 128.  

So, if all the characters of the UTF-8 string are in the ASCII subset, it is already "converted" to ASCII.  

You are not going to find any interesting LiveCode characters (I think) in the non ASCII characters of UTF-8.  Tab, new-line (LF), space, comma, quote, digits, decimal point, and so on are all ASCII.  This means that your scripts to work with the db values might still work with UTF-8.  The important thing is to watch out for cases where you assume a character is one char (in the LiveCode script sense).  However, if you are not writing back to the db and you think the non ASCII characters as unimportant, then you can remove them.  Conversion might remove or try to translate them, I'm not sure.  

If you think the encoding is not UTF-8, then it might be UTF-16.  If the text is mostly ASCII characters, then that will be encoded in UTF-16BE as NUL-char, NUL-char...  If it is UTF-16LE (little endian) then you will see char-NUL patterns.  So, if you see the code (that is charToNum()) is zero a lot, then suspect you have some form of UTF-16.  A db might use UTF-16LE, UTF-16BE, or track the endian of unsigned 16-bit integers of the machine.  If you know the order, you can decide whether to swap bytes to change the endian to that of your machine, then you can convert to UTF-8.  To get the endian of your machine, convert  a char to UTF-16 and then look at whether the first byte is NUL.  This paragraph has a lot of info, and I might have skipped some parts, so keep at me until I explain it well.

Dar



On Jun 20, 2013, at 12:07 AM, Ludovic Thébault wrote:

> Hello,
> 
> I need to get datas from  sqlite (in UTF-8) and convert it in ASCII for treatment, but i don't need to put it in a field..
> I try unidecode(uniencode(myTXT, "utf8")) and many others solutions with no result.
> 
> We need to pass by a field ? 
> with this command : set the unicodetext of field "xxx" to .. ?
> 
> Thanks
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode





More information about the use-livecode mailing list