Unicode and languages

Paul Dupuis paul at researchware.com
Fri Jun 5 13:15:11 EDT 2020


In all the added stuff the LC7 and higher Unicode engine includes, is 
there any way to determine the LANGUAGE of a range of text?

USE-CASE

We have a tool that helps researchers transcribe text from digital 
media. It is used internationally. We have added spell checking using 
lclSpell form Live Code Labs, a LiveCode store add-on.

For lclSpell, we only have Dictionaries for a small set of languages. 
You can build you own Dictionaries for lclSpell, but we'll still only 
have Dictionaries for a small subset of the languages people transcribe 
in. We also have people who do BOTH transcription AND translations.

For example, transcribing a Chinese language media recording, typing in 
the Simplified or Traditional Chinese characters AND then translate it 
to English, typing the English translation after the transcription.

With lclSpell (or I suspect ANY LiveCode compatible spell checker) if 
you try to spell check a reasonably large chunk of text that is NOT in 
the same language as your Dictionary, it ties up LiveCode forever, or at 
least such a long time and most people would force-quit. It is after all 
marking every word as misspelled and trying to do whatever it does to 
determine  that.

Now, you can react, that the researcher should just KNOW better than to 
do Spell check a text in a language that is not their loaded Dictionary! 
However, people are people, and will do such things and expect software 
to protect them from their own mistakes. Also, with mixed transcription 
and translation, you do want to spell check the English part and skip 
the Chinese (if you do not have a Chinese Dictionary)

So, we're looking for a way to detect the LANGUAGE of a range of text, 
in a LiveCode field, to be able to then determine whether it matches the 
current (or any available) dictionary or not and act accordingly.

There is a "fontLanguage" function in LC, but that seem to predate 
Unicode Everywhere and seem pretty useless now.

For example. in a new stack, with a single scrolling field, we paste in 
a Chinese text and then execute:

put the fontLanguage of (the effective textfont of char 1 to -1 of fld 1)

and get "ansi". Even you you set the range (char 2 to 3) that is 
specifically Chinese (no white space), it still returns "ansi". The 
textFont returns empty and the effective textFont returns "Segue UI"

I don't even know if language exists in the IBM Unicode engine as some 
exportable property a future version of LiveCode could expose.

Any clever ideas or thoughts on this problem are welcome.






More information about the use-livecode mailing list