Unicode and languages

David V Glasgow dvglasgow at gmail.com
Sun Jun 7 05:06:33 EDT 2020


Ha!  You beat me to it, Alex.  The only extra is that Paul might be able to identify very common but distinct markers to identify the language, and create a simple algorithm.  

Made me wonder how Google translate does it when it is set to 'detect language’

Cheers,

David G 

> On 6 Jun 2020, at 2:11 pm, Alex Tweedly via use-livecode <use-livecode at lists.runrev.com> wrote:
> 
> If you simply need to protect users in the scenario you describe, then you could try a simple heuristic
> 
>  - extract the first 100 (200? - 500?) characters (or first 20 words)
> 
>  - spell check that
> 
>  - if there are more than 10 (20? - 50??) spelling errors then flag it as a likely language mismatch.
>  - and if not, proceed to do the spellcheck.
> 
> Adjust the numbers until it gives protection without too many false positives.
> 
> Alex.
> 
> On 05/06/2020 18:15, Paul Dupuis via use-livecode wrote:
>> In all the added stuff the LC7 and higher Unicode engine includes, is there any way to determine the LANGUAGE of a range of text?
>> 
>> USE-CASE
>> 
>> We have a tool that helps researchers transcribe text from digital media. It is used internationally. We have added spell checking using lclSpell form Live Code Labs, a LiveCode store add-on.
>> 
>> For lclSpell, we only have Dictionaries for a small set of languages. You can build you own Dictionaries for lclSpell, but we'll still only have Dictionaries for a small subset of the languages people transcribe in. We also have people who do BOTH transcription AND translations.
>> 
>> For example, transcribing a Chinese language media recording, typing in the Simplified or Traditional Chinese characters AND then translate it to English, typing the English translation after the transcription.
>> 
>> With lclSpell (or I suspect ANY LiveCode compatible spell checker) if you try to spell check a reasonably large chunk of text that is NOT in the same language as your Dictionary, it ties up LiveCode forever, or at least such a long time and most people would force-quit. It is after all marking every word as misspelled and trying to do whatever it does to determine  that.
>> 
>> Now, you can react, that the researcher should just KNOW better than to do Spell check a text in a language that is not their loaded Dictionary! However, people are people, and will do such things and expect software to protect them from their own mistakes. Also, with mixed transcription and translation, you do want to spell check the English part and skip the Chinese (if you do not have a Chinese Dictionary)
>> 
>> So, we're looking for a way to detect the LANGUAGE of a range of text, in a LiveCode field, to be able to then determine whether it matches the current (or any available) dictionary or not and act accordingly.
>> 
>> There is a "fontLanguage" function in LC, but that seem to predate Unicode Everywhere and seem pretty useless now.
>> 
>> For example. in a new stack, with a single scrolling field, we paste in a Chinese text and then execute:
>> 
>> put the fontLanguage of (the effective textfont of char 1 to -1 of fld 1)
>> 
>> and get "ansi". Even you you set the range (char 2 to 3) that is specifically Chinese (no white space), it still returns "ansi". The textFont returns empty and the effective textFont returns "Segue UI"
>> 
>> I don't even know if language exists in the IBM Unicode engine as some exportable property a future version of LiveCode could expose.
>> 
>> Any clever ideas or thoughts on this problem are welcome.
>> 
>> 
>> 
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode





More information about the use-livecode mailing list