Unicode and languages

Alex Tweedly alex at tweedly.net
Sat Jun 6 09:11:32 EDT 2020


If you simply need to protect users in the scenario you describe, then 
you could try a simple heuristic

  - extract the first 100 (200? - 500?) characters (or first 20 words)

  - spell check that

  - if there are more than 10 (20? - 50??) spelling errors then flag it 
as a likely language mismatch.
  - and if not, proceed to do the spellcheck.

Adjust the numbers until it gives protection without too many false 
positives.

Alex.

On 05/06/2020 18:15, Paul Dupuis via use-livecode wrote:
> In all the added stuff the LC7 and higher Unicode engine includes, is 
> there any way to determine the LANGUAGE of a range of text?
>
> USE-CASE
>
> We have a tool that helps researchers transcribe text from digital 
> media. It is used internationally. We have added spell checking using 
> lclSpell form Live Code Labs, a LiveCode store add-on.
>
> For lclSpell, we only have Dictionaries for a small set of languages. 
> You can build you own Dictionaries for lclSpell, but we'll still only 
> have Dictionaries for a small subset of the languages people 
> transcribe in. We also have people who do BOTH transcription AND 
> translations.
>
> For example, transcribing a Chinese language media recording, typing 
> in the Simplified or Traditional Chinese characters AND then translate 
> it to English, typing the English translation after the transcription.
>
> With lclSpell (or I suspect ANY LiveCode compatible spell checker) if 
> you try to spell check a reasonably large chunk of text that is NOT in 
> the same language as your Dictionary, it ties up LiveCode forever, or 
> at least such a long time and most people would force-quit. It is 
> after all marking every word as misspelled and trying to do whatever 
> it does to determine  that.
>
> Now, you can react, that the researcher should just KNOW better than 
> to do Spell check a text in a language that is not their loaded 
> Dictionary! However, people are people, and will do such things and 
> expect software to protect them from their own mistakes. Also, with 
> mixed transcription and translation, you do want to spell check the 
> English part and skip the Chinese (if you do not have a Chinese 
> Dictionary)
>
> So, we're looking for a way to detect the LANGUAGE of a range of text, 
> in a LiveCode field, to be able to then determine whether it matches 
> the current (or any available) dictionary or not and act accordingly.
>
> There is a "fontLanguage" function in LC, but that seem to predate 
> Unicode Everywhere and seem pretty useless now.
>
> For example. in a new stack, with a single scrolling field, we paste 
> in a Chinese text and then execute:
>
> put the fontLanguage of (the effective textfont of char 1 to -1 of fld 1)
>
> and get "ansi". Even you you set the range (char 2 to 3) that is 
> specifically Chinese (no white space), it still returns "ansi". The 
> textFont returns empty and the effective textFont returns "Segue UI"
>
> I don't even know if language exists in the IBM Unicode engine as some 
> exportable property a future version of LiveCode could expose.
>
> Any clever ideas or thoughts on this problem are welcome.
>
>
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your 
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode




More information about the use-livecode mailing list