Unicode and languages
Alex Tweedly
alex at tweedly.net
Sat Jun 6 09:11:32 EDT 2020
If you simply need to protect users in the scenario you describe, then
you could try a simple heuristic
- extract the first 100 (200? - 500?) characters (or first 20 words)
- spell check that
- if there are more than 10 (20? - 50??) spelling errors then flag it
as a likely language mismatch.
- and if not, proceed to do the spellcheck.
Adjust the numbers until it gives protection without too many false
positives.
Alex.
On 05/06/2020 18:15, Paul Dupuis via use-livecode wrote:
> In all the added stuff the LC7 and higher Unicode engine includes, is
> there any way to determine the LANGUAGE of a range of text?
>
> USE-CASE
>
> We have a tool that helps researchers transcribe text from digital
> media. It is used internationally. We have added spell checking using
> lclSpell form Live Code Labs, a LiveCode store add-on.
>
> For lclSpell, we only have Dictionaries for a small set of languages.
> You can build you own Dictionaries for lclSpell, but we'll still only
> have Dictionaries for a small subset of the languages people
> transcribe in. We also have people who do BOTH transcription AND
> translations.
>
> For example, transcribing a Chinese language media recording, typing
> in the Simplified or Traditional Chinese characters AND then translate
> it to English, typing the English translation after the transcription.
>
> With lclSpell (or I suspect ANY LiveCode compatible spell checker) if
> you try to spell check a reasonably large chunk of text that is NOT in
> the same language as your Dictionary, it ties up LiveCode forever, or
> at least such a long time and most people would force-quit. It is
> after all marking every word as misspelled and trying to do whatever
> it does to determine that.
>
> Now, you can react, that the researcher should just KNOW better than
> to do Spell check a text in a language that is not their loaded
> Dictionary! However, people are people, and will do such things and
> expect software to protect them from their own mistakes. Also, with
> mixed transcription and translation, you do want to spell check the
> English part and skip the Chinese (if you do not have a Chinese
> Dictionary)
>
> So, we're looking for a way to detect the LANGUAGE of a range of text,
> in a LiveCode field, to be able to then determine whether it matches
> the current (or any available) dictionary or not and act accordingly.
>
> There is a "fontLanguage" function in LC, but that seem to predate
> Unicode Everywhere and seem pretty useless now.
>
> For example. in a new stack, with a single scrolling field, we paste
> in a Chinese text and then execute:
>
> put the fontLanguage of (the effective textfont of char 1 to -1 of fld 1)
>
> and get "ansi". Even you you set the range (char 2 to 3) that is
> specifically Chinese (no white space), it still returns "ansi". The
> textFont returns empty and the effective textFont returns "Segue UI"
>
> I don't even know if language exists in the IBM Unicode engine as some
> exportable property a future version of LiveCode could expose.
>
> Any clever ideas or thoughts on this problem are welcome.
>
>
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list