New chunks

Fraser Gordon fraser.gordon at runrev.com
Wed Mar 12 06:12:33 EDT 2014


On 11 Mar 2014, at 19:26, Richmond <richmondmathewson at gmail.com> wrote:
> 
> Well; in theory that looks good until you start to think about languages which are
> written (such as Sanskrit) with no obvious word boundaries and both vowel mutation (Sandhi)
> at what would be word boundaries, and consonant fusion.

The library that we use for low-level Unicode stuff (ICU) provides a facility called "break iterators" - basically, these functions break up text according to various rules and variants are provided for graphemes, words, sentences, etc. ICU has a (very large) database of rules and (for some languages) dictionaries in order to properly break words even in complex languages. Not all languages are supported but a large number are.

> 
>> sentence (breaks on unicode sentence boundaries)
> 
> That looks a bit fishy.
> 
> How are you going to work out what marks a sentence boundary in every language that one can write
> with Unicode? And there are languages where the idea of a 'sentence' is absent.

Again, ICU does the hard work. In a language without sentences, text will only contain one sentence. 

There is also enough intelligence in ICU that it can tell the difference between a decimal point and a full-stop/period. Some languages use different marks as sentence separators and ICU also knows about them.

> 
> I'm sorry to be such a "pill", but word and sentence boundaries are such culture-bound concepts
> that they will only be any good for languages that mark word and sentence boundaries.
> 
> This is about the same as stating dogmatically that "all bananas are yellow", when they are not.

Paragraphs are defined in the Unicode standard. They are runs of text terminated by the Paragraph Separator character or (optionally) any other newline character. While it may not make sense linguistically, this is how we delimit paragraphs in LiveCode fields.


Regards,
Fraser



More information about the use-livecode mailing list