the mouseText and Unicode: CONCLUSION

Pierre Sahores psahores at free.fr
Sun Jun 19 11:07:54 EDT 2011


Thanks for your fantastic work, Slava !  

I had to hack the LC htmltext features to build recent web apps and it's yet sure that i will become able to replace all this stuff with your UNICODE way to go. In using your method, it become simple to store data in UTF8 format inside PostgreSQL and this will make a clear and useful difference.

Kind regards,
Pierre

Le 19 juin 2011 à 07:20, Slava Paperno a écrit :

> Thanks to everyone who posted all the fun stuff in this thread today.
> 
> The workaround below combines ideas from all of you.
> 
> I learned two essential things from you guys:
> 
> 1) mouseChunk, mouseText, mouseCharChunk, and a couple of related functions return the positions of bytes, not characters. In a purely Roman or purely non-Roman field this can be easily dealt with, but not in a bilingual text. (Punctuation in a non-Roman text makes it bilingual.)
> 
> 2) a double-byte character whose first byte is such that byteToNum(it) evaluates to 32 is taken by "the number of words" and probably other functions to be a word boundary and makes them confused; an example is the Russian upper case R (decimal 1056).
> 
> Here is a handler that works, with copious comments. If any of them are wrong, please let me know.
> 
> =====
> 
> on mouseUp
>   --this is attached to field  "TextToClick" that contains bilingual (Russian+English) text;
>   --this field has its lockText set to true;
>   --the purpose of this exercise is to retrieve and display in another field the word that the user has clicked;
>   --NOTE:  the mouseChunk and the mouseText are useless in a Unicode field;
>   --equally useless is the select command when used with these expressions, as in "select the mouseChunk";
> 
>   local locStart, locEnd
>   local locClickedLine
>   local locEntireText
>   local locEscapeCounter
> 
>   if  the mouseCharChunk is empty then
>      set the unicodeText of field "ClickedWord" to uniEncode("You clicked an empty space.", "UTF8")
>      exit mouseUp
>   end if
> 
>   put word 2 of the mouseLine into locClickedLine
>   --"line" is really a paragraph: it is defined by the return character, not by soft wrapping; locClickedLine is not used below, but it is accurate
> 
>   put word 2 of the mouseCharChunk into locStart
>   put word 4 of the mouseCharChunk into locEnd
> 
>   --a strategy based on "the number of words in char 1 to locEnd" bombs when the text before locEnd contains the upper case Russian  R (1056);
>   --this is probably because the first byte in the two-byte representation of 1056 evaluates to 32, and LC takes it for a word delimiter;
> 
>   --relying on the accuracy of the values that are returned by the mouseCharChunk is dubious because these are the positions of bytes, not characters:
>   -- one byte for each Roman character and two bytes for each non-Roman character; this kills a couple of other strategies
> 
>   --the strategy below is based of "the selection" and is not dependent on the accuracy of the mouseCharChunk values: the correct chunk is selected anyway
>   set useUnicode to true
>   put  the unicodeText of field "TextToClick" into locEntireText --this is UTF16
>   put uniDecode(locEntireText, "UTF8") into locEntireText --this is UTF8
> 
>   --look for a word boundary to the left of the click
>   repeat until (locStart < 1)
>      if byteToNum(byte locStart of locEntireText) is among items of 9, 10, 32 then
>         add 1 to locStart
>         exit repeat
>      end if
>      subtract 1 from locStart
>   end repeat
> 
>   --look for a word boundary to the right of the click
>   repeat until (locEnd >= length(locEntireText))
>      if byteToNum(byte locEnd of locEntireText) is among items of 9, 10, 32 then
>         subtract 1 from locEnd
>         exit repeat
>      end if
>      add 1 to locEnd
>   end repeat
> 
>   select char locStart to locEnd of field "TextToClick" 
>   set the unicodeText of field "ClickedWord" to the unicodeText of the selection
> 	--adjacent punctuation will be displayed as part of the word and can be easily trimmed   
>   pass mouseUp
> end mouseUp
> =====
> 
> Slava
> 
> 
> 
> 
> 
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
> 

--
Pierre Sahores
mobile : (33) 6 03 95 77 70

www.woooooooords.com
www.sahores-conseil.com








More information about the use-livecode mailing list