Line Breaks Dropped on Importing Unicode Text

Jim Ault jimaultwins at yahoo.com
Fri Sep 4 20:38:02 EDT 2009


On Sep 4, 2009, at 3:01 PM, Sivakatirswami wrote:

> Aloha, Joe:
>
> I'm not quite sure how your suggestion relates to the problem of  
> endlines.
>
> The unicode.txt file I have is being read OK in Pages on the mac.
> It also loads just fine in Rev, with the exception of the line breaks
>
> I'm not sure where the uniencode/unidecode  could be used to solve  
> the line break issue.
<full quote appears below>

 From the dictionary

     set the useUnicode to true
    Specifies whether the charToNum and numToChar functions assume a  
character is double-byte.

Perhaps the issue for me is:

If you are using UTF-16, that means that every character is  
represented as double-byte.
If you do a replacement using numtochar(13), it seems to me that the  
'replace' scan of the string is only looking as single-bytes, unless  
you set the useUnicode to true.

Some programs, like BBEdit, try to detect the encoding, but this does  
not always work.  I think the BOM is supposed to be the flag for the  
encoding, but this is not always clear.

In BBEdit one of the File menu commands is "Reopen Using Encoding >"
with (6) choices..
Unicode(UTF-8)
Unicode(UTF-8, no BOM)
Unicode(UTF-16)
Unicode(UTF-16, no BOM)
Unicode(UTF-16, Little Endian)
Unicode(UTF-16, Little Endian, no BOM)
thus there must be some trial and error involved depending on the  
source of the Unicode string.

I know this is confusing and I will be diving into the arena very soon.

(I will avoid any hint of a Mac arena pun because I don't want that  
tune bouncing around in my head.)

Jim Ault
Las Vegas

On Sep 4, 2009, at 3:01 PM, Sivakatirswami wrote:

> Aloha, Joe:
>
> I'm not quite sure how your suggestion relates to the problem of  
> endlines.
>
> The unicode.txt file I have is being read OK in Pages on the mac.
> It also loads just fine in Rev, with the exception of the line breaks
>
> I'm not sure where the uniencode/unidecode  could be used to solve  
> the line break issue.
>
> Joe F. wrote:
>> The trick is to use uniencode/unidecode for everything.
>> Three separate examples:
>>
>> ask file "Name new file:" with "NewFile.xml"
>> put "binfile:" & it into theNewFileName
>> get the unicodetext of cd fld 1
>> put unidecode(it,"utf8") into url (theNewFileName)
>> ------------------------------------------
>> set the unicodetext of cd fld 1 to uniencode(tMyUnicode,"utf8")
>> ------------------------------------------
>> put URL (theFTPRequest) into theDownLoadedText
>> put uniencode(theDownLoadedText,"utf8") into theDownLoadedText
>> put unidecode(theDownLoadedText,"ANSI") into cd fld 1 of cd id 4630
>>
>>
>>
>> On Sep 4, 2009, at 1:39 AM, Sivakatirswami wrote:
>>
>>> I have some UTF-16 unicode raw text. If I import this into Pages,  
>>> it displays the font correctly and also the line breaks between  
>>> paragraphs correctly
>>>
>>> But if I use this function:
>>>
>>> on mouseUp
>>> answer file "Choose a unicode file to read in."
>>> if it is empty then exit mouseUp
>>> put "binfile:" & it into urlName
>>> replace numtochar(13) with numtochar(10) in urlName
>>> set the unicodeText of fld "display" to url urlName
>>> end mouseUp
>>>
>>> the line breaks are not appearing in the field in revolution.
>>>
>>> Also if I try to analyze what chars are there, where the line  
>>> break should be, by selecting across a missing line break and then  
>>> use this test:
>>>
>>> on mouseUp
>>> set the useUnicode to true
>>> if the selection is empty then
>>>    answer "No Selection" with "ok"
>>> end if
>>> put the selection into tUnicode
>>> put tUnicode
>>> repeat for each char x in tUnicode
>>>    put  chartonum(x) & cr after tOutput      end repeat
>>> put tOutput # returns empty
>>> end mouseUp
>>>
>>> I get nothing in the msg box. If I switch from Tamil Inaimathi  
>>> (Mac unicode font) to Anjal Chittu unicode, The Tamil displays  
>>> correctly and now I can clearly see a two byte block
>>>
>>> space+square-box-symbol (somewhat transparent) where the line  
>>> breaks should be. But it still returns empty on an empty to  
>>> determine what the bytes are....
>>>
>>> How do we deal with/import correctly, line breaks in unicode text  
>>> in Revolution?
>>>
>>> I plan to create an editing environment as a revlet for online  
>>> work... and unicode will be imported and exported freely for later  
>>> use in InDesign. Obviously CRLF (or whatever it is in Unicode)  
>>> needs to be preserved.




More information about the use-livecode mailing list