Line Breaks Dropped on Importing Unicode Text
Jim Ault
jimaultwins at yahoo.com
Fri Sep 4 20:38:02 EDT 2009
On Sep 4, 2009, at 3:01 PM, Sivakatirswami wrote:
> Aloha, Joe:
>
> I'm not quite sure how your suggestion relates to the problem of
> endlines.
>
> The unicode.txt file I have is being read OK in Pages on the mac.
> It also loads just fine in Rev, with the exception of the line breaks
>
> I'm not sure where the uniencode/unidecode could be used to solve
> the line break issue.
<full quote appears below>
From the dictionary
set the useUnicode to true
Specifies whether the charToNum and numToChar functions assume a
character is double-byte.
Perhaps the issue for me is:
If you are using UTF-16, that means that every character is
represented as double-byte.
If you do a replacement using numtochar(13), it seems to me that the
'replace' scan of the string is only looking as single-bytes, unless
you set the useUnicode to true.
Some programs, like BBEdit, try to detect the encoding, but this does
not always work. I think the BOM is supposed to be the flag for the
encoding, but this is not always clear.
In BBEdit one of the File menu commands is "Reopen Using Encoding >"
with (6) choices..
Unicode(UTF-8)
Unicode(UTF-8, no BOM)
Unicode(UTF-16)
Unicode(UTF-16, no BOM)
Unicode(UTF-16, Little Endian)
Unicode(UTF-16, Little Endian, no BOM)
thus there must be some trial and error involved depending on the
source of the Unicode string.
I know this is confusing and I will be diving into the arena very soon.
(I will avoid any hint of a Mac arena pun because I don't want that
tune bouncing around in my head.)
Jim Ault
Las Vegas
On Sep 4, 2009, at 3:01 PM, Sivakatirswami wrote:
> Aloha, Joe:
>
> I'm not quite sure how your suggestion relates to the problem of
> endlines.
>
> The unicode.txt file I have is being read OK in Pages on the mac.
> It also loads just fine in Rev, with the exception of the line breaks
>
> I'm not sure where the uniencode/unidecode could be used to solve
> the line break issue.
>
> Joe F. wrote:
>> The trick is to use uniencode/unidecode for everything.
>> Three separate examples:
>>
>> ask file "Name new file:" with "NewFile.xml"
>> put "binfile:" & it into theNewFileName
>> get the unicodetext of cd fld 1
>> put unidecode(it,"utf8") into url (theNewFileName)
>> ------------------------------------------
>> set the unicodetext of cd fld 1 to uniencode(tMyUnicode,"utf8")
>> ------------------------------------------
>> put URL (theFTPRequest) into theDownLoadedText
>> put uniencode(theDownLoadedText,"utf8") into theDownLoadedText
>> put unidecode(theDownLoadedText,"ANSI") into cd fld 1 of cd id 4630
>>
>>
>>
>> On Sep 4, 2009, at 1:39 AM, Sivakatirswami wrote:
>>
>>> I have some UTF-16 unicode raw text. If I import this into Pages,
>>> it displays the font correctly and also the line breaks between
>>> paragraphs correctly
>>>
>>> But if I use this function:
>>>
>>> on mouseUp
>>> answer file "Choose a unicode file to read in."
>>> if it is empty then exit mouseUp
>>> put "binfile:" & it into urlName
>>> replace numtochar(13) with numtochar(10) in urlName
>>> set the unicodeText of fld "display" to url urlName
>>> end mouseUp
>>>
>>> the line breaks are not appearing in the field in revolution.
>>>
>>> Also if I try to analyze what chars are there, where the line
>>> break should be, by selecting across a missing line break and then
>>> use this test:
>>>
>>> on mouseUp
>>> set the useUnicode to true
>>> if the selection is empty then
>>> answer "No Selection" with "ok"
>>> end if
>>> put the selection into tUnicode
>>> put tUnicode
>>> repeat for each char x in tUnicode
>>> put chartonum(x) & cr after tOutput end repeat
>>> put tOutput # returns empty
>>> end mouseUp
>>>
>>> I get nothing in the msg box. If I switch from Tamil Inaimathi
>>> (Mac unicode font) to Anjal Chittu unicode, The Tamil displays
>>> correctly and now I can clearly see a two byte block
>>>
>>> space+square-box-symbol (somewhat transparent) where the line
>>> breaks should be. But it still returns empty on an empty to
>>> determine what the bytes are....
>>>
>>> How do we deal with/import correctly, line breaks in unicode text
>>> in Revolution?
>>>
>>> I plan to create an editing environment as a revlet for online
>>> work... and unicode will be imported and exported freely for later
>>> use in InDesign. Obviously CRLF (or whatever it is in Unicode)
>>> needs to be preserved.
More information about the use-livecode
mailing list