Line Breaks Dropped on Importing Unicode Text

Sivakatirswami katir at hindu.org
Fri Sep 4 23:06:26 EDT 2009


stephen barncard wrote:
> Why are you replacing the CRs with LFs? doesn't the engine's Unicode
> functions handle line endings?
> -------------------------
> Stephen Barncard
> San Francisco
> http://houseofcubes.com/disco.irev
>
>
> 2009/9/4 Sivakatirswami <katir at hindu.org>
>
>   
>> Aloha, Joe:
>>
>> I'm not quite sure how your suggestion relates to the problem of endlines.
>>
>> The unicode.txt file I have is being read OK in Pages on the mac.
>> It also loads just fine in Rev, with the exception of the line breaks
>>
>> I'm not sure where the uniencode/unidecode  could be used to solve the line
>> break issu
>>     
Sometime on Kauai it rans for so many days (max count on my log 63 
days...) we live in a "mud world"

Some how my entry into unicode feels like not like a "baptism by fire"  
but a "baptism by mud"

welcome to petroglyph land... (smile)

Stephen: the engine only handles line line ending for "file:*" and not 
"binary:*"

A note on the source: This is original Tamil done in MylaiSri which maps 
all chars against 0-127; Muthu Neduraman of Marusu System in Malaysia, 
IT Tamil Master, font designer, systems engineer etc. wrote me a C++ 
program to transform the ASCII input out to a Unicode.txt..  really 
don't have any specs on what his program outputs. ( would love to take 
that thing and turn it into an external if I knew how... that's another 
story...)

but, that's what I'm loading... but since he works on OS X I, thought, 
on a hunhc sure he was piping cr's from the original ASCII out to char(13)

Joe Ault: OK we are getting some where:
I obviously made a blooper where I was replacing
char(13) with char (10) in the filename and not the data.

Of course nothing happened... fixed it:

this now works fine! 

on mouseUp
  answer file "Choose a unicode file to read in."
  if it is empty then exit mouseUp
  put "binfile:" & it into urlName
  set the useUnicode to true
  put url urlName into tTamilUnicodeText
  replace numtochar(13) with numtochar(10) in tTamilUnicodeText
  set the unicodeText of fld "display" to tTamilUnicodeText
end mouseUp

OK so far so good. I'm getting the same line breaks from the original text.

Richard: thanks for the arcane script from Mark, which I only saw 
*after* trying the above... so I did not need it.
but I will keep it as a reference, thank you.

Jim F: thanks for the tip on always encoding... since I have to move 
this stuff back and forth to the web server and possible in and out of 
PostGreSQL..  I will take your advice.

So, for now it works... Read on if you want walk into the morass of 
trying actually see what you have a decimal strings:

Ken Kojima, Thanks: this now works  -- well appears to, on the surface.

on mouseUp
 set the useUnicode to true
 if the selection is empty then
   answer "No Selection" with "ok"
 end if
 put the selection into tUnicode
  repeat with i=1 to the num of chars of tUnicode step 2
     put  chartonum(char i to i+1 of tUnicode) & cr after tOutput
  end repeat
put tOutput
end mouseUp

but I get super irrational results (irrational to me at least)

Tamil lives here:

U+0B80 – U+0BFF   (2944–3071)

if load the text *without* handling the line endings and select across 
the last letters of one line and the beginning of the next:

[note, the editor of this text typically puts two end-of-paragraph (i.e. 
1 blank line) between paragraphs, block style]

2990 - Valid Tamil Character
3021 - Valid Tamil Character
3374 - out of range: should be line break and does show as one in Pages
8205 - out of range: should be line break and does show as one in Pages, 
or in the field if I do the (13) to (10) conversion
2953 - Valid Tamil Character

And if I select the same thing a second time... different results!

2990
3021
12576
2570
2992
3007



OK now... if I select another section of text where there is a text/2 
line breaks/text/2 line breaks/text

I get super bizarre results back

3377
45069 # way out of range.
48907
44555
52491
8203

If I lengthen the selection, left and right I get completely (almost) 
different results)

Even the same characters selected in the short selection are not output 
as the characters:

3015
2985
3021
3391
39437
49419
45579
51979
38155
12555
3341
2992
3007


if I put  replace numtochar(13) with numtochar(10) in tTamilUnicodeText

back into my import script and then select across the end of the same 
line and 2 cr's and the beginning of the 4th line, I get different 
results again. And this time, so beyond my ken as to be a black box. I 
don't think I will even "go there" in trying to understand what is 
happening, wrong and why we get something like:

2985
3021
2623
39434
49419
45579
51979
38155
44555
52491
8203
2609
45066
48907

There's more bizarre events occuring (selecting text causes characters 
to switch places!)  Wish me luck in creating an online editor as a revlet!

Sivakatirswami


















More information about the use-livecode mailing list