CSV again.

Alex Tweedly alex at tweedly.net
Fri Oct 16 11:04:55 EDT 2015


Hi Mike,

thanks for that additional info.

I *think* (it's been 3 years) I left them as <GS> (i.e. numtochar(29)) 
because I had some data including normal TAB characters within the cells 
(!!( and thought <GS> was a safer bet - though of course nothing is 
completely safe. It's then up to the caller to decide whether to do 
"replace numtochar(29) with TAB in ...", or do TAB escaping, or whatever 
they want.

As for the other bigger problem ....   Oh dear = CR vs LF vs CRLF ....

Are you on Mac or Windows or Linux ?
How is the LF delimited data getting into your app ?
Maybe we should just add a "replace chartonum(13) with CR in pData" ?

(I confess to being confused by this - I know that LC does 
auto-translation of line delimiters at various places, but I'm not sure 
when it is, or isn't, completely safe. Maybe the easiest thing is to jst 
do all the translations ....

   replace CRLF with CR in pData
   replace numtochar(10) with CR in pData
   replace numtochar(13) with CR in pData

-- Alex.

On 16/10/2015 12:48, Mike Kerner wrote:
> Richard,
> Yes, I understand it was a Pascal Pun, and then in 2012, when this thread
> originally happened, it became something more, sort of a version pun on a
> pascal pun, if you will.
>
> Rather than posting fixes to the one on your blog, let's go through the
> "state of the art" and work on that, instead, if it needs it.
>
>
> Alex,
> I see at least two issues with this version:
> First of all, you never substitute tab for tNuDelim, so the string you
> return is numtochar(29) delimited, not tab-delimited.
> The last line of your function, before the "return tNuData" line should be
> "replace tNuDelim with tab"
>
> Second of all, I get two different results in my sample, depending on
> whether or not the string is <CR>...ERRRRRRRRRRR <LF>-terminated or not
> After fixing the problem, above,
>
> When I run
> "A","","C"
> I get
> A <HT> <HT>
> i.e. the "C" is missing
>
> NOW, if I send
> "A","","C"<LF>
> A <HT> <HT> C <LF>
>
> I haven't looked for that bug, yet.
>
> On Thu, Oct 15, 2015 at 10:55 PM, Alex Tweedly <alex at tweedly.net> wrote:
>
>> Hmmmm ... my quick test of what was csv4Tab, but is now called csvToTab1 -
>> see below - gives me
>> (showing results with a colon  ':' for the cell delimiter, i.e. replacing
>> numtochar(29) from the code in the previous use-list code
>>
>> a,b,c   ---> a:b:c
>> "a","","c" ---> a::c
>>
>> Now to me, that's what it should give - so I think it gets it right :-)
>>
>> Question is
>> a. do you get the same result ?
>>      if not, what do you get ?  OR can you try with the code below
>>      if you do, but disagree that this is right, what do you think it
>> should give ?
>>
>> -- Alex
>>
>> function CSVToTab1 pData,pcoldelim
>>     local tNuData -- contains tabbed copy of data
>>     local tReturnPlaceholder -- replaces cr in field data to avoid line
>>     --                       breaks which would be misread as records;
>>     local tNuDelim  -- new character to replace the delimiter
>>     local tStatus, theInsideStringSoFar
>>     --
>>     put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
>>     put numtochar(29) into tNuDelim
>>     --
>>     if pcoldelim is empty then put comma into pcoldelim
>>     -- Normalize line endings:
>>     replace crlf with cr in pData          -- Win to UNIX
>>     replace numtochar(13) with cr in pData -- Mac to UNIX
>>
>>     put "outside" into tStatus
>>     set the itemdel to quote
>>     repeat for each item k in pData
>>        -- put tStatus && k & CR after msg
>>        switch tStatus
>>
>>           case "inside"
>>              put k after theInsideStringSoFar
>>              put "passedquote" into tStatus
>>              next repeat
>>
>>           case "passedquote"
>>              -- decide if it was a duplicated escapedQuote or a closing
>> quote
>>              if k is empty then   -- it's a duplicated quote
>>                 put quote after theInsideStringSoFar
>>                 put "inside" into tStatus
>>                 next repeat
>>              end if
>>              -- not empty - so we remain inside the cell, though we have
>> left the quoted section
>>              -- NB this allows for quoted sub-strings within the cell
>> content !!
>>              replace cr with tReturnPlaceholder in theInsideStringSoFar
>>              put theInsideStringSoFar after tNuData
>>
>>           case "outside"
>>              replace pcoldelim with tNuDelim in k
>>              -- and deal with the "empty trailing item" issue in Livecode
>>              replace (tNuDelim & CR) with tNuDelim & tNuDelim & CR in k
>>              put k after tNuData
>>              put "inside" into tStatus
>>              put empty into theInsideStringSoFar
>>              next repeat
>>           default
>>              put "defaulted"
>>              break
>>        end switch
>>     end repeat
>>     return tNuData
>> end CSVToTab1
>>
>>
>> On 16/10/2015 01:34, Mike Kerner wrote:
>>
>>> csv4 does not handle it, and it comes up with a different result from csv2
>>> (which is also wrong).  I sent Richard proposed changes to csv2 which
>>> addresses that issue, but I'll wait while we collectively try to remember
>>> what the latest and greatest csv parser algorithm is before I try to come
>>> up with more ways to break or fix it.
>>>
>>> On Thu, Oct 15, 2015 at 8:24 PM, Alex Tweedly <alex at tweedly.net> wrote:
>>>
>>> Richard et al.,
>>>> sometime after that article, there was a further thread on the use-list.
>>>> Pete Haworth found a case not properly covered by the version on the
>>>> article, and I came up with a revised version (cutely called csv4Tab !! -
>>>> csv3Tab was an interim, deeply buggy attempt)
>>>>
>>>> (It's in
>>>> http://lists.runrev.com/pipermail/use-livecode/2012-May/172275.html )
>>>>
>>>> It *looks* from that thread (
>>>> http://lists.runrev.com/pipermail/use-livecode/2012-May/172191.html ) as
>>>> though this case had been discussed, and the re-write should properly
>>>> handle it - but I haven't yet had time to try it. My laptop has been
>>>> replaced in the meantime, and I can't find my test stack, and recreating
>>>> it
>>>> and finding the test data is a bit too much for after 1am:-)
>>>>
>>>> So I'll try it tomorrow; hopefully csv4Tab() will already work for this
>>>> case. If it doesn't, we can try again :-)
>>>>
>>>> -- Alex.
>>>>
>>>>
>>>> On 16/10/2015 00:34, Richard Gaskin wrote:
>>>>
>>>> Mike Kerner wrote:
>>>>>> Alex, Richard, etc.
>>>>>>
>>>>>> What do we consider the latest version of the csv parser?  I think I
>>>>>> found a bug in Richard's CSV2Text code, and proposed changes, but he
>>>>>> wanted the discussion to go down over here, first.  Then I noticed
>>>>>> that csv4Text is out over here, which makes 2, I guess, a bit long in
>>>>>> the tooth.
>>>>>>
>>>>> The version referred to here as "Richard's" is the famous Tweedly algo,
>>>>> in the middle of this page:
>>>>> <http://www.fourthworld.com/embassy/articles/csv-must-die.html>
>>>>>
>>>>> Alex came up with that after a a bunch of us here had a long discussion
>>>>> about the many variants of CSV running around, and how stupidly complex
>>>>> they are to parse (see the details in that article).
>>>>>
>>>>> Mike wrote me this afternoon letting me know that there's yet another
>>>>> exception that doesn't seem to be accounted for there:
>>>>>
>>>>>      "value","","value"
>>>>>
>>>>> I had thought we'd covered that in the earlier discussion, but perhaps
>>>>> not.
>>>>>
>>>>> So this seems like a good time to once again bring together the best
>>>>> minds in our community (are you listening Alex Tweedly, Geoff Canyon,
>>>>> Mark
>>>>> Weider, Dick Kreisel, and others?) to see if we can revisit CSV parsing
>>>>> and
>>>>> come up with a function that can parse it into tabs efficiently, while
>>>>> taking into account all of the really stupid exceptions that have crept
>>>>> into the world since that really stupid format was first popularized.
>>>>>
>>>>> When we're done I'll update the article, and add even more sarcastic
>>>>> comments about what a really dumb idea it was to have encouraged people
>>>>> to
>>>>> delimit text with a character so frequently appearing in text.
>>>>>
>>>>> --
>>>>>    Richard Gaskin
>>>>>    Fourth World Systems
>>>>>    Software Design and Development for the Desktop, Mobile, and the Web
>>>>>    ____________________________________________________________________
>>>>>    Ambassador at FourthWorld.com http://www.FourthWorld.com
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> use-livecode mailing list
>>>>> use-livecode at lists.runrev.com
>>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>>> subscription preferences:
>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>>
>>>>>
>>>> _______________________________________________
>>>> use-livecode mailing list
>>>> use-livecode at lists.runrev.com
>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>> subscription preferences:
>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>
>>>>
>>>
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>
>





More information about the use-livecode mailing list