CSV again.

Peter M. Brigham pmbrig at gmail.com
Sat Oct 17 20:56:51 EDT 2015


Thanks for catching that. Change the if-then structure to:

if howmany(openQuoteChar,thisItem) <> howmany(closeQuoteChar,thisItem) then
    return "This CSV data is not parsable (unclosed quotes in item)."
end if

Revised function:

function CSVtoArray pData
   -- by Peter M. Brigham, pmbrig at gmail.com
   -- requires getDelimiters(), howmany(), offsets()
   put getDelimiters(pData,5) into tDelims
   put line 1 of tDelims into crChar
   put line 2 of tDelims into tabChar
   put line 3 of tDelims into commaChar
   put line 4 of tDelims into openQuoteChar
   put line 5 of tDelims into closeQuoteChar
   
   replace crlf with cr in pData          -- Win to UNIX
   replace numtochar(13) with cr in pData -- Mac to UNIX
   
   if howmany(quote,pData) mod 2 = 1 then
      return "This CSV data is not parsable (unclosed quotes in data)."
   end if
   
   put offsets(quote,pData) into qOffsets
   if qOffsets > 0 then
      put 1 into counter
      repeat for each item q in qOffsets
         if counter mod 2 = 1 then put openQuoteChar into char q of pData
         else put closeQuoteChar into char q of pData
         add 1 to counter
      end repeat
   end if
   
   put offsets(cr,pData) into crOffsets
   repeat for each item r in crOffsets
      put char 1 to r of pData into upToHere
      if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
         -- the cr is within a quoted string
         put crChar into char r of pData
      end if
   end repeat
   put offsets(tab,pData) into tabOffsets
   repeat for each item t in tabOffsets
      put char 1 to t of pData into upToHere
      if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
         -- the tab is within a quoted string
         put tabChar into char t of pData
      end if
   end repeat
   put offsets(comma,pData) into commaOffsets
   repeat for each item c in commaOffsets
      put char 1 to c of pData into upToHere
      if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
         -- the comma is within a quoted string
         put commaChar into char c of pData
      end if
   end repeat
   put 0 into lineCounter
   repeat for each line L in pData
      add 1 to lineCounter
      put 0 into itemCounter
      repeat for each item i in L
         add 1 to itemCounter
         put i into thisItem
         if howmany(openQuoteChar,thisItem) <> howmany(closeQuoteChar,thisItem) then
            return "This CSV data is not parsable (unclosed quotes in item)."
         end if
         replace crChar with cr in thisItem
         replace tabChar with tab in thisItem
         replace commaChar with comma in thisItem
         replace openQuoteChar with quote in thisItem
         replace closeQuoteChar with quote in thisItem
         put thisItem into A[lineCounter][itemCounter]
      end repeat
   end repeat
   return A
end CSVtoArray

----------

-- Peter

Peter M. Brigham
pmbrig at gmail.com
http://home.comcast.net/~pmbrig


On Oct 17, 2015, at 8:47 PM, Alex Tweedly wrote:

> Ummm .... surely at this point
> 
> ....
> 
>      repeat for each item i in L
>         add 1 to itemCounter
>         put i into thisItem
>         if howmany(quote,thisItem) mod 2 = 1 then
>            return "This CSV data is not parsable (unclosed quotes in item)."
>         end if
> 
> ...
> 
> howmany(quote,thisItem) must be 0 - all quotes have been replaced by either openQuoteChar or closeQuoteChar
> 
> Shouldn't this test be
>   if howmany(openQuoteChar, thisItem) <> howmany(closeUqoteChar, thisItem)  then
> 
> 
> Also, I think (i.e. I haven't yet run the code, since I don't have offsets() available) there is another mis-formed case you don't properly detect :
> a,b,c,"def"""g"h",i,j,k
> 
> The quoted cell contains the right number (i.e. a multiple of 2) of quotes, but they are not suitably adjacent, so they can't be properly interpreted as paired 'enclosed quotes'.   (I should say, none of the earlier versions detect this either - their intent was to make the best feasible result from well-formed data, and not to detect all malformed cases - but if this version is going to detect and give error returns for error inputs in some cases, then we should try to do it fully).
> 
> -- Alex.
> 
> 
> On 18/10/2015 00:41, Peter M. Brigham wrote:
>> So here's my attempt. It converts a CVS text to an array. Let's see if there's csv data that can break it.
>> 
>> -- Peter
>> 
>> Peter M. Brigham
>> pmbrig at gmail.com
>> http://home.comcast.net/~pmbrig
>> 
>> -------
>> 
>> function CSVtoArray pData
>>    -- by Peter M. Brigham, pmbrig at gmail.com
>>    -- requires getDelimiters(), howmany()
>>    put getDelimiters(pData,5) into tDelims
>>    put line 1 of tDelims into crChar
>>    put line 2 of tDelims into tabChar
>>    put line 3 of tDelims into commaChar
>>    put line 4 of tDelims into openQuoteChar
>>    put line 5 of tDelims into closeQuoteChar
>>        replace crlf with cr in pData          -- Win to UNIX
>>    replace numtochar(13) with cr in pData -- Mac to UNIX
>>        if howmany(quote,pData) mod 2 = 1 then
>>       return "This CSV data is not parsable (unclosed quotes in data)."
>>    end if
>>        put offsets(quote,pData) into qOffsets
>>    if qOffsets > 0 then
>>       put 1 into counter
>>       repeat for each item q in qOffsets
>>          if counter mod 2 = 1 then put openQuoteChar into char q of pData
>>          else put closeQuoteChar into char q of pData
>>          add 1 to counter
>>       end repeat
>>    end if
>>        put offsets(cr,pData) into crOffsets
>>    repeat for each item r in crOffsets
>>       put char 1 to r of pData into upToHere
>>       if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
>>          -- the cr is within a quoted string
>>          put crChar into char r of pData
>>       end if
>>    end repeat
>>    put offsets(tab,pData) into tabOffsets
>>    repeat for each item t in tabOffsets
>>       put char 1 to t of pData into upToHere
>>       if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
>>          -- the tab is within a quoted string
>>          put tabChar into char t of pData
>>       end if
>>    end repeat
>>    put offsets(comma,pData) into commaOffsets
>>    repeat for each item c in commaOffsets
>>       put char 1 to c of pData into upToHere
>>       if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
>>          -- the comma is within a quoted string
>>          put commaChar into char c of pData
>>       end if
>>    end repeat
>>    put 0 into lineCounter
>>    repeat for each line L in pData
>>       add 1 to lineCounter
>>       put 0 into itemCounter
>>       repeat for each item i in L
>>          add 1 to itemCounter
>>          put i into thisItem
>>          if howmany(quote,thisItem) mod 2 = 1 then
>>             return "This CSV data is not parsable (unclosed quotes in item)."
>>          end if
>>          replace crChar with cr in thisItem
>>          replace tabChar with tab in thisItem
>>          replace commaChar with comma in thisItem
>>          replace openQuoteChar with quote in thisItem
>>          replace closeQuoteChar with quote in thisItem
>>          put thisItem into A[lineCounter][itemCounter]
>>       end repeat
>>    end repeat
>>    return A
>> end CSVtoArray
>> 
>> function getDelimiters pText, nbr
>>    -- returns a cr-delimited list of <nbr> characters
>>    --    not found in the variable pText
>>    -- use for delimiters for, eg, parsing text files, manipulating arrays, etc.
>>    -- usage: put getDelimiters(pText,2) into tDelims
>>    --        if tDelims begins with "Error" then exit to top -- or whatever
>>    --        put line 1 of tDelims into lineDivider
>>    --        put line 2 of tDelims into itemDivider
>>    --             etc.
>>    -- by Peter M. Brigham, pmbrig at gmail.com — freeware
>>        if pText = empty then return "Error: no text specified."
>>    if nbr = empty then put 1 into nbr -- default 1 delimiter
>>    put "2,3,4,5,6,7,8,16,17,18,19,20,21,22,23,24,25,26" into baseList
>>    -- low ASCII values, excluding CR, LF, tab, etc.
>>    put the number of items of baseList into maxNbr
>>    if nbr > maxNbr then return "Error: max" && maxNbr && "delimiters."
>>    repeat with tCount = 1 to nbr
>>       put true into failed
>>       repeat with i = 1 to the number of items of baseList
>>          put item i of baseList into testNbr
>>          put numtochar(testNbr) into testChar
>>          if testChar is not in pText then
>>             -- found one, store and get next delim
>>             put false into failed
>>             put testChar into line tCount of delimList
>>             exit repeat
>>          end if
>>       end repeat
>>       if failed then
>>          if tCount = 0 then
>>             return "Error: cannot get any delimiters."
>>          else if tCount = 1 then
>>             return "Error: can only get one delimiter."
>>          else
>>             return "Error: can only get" && tCount && "delimiters."
>>          end if
>>       end if
>>       delete item i of baseList
>>    end repeat
>>    return delimList
>> end getDelimiters
>> 
>> function howmany pStr, pContainer, pCaseSens
>>    -- how many times pStr occurs in pContainer
>>    -- note that howmany("00","000000") returns 3, not 5
>>    --     ie,  overlapping matches are not counted
>>    -- by Peter M. Brigham, pmbrig at gmail.com — freeware
>>        if pCaseSens = empty then put false into pCaseSens
>>    set the casesensitive to pCaseSens
>>    if pStr is not in pContainer then return 0
>>    put len(pContainer) into origLength
>>    replace pStr with char 2 to -1 of pStr in pContainer
>>    return origLength - len(pContainer)
>> end howmany
>> 
>> 
>> On Oct 17, 2015, at 5:03 AM, Alex Tweedly wrote:
>> 
>>> Naturally it must be removed.
>>> 
>>> But I have a more philosophical issue / question.
>>> 
>>> 
>>> TSV (in and of itself) doesn't have any quotes, and so doesn't handle quoted CRs or TABs.
>>> 
>>> Currently, the 'old' version - as in Richard's published article, doesn't handle TAB characters enclosed within a quoted cell. The 'new' version does - but only by returning the data delimited by <GS> instead of TAB, and leaving enclosed TABs alone - a mistake, IMHO.
>>> 
>>> I believe that what the converter should do is :
>>> - return TSV - i.e. delimited by TABs
>>> - replace quoted CR by <VT> within quoted cells (as it does now)
>>> - replace quoted TABs by <GS> within quoted cells
>>> 
>>> Any comments or suggestions ?
>>> 
>>> Thanks
>>> Alex.
>>> 
>>> On 17/10/2015 02:34, Mike Kerner wrote:
>>>> It's safe as long as you remember to remove it at the end of the function
>>>> 
>>>> On Fri, Oct 16, 2015 at 7:12 PM, Alex Tweedly <alex at tweedly.net> wrote:
>>>> 
>>>>> Duh - replying to myself again :-)
>>>>> 
>>>>> It looks as though that's exactly what you do mean - it certainly
>>>>> generates the problems you described earlier. And my one-line additional
>>>>> test would (does in my testing) solve it properly - without it, we don't
>>>>> get a chance to flush "theInsideStringSoFar" to tNuData, with the extra
>>>>> line we do. And adding it is always safe (AFAICI).
>>>>> 
>>>>> -- Alex.
>>>>> 
>>>>> 
>>>>> On 17/10/2015 00:03, Alex Tweedly wrote:
>>>>> 
>>>>>> Sorry, Mike, but can you describe what you mean by a "naked" line ?
>>>>>> Is it simply one with no line delimiter after it ?
>>>>>> i.e. could only happen on the very last line of a file of input ?
>>>>>> 
>>>>>> Could that be solved by a simple test (after the various 'replace'
>>>>>> statements)
>>>>>>     if the last char of pData <> CR then put CR after pData
>>>>>> before the parsing happens ?
>>>>>> 
>>>>>> -- Alex.
>>>>>> 
>>>>>> 
>>>>>> On 16/10/2015 17:19, Mike Kerner wrote:
>>>>>> 
>>>>>>> No, the problem isn't that LC use LF and CR for ascii(10) and ignores
>>>>>>> ascii(13).  That's just a personal problem.
>>>>>>> 
>>>>>>> The problem, here, is that the csv parser handles a naked line and a
>>>>>>> terminated line differently.  If the line is terminated, it parses it one
>>>>>>> way, and if it is not, it parses it (incorrectly) a different way, which
>>>>>>> makes me wonder if this is the latest version.
>>>>>>> 
>>>>>>> On Fri, Oct 16, 2015 at 11:28 AM, Bob Sneidar <
>>>>>>> bobsneidar at iotecdigital.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> But what if the cr or lf or crlf is inside quoted text, meaning it is not
>>>>>>>> a delimiter? Oh, I'm afraid the deflector shield will be quite
>>>>>>>> operational
>>>>>>>> when your friends arrive.
>>>>>>>> 
>>>>>>>> Bob S
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Oct 16, 2015, at 08:04 , Alex Tweedly <alex at tweedly.net> wrote:
>>>>>>>>> Hi Mike,
>>>>>>>>> 
>>>>>>>>> thanks for that additional info.
>>>>>>>>> 
>>>>>>>>> I *think* (it's been 3 years) I left them as <GS> (i.e. numtochar(29))
>>>>>>>>> 
>>>>>>>> because I had some data including normal TAB characters within the cells
>>>>>>>> (!!( and thought <GS> was a safer bet - though of course nothing is
>>>>>>>> completely safe. It's then up to the caller to decide whether to do
>>>>>>>> "replace numtochar(29) with TAB in ...", or do TAB escaping, or whatever
>>>>>>>> they want.
>>>>>>>> 
>>>>>>>>> As for the other bigger problem .... Oh dear = CR vs LF vs CRLF ....
>>>>>>>>> 
>>>>>>>>> Are you on Mac or Windows or Linux ?
>>>>>>>>> How is the LF delimited data getting into your app ?
>>>>>>>>> Maybe we should just add a "replace chartonum(13) with CR in pData" ?
>>>>>>>>> 
>>>>>>>>> (I confess to being confused by this - I know that LC does
>>>>>>>>> 
>>>>>>>> auto-translation of line delimiters at various places, but I'm not sure
>>>>>>>> when it is, or isn't, completely safe. Maybe the easiest thing is to
>>>>>>>> jst do
>>>>>>>> all the translations ....
>>>>>>>> 
>>>>>>>>>   replace CRLF with CR in pData
>>>>>>>>>   replace numtochar(10) with CR in pData
>>>>>>>>>   replace numtochar(13) with CR in pData
>>>>>>>>> 
>>>>>>>>> -- Alex.
>>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> use-livecode mailing list
>>>>>>>> use-livecode at lists.runrev.com
>>>>>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>>>>>> subscription preferences:
>>>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>>>>> 
>>>>>>>> 
>>>>>> _______________________________________________
>>>>>> use-livecode mailing list
>>>>>> use-livecode at lists.runrev.com
>>>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>>>> subscription preferences:
>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>>> 
>>>>> _______________________________________________
>>>>> use-livecode mailing list
>>>>> use-livecode at lists.runrev.com
>>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>>> subscription preferences:
>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> use-livecode mailing list
>>> use-livecode at lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>> 
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> 
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode





More information about the use-livecode mailing list