CSV again.

Peter M. Brigham pmbrig at gmail.com
Sat Oct 17 19:41:05 EDT 2015


So here's my attempt. It converts a CVS text to an array. Let's see if there's csv data that can break it.

-- Peter

Peter M. Brigham
pmbrig at gmail.com
http://home.comcast.net/~pmbrig

-------

function CSVtoArray pData
   -- by Peter M. Brigham, pmbrig at gmail.com
   -- requires getDelimiters(), howmany()
   put getDelimiters(pData,5) into tDelims
   put line 1 of tDelims into crChar
   put line 2 of tDelims into tabChar
   put line 3 of tDelims into commaChar
   put line 4 of tDelims into openQuoteChar
   put line 5 of tDelims into closeQuoteChar
   
   replace crlf with cr in pData          -- Win to UNIX
   replace numtochar(13) with cr in pData -- Mac to UNIX
   
   if howmany(quote,pData) mod 2 = 1 then
      return "This CSV data is not parsable (unclosed quotes in data)."
   end if
   
   put offsets(quote,pData) into qOffsets
   if qOffsets > 0 then
      put 1 into counter
      repeat for each item q in qOffsets
         if counter mod 2 = 1 then put openQuoteChar into char q of pData
         else put closeQuoteChar into char q of pData
         add 1 to counter
      end repeat
   end if
   
   put offsets(cr,pData) into crOffsets
   repeat for each item r in crOffsets
      put char 1 to r of pData into upToHere
      if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
         -- the cr is within a quoted string
         put crChar into char r of pData
      end if
   end repeat
   put offsets(tab,pData) into tabOffsets
   repeat for each item t in tabOffsets
      put char 1 to t of pData into upToHere
      if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
         -- the tab is within a quoted string
         put tabChar into char t of pData
      end if
   end repeat
   put offsets(comma,pData) into commaOffsets
   repeat for each item c in commaOffsets
      put char 1 to c of pData into upToHere
      if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
         -- the comma is within a quoted string
         put commaChar into char c of pData
      end if
   end repeat
   put 0 into lineCounter
   repeat for each line L in pData
      add 1 to lineCounter
      put 0 into itemCounter
      repeat for each item i in L
         add 1 to itemCounter
         put i into thisItem
         if howmany(quote,thisItem) mod 2 = 1 then
            return "This CSV data is not parsable (unclosed quotes in item)."
         end if
         replace crChar with cr in thisItem
         replace tabChar with tab in thisItem
         replace commaChar with comma in thisItem
         replace openQuoteChar with quote in thisItem
         replace closeQuoteChar with quote in thisItem
         put thisItem into A[lineCounter][itemCounter]
      end repeat
   end repeat
   return A
end CSVtoArray

function getDelimiters pText, nbr
   -- returns a cr-delimited list of <nbr> characters
   --    not found in the variable pText
   -- use for delimiters for, eg, parsing text files, manipulating arrays, etc.
   -- usage: put getDelimiters(pText,2) into tDelims
   --        if tDelims begins with "Error" then exit to top -- or whatever
   --        put line 1 of tDelims into lineDivider
   --        put line 2 of tDelims into itemDivider
   --             etc.
   -- by Peter M. Brigham, pmbrig at gmail.com — freeware
   
   if pText = empty then return "Error: no text specified."
   if nbr = empty then put 1 into nbr -- default 1 delimiter
   put "2,3,4,5,6,7,8,16,17,18,19,20,21,22,23,24,25,26" into baseList
   -- low ASCII values, excluding CR, LF, tab, etc.
   put the number of items of baseList into maxNbr
   if nbr > maxNbr then return "Error: max" && maxNbr && "delimiters."
   repeat with tCount = 1 to nbr
      put true into failed
      repeat with i = 1 to the number of items of baseList
         put item i of baseList into testNbr
         put numtochar(testNbr) into testChar
         if testChar is not in pText then
            -- found one, store and get next delim
            put false into failed
            put testChar into line tCount of delimList
            exit repeat
         end if
      end repeat
      if failed then
         if tCount = 0 then
            return "Error: cannot get any delimiters."
         else if tCount = 1 then
            return "Error: can only get one delimiter."
         else
            return "Error: can only get" && tCount && "delimiters."
         end if
      end if
      delete item i of baseList
   end repeat
   return delimList
end getDelimiters

function howmany pStr, pContainer, pCaseSens
   -- how many times pStr occurs in pContainer
   -- note that howmany("00","000000") returns 3, not 5
   --     ie,  overlapping matches are not counted
   -- by Peter M. Brigham, pmbrig at gmail.com — freeware
   
   if pCaseSens = empty then put false into pCaseSens
   set the casesensitive to pCaseSens
   if pStr is not in pContainer then return 0
   put len(pContainer) into origLength
   replace pStr with char 2 to -1 of pStr in pContainer
   return origLength - len(pContainer)
end howmany


On Oct 17, 2015, at 5:03 AM, Alex Tweedly wrote:

> Naturally it must be removed.
> 
> But I have a more philosophical issue / question.
> 
> 
> TSV (in and of itself) doesn't have any quotes, and so doesn't handle quoted CRs or TABs.
> 
> Currently, the 'old' version - as in Richard's published article, doesn't handle TAB characters enclosed within a quoted cell. The 'new' version does - but only by returning the data delimited by <GS> instead of TAB, and leaving enclosed TABs alone - a mistake, IMHO.
> 
> I believe that what the converter should do is :
> - return TSV - i.e. delimited by TABs
> - replace quoted CR by <VT> within quoted cells (as it does now)
> - replace quoted TABs by <GS> within quoted cells
> 
> Any comments or suggestions ?
> 
> Thanks
> Alex.
> 
> On 17/10/2015 02:34, Mike Kerner wrote:
>> It's safe as long as you remember to remove it at the end of the function
>> 
>> On Fri, Oct 16, 2015 at 7:12 PM, Alex Tweedly <alex at tweedly.net> wrote:
>> 
>>> Duh - replying to myself again :-)
>>> 
>>> It looks as though that's exactly what you do mean - it certainly
>>> generates the problems you described earlier. And my one-line additional
>>> test would (does in my testing) solve it properly - without it, we don't
>>> get a chance to flush "theInsideStringSoFar" to tNuData, with the extra
>>> line we do. And adding it is always safe (AFAICI).
>>> 
>>> -- Alex.
>>> 
>>> 
>>> On 17/10/2015 00:03, Alex Tweedly wrote:
>>> 
>>>> Sorry, Mike, but can you describe what you mean by a "naked" line ?
>>>> Is it simply one with no line delimiter after it ?
>>>> i.e. could only happen on the very last line of a file of input ?
>>>> 
>>>> Could that be solved by a simple test (after the various 'replace'
>>>> statements)
>>>>     if the last char of pData <> CR then put CR after pData
>>>> before the parsing happens ?
>>>> 
>>>> -- Alex.
>>>> 
>>>> 
>>>> On 16/10/2015 17:19, Mike Kerner wrote:
>>>> 
>>>>> No, the problem isn't that LC use LF and CR for ascii(10) and ignores
>>>>> ascii(13).  That's just a personal problem.
>>>>> 
>>>>> The problem, here, is that the csv parser handles a naked line and a
>>>>> terminated line differently.  If the line is terminated, it parses it one
>>>>> way, and if it is not, it parses it (incorrectly) a different way, which
>>>>> makes me wonder if this is the latest version.
>>>>> 
>>>>> On Fri, Oct 16, 2015 at 11:28 AM, Bob Sneidar <
>>>>> bobsneidar at iotecdigital.com>
>>>>> wrote:
>>>>> 
>>>>> But what if the cr or lf or crlf is inside quoted text, meaning it is not
>>>>>> a delimiter? Oh, I'm afraid the deflector shield will be quite
>>>>>> operational
>>>>>> when your friends arrive.
>>>>>> 
>>>>>> Bob S
>>>>>> 
>>>>>> 
>>>>>> On Oct 16, 2015, at 08:04 , Alex Tweedly <alex at tweedly.net> wrote:
>>>>>>> Hi Mike,
>>>>>>> 
>>>>>>> thanks for that additional info.
>>>>>>> 
>>>>>>> I *think* (it's been 3 years) I left them as <GS> (i.e. numtochar(29))
>>>>>>> 
>>>>>> because I had some data including normal TAB characters within the cells
>>>>>> (!!( and thought <GS> was a safer bet - though of course nothing is
>>>>>> completely safe. It's then up to the caller to decide whether to do
>>>>>> "replace numtochar(29) with TAB in ...", or do TAB escaping, or whatever
>>>>>> they want.
>>>>>> 
>>>>>>> As for the other bigger problem .... Oh dear = CR vs LF vs CRLF ....
>>>>>>> 
>>>>>>> Are you on Mac or Windows or Linux ?
>>>>>>> How is the LF delimited data getting into your app ?
>>>>>>> Maybe we should just add a "replace chartonum(13) with CR in pData" ?
>>>>>>> 
>>>>>>> (I confess to being confused by this - I know that LC does
>>>>>>> 
>>>>>> auto-translation of line delimiters at various places, but I'm not sure
>>>>>> when it is, or isn't, completely safe. Maybe the easiest thing is to
>>>>>> jst do
>>>>>> all the translations ....
>>>>>> 
>>>>>>>   replace CRLF with CR in pData
>>>>>>>   replace numtochar(10) with CR in pData
>>>>>>>   replace numtochar(13) with CR in pData
>>>>>>> 
>>>>>>> -- Alex.
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> use-livecode mailing list
>>>>>> use-livecode at lists.runrev.com
>>>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>>>> subscription preferences:
>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>>> 
>>>>>> 
>>>>> 
>>>> _______________________________________________
>>>> use-livecode mailing list
>>>> use-livecode at lists.runrev.com
>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>> subscription preferences:
>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>> 
>>> 
>>> _______________________________________________
>>> use-livecode mailing list
>>> use-livecode at lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your
>>> subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>> 
>> 
>> 
> 
> 
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode





More information about the use-livecode mailing list