CSV again.

Alex Tweedly alex at tweedly.net
Mon Oct 19 19:35:25 EDT 2015


On 19/10/2015 02:52, Mike Kerner wrote:
> Well, there goes that idea.  There are tutorials right on Git, but it might
> be easier if you (and anyone else so not-inclined to Git) post here and
> those of us who are at least inclined to try will make do with doing that
> work for you.
>

OK, OK, I know I need to learn Git / github - and I will soon - but just 
not today. I looked at some of the tutorials, and decided they would 
take a small amount of time. But, I have between 1/2 and one hour or so 
to work on my favourite hobby - Livecode - and I'd rather spend it 
updating my CSV script than learning a tutorial that I probably won't 
have time to complete.

Yes - your change for the "no trailing CR' case is better than mine - 
there's no need to test, just change it.

However, I later decided that that wasn't the best approach .... in 
conjunction with another change.

The various version of the script all have some initial replacements, like

       -- Normalize line endings:
       replace crlf with cr in pData          -- Win to UNIX
       replace numtochar(13) with cr in pData -- Mac to UNIX

I put these in initially because I didn't fully understand how Runtime 
Revolution handled these (what can I say, I'd only been a RR user for a 
couple of weeks at the time :-) :-).   I now believe that, so long as 
the data came from sensible place (i.e. a file, or a web site, or a 
database) and was pulled in in some sensible way (i.e. put URL 
"file:....") or equivalent, then this is a non-issue. Otherwise, every 
real script that handled data would have this kind of thing in it - and 
they don't.   So - I think the 'replace' statements can be removed.

Once they are out, then we see that "pData" is a read-only parameter, 
until we add this extra CR. Since a large part of the initial purpose 
was to be efficient (in CPU and in memory usage) so we can handle 
*large* datasets, it would be desirable to keep pData as read-only, 
hence avoiding both a memory copy and the additional memory used. So 
instead of adding a CR, we can instead do that by checking just after 
the loop whether or not the situation exists, and handling it there.

So - given those two ideas, plus the need to parameterize,  I upgraded 
the code to
  - not do the initial replacements
  - be fully parameterized for input delimiters
  - be fully parameterized for TAB or CR characters within quoted cells
  - and do all the quote- replacement, etc.
( see code below )

I then tested three versions of code
  - the earlier csvToTab2 (i.e. adding the CR at the end)
  - this new version (called csvToTab3
  - Peter's csvToArray
against 3 input datasets - the two from Richard's article, plus one 
testing the case of no trailing CR.

Fortunately, all 3 produce equivalent output (not identical, since 
Peter's produces an array, doesn't remove quotes in cells and doesn't do 
the same things with enclosed CR and TABs - but equivalent.

I also added to my test script the ability to choose how many copies of 
the input data file to put into the variable before calling each 
function - to allow simple benchmarking. (All the code for the functions 
and the test button is below)

With that we get (remember - equivalent results)

1 copy of data (sample 1 from Richard - 7 lines, 370 chars)

csvToTab2         0 msecs
csvToTab3         0 msecs
csvToArray   6539 msecs

20,000 copies of the data ( - 140,000 lines, 7.4Mb)
csvToTab2       690 msecs
csvToTab3       566 msecs
csvToArray       not tested


-- Alex.

Code for the test button

on mouseUp
    local tChosenFile

    put empty into msg
    answer file "CSV file to process"
    if the result is not "Cancel" then
       put it into tChosenFile
    else
       exit mouseUp
    end if

    local tmp, t1, tmp1
    put URL ("file:" & tChosenFile) into tmp1
    put the number of chars in tmp1 & CR & tmp1 & CR after msg

    local tTimes
    ask "How many multiples" with 1
    put it into tTimes
    repeat tTimes
       put tmp1 after tmp
    end repeat

    local time1
    put the millisecs into time1
    put csvToTab2(tmp) into t1
    put "Version 2 took" &&  the millisecs - time1 &CR after msg
    if tTimes = 1 then
       replace numtochar(29) with "<GS>" in t1
       replace numtochar(11) with "<VT>" in t1
       replace TAB with "<TAB>" in t1
       put "[" & t1 & "]" & CR & CR  after msg
    end if

    put the millisecs into time1
    put csvToTab3(tmp) into t1
    put "Version 3 took" &&  the millisecs - time1 &CR after msg
    if tTimes = 1 then
       replace numtochar(29) with "<GS>" in t1
       replace numtochar(11) with "<VT>" in t1
       replace TAB with "<TAB>" in t1
       put "[" & t1 & "]" & CR & CR  after msg
    end if


    put empty into tA
    put the millisecs into time1
    if tTimes = 1 then
       put csvToArray(tmp) into tA
       put "Version Array took" &&  the millisecs - time1 &CR after msg
       repeat for each key K in tA
          repeat for each key KK in tA[K]
             put K && KK && tA[K][KK] &CR after msg
          end repeat
       end repeat
    end if
end mouseUp





Code for the functions
function CSVToTab3 pData, pOldLineDelim, pOldItemDelim, pNewCR, pNewTAB
    -- fill in defaults
    if pOldLineDelim is empty then put CR into pOldLineDelim
    if pOldItemDelim is empty then put COMMA into pOldItemDelim
    if pNewCR is empty then put numtochar(11) into pNewCR   -- Use <VT> 
for quoted CRs
    if pNewTAB is empty then put numtochar(29) into pNewTAB      -- Use 
<GS> (group separator) for quoted TABs

    local tNuData                         -- contains tabbed copy of data

    local tStatus, theInsideStringSoFar

    -- Normalize line endings: REMOVED
    -- Will normally be correct already, only binfile: or similar should 
make this necessary
    -- and that exceptional case should be the caller's responsibility

    put "outside" into tStatus
    set the itemdel to quote
    repeat for each item k in pData
       -- put tStatus && k & CR after msg
       switch tStatus

          case "inside"
             put k after theInsideStringSoFar
             put "passedquote" into tStatus
             next repeat

          case "passedquote"
             -- decide if it was a duplicated escapedQuote or a closing 
quote
             if k is empty then   -- it's a duplicated quote
                put quote after theInsideStringSoFar
                put "inside" into tStatus
                next repeat
             end if
             -- not empty - so we remain inside the cell, though we have 
left the quoted section
             -- NB this allows for quoted sub-strings within the cell 
content !!
             replace pOldLineDelim with pNewCR in theInsideStringSoFar
             -- replace pOldItemDelim with pNewTAB in theInsideStringSoFar
             put theInsideStringSoFar after tNuData

          case "outside"
             replace pOldItemDelim with TAB in k
             -- and deal with the "empty trailing item" issue in Livecode
             replace (pNewTAB & pOldLineDelim) with pNewTAB & pNewTAB & 
CR in k
             put k after tNuData
             put "inside" into tStatus
             put empty into theInsideStringSoFar
             next repeat
          default
             put "defaulted"
             break
       end switch
    end repeat

    -- and finally deal with the trailing item isse in input data
    -- i.e. the very last char is a quote, so there is no trigger to 
flush the
    --      last item
    if the last char of pData = quote then
       put theInsideStringSoFar after tNuData
    end if

    return tNuData
end CSVToTab3



function CSVToTab2 pData, pcoldelim
    local tNuData                         -- contains tabbed copy of data
    local tReturnPlaceholder         -- replaces cr in field data to 
avoid line
    --      breaks which would be misread as records;
    local tNuDelim                        -- new character to replace 
the delimiter
    local tNeedToRemoveFinalCR  -- boolean to remove CR if added to 
ensure final token is processed
    local tStatus, theInsideStringSoFar
    --
    put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
    put numtochar(29) into tNuDelim               -- use <GS> internally
    --
    if pcoldelim is empty then put comma into pcoldelim
    -- Normalize line endings:
    replace crlf with cr in pData                   -- Win to UNIX
    replace numtochar(13) with cr in pData  -- Mac to UNIX

    if the last char of pData <> CR then
       put CR after pData
       put TRUE into tNeedToRemoveFinalCR
    end if

    put "outside" into tStatus
    set the itemdel to quote
    repeat for each item k in pData
       -- put tStatus && k & CR after msg
       switch tStatus

          case "inside"
             put k after theInsideStringSoFar
             put "passedquote" into tStatus
             next repeat

          case "passedquote"
             -- decide if it was a duplicated escapedQuote or a closing 
quote
             if k is empty then   -- it's a duplicated quote
                put quote after theInsideStringSoFar
                put "inside" into tStatus
                next repeat
             end if
             -- not empty - so we remain inside the cell, though we have 
left the quoted section
             -- NB this allows for quoted sub-strings within the cell 
content !!
             replace cr with tReturnPlaceholder in theInsideStringSoFar
             put theInsideStringSoFar after tNuData

          case "outside"
             replace pcoldelim with tNuDelim in k
             -- and deal with the "empty trailing item" issue in Livecode
             replace (tNuDelim & CR) with tNuDelim & tNuDelim & CR in k
             put k after tNuData
             put "inside" into tStatus
             put empty into theInsideStringSoFar
             next repeat
          default
             put "defaulted"
             break
       end switch
    end repeat

    if tNeedToRemoveFinalCR and the last char of tNuData = CR then
       delete the last char of tNuData
    end if

    -- OPTIONAL - if the data may contain enclosed TABs,
    --     remove this next line to keep <GS> as separator
    replace numtochar(29) with TAB in tNuData

    return tNuData
end CSVToTab2



function offsets str, pContainer
    -- returns a comma-delimited list of all the offsets of str in 
pContainer
    -- returns 0 if not found
    -- note: offsets("xx","xxxxxx") returns "1,3,5" not "1,2,3,4,5"
    --     ie, overlapping offsets are not counted
    -- note: to get the last occurrence of a string in a container 
(often useful)
    --     use "item -1 of offsets(...)"
    -- by Peter M. Brigham, pmbrig at gmail.com — freeware

    if str is not in pContainer then return 0
    put 0 into startPoint
    repeat
       put offset(str,pContainer,startPoint) into thisOffset
       if thisOffset = 0 then exit repeat
       add thisOffset to startPoint
       put startPoint & comma after offsetList
       add length(str)-1 to startPoint
    end repeat
    return item 1 to -1 of offsetList -- delete trailing comma
end offsets

function CSVtoArray pData
    -- by Peter M. Brigham, pmbrig at gmail.com
    -- requires getDelimiters(), howmany()
    put getDelimiters(pData,5) into tDelims
    put line 1 of tDelims into crChar
    put line 2 of tDelims into tabChar
    put line 3 of tDelims into commaChar
    put line 4 of tDelims into openQuoteChar
    put line 5 of tDelims into closeQuoteChar

    replace crlf with cr in pData          -- Win to UNIX
    replace numtochar(13) with cr in pData -- Mac to UNIX

    if howmany(quote,pData) mod 2 = 1 then
       return "This CSV data is not parsable (unclosed quotes in data)."
    end if

    put offsets(quote,pData) into qOffsets
    if qOffsets > 0 then
       put 1 into counter
       repeat for each item q in qOffsets
          if counter mod 2 = 1 then put openQuoteChar into char q of pData
          else put closeQuoteChar into char q of pData
          add 1 to counter
       end repeat
    end if

    put offsets(cr,pData) into crOffsets
    repeat for each item r in crOffsets
       put char 1 to r of pData into upToHere
       if howmany(openQuoteChar,upToHere) <> 
howmany(closeQuoteChar,upToHere) then
          -- the cr is within a quoted string
          put crChar into char r of pData
       end if
    end repeat
    put offsets(tab,pData) into tabOffsets
    repeat for each item t in tabOffsets
       put char 1 to t of pData into upToHere
       if howmany(openQuoteChar,upToHere) <> 
howmany(closeQuoteChar,upToHere) then
          -- the tab is within a quoted string
          put tabChar into char t of pData
       end if
    end repeat
    put offsets(comma,pData) into commaOffsets
    repeat for each item c in commaOffsets
       put char 1 to c of pData into upToHere
       if howmany(openQuoteChar,upToHere) <> 
howmany(closeQuoteChar,upToHere) then
          -- the comma is within a quoted string
          put commaChar into char c of pData
       end if
    end repeat
    put 0 into lineCounter
    repeat for each line L in pData
       add 1 to lineCounter
       put 0 into itemCounter
       repeat for each item i in L
          add 1 to itemCounter
          put i into thisItem
          if howmany(quote,thisItem) mod 2 = 1 then
             return "This CSV data is not parsable (unclosed quotes in 
item)."
          end if
          replace crChar with cr in thisItem
          replace tabChar with tab in thisItem
          replace commaChar with comma in thisItem
          replace openQuoteChar with quote in thisItem
          replace closeQuoteChar with quote in thisItem
          put thisItem into A[lineCounter][itemCounter]
       end repeat
    end repeat
    return A
end CSVtoArray

function getDelimiters pText, nbr
    -- returns a cr-delimited list of <nbr> characters
    --    not found in the variable pText
    -- use for delimiters for, eg, parsing text files, manipulating 
arrays, etc.
    -- usage: put getDelimiters(pText,2) into tDelims
    --        if tDelims begins with "Error" then exit to top -- or whatever
    --        put line 1 of tDelims into lineDivider
    --        put line 2 of tDelims into itemDivider
    --             etc.
    -- by Peter M. Brigham, pmbrig at gmail.com — freeware

    if pText = empty then return "Error: no text specified."
    if nbr = empty then put 1 into nbr -- default 1 delimiter
    put "2,3,4,5,6,7,8,16,17,18,19,20,21,22,23,24,25,26" into baseList
    -- low ASCII values, excluding CR, LF, tab, etc.
    put the number of items of baseList into maxNbr
    if nbr > maxNbr then return "Error: max" && maxNbr && "delimiters."
    repeat with tCount = 1 to nbr
       put true into failed
       repeat with i = 1 to the number of items of baseList
          put item i of baseList into testNbr
          put numtochar(testNbr) into testChar
          if testChar is not in pText then
             -- found one, store and get next delim
             put false into failed
             put testChar into line tCount of delimList
             exit repeat
          end if
       end repeat
       if failed then
          if tCount = 0 then
             return "Error: cannot get any delimiters."
          else if tCount = 1 then
             return "Error: can only get one delimiter."
          else
             return "Error: can only get" && tCount && "delimiters."
          end if
       end if
       delete item i of baseList
    end repeat
    return delimList
end getDelimiters

function howmany pStr, pContainer, pCaseSens
    -- how many times pStr occurs in pContainer
    -- note that howmany("00","000000") returns 3, not 5
    --     ie,  overlapping matches are not counted
    -- by Peter M. Brigham, pmbrig at gmail.com — freeware

    if pCaseSens = empty then put false into pCaseSens
    set the casesensitive to pCaseSens
    if pStr is not in pContainer then return 0
    put len(pContainer) into origLength
    replace pStr with char 2 to -1 of pStr in pContainer
    return origLength - len(pContainer)
end howmany

end of code





More information about the use-livecode mailing list