CSV again.
Alex Tweedly
alex at tweedly.net
Mon Oct 19 19:35:25 EDT 2015
On 19/10/2015 02:52, Mike Kerner wrote:
> Well, there goes that idea. There are tutorials right on Git, but it might
> be easier if you (and anyone else so not-inclined to Git) post here and
> those of us who are at least inclined to try will make do with doing that
> work for you.
>
OK, OK, I know I need to learn Git / github - and I will soon - but just
not today. I looked at some of the tutorials, and decided they would
take a small amount of time. But, I have between 1/2 and one hour or so
to work on my favourite hobby - Livecode - and I'd rather spend it
updating my CSV script than learning a tutorial that I probably won't
have time to complete.
Yes - your change for the "no trailing CR' case is better than mine -
there's no need to test, just change it.
However, I later decided that that wasn't the best approach .... in
conjunction with another change.
The various version of the script all have some initial replacements, like
-- Normalize line endings:
replace crlf with cr in pData -- Win to UNIX
replace numtochar(13) with cr in pData -- Mac to UNIX
I put these in initially because I didn't fully understand how Runtime
Revolution handled these (what can I say, I'd only been a RR user for a
couple of weeks at the time :-) :-). I now believe that, so long as
the data came from sensible place (i.e. a file, or a web site, or a
database) and was pulled in in some sensible way (i.e. put URL
"file:....") or equivalent, then this is a non-issue. Otherwise, every
real script that handled data would have this kind of thing in it - and
they don't. So - I think the 'replace' statements can be removed.
Once they are out, then we see that "pData" is a read-only parameter,
until we add this extra CR. Since a large part of the initial purpose
was to be efficient (in CPU and in memory usage) so we can handle
*large* datasets, it would be desirable to keep pData as read-only,
hence avoiding both a memory copy and the additional memory used. So
instead of adding a CR, we can instead do that by checking just after
the loop whether or not the situation exists, and handling it there.
So - given those two ideas, plus the need to parameterize, I upgraded
the code to
- not do the initial replacements
- be fully parameterized for input delimiters
- be fully parameterized for TAB or CR characters within quoted cells
- and do all the quote- replacement, etc.
( see code below )
I then tested three versions of code
- the earlier csvToTab2 (i.e. adding the CR at the end)
- this new version (called csvToTab3
- Peter's csvToArray
against 3 input datasets - the two from Richard's article, plus one
testing the case of no trailing CR.
Fortunately, all 3 produce equivalent output (not identical, since
Peter's produces an array, doesn't remove quotes in cells and doesn't do
the same things with enclosed CR and TABs - but equivalent.
I also added to my test script the ability to choose how many copies of
the input data file to put into the variable before calling each
function - to allow simple benchmarking. (All the code for the functions
and the test button is below)
With that we get (remember - equivalent results)
1 copy of data (sample 1 from Richard - 7 lines, 370 chars)
csvToTab2 0 msecs
csvToTab3 0 msecs
csvToArray 6539 msecs
20,000 copies of the data ( - 140,000 lines, 7.4Mb)
csvToTab2 690 msecs
csvToTab3 566 msecs
csvToArray not tested
-- Alex.
Code for the test button
on mouseUp
local tChosenFile
put empty into msg
answer file "CSV file to process"
if the result is not "Cancel" then
put it into tChosenFile
else
exit mouseUp
end if
local tmp, t1, tmp1
put URL ("file:" & tChosenFile) into tmp1
put the number of chars in tmp1 & CR & tmp1 & CR after msg
local tTimes
ask "How many multiples" with 1
put it into tTimes
repeat tTimes
put tmp1 after tmp
end repeat
local time1
put the millisecs into time1
put csvToTab2(tmp) into t1
put "Version 2 took" && the millisecs - time1 &CR after msg
if tTimes = 1 then
replace numtochar(29) with "<GS>" in t1
replace numtochar(11) with "<VT>" in t1
replace TAB with "<TAB>" in t1
put "[" & t1 & "]" & CR & CR after msg
end if
put the millisecs into time1
put csvToTab3(tmp) into t1
put "Version 3 took" && the millisecs - time1 &CR after msg
if tTimes = 1 then
replace numtochar(29) with "<GS>" in t1
replace numtochar(11) with "<VT>" in t1
replace TAB with "<TAB>" in t1
put "[" & t1 & "]" & CR & CR after msg
end if
put empty into tA
put the millisecs into time1
if tTimes = 1 then
put csvToArray(tmp) into tA
put "Version Array took" && the millisecs - time1 &CR after msg
repeat for each key K in tA
repeat for each key KK in tA[K]
put K && KK && tA[K][KK] &CR after msg
end repeat
end repeat
end if
end mouseUp
Code for the functions
function CSVToTab3 pData, pOldLineDelim, pOldItemDelim, pNewCR, pNewTAB
-- fill in defaults
if pOldLineDelim is empty then put CR into pOldLineDelim
if pOldItemDelim is empty then put COMMA into pOldItemDelim
if pNewCR is empty then put numtochar(11) into pNewCR -- Use <VT>
for quoted CRs
if pNewTAB is empty then put numtochar(29) into pNewTAB -- Use
<GS> (group separator) for quoted TABs
local tNuData -- contains tabbed copy of data
local tStatus, theInsideStringSoFar
-- Normalize line endings: REMOVED
-- Will normally be correct already, only binfile: or similar should
make this necessary
-- and that exceptional case should be the caller's responsibility
put "outside" into tStatus
set the itemdel to quote
repeat for each item k in pData
-- put tStatus && k & CR after msg
switch tStatus
case "inside"
put k after theInsideStringSoFar
put "passedquote" into tStatus
next repeat
case "passedquote"
-- decide if it was a duplicated escapedQuote or a closing
quote
if k is empty then -- it's a duplicated quote
put quote after theInsideStringSoFar
put "inside" into tStatus
next repeat
end if
-- not empty - so we remain inside the cell, though we have
left the quoted section
-- NB this allows for quoted sub-strings within the cell
content !!
replace pOldLineDelim with pNewCR in theInsideStringSoFar
-- replace pOldItemDelim with pNewTAB in theInsideStringSoFar
put theInsideStringSoFar after tNuData
case "outside"
replace pOldItemDelim with TAB in k
-- and deal with the "empty trailing item" issue in Livecode
replace (pNewTAB & pOldLineDelim) with pNewTAB & pNewTAB &
CR in k
put k after tNuData
put "inside" into tStatus
put empty into theInsideStringSoFar
next repeat
default
put "defaulted"
break
end switch
end repeat
-- and finally deal with the trailing item isse in input data
-- i.e. the very last char is a quote, so there is no trigger to
flush the
-- last item
if the last char of pData = quote then
put theInsideStringSoFar after tNuData
end if
return tNuData
end CSVToTab3
function CSVToTab2 pData, pcoldelim
local tNuData -- contains tabbed copy of data
local tReturnPlaceholder -- replaces cr in field data to
avoid line
-- breaks which would be misread as records;
local tNuDelim -- new character to replace
the delimiter
local tNeedToRemoveFinalCR -- boolean to remove CR if added to
ensure final token is processed
local tStatus, theInsideStringSoFar
--
put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
put numtochar(29) into tNuDelim -- use <GS> internally
--
if pcoldelim is empty then put comma into pcoldelim
-- Normalize line endings:
replace crlf with cr in pData -- Win to UNIX
replace numtochar(13) with cr in pData -- Mac to UNIX
if the last char of pData <> CR then
put CR after pData
put TRUE into tNeedToRemoveFinalCR
end if
put "outside" into tStatus
set the itemdel to quote
repeat for each item k in pData
-- put tStatus && k & CR after msg
switch tStatus
case "inside"
put k after theInsideStringSoFar
put "passedquote" into tStatus
next repeat
case "passedquote"
-- decide if it was a duplicated escapedQuote or a closing
quote
if k is empty then -- it's a duplicated quote
put quote after theInsideStringSoFar
put "inside" into tStatus
next repeat
end if
-- not empty - so we remain inside the cell, though we have
left the quoted section
-- NB this allows for quoted sub-strings within the cell
content !!
replace cr with tReturnPlaceholder in theInsideStringSoFar
put theInsideStringSoFar after tNuData
case "outside"
replace pcoldelim with tNuDelim in k
-- and deal with the "empty trailing item" issue in Livecode
replace (tNuDelim & CR) with tNuDelim & tNuDelim & CR in k
put k after tNuData
put "inside" into tStatus
put empty into theInsideStringSoFar
next repeat
default
put "defaulted"
break
end switch
end repeat
if tNeedToRemoveFinalCR and the last char of tNuData = CR then
delete the last char of tNuData
end if
-- OPTIONAL - if the data may contain enclosed TABs,
-- remove this next line to keep <GS> as separator
replace numtochar(29) with TAB in tNuData
return tNuData
end CSVToTab2
function offsets str, pContainer
-- returns a comma-delimited list of all the offsets of str in
pContainer
-- returns 0 if not found
-- note: offsets("xx","xxxxxx") returns "1,3,5" not "1,2,3,4,5"
-- ie, overlapping offsets are not counted
-- note: to get the last occurrence of a string in a container
(often useful)
-- use "item -1 of offsets(...)"
-- by Peter M. Brigham, pmbrig at gmail.com — freeware
if str is not in pContainer then return 0
put 0 into startPoint
repeat
put offset(str,pContainer,startPoint) into thisOffset
if thisOffset = 0 then exit repeat
add thisOffset to startPoint
put startPoint & comma after offsetList
add length(str)-1 to startPoint
end repeat
return item 1 to -1 of offsetList -- delete trailing comma
end offsets
function CSVtoArray pData
-- by Peter M. Brigham, pmbrig at gmail.com
-- requires getDelimiters(), howmany()
put getDelimiters(pData,5) into tDelims
put line 1 of tDelims into crChar
put line 2 of tDelims into tabChar
put line 3 of tDelims into commaChar
put line 4 of tDelims into openQuoteChar
put line 5 of tDelims into closeQuoteChar
replace crlf with cr in pData -- Win to UNIX
replace numtochar(13) with cr in pData -- Mac to UNIX
if howmany(quote,pData) mod 2 = 1 then
return "This CSV data is not parsable (unclosed quotes in data)."
end if
put offsets(quote,pData) into qOffsets
if qOffsets > 0 then
put 1 into counter
repeat for each item q in qOffsets
if counter mod 2 = 1 then put openQuoteChar into char q of pData
else put closeQuoteChar into char q of pData
add 1 to counter
end repeat
end if
put offsets(cr,pData) into crOffsets
repeat for each item r in crOffsets
put char 1 to r of pData into upToHere
if howmany(openQuoteChar,upToHere) <>
howmany(closeQuoteChar,upToHere) then
-- the cr is within a quoted string
put crChar into char r of pData
end if
end repeat
put offsets(tab,pData) into tabOffsets
repeat for each item t in tabOffsets
put char 1 to t of pData into upToHere
if howmany(openQuoteChar,upToHere) <>
howmany(closeQuoteChar,upToHere) then
-- the tab is within a quoted string
put tabChar into char t of pData
end if
end repeat
put offsets(comma,pData) into commaOffsets
repeat for each item c in commaOffsets
put char 1 to c of pData into upToHere
if howmany(openQuoteChar,upToHere) <>
howmany(closeQuoteChar,upToHere) then
-- the comma is within a quoted string
put commaChar into char c of pData
end if
end repeat
put 0 into lineCounter
repeat for each line L in pData
add 1 to lineCounter
put 0 into itemCounter
repeat for each item i in L
add 1 to itemCounter
put i into thisItem
if howmany(quote,thisItem) mod 2 = 1 then
return "This CSV data is not parsable (unclosed quotes in
item)."
end if
replace crChar with cr in thisItem
replace tabChar with tab in thisItem
replace commaChar with comma in thisItem
replace openQuoteChar with quote in thisItem
replace closeQuoteChar with quote in thisItem
put thisItem into A[lineCounter][itemCounter]
end repeat
end repeat
return A
end CSVtoArray
function getDelimiters pText, nbr
-- returns a cr-delimited list of <nbr> characters
-- not found in the variable pText
-- use for delimiters for, eg, parsing text files, manipulating
arrays, etc.
-- usage: put getDelimiters(pText,2) into tDelims
-- if tDelims begins with "Error" then exit to top -- or whatever
-- put line 1 of tDelims into lineDivider
-- put line 2 of tDelims into itemDivider
-- etc.
-- by Peter M. Brigham, pmbrig at gmail.com — freeware
if pText = empty then return "Error: no text specified."
if nbr = empty then put 1 into nbr -- default 1 delimiter
put "2,3,4,5,6,7,8,16,17,18,19,20,21,22,23,24,25,26" into baseList
-- low ASCII values, excluding CR, LF, tab, etc.
put the number of items of baseList into maxNbr
if nbr > maxNbr then return "Error: max" && maxNbr && "delimiters."
repeat with tCount = 1 to nbr
put true into failed
repeat with i = 1 to the number of items of baseList
put item i of baseList into testNbr
put numtochar(testNbr) into testChar
if testChar is not in pText then
-- found one, store and get next delim
put false into failed
put testChar into line tCount of delimList
exit repeat
end if
end repeat
if failed then
if tCount = 0 then
return "Error: cannot get any delimiters."
else if tCount = 1 then
return "Error: can only get one delimiter."
else
return "Error: can only get" && tCount && "delimiters."
end if
end if
delete item i of baseList
end repeat
return delimList
end getDelimiters
function howmany pStr, pContainer, pCaseSens
-- how many times pStr occurs in pContainer
-- note that howmany("00","000000") returns 3, not 5
-- ie, overlapping matches are not counted
-- by Peter M. Brigham, pmbrig at gmail.com — freeware
if pCaseSens = empty then put false into pCaseSens
set the casesensitive to pCaseSens
if pStr is not in pContainer then return 0
put len(pContainer) into origLength
replace pStr with char 2 to -1 of pStr in pContainer
return origLength - len(pContainer)
end howmany
end of code
More information about the use-livecode
mailing list