CSV again.
Peter M. Brigham
pmbrig at gmail.com
Sat Oct 17 19:41:05 EDT 2015
So here's my attempt. It converts a CVS text to an array. Let's see if there's csv data that can break it.
-- Peter
Peter M. Brigham
pmbrig at gmail.com
http://home.comcast.net/~pmbrig
-------
function CSVtoArray pData
-- by Peter M. Brigham, pmbrig at gmail.com
-- requires getDelimiters(), howmany()
put getDelimiters(pData,5) into tDelims
put line 1 of tDelims into crChar
put line 2 of tDelims into tabChar
put line 3 of tDelims into commaChar
put line 4 of tDelims into openQuoteChar
put line 5 of tDelims into closeQuoteChar
replace crlf with cr in pData -- Win to UNIX
replace numtochar(13) with cr in pData -- Mac to UNIX
if howmany(quote,pData) mod 2 = 1 then
return "This CSV data is not parsable (unclosed quotes in data)."
end if
put offsets(quote,pData) into qOffsets
if qOffsets > 0 then
put 1 into counter
repeat for each item q in qOffsets
if counter mod 2 = 1 then put openQuoteChar into char q of pData
else put closeQuoteChar into char q of pData
add 1 to counter
end repeat
end if
put offsets(cr,pData) into crOffsets
repeat for each item r in crOffsets
put char 1 to r of pData into upToHere
if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
-- the cr is within a quoted string
put crChar into char r of pData
end if
end repeat
put offsets(tab,pData) into tabOffsets
repeat for each item t in tabOffsets
put char 1 to t of pData into upToHere
if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
-- the tab is within a quoted string
put tabChar into char t of pData
end if
end repeat
put offsets(comma,pData) into commaOffsets
repeat for each item c in commaOffsets
put char 1 to c of pData into upToHere
if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
-- the comma is within a quoted string
put commaChar into char c of pData
end if
end repeat
put 0 into lineCounter
repeat for each line L in pData
add 1 to lineCounter
put 0 into itemCounter
repeat for each item i in L
add 1 to itemCounter
put i into thisItem
if howmany(quote,thisItem) mod 2 = 1 then
return "This CSV data is not parsable (unclosed quotes in item)."
end if
replace crChar with cr in thisItem
replace tabChar with tab in thisItem
replace commaChar with comma in thisItem
replace openQuoteChar with quote in thisItem
replace closeQuoteChar with quote in thisItem
put thisItem into A[lineCounter][itemCounter]
end repeat
end repeat
return A
end CSVtoArray
function getDelimiters pText, nbr
-- returns a cr-delimited list of <nbr> characters
-- not found in the variable pText
-- use for delimiters for, eg, parsing text files, manipulating arrays, etc.
-- usage: put getDelimiters(pText,2) into tDelims
-- if tDelims begins with "Error" then exit to top -- or whatever
-- put line 1 of tDelims into lineDivider
-- put line 2 of tDelims into itemDivider
-- etc.
-- by Peter M. Brigham, pmbrig at gmail.com — freeware
if pText = empty then return "Error: no text specified."
if nbr = empty then put 1 into nbr -- default 1 delimiter
put "2,3,4,5,6,7,8,16,17,18,19,20,21,22,23,24,25,26" into baseList
-- low ASCII values, excluding CR, LF, tab, etc.
put the number of items of baseList into maxNbr
if nbr > maxNbr then return "Error: max" && maxNbr && "delimiters."
repeat with tCount = 1 to nbr
put true into failed
repeat with i = 1 to the number of items of baseList
put item i of baseList into testNbr
put numtochar(testNbr) into testChar
if testChar is not in pText then
-- found one, store and get next delim
put false into failed
put testChar into line tCount of delimList
exit repeat
end if
end repeat
if failed then
if tCount = 0 then
return "Error: cannot get any delimiters."
else if tCount = 1 then
return "Error: can only get one delimiter."
else
return "Error: can only get" && tCount && "delimiters."
end if
end if
delete item i of baseList
end repeat
return delimList
end getDelimiters
function howmany pStr, pContainer, pCaseSens
-- how many times pStr occurs in pContainer
-- note that howmany("00","000000") returns 3, not 5
-- ie, overlapping matches are not counted
-- by Peter M. Brigham, pmbrig at gmail.com — freeware
if pCaseSens = empty then put false into pCaseSens
set the casesensitive to pCaseSens
if pStr is not in pContainer then return 0
put len(pContainer) into origLength
replace pStr with char 2 to -1 of pStr in pContainer
return origLength - len(pContainer)
end howmany
On Oct 17, 2015, at 5:03 AM, Alex Tweedly wrote:
> Naturally it must be removed.
>
> But I have a more philosophical issue / question.
>
>
> TSV (in and of itself) doesn't have any quotes, and so doesn't handle quoted CRs or TABs.
>
> Currently, the 'old' version - as in Richard's published article, doesn't handle TAB characters enclosed within a quoted cell. The 'new' version does - but only by returning the data delimited by <GS> instead of TAB, and leaving enclosed TABs alone - a mistake, IMHO.
>
> I believe that what the converter should do is :
> - return TSV - i.e. delimited by TABs
> - replace quoted CR by <VT> within quoted cells (as it does now)
> - replace quoted TABs by <GS> within quoted cells
>
> Any comments or suggestions ?
>
> Thanks
> Alex.
>
> On 17/10/2015 02:34, Mike Kerner wrote:
>> It's safe as long as you remember to remove it at the end of the function
>>
>> On Fri, Oct 16, 2015 at 7:12 PM, Alex Tweedly <alex at tweedly.net> wrote:
>>
>>> Duh - replying to myself again :-)
>>>
>>> It looks as though that's exactly what you do mean - it certainly
>>> generates the problems you described earlier. And my one-line additional
>>> test would (does in my testing) solve it properly - without it, we don't
>>> get a chance to flush "theInsideStringSoFar" to tNuData, with the extra
>>> line we do. And adding it is always safe (AFAICI).
>>>
>>> -- Alex.
>>>
>>>
>>> On 17/10/2015 00:03, Alex Tweedly wrote:
>>>
>>>> Sorry, Mike, but can you describe what you mean by a "naked" line ?
>>>> Is it simply one with no line delimiter after it ?
>>>> i.e. could only happen on the very last line of a file of input ?
>>>>
>>>> Could that be solved by a simple test (after the various 'replace'
>>>> statements)
>>>> if the last char of pData <> CR then put CR after pData
>>>> before the parsing happens ?
>>>>
>>>> -- Alex.
>>>>
>>>>
>>>> On 16/10/2015 17:19, Mike Kerner wrote:
>>>>
>>>>> No, the problem isn't that LC use LF and CR for ascii(10) and ignores
>>>>> ascii(13). That's just a personal problem.
>>>>>
>>>>> The problem, here, is that the csv parser handles a naked line and a
>>>>> terminated line differently. If the line is terminated, it parses it one
>>>>> way, and if it is not, it parses it (incorrectly) a different way, which
>>>>> makes me wonder if this is the latest version.
>>>>>
>>>>> On Fri, Oct 16, 2015 at 11:28 AM, Bob Sneidar <
>>>>> bobsneidar at iotecdigital.com>
>>>>> wrote:
>>>>>
>>>>> But what if the cr or lf or crlf is inside quoted text, meaning it is not
>>>>>> a delimiter? Oh, I'm afraid the deflector shield will be quite
>>>>>> operational
>>>>>> when your friends arrive.
>>>>>>
>>>>>> Bob S
>>>>>>
>>>>>>
>>>>>> On Oct 16, 2015, at 08:04 , Alex Tweedly <alex at tweedly.net> wrote:
>>>>>>> Hi Mike,
>>>>>>>
>>>>>>> thanks for that additional info.
>>>>>>>
>>>>>>> I *think* (it's been 3 years) I left them as <GS> (i.e. numtochar(29))
>>>>>>>
>>>>>> because I had some data including normal TAB characters within the cells
>>>>>> (!!( and thought <GS> was a safer bet - though of course nothing is
>>>>>> completely safe. It's then up to the caller to decide whether to do
>>>>>> "replace numtochar(29) with TAB in ...", or do TAB escaping, or whatever
>>>>>> they want.
>>>>>>
>>>>>>> As for the other bigger problem .... Oh dear = CR vs LF vs CRLF ....
>>>>>>>
>>>>>>> Are you on Mac or Windows or Linux ?
>>>>>>> How is the LF delimited data getting into your app ?
>>>>>>> Maybe we should just add a "replace chartonum(13) with CR in pData" ?
>>>>>>>
>>>>>>> (I confess to being confused by this - I know that LC does
>>>>>>>
>>>>>> auto-translation of line delimiters at various places, but I'm not sure
>>>>>> when it is, or isn't, completely safe. Maybe the easiest thing is to
>>>>>> jst do
>>>>>> all the translations ....
>>>>>>
>>>>>>> replace CRLF with CR in pData
>>>>>>> replace numtochar(10) with CR in pData
>>>>>>> replace numtochar(13) with CR in pData
>>>>>>>
>>>>>>> -- Alex.
>>>>>>>
>>>>>> _______________________________________________
>>>>>> use-livecode mailing list
>>>>>> use-livecode at lists.runrev.com
>>>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>>>> subscription preferences:
>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>>>
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> use-livecode mailing list
>>>> use-livecode at lists.runrev.com
>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>> subscription preferences:
>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>
>>>
>>> _______________________________________________
>>> use-livecode mailing list
>>> use-livecode at lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your
>>> subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>
>>
>>
>
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list