CSV again.
Alex Tweedly
alex at tweedly.net
Sat Oct 17 20:47:25 EDT 2015
Ummm .... surely at this point
....
repeat for each item i in L
add 1 to itemCounter
put i into thisItem
if howmany(quote,thisItem) mod 2 = 1 then
return "This CSV data is not parsable (unclosed quotes in item)."
end if
...
howmany(quote,thisItem) must be 0 - all quotes have been replaced by
either openQuoteChar or closeQuoteChar
Shouldn't this test be
if howmany(openQuoteChar, thisItem) <> howmany(closeUqoteChar,
thisItem) then
Also, I think (i.e. I haven't yet run the code, since I don't have
offsets() available) there is another mis-formed case you don't properly
detect :
a,b,c,"def"""g"h",i,j,k
The quoted cell contains the right number (i.e. a multiple of 2) of
quotes, but they are not suitably adjacent, so they can't be properly
interpreted as paired 'enclosed quotes'. (I should say, none of the
earlier versions detect this either - their intent was to make the best
feasible result from well-formed data, and not to detect all malformed
cases - but if this version is going to detect and give error returns
for error inputs in some cases, then we should try to do it fully).
-- Alex.
On 18/10/2015 00:41, Peter M. Brigham wrote:
> So here's my attempt. It converts a CVS text to an array. Let's see if there's csv data that can break it.
>
> -- Peter
>
> Peter M. Brigham
> pmbrig at gmail.com
> http://home.comcast.net/~pmbrig
>
> -------
>
> function CSVtoArray pData
> -- by Peter M. Brigham, pmbrig at gmail.com
> -- requires getDelimiters(), howmany()
> put getDelimiters(pData,5) into tDelims
> put line 1 of tDelims into crChar
> put line 2 of tDelims into tabChar
> put line 3 of tDelims into commaChar
> put line 4 of tDelims into openQuoteChar
> put line 5 of tDelims into closeQuoteChar
>
> replace crlf with cr in pData -- Win to UNIX
> replace numtochar(13) with cr in pData -- Mac to UNIX
>
> if howmany(quote,pData) mod 2 = 1 then
> return "This CSV data is not parsable (unclosed quotes in data)."
> end if
>
> put offsets(quote,pData) into qOffsets
> if qOffsets > 0 then
> put 1 into counter
> repeat for each item q in qOffsets
> if counter mod 2 = 1 then put openQuoteChar into char q of pData
> else put closeQuoteChar into char q of pData
> add 1 to counter
> end repeat
> end if
>
> put offsets(cr,pData) into crOffsets
> repeat for each item r in crOffsets
> put char 1 to r of pData into upToHere
> if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
> -- the cr is within a quoted string
> put crChar into char r of pData
> end if
> end repeat
> put offsets(tab,pData) into tabOffsets
> repeat for each item t in tabOffsets
> put char 1 to t of pData into upToHere
> if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
> -- the tab is within a quoted string
> put tabChar into char t of pData
> end if
> end repeat
> put offsets(comma,pData) into commaOffsets
> repeat for each item c in commaOffsets
> put char 1 to c of pData into upToHere
> if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
> -- the comma is within a quoted string
> put commaChar into char c of pData
> end if
> end repeat
> put 0 into lineCounter
> repeat for each line L in pData
> add 1 to lineCounter
> put 0 into itemCounter
> repeat for each item i in L
> add 1 to itemCounter
> put i into thisItem
> if howmany(quote,thisItem) mod 2 = 1 then
> return "This CSV data is not parsable (unclosed quotes in item)."
> end if
> replace crChar with cr in thisItem
> replace tabChar with tab in thisItem
> replace commaChar with comma in thisItem
> replace openQuoteChar with quote in thisItem
> replace closeQuoteChar with quote in thisItem
> put thisItem into A[lineCounter][itemCounter]
> end repeat
> end repeat
> return A
> end CSVtoArray
>
> function getDelimiters pText, nbr
> -- returns a cr-delimited list of <nbr> characters
> -- not found in the variable pText
> -- use for delimiters for, eg, parsing text files, manipulating arrays, etc.
> -- usage: put getDelimiters(pText,2) into tDelims
> -- if tDelims begins with "Error" then exit to top -- or whatever
> -- put line 1 of tDelims into lineDivider
> -- put line 2 of tDelims into itemDivider
> -- etc.
> -- by Peter M. Brigham, pmbrig at gmail.com — freeware
>
> if pText = empty then return "Error: no text specified."
> if nbr = empty then put 1 into nbr -- default 1 delimiter
> put "2,3,4,5,6,7,8,16,17,18,19,20,21,22,23,24,25,26" into baseList
> -- low ASCII values, excluding CR, LF, tab, etc.
> put the number of items of baseList into maxNbr
> if nbr > maxNbr then return "Error: max" && maxNbr && "delimiters."
> repeat with tCount = 1 to nbr
> put true into failed
> repeat with i = 1 to the number of items of baseList
> put item i of baseList into testNbr
> put numtochar(testNbr) into testChar
> if testChar is not in pText then
> -- found one, store and get next delim
> put false into failed
> put testChar into line tCount of delimList
> exit repeat
> end if
> end repeat
> if failed then
> if tCount = 0 then
> return "Error: cannot get any delimiters."
> else if tCount = 1 then
> return "Error: can only get one delimiter."
> else
> return "Error: can only get" && tCount && "delimiters."
> end if
> end if
> delete item i of baseList
> end repeat
> return delimList
> end getDelimiters
>
> function howmany pStr, pContainer, pCaseSens
> -- how many times pStr occurs in pContainer
> -- note that howmany("00","000000") returns 3, not 5
> -- ie, overlapping matches are not counted
> -- by Peter M. Brigham, pmbrig at gmail.com — freeware
>
> if pCaseSens = empty then put false into pCaseSens
> set the casesensitive to pCaseSens
> if pStr is not in pContainer then return 0
> put len(pContainer) into origLength
> replace pStr with char 2 to -1 of pStr in pContainer
> return origLength - len(pContainer)
> end howmany
>
>
> On Oct 17, 2015, at 5:03 AM, Alex Tweedly wrote:
>
>> Naturally it must be removed.
>>
>> But I have a more philosophical issue / question.
>>
>>
>> TSV (in and of itself) doesn't have any quotes, and so doesn't handle quoted CRs or TABs.
>>
>> Currently, the 'old' version - as in Richard's published article, doesn't handle TAB characters enclosed within a quoted cell. The 'new' version does - but only by returning the data delimited by <GS> instead of TAB, and leaving enclosed TABs alone - a mistake, IMHO.
>>
>> I believe that what the converter should do is :
>> - return TSV - i.e. delimited by TABs
>> - replace quoted CR by <VT> within quoted cells (as it does now)
>> - replace quoted TABs by <GS> within quoted cells
>>
>> Any comments or suggestions ?
>>
>> Thanks
>> Alex.
>>
>> On 17/10/2015 02:34, Mike Kerner wrote:
>>> It's safe as long as you remember to remove it at the end of the function
>>>
>>> On Fri, Oct 16, 2015 at 7:12 PM, Alex Tweedly <alex at tweedly.net> wrote:
>>>
>>>> Duh - replying to myself again :-)
>>>>
>>>> It looks as though that's exactly what you do mean - it certainly
>>>> generates the problems you described earlier. And my one-line additional
>>>> test would (does in my testing) solve it properly - without it, we don't
>>>> get a chance to flush "theInsideStringSoFar" to tNuData, with the extra
>>>> line we do. And adding it is always safe (AFAICI).
>>>>
>>>> -- Alex.
>>>>
>>>>
>>>> On 17/10/2015 00:03, Alex Tweedly wrote:
>>>>
>>>>> Sorry, Mike, but can you describe what you mean by a "naked" line ?
>>>>> Is it simply one with no line delimiter after it ?
>>>>> i.e. could only happen on the very last line of a file of input ?
>>>>>
>>>>> Could that be solved by a simple test (after the various 'replace'
>>>>> statements)
>>>>> if the last char of pData <> CR then put CR after pData
>>>>> before the parsing happens ?
>>>>>
>>>>> -- Alex.
>>>>>
>>>>>
>>>>> On 16/10/2015 17:19, Mike Kerner wrote:
>>>>>
>>>>>> No, the problem isn't that LC use LF and CR for ascii(10) and ignores
>>>>>> ascii(13). That's just a personal problem.
>>>>>>
>>>>>> The problem, here, is that the csv parser handles a naked line and a
>>>>>> terminated line differently. If the line is terminated, it parses it one
>>>>>> way, and if it is not, it parses it (incorrectly) a different way, which
>>>>>> makes me wonder if this is the latest version.
>>>>>>
>>>>>> On Fri, Oct 16, 2015 at 11:28 AM, Bob Sneidar <
>>>>>> bobsneidar at iotecdigital.com>
>>>>>> wrote:
>>>>>>
>>>>>> But what if the cr or lf or crlf is inside quoted text, meaning it is not
>>>>>>> a delimiter? Oh, I'm afraid the deflector shield will be quite
>>>>>>> operational
>>>>>>> when your friends arrive.
>>>>>>>
>>>>>>> Bob S
>>>>>>>
>>>>>>>
>>>>>>> On Oct 16, 2015, at 08:04 , Alex Tweedly <alex at tweedly.net> wrote:
>>>>>>>> Hi Mike,
>>>>>>>>
>>>>>>>> thanks for that additional info.
>>>>>>>>
>>>>>>>> I *think* (it's been 3 years) I left them as <GS> (i.e. numtochar(29))
>>>>>>>>
>>>>>>> because I had some data including normal TAB characters within the cells
>>>>>>> (!!( and thought <GS> was a safer bet - though of course nothing is
>>>>>>> completely safe. It's then up to the caller to decide whether to do
>>>>>>> "replace numtochar(29) with TAB in ...", or do TAB escaping, or whatever
>>>>>>> they want.
>>>>>>>
>>>>>>>> As for the other bigger problem .... Oh dear = CR vs LF vs CRLF ....
>>>>>>>>
>>>>>>>> Are you on Mac or Windows or Linux ?
>>>>>>>> How is the LF delimited data getting into your app ?
>>>>>>>> Maybe we should just add a "replace chartonum(13) with CR in pData" ?
>>>>>>>>
>>>>>>>> (I confess to being confused by this - I know that LC does
>>>>>>>>
>>>>>>> auto-translation of line delimiters at various places, but I'm not sure
>>>>>>> when it is, or isn't, completely safe. Maybe the easiest thing is to
>>>>>>> jst do
>>>>>>> all the translations ....
>>>>>>>
>>>>>>>> replace CRLF with CR in pData
>>>>>>>> replace numtochar(10) with CR in pData
>>>>>>>> replace numtochar(13) with CR in pData
>>>>>>>>
>>>>>>>> -- Alex.
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> use-livecode mailing list
>>>>>>> use-livecode at lists.runrev.com
>>>>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>>>>> subscription preferences:
>>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>>>>
>>>>>>>
>>>>> _______________________________________________
>>>>> use-livecode mailing list
>>>>> use-livecode at lists.runrev.com
>>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>>> subscription preferences:
>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>>
>>>> _______________________________________________
>>>> use-livecode mailing list
>>>> use-livecode at lists.runrev.com
>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>> subscription preferences:
>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>
>>>
>>
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list