CSV again.

Peter Haworth pete at lcsql.com
Tue May 15 13:51:06 EDT 2012


Thanks Alex, all good points.

I'm still trying to figure out why the program that created the csv file
used this problematic string since it only happened for one cell - all
other empty cells simply had two consecutive commas. Nevertheless, the
other cases you cited are definitely valid so I guess the function will
need to handle them.

As for performance, it's obviously good to do the parsing as efficiently as
possible but my use of the function is to use its output to issue INSERT
statements against an sqlite database.  So we're talking milliseconds for
the parsing vs seconds (or maybe even minutes depending on how much data is
involved) for the INSERT command.  I'd be fine with the parsing taking
longer to handle more corner cases.

Not really anything to do with the parsing but I'm also facing another
issue in this context and that's csv files that are too large to read
completely into memory in one go.  I have one guy who wants to import a 44
gigabyte file!

I'll probably have to implement some sort of mechanism for reading in a
given number of lines.  BUT..  a carriage return in the middle of a quoted
cell will be taken by the read for x lines command to be the end of a line
so I could end up with a partial line in my read buffer.

I may end up just declaring a maximum file size in the documentation and
leaving it up to the user to break up the file into multiple smaller files.

Thanks for your help on this Alex, much appreciated.


Pete
lcSQL Software <http://www.lcsql.com>



On Tue, May 15, 2012 at 10:02 AM, Alex Tweedly <alex at tweedly.net> wrote:

> Unfortunately, that's not enough to fix it, Peter.
>
> The problem case you have identified is where the CSV exporter has decided
> to quote even empty cells. This wasn't covered in the original samples, or
> in any cases I've had to deal with.
>
> Your workaround uses the sequence <comma & quote & quote & comma> to
> attempt to identify this case - but that only identifies it when it occurs
> in the "interior" cells within a record (line). You'd need to extend it to
> also cover the first cell in the line -
>  i.e. <CR & quote & quote & comma>
> and the last cell on the line
>  i.e. <comma & quote & quote & CR>
> and even the *only* cell on the line
>  i.e. <CR & quote & quote & CR>
>
> and then subsequently un-replace each of those appropriately.
>
> BUT - there's an even worse problem - any of these sequences *can* occur
> within a quoted string - e.g.   abc,"this cell contains an escaped quote
> ,"", within it", another cell
>
> Basically - the original idea ONLY works if the only time two quotes
> appear as consecutive characters is as an escaped quote within a quoted
> cell.    (hmmm - that means there is another nasty corner case - where the
> escaped quote appears as the first character within a quoted cell, e.g.
> abc,"""quoted string""",def !!)
>
> Fixing this is going to require checking for the doubled quote and acting
> differently within the loop that alternates between 'inside' and 'outside'
> quoted cells; and of course that alternation depends on the discovery of
> quotes (and hence needs to look-ahead at subsequent characters to detect
> the doubled cases.
>
> I'll have a go at re-writing it using that method - but it is basically a
> re-write from scratch, so it may take an hour or two to make sure I've got
> all the cases covered (and I don't yet have any prediction about the
> performance).
>
> If you could send me your test data off-list that would be helpful.
>
> Thanks
> -- Alex.
>
>
> On 15/05/2012 02:00, Peter Haworth wrote:
>
>> Hi Alex,
>> Just toi clat=rify, this was two double quotes with a comma right before
>> and right after them, not an escaped double quote in the middle of string.
>>
>> I've made a fix to this which works, subject to your approval
>>
>> I changed the line:
>>
>> *replace* quote&quote with tEscapedQuotePlaceholder in pData
>>
>>
>> to these three lines:
>>
>>
>> *replace* comma&  quote&  quote&  comma with numToChar(31) in pData
>>
>> *replace* quote&quote with tEscapedQuotePlaceholder in pData
>>
>> *replace* numToChar(31) with comma&  quote&  quote&  comma in pData
>>
>>
>>
>> That seems to have fixed it.
>>
>>
>> Pete
>> lcSQL Software<http://www.lcsql.com>
>>
>>
>>
>>
>> On Mon, May 14, 2012 at 2:50 PM, Peter Haworth<pete at lcsql.com>  wrote:
>>
>>  However, I have found another corner case and that is two consecutive
>>> double quote characters with no intervening characters.  I'm still
>>> checking
>>> into it for sure, but it looks like what happens with that after running
>>> it
>>> through your function is a single quote character.  Any thoughts on that?
>>>
>> ______________________________**_________________
>>
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/**mailman/listinfo/use-livecode<http://lists.runrev.com/mailman/listinfo/use-livecode>
>>
>>
>
> ______________________________**_________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/**mailman/listinfo/use-livecode<http://lists.runrev.com/mailman/listinfo/use-livecode>
>



More information about the use-livecode mailing list