Stupid CSV tricks
Richard Gaskin
ambassador at FourthWorld.com
Fri Jun 14 00:59:01 EDT 2002
Sjoerd Op 't Land wrote:
> What about (this would be a new feature for MetaCard) multi-char
> itemDelimiters?
>
> set itemDel to quote & comma & quote
> repeat for each line m in csvData
> put char 2 to -2 of m into tRecord
> put item 1 of tRecord into ...
> end repeat
That woud work if CSV were consistent. But as implemented in Microsoft
products, CVS puts quotes around text values only; numeric values are
delimited without quotes, e.g.:
"num_hits","page_requested","comments"
2,"/info/index.htm","This is a comment."
5,"index.htm","This is another comment\,
which ilustrates how a single field in CVS
may also contain escaped quotes
and crlf's."
An example of the other complication with CSV is the last item of the last
row: quoted text elements can contain both returns (crlf on Win) and
commas. To distinguish delimiter commas from commas in data, the latter are
escaped with "\".
I wrote a quickie CSV parser to get the job done (copied below), but I can't
help but think there's some clever way to use the matchText or replaceText
functions to speed it up (it processes a megabyte in about 2.8 secs on my
Mac G4/500MHz/OS 9.2.2, and 4.1 secs on a PII/400MHz/WinME).
Suggestions?
--
Richard Gaskin
Fourth World Media Corporation
Developer of WebMerge 2.0: Publish any Database on Any Site
___________________________________________________________
Ambassador at FourthWorld.com http://www.FourthWorld.com
Tel: 323-225-3717 AIM: FourthWorldInc
--====================================================================--
--
-- CSV2Tab
--
-- Takes any comma-delimited or tab-delimited tabular data in which
-- some values may be contained in quotes and others not, and returns
-- a consistently-formatted string with non-data quotes removed and
-- in-data returns replaced with a placeholder which can be replaced
-- with cr's again later as needed for display (ASCII 11 is borrowed
-- from the FileMaker convention for exported data).
--
-- After some initial setup, it walks through each char of the data,
-- keeping track of whether or not it is inside of a quoted string.
-- If so, it collects the data into tmp, and once the closing quote
-- is reached it restores escaped quotes and replaces returns before
-- copying tmp to the output, tNuData.
-- Characters encountered outside of quotes are simply copied to the
-- output, with the exception of commas which are converted to tabs.
--
function CSV2Tab pData
local tNuData -- contains tabbed copy of data
local tReturnPlaceholder -- replaces cr in field data to avoid line
-- breaks which would be misread as records;
-- replaced later during dislay
local tEscapedQuotePlaceholder -- used for keeping track of quotes in data
local tmp -- collection for data between quotes
local tInQuotedText -- flag set while reading data between quotes
--
put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
put numtochar(2) into tEscapedQuotePlaceholder -- used to simplify
-- distinction between quotes in data and those
-- used in delimiters
--
-- Normalize line endings:
replace crlf with cr in pData -- Win to UNIX
replace numtochar(13) with cr in pData -- Mac to UNIX
--
-- Put placeholder in escaped quote (non-delimiter) chars:
replace ("\""e) with tEscapedQuotePlaceholder in pData
--
put empty into tInQuotedText
repeat for each char k in pData
if (tInQuotedText) then
if k = quote then
replace cr with tReturnPlaceholder in tmp
replace tEscapedQuotePlaceholder with quote in tmp
put tmp after tNuData
put empty into tInQuotedText
else
put k after tmp
end if
else if k = quote then
put true into tInQuotedText
put empty into tmp
else if k = comma then
put tab after tNuData
else
put k after tNuData
end if
end repeat
--
return tNuData
end CSV2Tab
More information about the metacard
mailing list