Stupid CSV tricks

Richard Gaskin ambassador at FourthWorld.com
Fri Jun 14 00:59:01 EDT 2002


Sjoerd Op 't Land wrote:

> What about (this would be a new feature for MetaCard) multi-char
> itemDelimiters?
> 
> set itemDel to quote & comma & quote
> repeat for each line m in csvData
> put char 2 to -2 of m into tRecord
> put item 1 of tRecord into ...
> end repeat

That woud work if CSV were consistent.  But as implemented in Microsoft
products, CVS puts quotes around text values only; numeric values are
delimited without quotes, e.g.:

  "num_hits","page_requested","comments"
  2,"/info/index.htm","This is a comment."
  5,"index.htm","This is another comment\,
  which ilustrates how a single field in CVS
  may also contain escaped quotes
  and crlf's."

An example of the other complication with CSV is the last item of the last
row:  quoted text elements can contain both returns (crlf on Win) and
commas.  To distinguish delimiter commas from commas in data, the latter are
escaped with "\".

I wrote a quickie CSV parser to get the job done (copied below), but I can't
help but think there's some clever way to use the matchText or replaceText
functions to speed it up (it processes a megabyte in about 2.8 secs on my
Mac G4/500MHz/OS 9.2.2, and 4.1 secs on a PII/400MHz/WinME).
  
Suggestions?

-- 
 Richard Gaskin 
 Fourth World Media Corporation
 Developer of WebMerge 2.0: Publish any Database on Any Site
 ___________________________________________________________
 Ambassador at FourthWorld.com       http://www.FourthWorld.com
 Tel: 323-225-3717                       AIM: FourthWorldInc


--====================================================================--
--
-- CSV2Tab
--
-- Takes any comma-delimited or tab-delimited tabular data in which
-- some values may be contained in quotes and others not, and returns
-- a consistently-formatted string with non-data quotes removed and
-- in-data returns replaced with a placeholder which can be replaced
-- with cr's again later as needed for display (ASCII 11 is borrowed
-- from the FileMaker convention for exported data).
--
-- After some initial setup, it walks through each char of the data,
-- keeping track of whether or not it is inside of a quoted string.
-- If so, it collects the data into tmp, and once the closing quote
-- is reached it restores escaped quotes and replaces returns before
-- copying tmp to the output, tNuData.
-- Characters encountered outside of quotes are simply copied to the
-- output, with the exception of commas which are converted to tabs.
--
function CSV2Tab pData
  local tNuData -- contains tabbed copy of data
  local tReturnPlaceholder -- replaces cr in field data to avoid line
  --                          breaks which would be misread as records;
  --                          replaced later during dislay
  local tEscapedQuotePlaceholder -- used for keeping track of quotes in data
  local tmp -- collection for data between quotes
  local tInQuotedText -- flag set while reading data between quotes
  --
  put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
  put numtochar(2)  into tEscapedQuotePlaceholder -- used to simplify
  --                       distinction between quotes in data and those
  --                       used in delimiters
  --
  -- Normalize line endings:
  replace crlf with cr in pData          -- Win to UNIX
  replace numtochar(13) with cr in pData -- Mac to UNIX
  --
  -- Put placeholder in escaped quote (non-delimiter) chars:
  replace ("\"&quote) with tEscapedQuotePlaceholder in pData
  --
  put empty into tInQuotedText
  repeat for each char k in pData
    if (tInQuotedText) then
      if k = quote then
        replace cr with tReturnPlaceholder in tmp
        replace tEscapedQuotePlaceholder with quote in tmp
        put tmp after tNuData
        put empty into tInQuotedText
      else
        put k after tmp
      end if
    else if k = quote then
      put true into tInQuotedText
      put empty into tmp
    else if k = comma then
      put tab after tNuData
    else
      put k after tNuData
    end if
  end repeat
  --
  return tNuData
end CSV2Tab






More information about the metacard mailing list