Tools & techniques for one-off consolidation of multiple 'similar' CSV files?

Ben Rubinstein benr_mc at cogapp.com
Mon Apr 4 17:55:09 EDT 2022


Hi Keith,

I feel your pain. I spend a lot of time doing this kind of manipulation.

I generally avoid CSV, the format of the devil. If necessary I have a library 
that does a reasonable job of converting CSV to TSV, and run everything 
through that. Let's just assume for now that you've already done this....

My technique is this:

- given that you've got a routine to iterate over a folder or tree of files
- given that you can easily treat a row at a time, and on each row can easily 
work through a column at a time (e.g., this is TSV, you've set the 
itemdelimiter to tab)
- given that the first row in each file gives the column names

1. Maintain an ordered list of output column names

I'd probably keep it in two formats: a string with tab separated column names, 
  an array mapping column name to index, and a variable giving the number of 
columns.

2. For each file, go through the list of column names (the 'items' of the 
first row). For each one, if it's not already in the master list of column 
names (e.g., it's not in the array) then append it with a tab to the end of 
the string, and add it to the array with the appropriate index.

Also keep an array mapping column index in _this_ file to column index in the 
master file.

3. Then for each row after that, start with an empty array. For each non-empty 
item on the row, add it to this 'row array', with the key being the index in 
the master file corresponding to this item's index in this file.

4. When you've got to the end of the row, dump the data from this array; index 
from 1 to number-of-master-columns, adding a tab between each. Then add this 
to the master file accumulator.


So code would look something like this


local tMasterColumns -- tab delimited column names for output 'master' file
local aColumnNameToMasterIndex -- array mapping column name to index in above 
  local iNumMasterColumns -- number of items in the above two

local tMasterFileData -- will accumulate the rows of data for the output file

local aFileIndexToMasterIndex -- for each file, maps index of column in file 
to index in master file
local aRowData -- for each row, we first move data into this array...
local tOutRow -- ...then output it into this string

local iFileNumCols -- number of columns in the current input file

local iFileColInx, iMasterColInx -- keep track of input and output col indices


repeat for each file... load it into tFileData.. etc

   -- map the columns of this file to the (growing) columns of the masterfile
   put 0 into iFileColInx
   repeat for each item x in line 1 of tFileData
     add 1 to iFileColInx
     get aColumnNameToMasterIndex[x]
     if it = empty then
        put tab & x after tMasterColumns
        add 1 to iNumMasterColumns
        put iNumMasterColumns into aColumnNameToMasterIndex[x]
        get iNumMasterColumns
     end if
     -- now it is the index of this column in the master file
     put it into aFileIndexToMasterIndex[iFileColInx]
   end repeat
   delete line 1 of tFileData
   put iFileColInx into iFileNumCols

   repeat for each line tRowData in tFileData

     -- get data from the row into the proper columns
     put empty into aRowData
     repeat with i = 1 to iFileNumCols -- number of columns in this file
       put aFileIndexToMasterIndex[i] into iMasterColInx
       put item i of tRowData into aRowData[iMasterColInx]
     end repeat

     -- now dump the row
     put empty into tOutRow
     repeat with i = 1 to iNumMasterColumns
       put aRowData[i] & tab after tOutRow
     end repeat
     put (char 1 to -2 of tOutRow) \ -- delete last tab
       & return after tMasterFileData

   end repeat

end repeat -- for each file

-- finally save tMasterColumns & return & tMasterFileData










On 04/04/2022 18:03, Keith Clarke via use-livecode wrote:
> Hi folks,
> I need to consolidate a couple of hundred CSV files of varying sizes (dozens to hundreds of rows) and column structures (some shared columns but many unique), into a single superset dataset that comprises all data from all files.
> 
> There are too many files and columns to attempt a manual process, so in I’ve been trying to achieve this with LiveCode - by iterating through the files to parse the column header rows into into row template and column mapping arrays used to build the combined table. However, I'm struggling to both compile the superset and render the results.
> 
> I feel I may be designing a very complicated solution for a problem that has probably been solved before. I wonder if I’m missing a simpler method - with LiveCode or perhaps with Excel or ‘drag & drop’ into some kind of self-organising database GUI for SQLite, etc?
> 
> Thanks in advance for any ideas.
> Best,
> Keith
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode



More information about the use-livecode mailing list