Tools & techniques for one-off consolidation of multiple 'similar' CSV files?
Ben Rubinstein
benr_mc at cogapp.com
Mon Apr 4 17:55:09 EDT 2022
Hi Keith,
I feel your pain. I spend a lot of time doing this kind of manipulation.
I generally avoid CSV, the format of the devil. If necessary I have a library
that does a reasonable job of converting CSV to TSV, and run everything
through that. Let's just assume for now that you've already done this....
My technique is this:
- given that you've got a routine to iterate over a folder or tree of files
- given that you can easily treat a row at a time, and on each row can easily
work through a column at a time (e.g., this is TSV, you've set the
itemdelimiter to tab)
- given that the first row in each file gives the column names
1. Maintain an ordered list of output column names
I'd probably keep it in two formats: a string with tab separated column names,
an array mapping column name to index, and a variable giving the number of
columns.
2. For each file, go through the list of column names (the 'items' of the
first row). For each one, if it's not already in the master list of column
names (e.g., it's not in the array) then append it with a tab to the end of
the string, and add it to the array with the appropriate index.
Also keep an array mapping column index in _this_ file to column index in the
master file.
3. Then for each row after that, start with an empty array. For each non-empty
item on the row, add it to this 'row array', with the key being the index in
the master file corresponding to this item's index in this file.
4. When you've got to the end of the row, dump the data from this array; index
from 1 to number-of-master-columns, adding a tab between each. Then add this
to the master file accumulator.
So code would look something like this
local tMasterColumns -- tab delimited column names for output 'master' file
local aColumnNameToMasterIndex -- array mapping column name to index in above
local iNumMasterColumns -- number of items in the above two
local tMasterFileData -- will accumulate the rows of data for the output file
local aFileIndexToMasterIndex -- for each file, maps index of column in file
to index in master file
local aRowData -- for each row, we first move data into this array...
local tOutRow -- ...then output it into this string
local iFileNumCols -- number of columns in the current input file
local iFileColInx, iMasterColInx -- keep track of input and output col indices
repeat for each file... load it into tFileData.. etc
-- map the columns of this file to the (growing) columns of the masterfile
put 0 into iFileColInx
repeat for each item x in line 1 of tFileData
add 1 to iFileColInx
get aColumnNameToMasterIndex[x]
if it = empty then
put tab & x after tMasterColumns
add 1 to iNumMasterColumns
put iNumMasterColumns into aColumnNameToMasterIndex[x]
get iNumMasterColumns
end if
-- now it is the index of this column in the master file
put it into aFileIndexToMasterIndex[iFileColInx]
end repeat
delete line 1 of tFileData
put iFileColInx into iFileNumCols
repeat for each line tRowData in tFileData
-- get data from the row into the proper columns
put empty into aRowData
repeat with i = 1 to iFileNumCols -- number of columns in this file
put aFileIndexToMasterIndex[i] into iMasterColInx
put item i of tRowData into aRowData[iMasterColInx]
end repeat
-- now dump the row
put empty into tOutRow
repeat with i = 1 to iNumMasterColumns
put aRowData[i] & tab after tOutRow
end repeat
put (char 1 to -2 of tOutRow) \ -- delete last tab
& return after tMasterFileData
end repeat
end repeat -- for each file
-- finally save tMasterColumns & return & tMasterFileData
On 04/04/2022 18:03, Keith Clarke via use-livecode wrote:
> Hi folks,
> I need to consolidate a couple of hundred CSV files of varying sizes (dozens to hundreds of rows) and column structures (some shared columns but many unique), into a single superset dataset that comprises all data from all files.
>
> There are too many files and columns to attempt a manual process, so in I’ve been trying to achieve this with LiveCode - by iterating through the files to parse the column header rows into into row template and column mapping arrays used to build the combined table. However, I'm struggling to both compile the superset and render the results.
>
> I feel I may be designing a very complicated solution for a problem that has probably been solved before. I wonder if I’m missing a simpler method - with LiveCode or perhaps with Excel or ‘drag & drop’ into some kind of self-organising database GUI for SQLite, etc?
>
> Thanks in advance for any ideas.
> Best,
> Keith
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list