PDF files handling

Bryan McCormick bryan at deepfoo.com
Fri Jul 6 07:57:59 EDT 2007


Ken, Andre

Thanks for taking the time on this vexing PDF issue. Either solution 
does appear to work some of the time at least. The better formatted 
papers that have standard form (JEL classifcation, etc) can easily be 
read by Andre's solution. Sometimes by Ken's as well although many 
papers were created in other countries though published in English. Thus 
the file seems radically different in structure and honked the app in 
some cases.

The problem is one that is simply not technical on some level. Many 
papers that were published "pretty print" don't have any explicit 
structure. So for example using the direct read method you'd never find 
a title element. And when read in using the pdftohtml conversion (cool 
trick!) there is nothing, nada, rien de tout that suggests where the 
title is on the page. So for automatic indexing or scraping of the page, 
it's a no go.

Unfortunately this appears to be a result of not thinking through (the 
publishers) the implications of needing a machine to read a file. These 
worst offenders have no consistent structure and assumed one person 
sitting at a machine at a time having the leisure to actually read 
something. What the heck were they thinking?

This is one area where Google wins. Thanks guys.



More information about the Use-livecode mailing list