PDF files handling
bryan at deepfoo.com
Fri Jul 6 06:57:59 CDT 2007
Thanks for taking the time on this vexing PDF issue. Either solution
does appear to work some of the time at least. The better formatted
papers that have standard form (JEL classifcation, etc) can easily be
read by Andre's solution. Sometimes by Ken's as well although many
papers were created in other countries though published in English. Thus
the file seems radically different in structure and honked the app in
The problem is one that is simply not technical on some level. Many
papers that were published "pretty print" don't have any explicit
structure. So for example using the direct read method you'd never find
a title element. And when read in using the pdftohtml conversion (cool
trick!) there is nothing, nada, rien de tout that suggests where the
title is on the page. So for automatic indexing or scraping of the page,
it's a no go.
Unfortunately this appears to be a result of not thinking through (the
publishers) the implications of needing a machine to read a file. These
worst offenders have no consistent structure and assumed one person
sitting at a machine at a time having the leisure to actually read
something. What the heck were they thinking?
This is one area where Google wins. Thanks guys.
More information about the use-livecode