PDF files handling

Andre Garzia andre at andregarzia.com
Fri Jul 6 10:27:09 EDT 2007


Bryan,

there's still hope! :D

two new tricks, does the filename contain meaningful data about the title?
if so, check for the presence of those words and what is near them. I belive
titles uses big font faces and appear alone on a page or at least have
importance on a page. Look for big font sized text.

What you need is a conflict resolution screen, for the pdfs that the process
work, then it's fine. For those that the process get lost, just launch them
in Preview or your favorite application and tell the "user" to
hightlight/select the text of title in preview. In rev use a simple
applescript to get the selected Text of preview. This way, for the small
cases where your software does not work, you have a quick fix that involves
only a human selecting the text of the title and pressing a button.

Well, I am assuming you are using MacOS X, if you're indeed on windows, then
someone here with a better windows experience may know how to get the
selected text of Adobe Reader using vbscript or shell commands or something
like that.

A system agnostic approach would be to ask the user to select and copy the
title to the clipboard, this way, you just need to check
clipboarddata["text"] to get your title.


Cheers
andre "this is a hack" garzia

On 7/6/07, Bryan McCormick <bryan at deepfoo.com> wrote:
>
> Ken, Andre
>
> Thanks for taking the time on this vexing PDF issue. Either solution
> does appear to work some of the time at least. The better formatted
> papers that have standard form (JEL classifcation, etc) can easily be
> read by Andre's solution. Sometimes by Ken's as well although many
> papers were created in other countries though published in English. Thus
> the file seems radically different in structure and honked the app in
> some cases.
>
> The problem is one that is simply not technical on some level. Many
> papers that were published "pretty print" don't have any explicit
> structure. So for example using the direct read method you'd never find
> a title element. And when read in using the pdftohtml conversion (cool
> trick!) there is nothing, nada, rien de tout that suggests where the
> title is on the page. So for automatic indexing or scraping of the page,
> it's a no go.
>
> Unfortunately this appears to be a result of not thinking through (the
> publishers) the implications of needing a machine to read a file. These
> worst offenders have no consistent structure and assumed one person
> sitting at a machine at a time having the leisure to actually read
> something. What the heck were they thinking?
>
> This is one area where Google wins. Thanks guys.
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
>



More information about the use-livecode mailing list