Extracting text from PDF

Thu Feb 21 01:12:15 EST 2008

--- Richard Gaskin <ambassador at fourthworld.com> wrote:
> 
> Is there an easy way to do this in script?
> 
> -- 
>   Richard Gaskin
>   Fourth World Media Corporation
>  

Hi Richard et al,

Extracting text from a PDF file is possible, and can
indeed be done via scripting, though not for all files
until you've climbed the decompression, decryption and
decoding mountains.
But that's actually just the start of it: PDF is just
about the worst text file format in history. Even
after stripping out the intermingled styling and
positioning instructions, you're left with a bunch of
strings which may not necessarily be in the correct
order.
The applications that are out there to convert PDF to
Word files, have a lot in common with Optical
Character Recognition (OCR) applications, which
attempt to convert scanned images to text, in that
they apply algorithms to "collate" the pieces of text
into a collection of words and paragraphs.
Heck, even Adobe Reader, Apple Preview and other PDF
viewers have to "best-guess" what text makes up a
sentence when you use the text selection tool.
Granted, a good number of files can be read
sequentially and churn out the strings in a reasonably
effective order - but all bets are off if you takea
random document that came out of graphically-oriented
tools where people play around with layers and filter
effects.

Sorry to disappoint you,

Jan Schenkel.

Quartam Reports & PDF Library for Revolution
<http://www.quartam.com>

=====
"As we grow older, we grow both wiser and more foolish at the same time."  (La Rochefoucauld)

      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs