Extracting text from PDF

Jan Schenkel janschenkel at yahoo.com
Thu Jan 11 15:42:35 EST 2007


--- Richard Gaskin <ambassador at fourthworld.com> wrote:
> Anyone here have an efficient algo for extracting
> text from PDFs?
> 
> -- 
>   Richard Gaskin
>   Fourth World Media Corporation
> 

Well, one would hope I know a thing or two about PDF
files ;-)
There are a couple of things that make this a
challenge: text can be either in Latin or Unicode /
UTF-16 (Big Endian) encoding. You can use the BOM
marker to figure out if a piece of text is Latin or
Unicode.
But PDF files can also be compressed and/or encrypted,
making it nearly impossible to read from Revolution.

If this is Mac-only, you might be able to AppleScript
another application to get this information -
Preiew.app doesn't seem to be scriptable, but perhaps
another application could do the trick.
Some googling turned up the texttopdf command line
tool, which is open-source:
<http://www.glyphandcog.com/textext.html>
There's also a build for MacOSX, which you can
download at:
<http://www.bluem.net/downloads/pdftotext_en/>

Hope this helped,

Jan Schenkel.

Quartam Reports for Revolution
<http://www.quartam.com>

=====
"As we grow older, we grow both wiser and more foolish at the same time."  (La Rochefoucauld)


 
____________________________________________________________________________________
Yahoo! Music Unlimited
Access over 1 million songs.
http://music.yahoo.com/unlimited



More information about the use-livecode mailing list