Extracting text from PDF
Jan Schenkel
janschenkel at yahoo.com
Thu Jan 11 15:42:35 EST 2007
--- Richard Gaskin <ambassador at fourthworld.com> wrote:
> Anyone here have an efficient algo for extracting
> text from PDFs?
>
> --
> Richard Gaskin
> Fourth World Media Corporation
>
Well, one would hope I know a thing or two about PDF
files ;-)
There are a couple of things that make this a
challenge: text can be either in Latin or Unicode /
UTF-16 (Big Endian) encoding. You can use the BOM
marker to figure out if a piece of text is Latin or
Unicode.
But PDF files can also be compressed and/or encrypted,
making it nearly impossible to read from Revolution.
If this is Mac-only, you might be able to AppleScript
another application to get this information -
Preiew.app doesn't seem to be scriptable, but perhaps
another application could do the trick.
Some googling turned up the texttopdf command line
tool, which is open-source:
<http://www.glyphandcog.com/textext.html>
There's also a build for MacOSX, which you can
download at:
<http://www.bluem.net/downloads/pdftotext_en/>
Hope this helped,
Jan Schenkel.
Quartam Reports for Revolution
<http://www.quartam.com>
=====
"As we grow older, we grow both wiser and more foolish at the same time." (La Rochefoucauld)
____________________________________________________________________________________
Yahoo! Music Unlimited
Access over 1 million songs.
http://music.yahoo.com/unlimited
More information about the use-livecode
mailing list