Extracting text from PDF

Thu Feb 21 23:28:10 EST 2008

On Thu, Feb 21, 2008 at 9:17 AM, Richard Gaskin <ambassador at fourthworld.com>
wrote:

>
> Is there an easy way to do this in script?
>

Ah my favourite pass time :-)

Probably no use to you Richard as I believe this is for personal use only as
it requires PDF2RTFService from Devon Systems:

http://www.devon-technologies.com/products/freeware/services.html

I quickly checked the web-site but couldn't find anything, maybe it's in the
Read Me when you install, but I seem to remember something about personal
use only. Sill, for the inventive users out there, this will help.

Secondly, this is OSX only.

Easy Solution:

After installing PDF2RTFService, set the Preferences in TextEdit so that a
'new' document will be opened as Plain Text, not RTF.

Use Rev to 'force' the pdf document to be opened with TextEdit.
PDF2RTFService will automatically translate the PDF to RTF.

Use AppleScript to take the contents (rtf) of the file, open a new document
(txt) and insert the now plain text into the document. Save the document
with a fixed name and location.

User Rev to read the fixed file.

So in Rev:

launch tFileName with "/yourHD/Applications/TextEdit.app"

the 8 line AppleScript looks like this:

1 tell application "TextEdit"
2  set theText to the text of document 1
3  make new document
4  set the path of document 1 to "/yourHD/Users/Shared/Untitled1.txt"
5  save document 1
6  close document 1
7  close document 1
8 end tell

It is important to note that there should be no other documents open in
TextEdit when you run this.

After Rev launches your pdf with TextEdit it will become document 1. When
AppleScript 'makes new document', that is now document 1. I fix the location
to save 'Untitled1' in the Shared folder as this eliminates any permissions
issues. AppleScript then closes document 1, which leaves the original file
open, which now becomes document 1, which explains why document 1 is closed
twice.

I'll leave it up to you to put the above in a variable;-) but obviously
you'd run it Rev:

do tAppleScript as AppleScript

followed by:

put URL "file:/yourHD/Users/Shared/Untitled1.txt" into tTheText

Clearly this can easily be put into a repeat loop to run through a bunch of
pdf files to be opened, converted then fed into Untitled1 and finally read
into Rev.

Slightly more complex:
for those who wish to keep the converted files, here's the ApplesScript that
will save the txt version in same location as the pdf version, and just the
extension changed:

1  tell application "TextEdit"
2    set thePDFPath to the path of document 1
3    set endChar to the count character in thePDFPath
4    set endChar to endChar - 4
5    set theTxtPath to (characters 1 thru endChar of (thePDFPath as text) &
".txt") as text
6    set theText to the text of document 1
7    make new document
8    set the path of document 1 to theTxtPath
9    set the text of document 1 to theText
10    save document 1
11    close document 1
12    close document 1
13 end tell

Obviously in this case you'll need to do a bit of chunk manipulation in Rev
to figure out from the start file name and location where you'll find the
txt version so you can URL it. But that's so easy in Rev :-)

HTHs someone