Parsing a PDF file

Kay C Lan lan.kc.macmail at gmail.com
Sun Jul 10 22:50:32 EDT 2016


On Mon, Jul 11, 2016 at 9:36 AM, Roger Eller
<roger.e.eller at sealedair.com> wrote:
> Since this seems to be Mac only, why not "do as Applescript" the select
> all, and Copy?
>
Because Preview isn't properly scriptable and you can't "Select All"
or "Copy". As Richard said, the answer is with Automator.

If you open Automator, select a new 'application', then in the left
hand column you'll see "PDF's", as an option. If you click on that and
browse down the middle column you'll see 'Extract PDF Text', and if
you click on that, in it's description you'll see that it can extract
Plain or Rich text.

So how can we get this to work with LC?

1) In Automator, drag the 'Extract PDF Text' action into the right
hand workspace window.
a) Choose the output type - most likely Plain Text
b) Select a folder to save to - for convenience we'll use "Desktop"
c) For the Output File Name you probably want to use a Custom Name -
pdf2text or whatever. You do not need to specify the suffix.
d) tick the Replace Existing files box.

2) Back in the left hand column where you clicked on the PDFs icon,
now click on the 'Files & Folders' icon (looks like the Finder icon).
>From the middle column drag 'Ask for Finder Items' into the right hand
column, place it above 'Extract PDF Text'.
a) Set the 'Start at: to a logical location, like Downloads, if that
is where your PDFs are likely to be located.
b) Type: should be left at files and do NOT tick the Allow Multiple
Selection box as these instruction are for a single file only.


3) From the middle column drag 'Open Finder Items' and place it
'between' the last two actions - so the order will be Ask for Finder
Items, Open Finder Items, Extract PDF Text.
a) Set Open with: to Preview.

4) Optionally, if you don't always have Preview open and you don't
want to be left with the PDF file open, in the left hand column click
Utility, and from the middle column drag 'Quit Application' to the end
of your workflow.
a) set it to "Preview.app"

You can now test this by clicking the Run button in the top right
corner. What should happen is you should get a standard Open File
dialog box to point to a file, you then select a file and shortly
thereafter the Automator log window at the bottom should have all
green ticks.

You should then be able to navigate to the Desktop folder and the file
'pdf2text.txt' should be there.

So to complete the LC integration process. Save your Automator
workflow, and call it something like pdf2text. For this example we'll
also save it to Desktop.

Then in you LC script:

on mouseUp
   set the defaultFolder to specialFolderPath("desktop")
   launch pdf2text.app
   --if file is large, consider a wait 1 or more here.
   put textDecode(URL
"file:/Users/yourname/Desktop/pdf2text.txt","utf8") into tNotPDF
   --do what you have to after this

   --your Automator app will auto Quit once it's done it's thing so
there is no need to balance the 'launch' command with a 'kill' command
end mouseUp

It should be noted that Automators Extract PDF Text typically does a
better job of text extraction than manually Select All + Copy + Paste.

Unfortunately I consider both these options about 30% or less accurate
than using my old PPC G5 running Leopard and Devon Technologies old
PDF2RTFService. I had not previously offered a solution to the OP
because, get a PPC Mac, install Leopard and PDF2TEXTService is only
really an option if you are handling many large, complex formatted
pdfs day in, day out, as I am. Jim's problem sounds like a one off.




More information about the use-livecode mailing list