PDF files handling
andre at andregarzia.com
Fri Jul 6 11:49:31 CDT 2007
I've met one of Devon Tech developers in Malta :-) Small world ain't it, a
Brazilian and he is from Bulgary, met in Malta in the house of Sims and
My needs are simple. I put online versions of some parts of magazines for
the Himalayan Academy Publications (www.himalayanacademy.com). We're
interested in the content not in the presentation layer, so to extract that
data, I tried some pdf conversion tools and tried parsing the pdf myself. As
Bryan found, many documents are made for humans and computers have no way to
make sense of them. So in the end I create a little stack tool that allows
me to place the content in fields and then generate the HTML to be put
online using XSLT. I am lucky to have an old version of Adobe InDesigner
that my university allowed me as a monitor to install. Sivakatirswami then
send me an IDX file which is an XML InDesign Exchange file, I open this file
in InDesign and use a mix of applescript to get the data from InDesign.
It's the same approach I said before, one applescript get the selected text
of InDesign. Adobe application does not return the text but return a pair of
chunk positions like "char 100 to 10024 of box 23 of document 1" then
another applescript uses those chunk positions to get the actual text.
I can work really well with this approach, my stack replaces unicode
entities and strange chars and transforms everything automatically. First I
used PDFs but selecting multi colum text in Preview mixed the text when
applescript picked it, so I ditched it in favor of InDesign files.
If you're not selecting multi colum text, applescript + preview works fine.
Be aware that the trouble you quote about Revolution clipboard text will not
affect applescript workflows like this.
I don't have much experience with pdf2html programs, I know they exist thats
why I told Bryan to investigate, but so far, I managed to solve my troubles
using the raw indesign source files.
The idea of using clipboard or applescript to make the user more productive,
since selecting is faster than typing and less error prone.
On 7/6/07, Kay C Lan <lan.kc.macmail at gmail.com> wrote:
> On 7/6/07, Andre Garzia <andre at andregarzia.com> wrote:
> > A system agnostic approach would be to ask the user to select and copy
> > title to the clipboard, this way, you just need to check
> > clipboarddata["text"] to get your title.
> Unfortunately Rev's inter-app clipboard transferring ability is less than
> stellar. Sometimes accessing the data via script works, a lot of the times
> it doesn't, sometimes using keyboard short cuts work, sometimes they don't
> using menu items to cut and paste work consistently. Again this is only
> referring to transferring clipboardData from other apps to Rev.
> Andre can you provide a little more info on pdf2html command line tools
> OSX. I did a quick search of Man pages (using ManOpen) and the only
> reference that came up for pdf was snmpdf, which of course has nothing to
> with pdfs. I did a search on google for OSX command line pdf tools and
> up with commercial products:
> selling for $360 a pop!!!
> If the tool already comes with OSX, how can I find out the command syntax
> and options?
> I currently use the free and excellent PDF2RTFService
> but I can't see anywhere how to access this via the command line.
> I work with lots and lots of pdf, but currently, because the clipboardData
> is unreliable, I work in two steps, I use Automator to open all the pdf
> files with TextEdit (PDF2RTFService taking care of the conversion process)
> and save them as plain text. Then I use Rev to open and read the files and
> do the real work.
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
More information about the use-livecode