Parsing a PDF file

Paul Dupuis paul at researchware.com
Fri Jul 8 12:30:17 EDT 2016


On 7/8/2016 11:55 AM, Colin Holgate wrote:
> I was trying an export as spreadsheet from Acrobat Pro, but that didn’t work. Doing a Save as Text from Acrobat Reader was more successful, but the columns come out in a different order, and some columns get combined into a single string.

Over the few years, I have spent a ridiculous amount of time exploring
PDF access via LiveCode is every way possible. Ultimately, for our needs
we created the XPDF external and transferred it to LiveCode, but we
explored javascript extraction from a browser. Interapplication
communication, shell command line tools, etc., etc.

The reality is the PDF format is great for visually representing a
printed page and totally sucks for text content - that is actually
getting the characters of the document rather than an image of the
characters.

There is NO really mapping of characters to their appearance in the PDF
other than geometric position on the page. You get no font information,
no size, no styles, zip. You get line breaks at the end of every visible
line and you can get line breaks in what appears to be the middle of
content depending upon how the original source document was rendered
into a PDF. Headers and footers end up in the middle of paragraphs. You
have no real way to tell a line break from a paragraph break and more.

In truth a NEW portable document format needs to be invented that
connects and preserves content to its appearance, but I suspect that
people who want to keep both intact and portable are just using HTML5
and CSS3.





More information about the use-livecode mailing list