Parsing a PDF file

Mike Bonner bonnmike at gmail.com
Fri Jul 8 12:50:23 EDT 2016


Might read this one too:
http://stackoverflow.com/questions/1554280/extract-text-from-pdf-in-javascript

On Fri, Jul 8, 2016 at 10:48 AM, Mike Bonner <bonnmike at gmail.com> wrote:

> Its ugly but, could you use pdf.js to extract the text  in a browser
> widget showing the pdf?
> http://git.macropus.org/2011/11/pdftotext/example/
>
> Not sure what else is in pdf.js but it looks interesting.
>
> On Fri, Jul 8, 2016 at 10:30 AM, Paul Dupuis <paul at researchware.com>
> wrote:
>
>> On 7/8/2016 11:55 AM, Colin Holgate wrote:
>> > I was trying an export as spreadsheet from Acrobat Pro, but that didn’t
>> work. Doing a Save as Text from Acrobat Reader was more successful, but the
>> columns come out in a different order, and some columns get combined into a
>> single string.
>>
>> Over the few years, I have spent a ridiculous amount of time exploring
>> PDF access via LiveCode is every way possible. Ultimately, for our needs
>> we created the XPDF external and transferred it to LiveCode, but we
>> explored javascript extraction from a browser. Interapplication
>> communication, shell command line tools, etc., etc.
>>
>> The reality is the PDF format is great for visually representing a
>> printed page and totally sucks for text content - that is actually
>> getting the characters of the document rather than an image of the
>> characters.
>>
>> There is NO really mapping of characters to their appearance in the PDF
>> other than geometric position on the page. You get no font information,
>> no size, no styles, zip. You get line breaks at the end of every visible
>> line and you can get line breaks in what appears to be the middle of
>> content depending upon how the original source document was rendered
>> into a PDF. Headers and footers end up in the middle of paragraphs. You
>> have no real way to tell a line break from a paragraph break and more.
>>
>> In truth a NEW portable document format needs to be invented that
>> connects and preserves content to its appearance, but I suspect that
>> people who want to keep both intact and portable are just using HTML5
>> and CSS3.
>>
>>
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>
>



More information about the use-livecode mailing list