Parsing a PDF file

Jim Hurley jhurley0305 at sbcglobal.net
Sat Jul 9 10:43:58 EDT 2016


Thanks Richard. 

You are so right about releasing data in complex formats.
I spoke to the election's off about posting election results in PDF format.
I knew there was not use fighting them when they told me that it was now County "policy" to post everything in PDF--not unlike those 10 policies of renown that were carved in stone--and a metaphor was born.

In the County's old system, each of the 50 election precincts were stored in 50 web pages as HTML documents.
That was perfect for LiveCode's "get url". It was a matter of second to  visit all 50 pages, parse the text, and store the data.

Thankfully this new PDF web page has all the data for all 50 precincts on the one page.
If I save the page to a pdf file, open than file in Adobe Acrobat, and save it as "Text (Accessible)" , as you suggested, I get a readable text file for LC to work its magic on.

(The other two text options in Adobe are "Rich Text Format" and "Text (Plain)", neither of which works--only "Text (Accessible)"

I was unaware of Apple's Automator. I'll look into it--but it is unnecessary for this project.

Thanks again,

Jim Hurley


> Message: 9
> Date: Fri, 8 Jul 2016 08:44:50 -0700
> From: Richard Gaskin <ambassador at fourthworld.com>
> To: use-livecode at lists.runrev.com
> Subject: Re: Parsing a PDF file
> Message-ID: <577FCA72.2040901 at fourthworld.com>
> Content-Type: text/plain; charset=utf-8; format=flowed
> 
> Jim Hurley wrote:
> 
>> My County is now publishing the election results to the web as a PDF
>> file:
>> 
>> 
> https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/precinctreport.pdf
>> 
>> Is there a way to parse these PDF  files?
> 
> It's unfortunate that so many orgs release data useful to analysis in 
> complex formats that inhibit such use.  PDF is great when the goal is to 
> preserve page layout, but a uniquely poor choice for sharing data to be 
> used for analytics.  Alas, that hasn't slowed its unfortunate use in 
> such contexts.
> 
> If this is to be done within an application for others to use, perhaps 
> the smoothest user experience would be via the XPDF external, currently 
> available only in LiveCode Business Edition at $1999/yr.  While that may 
> seem high, for commercial products of such scope it may be a good bargain.
> 
> However, if this is only for use in tools you'll be using yourself, 
> where an extra step or two is less important, there are many options.
> 
> If it's just one file, perhaps the simplest is to use Save As Text from 
> Adobe's PDF Viewer.
> 
> If you'll need to automate this for reuse, here's a way to use Apple's 
> Automator for that:
> <https://www.engadget.com/2013/02/11/mac-101-use-automater-to-extract-text-from-pdfs/>
> 
> I believe there may also be a command line option available on macOS, 
> which could be called from within LC using the shell function.  I don't 
> know the name of the command line tool for that on macOS, but in Linux I 
> use pdftotext, where the syntax is pretty simple:
> 
>   pdftotext <sourcePdfFile> <destTextFile>
> 
> e.g.:
> 
>   put "/Users/me/folder/SomeFile.pdf" into tSrc
>   put "/Users/me/folder/SomeFile.txt" into tDest
>   get shell("pdftotext "& tSrc && tDest)
> 
> -- 
>  Richard Gaskin
>  Fourth World Systems
>  Software Design and Development for the Desktop, Mobile, and the Web
>  ____________________________________________________________________
>  Ambassador at FourthWorld.com                http://www.FourthWorld.com
> 
> 
> 





More information about the use-livecode mailing list