Parsing a PDF file

Richard Gaskin ambassador at fourthworld.com
Fri Jul 8 11:44:50 EDT 2016


Jim Hurley wrote:

 > My County is now publishing the election results to the web as a PDF
 > file:
 >
 > 
https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/precinctreport.pdf
 >
 > Is there a way to parse these PDF  files?

It's unfortunate that so many orgs release data useful to analysis in 
complex formats that inhibit such use.  PDF is great when the goal is to 
preserve page layout, but a uniquely poor choice for sharing data to be 
used for analytics.  Alas, that hasn't slowed its unfortunate use in 
such contexts.

If this is to be done within an application for others to use, perhaps 
the smoothest user experience would be via the XPDF external, currently 
available only in LiveCode Business Edition at $1999/yr.  While that may 
seem high, for commercial products of such scope it may be a good bargain.

However, if this is only for use in tools you'll be using yourself, 
where an extra step or two is less important, there are many options.

If it's just one file, perhaps the simplest is to use Save As Text from 
Adobe's PDF Viewer.

If you'll need to automate this for reuse, here's a way to use Apple's 
Automator for that:
<https://www.engadget.com/2013/02/11/mac-101-use-automater-to-extract-text-from-pdfs/>

I believe there may also be a command line option available on macOS, 
which could be called from within LC using the shell function.  I don't 
know the name of the command line tool for that on macOS, but in Linux I 
use pdftotext, where the syntax is pretty simple:

   pdftotext <sourcePdfFile> <destTextFile>

e.g.:

   put "/Users/me/folder/SomeFile.pdf" into tSrc
   put "/Users/me/folder/SomeFile.txt" into tDest
   get shell("pdftotext "& tSrc && tDest)

-- 
  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  ____________________________________________________________________
  Ambassador at FourthWorld.com                http://www.FourthWorld.com





More information about the use-livecode mailing list