PDF
Mike Bonner
bonnmike at gmail.com
Sun May 13 11:05:40 EDT 2018
I ended up using pdftotext, it worked like a charm. (Though I had to look
up how to send it a file list using find.. Too long away from the shell.)
I now have a little app that can do a word search for matching files and
show either the extracted text, or the original pdf using the browser
widget.
As far as being on the "make pdfs go away" bandwagon, yes I am.
Unfortunately, they're still used all over the place. Insurance companies
autogenerate a huge amount of pdf reports, some of them built live through
horribly slow clunky awful (insert a bunch of other words here to describe
how NOT enjoyable it is to use their websites) that then eventually (after
going through huge amounts of different screens, get to the end result)
hand you a pdf. /endInsuranceWebsiteVent
Reminds me of when I worked as phone support for a "large computer
manufacturer".. When there was a workflow issue, and slow call times due to
waiting on page loads for vantive.. The answer usually ended up being..
"Hey, its already slow so lets add 3 more required page loads that can take
forever to complete especially on busy days, thereby slowing things down
even more..." /endPhoneSupportVent
I seem to be on a "KISAF" kick lately. Keep It Simple And Fast
On Sun, May 13, 2018 at 8:30 AM, R.H. via use-livecode <
use-livecode at lists.runrev.com> wrote:
> To extract text from a PDF document, I am using a command line tool on
> Windows which is available also for Linux based systems called Xpdf.
>
> It was working well, using shell() on LiveCode Community 8x, but tested
> only in the IDE on Windows.
>
> It should work with Linux and Mac as well.
>
> If PDFs just contain images where the text is in the image, you need to
> first run it through character recognition program. Since I found that
> different tools generate different results when converting image characters
> in PDF to embedded text, I still find that Acrobat from Adobe is doing the
> best job.
>
> I needed this since some people had sent huge lists of numerical data in
> PDF which were impossible to extract, and the manual method could taken
> weeks. Also, it is helpful for building Document Management Systems where
> words within associated documents need to be indexed.
>
> Converting PDF to .docx formats (Word) usually does not give good results.
> The resulting documents are quite unclean. Extracting the text also does
> not necessarily result in a meaningful text if the original PDF is not
> structured with clearly separated paragraphs, headlines, etc. ideally in
> one top-to-bottom and left-to-right flow. So, a lot of manual work will
> often be required.
>
> Nevertheless, I can not see that PDF will lose ground as the standard for
> many years to come. There are possibly billions of documents in PDF around?
> What should replace it? And people are still printing.
>
> Xpdf can generate a pure text file that can be read from LiveCode and
> processed further.
>
> *Open Source Xpdf*
>
> http://www.xpdfreader.com/download.html
>
> https://en.wikipedia.org/wiki/Pdftotext
> Command line tools in Xpdf
>
> The open source Xpdf toolkit also includes several command line tools which
> perform various functions on PDF files:
>
> - *pdftotext*: converts PDF to text
> - *pdftops*: converts PDF to PostScript
> - *pdftoppm*: converts PDF pages to netpbm (PPM/PGM/PBM) image files
> - *pdftopng*: converts PDF pages to PNG image files
> - *pdftohtml*: converts PDF to HTML
> - *pdfinfo*: extracts PDF metadata
> - *pdfimages*: extracts raw images from PDF files
> - *pdffonts*: lists fonts used in PDF files
> - *pdfdetach*: extracts attached files from PDF files
>
> Cross-platform
>
> All of Xpdf tools are available for Linux, Windows, and Mac.
>
> The viewer (xpdf / XpdfReader) uses the Qt toolkit.
> Roland
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
More information about the use-livecode
mailing list