roland.huettmann at gmail.com
Sun May 13 10:30:37 EDT 2018
To extract text from a PDF document, I am using a command line tool on
Windows which is available also for Linux based systems called Xpdf.
It was working well, using shell() on LiveCode Community 8x, but tested
only in the IDE on Windows.
It should work with Linux and Mac as well.
If PDFs just contain images where the text is in the image, you need to
first run it through character recognition program. Since I found that
different tools generate different results when converting image characters
in PDF to embedded text, I still find that Acrobat from Adobe is doing the
I needed this since some people had sent huge lists of numerical data in
PDF which were impossible to extract, and the manual method could taken
weeks. Also, it is helpful for building Document Management Systems where
words within associated documents need to be indexed.
Converting PDF to .docx formats (Word) usually does not give good results.
The resulting documents are quite unclean. Extracting the text also does
not necessarily result in a meaningful text if the original PDF is not
structured with clearly separated paragraphs, headlines, etc. ideally in
one top-to-bottom and left-to-right flow. So, a lot of manual work will
often be required.
Nevertheless, I can not see that PDF will lose ground as the standard for
many years to come. There are possibly billions of documents in PDF around?
What should replace it? And people are still printing.
Xpdf can generate a pure text file that can be read from LiveCode and
*Open Source Xpdf*
Command line tools in Xpdf
The open source Xpdf toolkit also includes several command line tools which
perform various functions on PDF files:
- *pdftotext*: converts PDF to text
- *pdftops*: converts PDF to PostScript
- *pdftoppm*: converts PDF pages to netpbm (PPM/PGM/PBM) image files
- *pdftopng*: converts PDF pages to PNG image files
- *pdftohtml*: converts PDF to HTML
- *pdfinfo*: extracts PDF metadata
- *pdfimages*: extracts raw images from PDF files
- *pdffonts*: lists fonts used in PDF files
- *pdfdetach*: extracts attached files from PDF files
All of Xpdf tools are available for Linux, Windows, and Mac.
The viewer (xpdf / XpdfReader) uses the Qt toolkit.
More information about the Use-livecode