PDF
Bob Sneidar
bobsneidar at iotecdigital.com
Mon May 14 11:21:15 EDT 2018
Document Management systems use PDFs almost exclusively. I think PDF is here to stay.
Bob S
> On May 13, 2018, at 08:05 , Mike Bonner via use-livecode <use-livecode at lists.runrev.com> wrote:
>
> I ended up using pdftotext, it worked like a charm. (Though I had to look
> up how to send it a file list using find.. Too long away from the shell.)
> I now have a little app that can do a word search for matching files and
> show either the extracted text, or the original pdf using the browser
> widget.
>
> As far as being on the "make pdfs go away" bandwagon, yes I am.
> Unfortunately, they're still used all over the place. Insurance companies
> autogenerate a huge amount of pdf reports, some of them built live through
> horribly slow clunky awful (insert a bunch of other words here to describe
> how NOT enjoyable it is to use their websites) that then eventually (after
> going through huge amounts of different screens, get to the end result)
> hand you a pdf. /endInsuranceWebsiteVent
>
> Reminds me of when I worked as phone support for a "large computer
> manufacturer".. When there was a workflow issue, and slow call times due to
> waiting on page loads for vantive.. The answer usually ended up being..
> "Hey, its already slow so lets add 3 more required page loads that can take
> forever to complete especially on busy days, thereby slowing things down
> even more..." /endPhoneSupportVent
>
> I seem to be on a "KISAF" kick lately. Keep It Simple And Fast
>
> On Sun, May 13, 2018 at 8:30 AM, R.H. via use-livecode <
> use-livecode at lists.runrev.com> wrote:
>
>> To extract text from a PDF document, I am using a command line tool on
>> Windows which is available also for Linux based systems called Xpdf.
>>
>> It was working well, using shell() on LiveCode Community 8x, but tested
>> only in the IDE on Windows.
>>
>> It should work with Linux and Mac as well.
>>
>> If PDFs just contain images where the text is in the image, you need to
>> first run it through character recognition program. Since I found that
>> different tools generate different results when converting image characters
>> in PDF to embedded text, I still find that Acrobat from Adobe is doing the
>> best job.
>>
>> I needed this since some people had sent huge lists of numerical data in
>> PDF which were impossible to extract, and the manual method could taken
>> weeks. Also, it is helpful for building Document Management Systems where
>> words within associated documents need to be indexed.
>>
>> Converting PDF to .docx formats (Word) usually does not give good results.
>> The resulting documents are quite unclean. Extracting the text also does
>> not necessarily result in a meaningful text if the original PDF is not
>> structured with clearly separated paragraphs, headlines, etc. ideally in
>> one top-to-bottom and left-to-right flow. So, a lot of manual work will
>> often be required.
>>
>> Nevertheless, I can not see that PDF will lose ground as the standard for
>> many years to come. There are possibly billions of documents in PDF around?
>> What should replace it? And people are still printing.
>>
>> Xpdf can generate a pure text file that can be read from LiveCode and
>> processed further.
>>
>> *Open Source Xpdf*
>>
>> http://www.xpdfreader.com/download.html
>>
>> https://en.wikipedia.org/wiki/Pdftotext
>> Command line tools in Xpdf
>>
>> The open source Xpdf toolkit also includes several command line tools which
>> perform various functions on PDF files:
>>
>> - *pdftotext*: converts PDF to text
>> - *pdftops*: converts PDF to PostScript
>> - *pdftoppm*: converts PDF pages to netpbm (PPM/PGM/PBM) image files
>> - *pdftopng*: converts PDF pages to PNG image files
>> - *pdftohtml*: converts PDF to HTML
>> - *pdfinfo*: extracts PDF metadata
>> - *pdfimages*: extracts raw images from PDF files
>> - *pdffonts*: lists fonts used in PDF files
>> - *pdfdetach*: extracts attached files from PDF files
>>
>> Cross-platform
>>
>> All of Xpdf tools are available for Linux, Windows, and Mac.
>>
>> The viewer (xpdf / XpdfReader) uses the Qt toolkit.
>> Roland
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list