PDF

Richard Gaskin ambassador at fourthworld.com
Sat May 12 16:08:47 EDT 2018


Mike Bonner wrote:

 > I haven't needed to do this before, but is there a (relatively) easy
 > way to extract the text from a bunch of pdf files?  I'm hoping I can
 > build some indexes for the boatload of files I want to go through
 > (THough, I guess I could bipass LC and just grep my heart out)
 >
 > Any suggestions?

Long term:

Per Postel's Law, reduce the stockpile of PDFs littering humanity's 
infosphere by generating none except in the increasingly rare cases 
where no other format is a better choice.

PDF is an archaic format held over from the days when nearly all display 
devices had screens at least as wide as a printed page.  Back in the 
'90s, when it was popularized, a fixed-size format emulating a printed 
piece of paper was not an unreasonable thing to do.

But times have changed.  We rarely kill trees just to read anymore, so 
the bounds of a printed page are approaching meaninglessness.

This becomes critically important for delivering an enjoyable reading 
experience when we consider that an ever-smaller minority of our time is 
spent on screens large enough to accommodate that size.

Many of our screens are much smaller, and moreover they vary enough to 
make any single fixed size needlessly cumbersome.

Attempting to read PDFs on a phone ranges from mildly annoying to 
prohibitively frustrating.

That unnecessary pain is easily replaced these days with modern formats 
that reflow text to fit any of the many devices we might be using at any 
given moment.

There's a good argument for using EPub as that foundation.

But that's a long-term solution, and while I believe it's an 
inevitability as mobile use continues to grow it won't solve your need 
in the here-and-now., so:


Short term:

The Linux universe has many good command-line solutions available for 
extracting text from PDFs easily and efficiently, like this one:
https://www.howtogeek.com/228531/how-to-convert-a-pdf-file-to-editable-text-using-the-command-line-in-linux/

For those Win10 Pro users who can be convinced the tick a checkbox, the 
entire universe of the Ubuntu shell is now available.

macOS also includes utilities for this, but I don't believe the same 
ones (at least not without installing an independent package manager 
like Homebrew.

-- 
  Richard Gaskin
  Fourth World Systems





More information about the use-livecode mailing list