ambassador at fourthworld.com
Sat May 12 16:08:47 EDT 2018
Mike Bonner wrote:
> I haven't needed to do this before, but is there a (relatively) easy
> way to extract the text from a bunch of pdf files? I'm hoping I can
> build some indexes for the boatload of files I want to go through
> (THough, I guess I could bipass LC and just grep my heart out)
> Any suggestions?
Per Postel's Law, reduce the stockpile of PDFs littering humanity's
infosphere by generating none except in the increasingly rare cases
where no other format is a better choice.
PDF is an archaic format held over from the days when nearly all display
devices had screens at least as wide as a printed page. Back in the
'90s, when it was popularized, a fixed-size format emulating a printed
piece of paper was not an unreasonable thing to do.
But times have changed. We rarely kill trees just to read anymore, so
the bounds of a printed page are approaching meaninglessness.
This becomes critically important for delivering an enjoyable reading
experience when we consider that an ever-smaller minority of our time is
spent on screens large enough to accommodate that size.
Many of our screens are much smaller, and moreover they vary enough to
make any single fixed size needlessly cumbersome.
Attempting to read PDFs on a phone ranges from mildly annoying to
That unnecessary pain is easily replaced these days with modern formats
that reflow text to fit any of the many devices we might be using at any
There's a good argument for using EPub as that foundation.
But that's a long-term solution, and while I believe it's an
inevitability as mobile use continues to grow it won't solve your need
in the here-and-now., so:
The Linux universe has many good command-line solutions available for
extracting text from PDFs easily and efficiently, like this one:
For those Win10 Pro users who can be convinced the tick a checkbox, the
entire universe of the Ubuntu shell is now available.
macOS also includes utilities for this, but I don't believe the same
ones (at least not without installing an independent package manager
Fourth World Systems
More information about the use-livecode