Reading PDF - a cry for help

Dar Scott dsc at swcp.com
Thu Sep 29 12:50:49 EDT 2011


On Sep 29, 2011, at 9:24 AM, Ken Ray wrote:
> Are you looking at just extracting the images? Or other relevant parts of the PDF? The reason I ask is that it looks like binary data is always contained between two lines: "stream" and "endstream", so extracting just the streaming data should be pretty quick to do; although the next step would be going to read the bytes of what was extracted and then determine if it's an image or some other thing that had to be represented with a "stream" in the PDF...


There are a couple issues that complicate this in general.  

The parameters needed to process the stream need to be parsed and they can be far away.  

There are many stream filters (some complicated compression) and they can be nested.  I looked at a corpus of PDF files and, yeah, a several are used in practice.

However, if one needs to parse the output of a specific program or a specific model of a scanner, then the work to do parsing in LiveCode is a lot less.

I hope that makes sense; I'm a little under the weather today.

Dar






More information about the Use-livecode mailing list