Getting page counts of PDFs

Ralph DiMola rdimola at evergreeninfo.net
Sun Aug 23 10:17:50 EDT 2020


PDF Widget will do the trick. I start reading the PDF spec 20 years ago and got a giant headache.

Ralph DiMola
IT Director
Evergreen Information Services
rdimola at evergreeninfo.net

-----Original Message-----
From: use-livecode [mailto:use-livecode-bounces at lists.runrev.com] On Behalf Of David V Glasgow via use-livecode
Sent: Sunday, August 23, 2020 8:07 AM
To: How to use LiveCode
Cc: David V Glasgow
Subject: Getting page counts of PDFs

Livecoders,

In my day job, some of my income comes from the number of pages from  a number of PDF documents thatI have to read for individual cases.  I thought it would be fun and useful to write an LC script that would either count the pages or (even better) get the page count of a folder full of PDFs.

I didn’t imagine it would be too hard, because both Mac and Win OSs report page number instantly and accurately in the file information windows.

I discovered that in a small sample of PDFs a line… << /Type /Pages /MediaBox [0 0 612 792] /Count 149 /Kids [ 1396 0 R 1397 0 R


...contained the page count, which was a bit confusing because I read that the Mediabox was only about page dimensions.  Then I found that some PDFs don’t contain that line, or at least not in the clear.

There is a general online consensus that reliably finding the page count of a PDF involves quite a lot of messing about and parsing, and may involve pretty much counting the pages.

I found some code here <http://www.angusj.com/delphitips/pdfpagecount.php> with the following walk through:

//1.  See if there's a 'Linearization dictionary' for easy parsing.
//    Mostly there isn't so ...
//2.  Locate 'startxref' at end of file
//3.  get 'xref' offset and go to xref table
//4.  depending on version the xref table may or may not be in a compressed
//    stream. If it's in a compressed stream (PDF ver 1.5+) then getting the
//    page number requires a LOT of code which is too convoluted to summarise
//    here. Otherwise it still requires a moderate amount of code ...
//5.  parse the xref table and fill a list with object numbers and offsets
//6.  handle subsections within xref table.
//7.  read 'trailer' section at end of each xref
//8.  store 'Root' object number if found in 'trailer'
//9.  if 'Prev' xref found in 'trailer' - loop back to step 3
//10. locate Root in the object list
//11. locate 'Pages' object from Root
//12. get Count from Pages.


If this is right, how on earth do OSs do it so quickly?  Also, and more to the point, am I on a fools errand to do this with LC?  I haven’t seen anything that obviously couldn’t be done (didn’t understand the regex, but assumed with effort…).  However parsing huge files just doesn’t look like it would be worth the effort, particularly as I can select all the documents,  get info, and sum the pages in my head..

Cheers,

David Glasgow
_______________________________________________
use-livecode mailing list
use-livecode at lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode





More information about the use-livecode mailing list