Getting page counts of PDFs

David V Glasgow dvglasgow at gmail.com
Sun Aug 23 08:07:20 EDT 2020


Livecoders,

In my day job, some of my income comes from the number of pages from  a number of PDF documents thatI have to read for individual cases.  I thought it would be fun and useful to write an LC script that would either count the pages or (even better) get the page count of a folder full of PDFs.

I didn’t imagine it would be too hard, because both Mac and Win OSs report page number instantly and accurately in the file information windows.

I discovered that in a small sample of PDFs a line… 
<< /Type /Pages /MediaBox [0 0 612 792] /Count 149 /Kids [ 1396 0 R 1397 0 R


...contained the page count, which was a bit confusing because I read that the Mediabox was only about page dimensions.  Then I found that some PDFs don’t contain that line, or at least not in the clear.

There is a general online consensus that reliably finding the page count of a PDF involves quite a lot of messing about and parsing, and may involve pretty much counting the pages.

I found some code here <http://www.angusj.com/delphitips/pdfpagecount.php> with the following walk through:

//1.  See if there's a 'Linearization dictionary' for easy parsing.
//    Mostly there isn't so ...
//2.  Locate 'startxref' at end of file
//3.  get 'xref' offset and go to xref table
//4.  depending on version the xref table may or may not be in a compressed
//    stream. If it's in a compressed stream (PDF ver 1.5+) then getting the
//    page number requires a LOT of code which is too convoluted to summarise
//    here. Otherwise it still requires a moderate amount of code ...
//5.  parse the xref table and fill a list with object numbers and offsets
//6.  handle subsections within xref table.
//7.  read 'trailer' section at end of each xref
//8.  store 'Root' object number if found in 'trailer'
//9.  if 'Prev' xref found in 'trailer' - loop back to step 3
//10. locate Root in the object list
//11. locate 'Pages' object from Root
//12. get Count from Pages.


If this is right, how on earth do OSs do it so quickly?  Also, and more to the point, am I on a fools errand to do this with LC?  I haven’t seen anything that obviously couldn’t be done (didn’t understand the regex, but assumed with effort…).  However parsing huge files just doesn’t look like it would be worth the effort, particularly as I can select all the documents,  get info, and sum the pages in my head..

Cheers,

David Glasgow


More information about the use-livecode mailing list