How to extract whole text from a PDF file with the PDF widget?

Monte Goulding monte at appisle.net
Mon Dec 13 01:30:01 EST 2021


Both the page and character index are clamped to the number of pages and characters on a page so you could set both to very high numbers. Adding character counts to the documentPages property might be useful here too.

Cheers

Monte

> On 13 Dec 2021, at 11:17 am, Paul Dupuis via use-livecode <use-livecode at lists.runrev.com> wrote:
> 
> Thank you Monte,
> 
> We've just started to make a map from XPDF APIs to the PDF Widget APIs, so I'll make sure that gets done soon and add any missing capabilities as requests to the LC Quality Center.
> 
> With regard to the hilitedRange and hilitedRangeText properties, can you just advise on the correct use to get a PDF's text? i.e can you use a range of 1 to -1 to get the whole document text or would that just be the current page text?
> 
> Thanks in advance,
> 
> 
> On 12/12/2021 6:49 PM, Monte Goulding via use-livecode wrote:
>> Hi Folks
>> 
>> Currently you can extract text in the widget by setting the hilitedRange and getting the hilitedRangeText. It wouldn’t be that hard to add extracted text to the documentPages property. The PDF widget was built to meet the requirements for a client rather than to match the features of XPDF so it’s worthwhile anyone still using XPDF to take the time to audit their use and see if there’s any extra features required. If so please create feature requests for them. While XPDF will continue to function we intend to stop including it in LiveCode.
>> 
>> Cheers
>> 
>> Monte
>> 
>>> On 12 Dec 2021, at 12:27 am, Paul Dupuis via use-livecode <use-livecode at lists.runrev.com> wrote:
>>> 
>>> I suspect it is for backward compatibility.
>>> 
>>> When I turned over the XPDF external to Livecode, I asked that they maintain it for a couple years. I had expected we'd migrate out apps to the PDF widget by then, but business factors mean we're only now just starting a migration.
>>> 
>>> That's why I jumped in on this thread - we HAVE to have the ability to extract text and images from the PDF widget (as you can with the External) - to migrate to the Widget.
>>> 
>>> I suspect many other commercial developers who used the External still have active code using it that they have not migrated yet OR the issue of the undocumented (or, even worse, missing) properties of the widget most likely would have been raised before now.
>>> 
>>> To migrate, all the command and functions of the External need to be mapped to the properties of the Widget. We have probably a couple hundred calls to the External in our code all of which need to be mapped, updated, and tested - so no trivial task.
>>> 
>>> 
>>> On 12/11/2021 6:50 AM, matthias rebbe via use-livecode wrote:
>>>> Ah, i thought you were referring only to XPDF.
>>>> Btw. do you have an idea why both, XPDF external and PDF widget, are maintained? Wouldn't it make sense to have only one pdf solution included?
>>>> Or am i missing something?
>>>> 
>>>> Regards,
>>>> Matthias
>>>> 
>>>> 
>>>>> Am 11.12.2021 um 02:01 schrieb Paul Dupuis via use-livecode <use-livecode at lists.runrev.com>:
>>>>> 
>>>>> Yes, I am familiar with the XPDF external (based on Google's PDFium library), having designed it and paid Monte to code it and then turned it over to LiveCode.
>>>>> 
>>>>> I was referring to the PDF Widget (also based on Google's PDFium), which should have a comparable property for fetching the text of a page. The LC dictionary does not list any property for returning the page text, so I assume that is a Dictionary/Documentation error and that Monte can tell us the correct property of the PDF widget that will return the text of a page.
>>>>> 
>>>>> 
>>>>> On 12/10/2021 7:05 PM, matthias rebbe via use-livecode wrote:
>>>>>> Paul,
>>>>>> 
>>>>>> here on mac OS the dictionary of LC 10 DP1 definitely lists the function XPDFViewer_Text(viewerName, pageNumber).
>>>>>> Btw. checking this showed me that this function seems to be deprecated and instead the command
>>>>>>      XPDFViewer_Unicode viewerName, pageNumber, variableName
>>>>>> should be used.
>>>>>> 
>>>>>> 
>>>>>>> Am 10.12.2021 um 23:22 schrieb Paul Dupuis via use-livecode <use-livecode at lists.runrev.com>:
>>>>>>> 
>>>>>>> There must be an undocumented property for the text of a page - there was a function to return the full text of a page in the External (XPDF) and to get the full text of the PDF file, you just stepped through the pages (1..N) getting and concatenating the page text.
>>>>>>> 
>>>>>>> Monte? LC 10.0.0 Dictionary does not list a property for the page text.
>>>>>>> 
>>>>>>> 
>>>>>>> On 12/10/2021 4:46 PM, Torsten Holmer via use-livecode wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I have a PDF file with text and pictures, but I just want the text.
>>>>>>>> 
>>>>>>>> I can do it manually with Ctrl-A and Ctrl-Copy by viewing the file with Preview on MacOS.
>>>>>>>> 
>>>>>>>> I have a business licence and want to use the PDF widget but I cannot find a way to do it.
>>>>>>>> 
>>>>>>>> Can someone help me out?
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Torsten
>>>>>>>> _______________________________________________
>>>>>>>> use-livecode mailing list
>>>>>>>> use-livecode at lists.runrev.com
>>>>>>>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>>>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>>>> _______________________________________________
>>>>>>> use-livecode mailing list
>>>>>>> use-livecode at lists.runrev.com
>>>>>>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>>> _______________________________________________
>>>>>> use-livecode mailing list
>>>>>> use-livecode at lists.runrev.com
>>>>>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>> _______________________________________________
>>>>> use-livecode mailing list
>>>>> use-livecode at lists.runrev.com
>>>>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>> _______________________________________________
>>>> use-livecode mailing list
>>>> use-livecode at lists.runrev.com
>>>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>> 
>>> _______________________________________________
>>> use-livecode mailing list
>>> use-livecode at lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>> 
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> 
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode




More information about the use-livecode mailing list