Extracting text from a PDF
Paul Dupuis
paul at researchware.com
Mon Mar 9 12:21:48 EDT 2026
I have no idea if this will help as you are using the PDF Widget and
thsi was for the XPDF External, but the Widget is based on Google PDFium
just like the older External. The XPDF External had a problem with
hyphenations in PDF where the hyphen was actual a 2-byte Unicode
character. The following takes the text returned (that may include a
hyphen) and "fixes" it to include a normal hyphen:
command Rehyphenate @xText
-- This handler is a workaround for the following bug:
-- http://quality.livecode.com/show_bug.cgi?id=18442
-- This bug is fundamentally a issue in the PDFium PDF library where
certain hyphenated
-- strings (such as URLs) with line breaks are returned with a
Unicode BOM (xFFFE) instead
-- of a hyphen. Rehyphenate replaces hyphens between non-whitespace
where xFFFE is returned.
--
-- The intended usage is:
-- XPDFViewer_GetSelectionUnicode "Document1", "tUnicode"
-- put textDecode(tUnicode, "UTF16") into tUnicode
-- Rehyphenate tUnicode
-- put tUnicode into ...
local tStart, tEnd
put numToChar(255)&numToChar(254) into tBadUnicodeChar
if tText contains tBadUnicodeChar then
repeat while matchChunk(xText, "[^\s]*(\x{FFFE})[^\s]*", tStart, tEnd)
put "-" into char tStart to tEnd of xText
end repeat
end if
end Rehyphenate
At the very least, there may be a similar bug in the widget (because the
bug was in the underlying PDFium library) that requires some sort of
similar work around.
On 3/9/2026 11:40 AM, David Epstein via use-livecode wrote:
> Does anyone have experience trying to clean up the text that can be extracted from a PDF shown in the PDF widget by getting “the hilitedRangeText” of the widget?
>
> In the case I’m working with there is an invisible numToChar(10) at the end of each visible line; and to obtain text that will wrap freely in a LiveCode field I can “replace numToChar(10) with space” in the text I’ve extracted. This works.
>
> When a word is divided at the end of the line, the visible hyphen is a numToChar(63). But a command to “replace numToChar(63) with empty” does not work, and the character remains in place (showing up, in a field whose font is Palatino, as a boxed question mark).
>
> My impression is that not all PDF documents work the same way, and that there are other problems trying to extract their text. But why does this numToChar(63) character not get replaced?
>
> David Epstein
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list