Parsing and Extracting Text from ePub XHTML
brahma at hindu.org
Thu Jul 30 06:06:21 CEST 2015
pursuant to my last long winded email.
I've boiled it down to something very simple
The company doing our ePubs is mixing
1) unicode HTML dec entities for diacriticals in the IAST roman char range
2) Unicode UTF strings for Tamil and Devanagari script
3) old fashioned punctuation in the range of ISO-8859-1
<http://www.alanwood.net/demos/charsetdiffs.html#a> 128-255. ( I think... )
1 and 2 above work fine, mixing #3 into the text blocks does, for
mysterious reasons I cannot fathom (I wonder if it show correctly on
Windows) show correctly in a browser.
the curly apostrophe in "Sampradaya's" and the mdash on the last line
"souls—he who..." are rendered in the browser I guess browser are built
to properly render mixed unicode and ISO 8859-1 text in the same
BUT! they turn into garbage if saved copied out of a LC field and pasted
into a MySQL column/field meanwhile all the Unicode (both html entities
and UTF strings) are rendered correctly and preserved across all
"transport agents" -- GET POST via an api.php then processed in stacks
with LC JSON encoded array etc. but not any character in the range of
dunno why, but at least we know what to do... not use any ANSI! Unicode
all the way...
It's as simple as:
command getContent pBookPart
put the uBookFilesLocation of this stack into tPath
put url ("file:" & tPath & "/ops/xhtml/" & pBookPart ) into tText
# Fix ANSI chars first
replace "—" with "—" in tText
replace "’" with "’" in tText
#unicode all the way!
put uniDecode(uniEncode(tText,"UTF8")) into tText
set the htmlText of fld "CurrentChapterText" to tText
Now, the content in the field can be cut pasted and move to MySQL and
the quote and mdashes all work... hurray...
Swasti Astu, Be Well!
Kauai's Hindu Monastery
Mark Schonewille wrote:
> Hi Brahmanathaswami,
> This works on LC 6.7.3:
> on mouseUp
> put fld 1 into x
> if the platform is not "MacOS" then
> // not sure why this works
> put isoToMac(x) into x
> end if
> put uniDecode(uniEncode(x,"UTF8")) into x
> set the htmlText of fld 2 to x
> end mouseUp
> Best regards,
> Mark Schonewille
> Economy-x-Talk Consulting and Software Engineering
> Homepage: http://economy-x-talk.com
> Twitter: http://twitter.com/xtalkprogrammer
> KvK: 50277553
> Installer Maker for LiveCode:
> Buy my new book "Programming LiveCode for the Real Beginner"
> LiveCode on Facebook:
> On 7/26/2015 21:31, Brahmanathaswami wrote:
>> We do a lot of work with the contents of ePubs. For those who don't know
>> the spec:
>> "someBook.epub" is just "someBook.zip"
>> which when inflated has a mini-portable web site based on responsive CSS
>> (all percentages). You get
>> /ops # "Open Package Structure"
>> / fonts
>> / images
>> / styles
>> / xhtml
>> The xhtml folder then has all the these files:
>> The text is pretty advanced in the sense that it uses unicode... (I
>> think) for rendering diacritical fonts. mdash's etc.
>> If I simply import the raw file unprocessed into a LC field (7.0.5)... I
>> get the usual, expected text:
>> <h3 class="h3s"><samp>Is Monistic Theism Found in the <span
>> <h4 class="h4"><samp><span class="smallcap"><samp>ŚLOKA
>> <p class="noindent"><samp><span class="cmbold"><samp>Again and again in
>> the <em>Vedas </em>and from <em>satgurus </em>we hear ‚ÄúAhaṁ
>> Brahmāsmi,‚Äù ‚ÄúI am God,‚Äù and that God is both immanent and
>> transcendent. Taken together, these are clear statements of monistic
>> theism. Aum Namaḥ Śivāya.</samp></span></samp></p>
>> <h4 class="h4"><samp><span
>> <p class="noindent"><samp>Monistic theism is the philosophy of the <span
>> class="cmitalic"><samp>Vedas</samp></span>. Scholars have long noted
>> that the Hindu scriptures are alternately monistic, describing the
>> oneness of the individual soul and God, and theistic, describing the
>> reality of the Personal God. One cannot read the <span
>> class="cmitalic"><samp>Vedas</samp></span>, <span
>> class="cmitalic"><samp>Śaiva Āgamas</samp></span> and hymns
>> of the saints without being overwhelmed with theism as well as monism.
>> Monistic theism is the essential teaching of Hinduism, of Śaivism.
>> It is the conclusion of Tirumular, Vasugupta, Gorakshanatha, Bhaskara,
>> Srikantha, Basavanna, Vallabha, Ramakrishna, Yogaswami, Nityananda,
>> Radhakrishnan and thousands of others. It encompasses both
>> Siddhānta and Vedānta. It says, God is and is in all things.
>> It propounds the hopeful, glorious, exultant concept that every soul
>> will finally merge with Śiva in undifferentiated oneness, none
>> left to suffer forever because of human transgression. The <span
>> class="cmitalic"><samp>Vedas</samp></span> wisely proclaim, ‚ÄúHigher
>> and other than the world-tree, time and forms is He from whom this
>> expanse proceeds‚Äîthe bringer of <span
>> class="cmitalic"><samp>dharma,</samp></span> the remover of evil, the
>> lord of prosperity. Know Him as in one‚Äôs own Self, as the immortal
>> abode of all.‚Äù Aum Namaḥ Śivāya.</samp></p>
>> Goal is to create a tool for volunteers to go in and extract quotes to
>> allow them to grab a few sentences, which we will them push to an online
>> So: What is the best way to get this text rendered? Do I go the path of
>> setting the field's Unicode? But then what about the html mark up? if we
>> create a browser object... can users select text and does LC know that
>> there is a selected chunk if it is inside a browser object?
>> Before I start wading into this I though to see if anyone else has some
>> good guidance in advance,
>> Swasti Astu, Be Well!
>> Kauai's Hindu Monastery
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
More information about the use-livecode