Parsing and Extracting Text from ePub XHTML
Brahmanathaswami
brahma at hindu.org
Sun Jul 26 15:31:13 EDT 2015
We do a lot of work with the contents of ePubs. For those who don't know
the spec:
"someBook.epub" is just "someBook.zip"
which when inflated has a mini-portable web site based on responsive CSS
(all percentages). You get
someBook
/ops # "Open Package Structure"
/ fonts
/ images
/ styles
/ xhtml
toc.ncx
The xhtml folder then has all the these files:
ch09_05_b.html
ch09_05_c.html
ch09_06.html
etc.
The text is pretty advanced in the sense that it uses unicode... (I
think) for rendering diacritical fonts. mdash's etc.
If I simply import the raw file unprocessed into a LC field (7.0.5)... I
get the usual, expected text:
<h3 class="h3s"><samp>Is Monistic Theism Found in the <span
class="cmitalic"><samp>Vedas?</samp></span></samp></h3>
<h4 class="h4"><samp><span class="smallcap"><samp>ŚLOKA
145</samp></span></samp></h4>
<p class="noindent"><samp><span class="cmbold"><samp>Again and again in
the <em>Vedas </em>and from <em>satgurus </em>we hear ‚ÄúAhaṁ
Brahmāsmi,‚Äù ‚ÄúI am God,‚Äù and that God is both immanent and
transcendent. Taken together, these are clear statements of monistic
theism. Aum Namaḥ Śivāya.</samp></span></samp></p>
<h4 class="h4"><samp><span
class="smallcap"><samp>BHĀSHYA</samp></span></samp></h4>
<p class="noindent"><samp>Monistic theism is the philosophy of the <span
class="cmitalic"><samp>Vedas</samp></span>. Scholars have long noted
that the Hindu scriptures are alternately monistic, describing the
oneness of the individual soul and God, and theistic, describing the
reality of the Personal God. One cannot read the <span
class="cmitalic"><samp>Vedas</samp></span>, <span
class="cmitalic"><samp>Śaiva Āgamas</samp></span> and hymns
of the saints without being overwhelmed with theism as well as monism.
Monistic theism is the essential teaching of Hinduism, of Śaivism.
It is the conclusion of Tirumular, Vasugupta, Gorakshanatha, Bhaskara,
Srikantha, Basavanna, Vallabha, Ramakrishna, Yogaswami, Nityananda,
Radhakrishnan and thousands of others. It encompasses both
Siddhānta and Vedānta. It says, God is and is in all things.
It propounds the hopeful, glorious, exultant concept that every soul
will finally merge with Śiva in undifferentiated oneness, none
left to suffer forever because of human transgression. The <span
class="cmitalic"><samp>Vedas</samp></span> wisely proclaim, “Higher
and other than the world-tree, time and forms is He from whom this
expanse proceeds—the bringer of <span
class="cmitalic"><samp>dharma,</samp></span> the remover of evil, the
lord of prosperity. Know Him as in one’s own Self, as the immortal
abode of all.‚Äù Aum Namaḥ Śivāya.</samp></p>
Goal is to create a tool for volunteers to go in and extract quotes to
allow them to grab a few sentences, which we will them push to an online
database.
So: What is the best way to get this text rendered? Do I go the path of
setting the field's Unicode? But then what about the html mark up? if we
create a browser object... can users select text and does LC know that
there is a selected chunk if it is inside a browser object?
Before I start wading into this I though to see if anyone else has some
good guidance in advance,
Swasti Astu, Be Well!
Brahmanathaswami
Kauai's Hindu Monastery
www.HimalayanAcademy.com
More information about the use-livecode
mailing list