Parsing and Extracting Text from ePub XHTML

Richmond richmondmathewson at gmail.com
Sun Jul 26 15:45:31 EDT 2015


Presumably you have already tried SET THE HTML TEXT OF FIELD "BLAH" TO . . .

Richmond.

from my jail-broken, recycled iPad 1


On 26 Jul 2015, at 20:31, Brahmanathaswami <brahma at hindu.org> wrote:

> We do a lot of work with the contents of ePubs. For those who don't know the spec:
> 
> "someBook.epub" is just "someBook.zip"
> 
> which when inflated has a mini-portable web site based on responsive CSS (all percentages). You get
> 
> someBook
> /ops # "Open Package Structure"
> / fonts
> / images
> / styles
> / xhtml
> toc.ncx
> 
> The xhtml folder then has all the these files:
> ch09_05_b.html
> ch09_05_c.html
> ch09_06.html
> 
> etc.
> 
> The text is pretty advanced in the sense that it uses unicode... (I think) for rendering diacritical fonts. mdash's etc.
> 
> If I simply import the raw file unprocessed into a LC field (7.0.5)... I get the usual, expected text:
> 
> <h3 class="h3s"><samp>Is Monistic Theism Found in the <span class="cmitalic"><samp>Vedas?</samp></span></samp></h3>
> <h4 class="h4"><samp><span class="smallcap"><samp>ŚLOKA 145</samp></span></samp></h4>
> <p class="noindent"><samp><span class="cmbold"><samp>Again and again in the <em>Vedas </em>and from <em>satgurus </em>we hear ‚ÄúAhaṁ Brahm&#x101;smi,‚Äù ‚ÄúI am God,‚Äù and that God is both immanent and transcendent. Taken together, these are clear statements of monistic theism. Aum Nama&#x1e25; &#x15a;iv&#x101;ya.</samp></span></samp></p>
> <h4 class="h4"><samp><span class="smallcap"><samp>BHĀSHYA</samp></span></samp></h4>
> <p class="noindent"><samp>Monistic theism is the philosophy of the <span class="cmitalic"><samp>Vedas</samp></span>. Scholars have long noted that the Hindu scriptures are alternately monistic, describing the oneness of the individual soul and God, and theistic, describing the reality of the Personal God. One cannot read the <span class="cmitalic"><samp>Vedas</samp></span>, <span class="cmitalic"><samp>&#x15a;aiva Āgamas</samp></span> and hymns of the saints without being overwhelmed with theism as well as monism. Monistic theism is the essential teaching of Hinduism, of &#x15a;aivism. It is the conclusion of Tirumular, Vasugupta, Gorakshanatha, Bhaskara, Srikantha, Basavanna, Vallabha, Ramakrishna, Yogaswami, Nityananda, Radhakrishnan and thousands of others. It encompasses both Siddh&#x101;nta and Ved&#x101;nta. It says, God is and is in all things. It propounds the hopeful, glorious, exultant concept that every soul will finally merge with &#x15a;iva in undifferentiated oneness, none left to suffer forever because of human transgression. The <span class="cmitalic"><samp>Vedas</samp></span> wisely proclaim, ‚ÄúHigher and other than the world-tree, time and forms is He from whom this expanse proceeds‚Äîthe bringer of <span class="cmitalic"><samp>dharma,</samp></span> the remover of evil, the lord of prosperity. Know Him as in one‚Äôs own Self, as the immortal abode of all.‚Äù Aum Nama&#x1e25; &#x15a;iv&#x101;ya.</samp></p>
> 
> Goal is to create a tool for volunteers to go in and extract quotes to allow them to grab a few sentences, which we will them push to an online database.
> 
> So: What is the best way to get this text rendered? Do I go the path of setting the field's Unicode? But then what about the html mark up? if we create a browser object... can users select text and does LC know that there is a selected chunk if it is inside a browser object?
> 
> Before I start wading into this I though to see if anyone else has some good guidance in advance,
> 
> 
> Swasti Astu, Be Well!
> Brahmanathaswami
> 
> Kauai's Hindu Monastery
> www.HimalayanAcademy.com
> 
> 
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode




More information about the use-livecode mailing list