Parsing and Extracting Text from ePub XHTML

Hi Brahmanathaswami,

This works on LC 6.7.3:

on mouseUp
      put fld 1 into x
      if the platform is not "MacOS" then
           // not sure why this works
           put isoToMac(x) into x
      end if
      put uniDecode(uniEncode(x,"UTF8")) into x
      set the htmlText of fld 2 to x
end mouseUp

On 7/26/2015 21:31, Brahmanathaswami wrote:
> We do a lot of work with the contents of ePubs. For those who don't know
> the spec:
> "someBook.epub" is just ""
> which when inflated has a mini-portable web site based on responsive CSS
> (all percentages). You get
> someBook
> /ops # "Open Package Structure"
> / fonts
> / images
> / styles
> / xhtml
> toc.ncx
> The xhtml folder then has all the these files:
> ch09_05_b.html
> ch09_05_c.html
> ch09_06.html
> etc.
> The text is pretty advanced in the sense that it uses unicode... (I
> think) for rendering diacritical fonts. mdash's etc.
> If I simply import the raw file unprocessed into a LC field (7.0.5)... I
> get the usual, expected text:
> <h3 class="h3s"><samp>Is Monistic Theism Found in the <span
> class="cmitalic"><samp>Vedas?</samp></span></samp></h3>
> <h4 class="h4"><samp><span class="smallcap"><samp>ŚLOKA
> 145</samp></span></samp></h4>
> <p class="noindent"><samp><span class="cmbold"><samp>Again and again in
> the <em>Vedas </em>and from <em>satgurus </em>we hear ‚ÄúAhaṁ
> Brahm&#x101;smi,” “I am God,” and that God is both immanent and
> transcendent. Taken together, these are clear statements of monistic
> theism. Aum Nama&#x1e25; &#x15a;iv&#x101;ya.</samp></span></samp></p>
> <h4 class="h4"><samp><span
> class="smallcap"><samp>BHĀSHYA</samp></span></samp></h4>
> <p class="noindent"><samp>Monistic theism is the philosophy of the <span
> class="cmitalic"><samp>Vedas</samp></span>. Scholars have long noted
> that the Hindu scriptures are alternately monistic, describing the
> oneness of the individual soul and God, and theistic, describing the
> reality of the Personal God. One cannot read the <span
> class="cmitalic"><samp>Vedas</samp></span>, <span
> class="cmitalic"><samp>&#x15a;aiva Āgamas</samp></span> and hymns
> of the saints without being overwhelmed with theism as well as monism.
> Monistic theism is the essential teaching of Hinduism, of &#x15a;aivism.
> It is the conclusion of Tirumular, Vasugupta, Gorakshanatha, Bhaskara,
> Srikantha, Basavanna, Vallabha, Ramakrishna, Yogaswami, Nityananda,
> Radhakrishnan and thousands of others. It encompasses both
> Siddh&#x101;nta and Ved&#x101;nta. It says, God is and is in all things.
> It propounds the hopeful, glorious, exultant concept that every soul
> will finally merge with &#x15a;iva in undifferentiated oneness, none
> left to suffer forever because of human transgression. The <span
> class="cmitalic"><samp>Vedas</samp></span> wisely proclaim, “Higher
> and other than the world-tree, time and forms is He from whom this
> expanse proceeds—the bringer of <span
> class="cmitalic"><samp>dharma,</samp></span> the remover of evil, the
> lord of prosperity. Know Him as in one’s own Self, as the immortal
> abode of all.” Aum Nama&#x1e25; &#x15a;iv&#x101;ya.</samp></p>
> Goal is to create a tool for volunteers to go in and extract quotes to
> allow them to grab a few sentences, which we will them push to an online
> database.
> So: What is the best way to get this text rendered? Do I go the path of
> setting the field's Unicode? But then what about the html mark up? if we
> create a browser object... can users select text and does LC know that
> there is a selected chunk if it is inside a browser object?
> Before I start wading into this I though to see if anyone else has some
> good guidance in advance,
> Swasti Astu, Be Well!
> Brahmanathaswami
> Kauai's Hindu Monastery

