Parsing and Extracting Text from ePub XHTML

Brahmanathaswami brahma at hindu.org
Wed Jul 29 22:57:37 EDT 2015


Aloha, James:

great. thanks for this... seems we are each re-inventing the wheel here.

Your code is useful though.

I see you are still having to deal with the pesky "—" But only in 
your TOC xml processing routines.

What I don't understand  (which makes it hard to make good strategic 
decisions moving forward)

Is that these ePub companies (Atritex in Chennai for us)  to taking text 
from books  (inDesign) and then output strings that are ( to my eyes) 
mixed. Take for example this file from the front matter of a book.. (see 
below)

If I take this raw text and push to a field using the code suggested by Mark

      put uniDecode(uniEncode(x,"UTF8")) into x
      set the htmlText of fld 2 to x

and *if* I'm very sure to set the field to a font like Arial Unicode MS 
or Helvetica Neue... it works "brilliantly" and I can even cut and paste 
to email or Indesign or pages or MSWord and all characters are rendered. 
"Wunderbar! Marvelous!"

But it make me nervous when I look at the raw code because we see 
unicode characters output as decimal entities

Tamil Letter U
HTML Entity (decimal) 	உ

UTF-16 (hex) 	0x0B89 (0b89)



(Line 1 here) உ   Unicode expressed as a decimal entity. We don't 
see the script  here...

(Line 4 here) ????????   # which is obviously "pure" unicode for Tamil 
language I don't know what encoding it is.. because I see the actual 
script... in the raw text..

mixed with curly quotes and mdashes  which I assume are ANSI 
characters... because if open the file in BBEdit those characters are 
not encoded... they just appear as ' and ---

So we seem to have three different encodings.

I paste here a rendering from below  which I copied out of my LC 
field... to Indesign.. then from Indesign to this email:

I wonder if you are seeing the unicode Tamil,  all the diacritical marks 
and the mDash and curly quote... ?

Or do you see garbage when you open this post?

-------
????????
The thirty-six elements dance. Sada-s'iva dances. Consciousness dances. 
S'iva-S'akti dances. The animate and inanimate dance. All these and the 
Vedas dance when the Supreme dances His dance of bliss. The seven worlds 
as His golden abode, the five chakras as His pedestal, the central 
kun.d.alini- s'akti as His divine stage, thus in rapture He dances, He 
who is Transcendent Light. He dances with the celestials. He dances in 
the golden hall. He dances with the three Gods. He dances with the 
assembly of silent sages. He dances in song. He dances in ultimate 
energy. He dances in souls---He who is the Lord of Dances. Tat Astu.
-----------

everything appears to work.  which is amazing...and means LC 7 is a huge 
step forward for us.

If this really will hold all the way thru a JSON encoded POST to MySQL 
and back out again to desktop client or mobile app without anything 
breaking.

We *can* dumb this all down to 0-127 (I used to do that years ago and 
have a whole stack dedicated to stripping all diacriticals, replacing 
ANSI chars etc according to our spelling/lexicon conventions... )

But if LC 7 can actually provide us a way to display all everything, and 
I can actually put this on a web page also... it will be a quantum leap 
forward for us.

Here is what is in the ePub... Maybe I really shouldn't worry about the 
different encodings at all? and just assume this will retain "integrity" 
through all processes, assuming the rendering context is using a unicode 
font?

<h4 class="h4g"><samp><small>உ </small></samp></h4>
<h3 class="h3d"><samp><span 
class="cmbold"><samp>Dedication</samp></span></samp></h3>
<h4 class="h4g"><samp><em>Samarpaṇam</em></samp></h4>
<h4 class="h4gg"><samp>????????</samp></h4>
<p class="noindent"> <span class="smallcapr"><samp>GAṆE&#x15a;A, 
THE LORD OF CATEGORIES, WHO REMOVED ALL BARRIERS TO THE MANIFESTATION OF 
THIS CONTEMPORARY HINDU CATECHISM, TO HIM WE OFFER OUR REVERENT 
OBEISANCE. THIS TEXT IS DEDICATED TO MY <span 
class="cmitalic"><samp>SATGURU, </samp></span>SAGE 
YOGASWAMI</samp></span> of Columbuthurai, Sri Lanka, perfect <span 
class="cmitalic"><samp>siddha yogī </samp></span>and illumined 
master who knew the Unknowable and held Truth in the palm of his hand. 
As monarch of the Nandin&#x101;tha Samprad&#x101;ya's Kail&#x101;sa 
Parampar&#x101;, this obedient disciple of Satguru Chellappaswami 
infused in me all that you will find herein. Yogaswami commanded all to 
seek within, to know the Self and see God &#x15a;iva everywhere and in 
everyone. Among his great sayings: "Know thy Self by thyself. &#x15a;iva 
is doing it all. All is &#x15a;iva. Be still." Well over 2,000 years ago 
Rishi Tirumular, of our lineage, aptly conveyed the spirit of <span 
class="cmitalic"><samp>Dancing with &#x15a;iva:</samp></span></samp></p>
<p class="quote"><samp>The thirty-six elements dance. 
Sad&#x101;&#x15b;iva dances. Consciousness dances. 
&#x15a;iva-&#x15a;akti dances. The animate and inanimate dance. All 
these and the <span class="cmitalic"><samp>Vedas</samp></span> dance 
when the Supreme dances His dance of bliss. The seven worlds as His 
golden abode, the five chakras as His pedestal, the central <span 
class="cmitalic"><samp>ku&#x1e47;&#x1e0d;alinī 
&#x15b;akti</samp></span> as His divine stage, thus in rapture He 
dances, He who is Transcendent Light. He dances with the celestials. He 
dances in the golden hall. He dances with the three Gods. He dances with 
the assembly of silent sages. He dances in song. He dances in ultimate 
energy. He dances in souls---He who is the Lord of Dances. Tat Astu. 
</samp></p>


-- 
Swasti Astu, Be Well!
Brahmanathaswami

Kauai's Hindu Monastery
www.HimalayanAcademy.com



James Hale wrote:
> Hi Brahmanathaswami,
>
> I wrote a sample stack that opens and displays pubs if that is of any use.
>
> You can find it here...
>
> http://livecodeshare.runrev.com/stack/761/Epub-Opener
>
> If it is of help, let me know:-)
>
> James



More information about the use-livecode mailing list