Parsing and Extracting Text from ePub XHTML
Brahmanathaswami
brahma at hindu.org
Wed Jul 29 22:57:37 EDT 2015
Aloha, James:
great. thanks for this... seems we are each re-inventing the wheel here.
Your code is useful though.
I see you are still having to deal with the pesky "—" But only in
your TOC xml processing routines.
What I don't understand (which makes it hard to make good strategic
decisions moving forward)
Is that these ePub companies (Atritex in Chennai for us) to taking text
from books (inDesign) and then output strings that are ( to my eyes)
mixed. Take for example this file from the front matter of a book.. (see
below)
If I take this raw text and push to a field using the code suggested by Mark
put uniDecode(uniEncode(x,"UTF8")) into x
set the htmlText of fld 2 to x
and *if* I'm very sure to set the field to a font like Arial Unicode MS
or Helvetica Neue... it works "brilliantly" and I can even cut and paste
to email or Indesign or pages or MSWord and all characters are rendered.
"Wunderbar! Marvelous!"
But it make me nervous when I look at the raw code because we see
unicode characters output as decimal entities
Tamil Letter U
HTML Entity (decimal) உ
UTF-16 (hex) 0x0B89 (0b89)
(Line 1 here) உ Unicode expressed as a decimal entity. We don't
see the script here...
(Line 4 here) ???????? # which is obviously "pure" unicode for Tamil
language I don't know what encoding it is.. because I see the actual
script... in the raw text..
mixed with curly quotes and mdashes which I assume are ANSI
characters... because if open the file in BBEdit those characters are
not encoded... they just appear as ' and ---
So we seem to have three different encodings.
I paste here a rendering from below which I copied out of my LC
field... to Indesign.. then from Indesign to this email:
I wonder if you are seeing the unicode Tamil, all the diacritical marks
and the mDash and curly quote... ?
Or do you see garbage when you open this post?
-------
????????
The thirty-six elements dance. Sada-s'iva dances. Consciousness dances.
S'iva-S'akti dances. The animate and inanimate dance. All these and the
Vedas dance when the Supreme dances His dance of bliss. The seven worlds
as His golden abode, the five chakras as His pedestal, the central
kun.d.alini- s'akti as His divine stage, thus in rapture He dances, He
who is Transcendent Light. He dances with the celestials. He dances in
the golden hall. He dances with the three Gods. He dances with the
assembly of silent sages. He dances in song. He dances in ultimate
energy. He dances in souls---He who is the Lord of Dances. Tat Astu.
-----------
everything appears to work. which is amazing...and means LC 7 is a huge
step forward for us.
If this really will hold all the way thru a JSON encoded POST to MySQL
and back out again to desktop client or mobile app without anything
breaking.
We *can* dumb this all down to 0-127 (I used to do that years ago and
have a whole stack dedicated to stripping all diacriticals, replacing
ANSI chars etc according to our spelling/lexicon conventions... )
But if LC 7 can actually provide us a way to display all everything, and
I can actually put this on a web page also... it will be a quantum leap
forward for us.
Here is what is in the ePub... Maybe I really shouldn't worry about the
different encodings at all? and just assume this will retain "integrity"
through all processes, assuming the rendering context is using a unicode
font?
<h4 class="h4g"><samp><small>உ </small></samp></h4>
<h3 class="h3d"><samp><span
class="cmbold"><samp>Dedication</samp></span></samp></h3>
<h4 class="h4g"><samp><em>Samarpaṇam</em></samp></h4>
<h4 class="h4gg"><samp>????????</samp></h4>
<p class="noindent"> <span class="smallcapr"><samp>GAṆEŚA,
THE LORD OF CATEGORIES, WHO REMOVED ALL BARRIERS TO THE MANIFESTATION OF
THIS CONTEMPORARY HINDU CATECHISM, TO HIM WE OFFER OUR REVERENT
OBEISANCE. THIS TEXT IS DEDICATED TO MY <span
class="cmitalic"><samp>SATGURU, </samp></span>SAGE
YOGASWAMI</samp></span> of Columbuthurai, Sri Lanka, perfect <span
class="cmitalic"><samp>siddha yogī </samp></span>and illumined
master who knew the Unknowable and held Truth in the palm of his hand.
As monarch of the Nandinātha Sampradāya's Kailāsa
Paramparā, this obedient disciple of Satguru Chellappaswami
infused in me all that you will find herein. Yogaswami commanded all to
seek within, to know the Self and see God Śiva everywhere and in
everyone. Among his great sayings: "Know thy Self by thyself. Śiva
is doing it all. All is Śiva. Be still." Well over 2,000 years ago
Rishi Tirumular, of our lineage, aptly conveyed the spirit of <span
class="cmitalic"><samp>Dancing with Śiva:</samp></span></samp></p>
<p class="quote"><samp>The thirty-six elements dance.
Sadāśiva dances. Consciousness dances.
Śiva-Śakti dances. The animate and inanimate dance. All
these and the <span class="cmitalic"><samp>Vedas</samp></span> dance
when the Supreme dances His dance of bliss. The seven worlds as His
golden abode, the five chakras as His pedestal, the central <span
class="cmitalic"><samp>kuṇḍalinī
śakti</samp></span> as His divine stage, thus in rapture He
dances, He who is Transcendent Light. He dances with the celestials. He
dances in the golden hall. He dances with the three Gods. He dances with
the assembly of silent sages. He dances in song. He dances in ultimate
energy. He dances in souls---He who is the Lord of Dances. Tat Astu.
</samp></p>
--
Swasti Astu, Be Well!
Brahmanathaswami
Kauai's Hindu Monastery
www.HimalayanAcademy.com
James Hale wrote:
> Hi Brahmanathaswami,
>
> I wrote a sample stack that opens and displays pubs if that is of any use.
>
> You can find it here...
>
> http://livecodeshare.runrev.com/stack/761/Epub-Opener
>
> If it is of help, let me know:-)
>
> James
More information about the use-livecode
mailing list