Parsing and Extracting Text from ePub XHTML

Thu Jul 30 09:25:50 EDT 2015

Hi Brahmanathaswami,

My code was begun back in LC 5.5 slowly making the transition through 6 and then 7.
I think I still have a switch in there in case the stack is opened in LC 6 to ensure it does some of the fudging required.
I have learnt in doing all this that standards (such as ePub 2) seem to be fairly loosely adhered to unless you are making the ePub yourself. 
There is also the issue that LC text fields are not fully HTML compliant, using only a subset of HTML and then even doing their own thing with it (see the discussion on the forum concerning LC's interpretation of header level tags. You may have noticed I reassign the header levels to ensure everything stays reasonable on a page.)
But I want to be able to use other features available in LC's text fields so for now I am stuck with them rather than use a browser object to display.
The other thing that is important to remember in order to use the Unicode magic of LC 7 is that all text transport in and out of LC needs to pass through the textdecode/textencode functions (rather than the uniencode/decode fudge of versions less than 7.
This means when saving your text to a database or reading from one, you may have to use these functions. Peter Haworth did quite a bit of investigation in this are while working on his SQL app (yet to be released but eagerly awaited.) I can't remember if he shared this on the list or the forum. But if you write him he will be able to share his experience in dealing correctly with Unicode and its transfer in and out of DBs and LC.
I haven't crossed the bridge of needing to copy paste in or out of LC in my app yet but don't see this as an issue as I expect at this stage to be not concerned with roman texts. However you might want to ask someone in the mothership what is actually being moved when not.
BTW, if your ePub is well formed each xHTML making it up should include the UTF encoding info at the start of the file.

As for seeing some of what your email contained, unfortunately I have trouble even when posts just contain roman script. I get the digest version of the list and read it usually on my iPad and many posts have lots of question marks around words which obviously have been incorrectly rendered. So no, you example contained ni you post didn't quite make it. The web page link certainly did which shows either my browser and my mail client render differently or something funny happens with the list's digest function. Probably the later.

Thanks for sharing your work. I really appreciate it.

James