HTML to text in field

Stephen MacLean smaclean at madmansoft.com
Thu Aug 9 10:25:37 EDT 2018


Hi David,

I’m working on something thing similar at the moment (although I’m stripping out almost everything except for basic HTML formatting).

I too found that nugget by JLG and had the similar results. I’ve also looked at using XML and regex. The problem with HTML is that while it’s a standard, unless it’s strict xHtml, it’s not really and a lot of it ends up being malformed, etc. Todays browsers are very forgiving and figure most of it out, but if you open the dev tools in safari or firefox and look at most pages, you will see a LOT of errors.

So I don’t think there is an “out of the box” answer. My solution is similar to what you are trying, stripping out tags and some other things and cleaning it up to give me what I want.

If your solution (after fixing the function vs command syntax that Klaus pointed out) doesn’t work for you, I could put together an example stack using my “cleaner” to share. I don’t claim it’s the best, only or anything else way to do it, other than it works for me and am happy to share.

Best,

Steve MacLean

> On Aug 9, 2018, at 8:00 AM, David V Glasgow via use-livecode <use-livecode at lists.runrev.com> wrote:
> 
> Hello folks,
> 
> I am having an interesting time (MacOS 10.13.5 LC 8.1.9) trying to load some HTML files (≤ 5 ish MB).  Most of them will be lists or tables, generated by various users on various systems.
> 
> I don’t want to retain any of the formatting, except line endings, so I would be happy for tables to appear as lists.  I found a little 2013 nugget from the estimable  Jacqueline Landman Gay
> 
> set the htmltext of the templatefield to htmlVar -- variable contains the html string
> put the text of the templatefield into tPlainText
> 
> In some cases that works fine, but in others, it seems that HTML tables consisting  of maybe 20-30 thousand rows are rendered onto a single line of the field.  A sort of black-letters-overwritten splodge appears in the first row and LC cranks up to 100% of the processor and BBoD ensues.
> 
> Sometimes it never seems to recover, but other times it hands back control after maybe 20 minutes or so, and in those cases I can see the text if I set dontwrap to false.  It contains no line endings from the original table, and a shedload of tabs.
> 
> I have tried to operate on the HTML string in a variable before putting it into the field, but frankly don’t really know what property of some HTML tables might mean that line endings are lost.  I can only see </tr> when I examine the files in an editor.  
> 
> I tried a different approach, replacing a row end with a cr, and then stripping out tags:
> 
> put URL ("file:" & theFilePath) into ttemp
> 
> replace "</tr>" with cr in ttemp
> 
> replaceText (ttemp, "<*>", "|")
> 
> filter lines of ttemp without empty
> 
> set the text of field "import" to ttemp
> 
> 
> The replaceText line generates an error “button "Import HTML": execution error at line 7 (Handler: can't find handler) near "replaceText", char 1”  
> 
> Firstly I don’t get the error, and secondly I am worried I may be over complicating something which should be simple.
> 
> Advice please!
> 
> Best wishes,
> 
> David Glasgow
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode






More information about the use-livecode mailing list