Stripping away HTML

Wilhelm Sanke sanke at hrz.uni-kassel.de
Fri Apr 8 15:47:50 EDT 2005


On Fri Apr 8, Eric Chatonet eric.chatonet at sosmartsoftware.com wrote:

> Hi Gregory,
>
> Put your web page into a field and get the text of the field:
>
> put url "xyz" into fld "MyHiddenField"
> put fld "MyHiddenField" into tPlainText
>
> Le 7 avr. 05, à 23:07, Gregory Lypny a écrit :
>
> > Hello Everyone,
> >
> > Is there a way in Revolution to strip away the HTML code from a web
> > page, leaving just the content in plain text?
> >
> >     Greg



While this is a convenient and quick way to get sort of a "raw" version 
of the text of a HTML file - which then in most cases needs to be 
further edited - if you want to extract text from a special kind of HTMl 
files more often or on a regular basis, you should adapt your script to 
the specific structure of the HTML file.

Two examples:

1. Extracting text from articles of the online version of magazine 
"Education Week" <www.edweek.org>

The script assumes you have got three fields, two of which are named 
"HTMLText" and "Transtext"

"on mouseUp
  # fld 1 contains the HTML code of an "Education Week" article from 
<www.edweek.org>
  set the htmltext of fld "HTMLText" to fld 1
  put  fld "HTMLText" into fld "Transtext"
  put the htmltext of fld "TransText" into tInterim
  put the number of lines of tInterim into LNumber
  repeat with i = LNumber down to 1
    if line i of tInterim contains " " then delete line i of tInterim
  end repeat
  replace "“" with Quote in tInterim
  replace "”" with Quote in tInterim
  replace "’" with "'" in tInterim
  replace "—" with "'" in tInterim
  set the htmltext of fld "Transtext" to tInterim
end mouseUp"

The "replace" lines provide proper "Quotes" and apostrophes.

Line
"if line i of tInterim contains " " then delete line i of tInterim"
serves to remove code from the beginning of the web page. If this line 
would be left out you would get text like the following at the beginning 
of your "plain" text:

"var _hbEC=0,_hbE=new Array;function _hbEvent(a,b){b=_hbE[_hbEC++]=new 
Object();b._N=a;b._C=0;return b;} var 
hbx=_hbEvent("pv");hbx.vpc="HBX0100u";hbx.gn="ehg-editorialpro.hitbox.com";  
//BEGIN EDITABLE SECTION //CONFIGURATION VARIABLES 
hbx.acct="DM540902PMCA";//ACCOUNT NUMBER(S) 
hbx.pn="PUT+PAGE+NAME+HERE";//PAGE NAME(S) 
hbx.mlc="CONTENT+CATEGORY";//MULTI-LEVEL CONTENT CATEGORY 
hbx.pndef="title";//DEFAULT PAGE NAME hbx.ctdef="full";//DEFAULT CONTENT 
CATEGORY  //OPTIONAL PAGE VARIABLES //ACTION SETTINGS hbx.fv="";//FORM 
VALIDATION MINIMUM ELEMENTS OR SUBMIT FUNCTION NAME hbx.lt="auto";//LINK 
TRACKING hbx.dlf="n";//DOWNLOAD FILTER hbx.dft="n";//DOWNLOAD FILE 
NAMING hbx.elf="n";//EXIT LINK FILTER  //SEGMENTS AND FUNNELS 
hbx.seg="++";//VISITOR SEGMENTATION hbx.fnl="";//FUNNELS  //CAMPAIGNS 
hbx.cmp="";//CAMPAIGN ID hbx.cmpn="";//CAMPAIGN ID IN QUERY 
hbx.dcmp="";//DYNAMIC CAMPAIGN ID hbx.dcmpn="";//DYNAMIC CAMPAIGN ID IN 
QUERY..."
etc.

2. Extracting the plain text worth searching from the XML files of the 
Rev "Dictionary"

I used similar routines to store the searchable text portions as arrays 
in my tool "Searchdocs" (See last version at
<http://www.sanke.org/Software/SearchDocsXML24-Rev.zip>

The script assumes you have got two fields named "Display" and 
"Transtext". Because during the conversion from XML to text "Tabs" can 
happen to be inserted into the plain text,

 line

"replace Tab with CR in tXML"

is helpful for better formatting. See the different results when you 
leave out this line.


"on mouseUp
  answer file "Choose XML file from"&&Quote&"Dictionary"&Quote&&"folder."
  put it into Adresse
  put "file:"&Adresse  into Fxml
  put URL Fxml into tXML
  put offset("<name>",tXML) + 15 into ANam
  put offset("]]></name>",tXML) -1 into ENam
  put char ANam to ENam of tXML into tTitle
  put offset("<syntax>",tXML) + 17 into Asyn
  put offset("]]></syntax>",tXML) -1 into Esyn
  put char Asyn to Esyn of tXML into tSyntax
  put lineoffset("<summary>",tXML) into Zeile
  delete line 2 to (Zeile - 1) of tXML
  put tsyntax before tXML
  set the htmltext of fld "Transtext" to tXML
  put the text of fld "Transtext" into tXML
  replace Tab with CR in tXML
  put tTitle&CR&CR before tXML
  put tXML into fld "Display"
  set the textstyle of line 1 of fld "Display" to bold
end mouseUp"

Parsing the XML files to achieve a layout similar to that of the display 
of the full articles of the Dictionary in the left pane of stack 
"SearchDocs" of course needs a different and more complex approach.


Regards,

Wilhelm Sanke
<http://www.sanke.org/MetaMedia>



More information about the use-livecode mailing list