Stripping away HTML
Wilhelm Sanke
sanke at hrz.uni-kassel.de
Fri Apr 8 15:47:50 EDT 2005
On Fri Apr 8, Eric Chatonet eric.chatonet at sosmartsoftware.com wrote:
> Hi Gregory,
>
> Put your web page into a field and get the text of the field:
>
> put url "xyz" into fld "MyHiddenField"
> put fld "MyHiddenField" into tPlainText
>
> Le 7 avr. 05, à 23:07, Gregory Lypny a écrit :
>
> > Hello Everyone,
> >
> > Is there a way in Revolution to strip away the HTML code from a web
> > page, leaving just the content in plain text?
> >
> > Greg
While this is a convenient and quick way to get sort of a "raw" version
of the text of a HTML file - which then in most cases needs to be
further edited - if you want to extract text from a special kind of HTMl
files more often or on a regular basis, you should adapt your script to
the specific structure of the HTML file.
Two examples:
1. Extracting text from articles of the online version of magazine
"Education Week" <www.edweek.org>
The script assumes you have got three fields, two of which are named
"HTMLText" and "Transtext"
"on mouseUp
# fld 1 contains the HTML code of an "Education Week" article from
<www.edweek.org>
set the htmltext of fld "HTMLText" to fld 1
put fld "HTMLText" into fld "Transtext"
put the htmltext of fld "TransText" into tInterim
put the number of lines of tInterim into LNumber
repeat with i = LNumber down to 1
if line i of tInterim contains " " then delete line i of tInterim
end repeat
replace "â" with Quote in tInterim
replace "â" with Quote in tInterim
replace "â" with "'" in tInterim
replace "â" with "'" in tInterim
set the htmltext of fld "Transtext" to tInterim
end mouseUp"
The "replace" lines provide proper "Quotes" and apostrophes.
Line
"if line i of tInterim contains " " then delete line i of tInterim"
serves to remove code from the beginning of the web page. If this line
would be left out you would get text like the following at the beginning
of your "plain" text:
"var _hbEC=0,_hbE=new Array;function _hbEvent(a,b){b=_hbE[_hbEC++]=new
Object();b._N=a;b._C=0;return b;} var
hbx=_hbEvent("pv");hbx.vpc="HBX0100u";hbx.gn="ehg-editorialpro.hitbox.com";
//BEGIN EDITABLE SECTION //CONFIGURATION VARIABLES
hbx.acct="DM540902PMCA";//ACCOUNT NUMBER(S)
hbx.pn="PUT+PAGE+NAME+HERE";//PAGE NAME(S)
hbx.mlc="CONTENT+CATEGORY";//MULTI-LEVEL CONTENT CATEGORY
hbx.pndef="title";//DEFAULT PAGE NAME hbx.ctdef="full";//DEFAULT CONTENT
CATEGORY //OPTIONAL PAGE VARIABLES //ACTION SETTINGS hbx.fv="";//FORM
VALIDATION MINIMUM ELEMENTS OR SUBMIT FUNCTION NAME hbx.lt="auto";//LINK
TRACKING hbx.dlf="n";//DOWNLOAD FILTER hbx.dft="n";//DOWNLOAD FILE
NAMING hbx.elf="n";//EXIT LINK FILTER //SEGMENTS AND FUNNELS
hbx.seg="++";//VISITOR SEGMENTATION hbx.fnl="";//FUNNELS //CAMPAIGNS
hbx.cmp="";//CAMPAIGN ID hbx.cmpn="";//CAMPAIGN ID IN QUERY
hbx.dcmp="";//DYNAMIC CAMPAIGN ID hbx.dcmpn="";//DYNAMIC CAMPAIGN ID IN
QUERY..."
etc.
2. Extracting the plain text worth searching from the XML files of the
Rev "Dictionary"
I used similar routines to store the searchable text portions as arrays
in my tool "Searchdocs" (See last version at
<http://www.sanke.org/Software/SearchDocsXML24-Rev.zip>
The script assumes you have got two fields named "Display" and
"Transtext". Because during the conversion from XML to text "Tabs" can
happen to be inserted into the plain text,
line
"replace Tab with CR in tXML"
is helpful for better formatting. See the different results when you
leave out this line.
"on mouseUp
answer file "Choose XML file from"&&Quote&"Dictionary"&Quote&&"folder."
put it into Adresse
put "file:"&Adresse into Fxml
put URL Fxml into tXML
put offset("<name>",tXML) + 15 into ANam
put offset("]]></name>",tXML) -1 into ENam
put char ANam to ENam of tXML into tTitle
put offset("<syntax>",tXML) + 17 into Asyn
put offset("]]></syntax>",tXML) -1 into Esyn
put char Asyn to Esyn of tXML into tSyntax
put lineoffset("<summary>",tXML) into Zeile
delete line 2 to (Zeile - 1) of tXML
put tsyntax before tXML
set the htmltext of fld "Transtext" to tXML
put the text of fld "Transtext" into tXML
replace Tab with CR in tXML
put tTitle&CR&CR before tXML
put tXML into fld "Display"
set the textstyle of line 1 of fld "Display" to bold
end mouseUp"
Parsing the XML files to achieve a layout similar to that of the display
of the full articles of the Dictionary in the left pane of stack
"SearchDocs" of course needs a different and more complex approach.
Regards,
Wilhelm Sanke
<http://www.sanke.org/MetaMedia>
More information about the use-livecode
mailing list