Page 1 of 1

Extract everything between <body ...> & </body> tags

Posted: Wed Feb 23, 2011 3:33 pm
by mjacobson
I am trying to figure out a regular expression that extract all of the text between the opening <body ...> tag and the closing </body> tag but keep all of the other HTML tags.

I know I can use fetch to get the document and if I use <urltext> I get the body text without the HTML tags. If I use <urlinfo "rawdoc">, I get the raw HTML but it has all of the stuff I don't want.

So if I have something like this,

<html>
<head>
<title>JIR 11-Jan-2011 *Exposed to elements - Strategic metal supply poses security threat</title>
<META NAME="Publication" CONTENT="Jane's Intelligence Review">
<META NAME="PubAbbrev" CONTENT="JIR">
<META NAME="VolIssue" CONTENT="023/002">
</head>
<body bgcolor="#DDDDDD" text="#000000" vlink="#FF0000" link="#0000FF" alink="#00AA00">

<center>
<img border="0" src="html.jpg" alt="Jane's Information Group">

</center>

<table width="100%">
...
...
</body>

I just want


<center>
<img border="0" src="html.jpg" alt="Jane's Information Group">

</center>

<table width="100%">
...
...

</html>

Extract everything between <body ...> & </body> tags

Posted: Wed Feb 23, 2011 3:45 pm
by mark
<rex ">><body=[^>]*>\P=!</body>+" $rawdoc>
<$bodyonly=$ret>

Extract everything between <body ...> & </body> tags

Posted: Wed Feb 23, 2011 3:52 pm
by mjacobson
That was a lot easier than I thought it was going to be. I was up to 6 lines of sandr code and it still wasn't working. Thanks Mark