Extract everything between <body ...> & </body> tags

Post Reply
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Extract everything between <body ...> & </body> tags

Post by mjacobson »

I am trying to figure out a regular expression that extract all of the text between the opening <body ...> tag and the closing </body> tag but keep all of the other HTML tags.

I know I can use fetch to get the document and if I use <urltext> I get the body text without the HTML tags. If I use <urlinfo "rawdoc">, I get the raw HTML but it has all of the stuff I don't want.

So if I have something like this,

<html>
<head>
<title>JIR 11-Jan-2011 *Exposed to elements - Strategic metal supply poses security threat</title>
<META NAME="Publication" CONTENT="Jane's Intelligence Review">
<META NAME="PubAbbrev" CONTENT="JIR">
<META NAME="VolIssue" CONTENT="023/002">
</head>
<body bgcolor="#DDDDDD" text="#000000" vlink="#FF0000" link="#0000FF" alink="#00AA00">

<center>
<img border="0" src="html.jpg" alt="Jane's Information Group">

</center>

<table width="100%">
...
...
</body>

I just want


<center>
<img border="0" src="html.jpg" alt="Jane's Information Group">

</center>

<table width="100%">
...
...

</html>
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Extract everything between <body ...> & </body> tags

Post by mark »

<rex ">><body=[^>]*>\P=!</body>+" $rawdoc>
<$bodyonly=$ret>
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Extract everything between <body ...> & </body> tags

Post by mjacobson »

That was a lot easier than I thought it was going to be. I was up to 6 lines of sandr code and it still wasn't working. Thanks Mark
Post Reply