I am trying to figure out a regular expression that extract all of the text between the opening <body ...> tag and the closing </body> tag but keep all of the other HTML tags.
I know I can use fetch to get the document and if I use <urltext> I get the body text without the HTML tags. If I use <urlinfo "rawdoc">, I get the raw HTML but it has all of the stuff I don't want.
So if I have something like this,
<html>
<head>
<title>JIR 11-Jan-2011 *Exposed to elements - Strategic metal supply poses security threat</title>
<META NAME="Publication" CONTENT="Jane's Intelligence Review">
<META NAME="PubAbbrev" CONTENT="JIR">
<META NAME="VolIssue" CONTENT="023/002">
</head>
<body bgcolor="#DDDDDD" text="#000000" vlink="#FF0000" link="#0000FF" alink="#00AA00">
<center>
<img border="0" src="html.jpg" alt="Jane's Information Group">
</center>
<table width="100%">
...
...
</body>
I just want
<center>
<img border="0" src="html.jpg" alt="Jane's Information Group">
</center>
<table width="100%">
...
...
</html>
I know I can use fetch to get the document and if I use <urltext> I get the body text without the HTML tags. If I use <urlinfo "rawdoc">, I get the raw HTML but it has all of the stuff I don't want.
So if I have something like this,
<html>
<head>
<title>JIR 11-Jan-2011 *Exposed to elements - Strategic metal supply poses security threat</title>
<META NAME="Publication" CONTENT="Jane's Intelligence Review">
<META NAME="PubAbbrev" CONTENT="JIR">
<META NAME="VolIssue" CONTENT="023/002">
</head>
<body bgcolor="#DDDDDD" text="#000000" vlink="#FF0000" link="#0000FF" alink="#00AA00">
<center>
<img border="0" src="html.jpg" alt="Jane's Information Group">
</center>
<table width="100%">
...
...
</body>
I just want
<center>
<img border="0" src="html.jpg" alt="Jane's Information Group">
</center>
<table width="100%">
...
...
</html>