indexing html code

sambob
Posts: 2
Joined: Fri Jan 26, 2007 2:12 pm

indexing html code

Post by sambob »

Hello,
We need to index text that is pre-formatted in html within the main html document we are indexing. To prevent this text from appearing when a user views the page, it is also commented out. In order to index this into proper XML, then, we need to 1)remove the comments and 2)escape the brackets (< and >).
Here is an example:

<byline>
<!--
<span>some text</span>
-->
</byline>

we would need to index as:

<byline>
<span>some text</span>
</byline>

How can this be accomplished?

Thank you for your help.
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

indexing html code

Post by jason112 »

An easier method wound be to use CSS to have the text not
display.

<span style="display: none">some text</span>

The appliance would still pick it up normally.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

indexing html code

Post by mark »

Or put that text into a meta field and extract that.
Text in comments is intended to not be seen or generally accessible to the user so it is removed from indexing.
Having the indexer keep commented data would open a whole big can of worms when it started indexing all kinds of irrelevant cruft.
sambob
Posts: 2
Joined: Fri Jan 26, 2007 2:12 pm

indexing html code

Post by sambob »

Thanks for your help.
I didn't mention that this text will probably have to be stored in the head section of the html, so that would rule out the solution with the span tag.
But is it safe to put a considerable amount of text in the meta tag (could be a few hundred characters)?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

indexing html code

Post by mark »

Yes, that's safe.
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

indexing html code

Post by jason112 »

> But is it safe to put a considerable amount of text in
> the meta tag (could be a few hundred characters)?

Yes, just beware of putting in text that contains double quotes, as they'll need escaped.

To get the content

I said, "Look at me!"

Use this:
<meta name="byline" content ="I said, "Look at me!"">