parts of page not being indexed.

Post Reply
jgdoke
Posts: 167
Joined: Wed Jul 14, 2004 10:52 am

parts of page not being indexed.

Post by jgdoke »

This page : http://www.ab.com/eoi/panelbuilder32.html

Is not being indexed completely. Can you help determine why we did not get the first part of the page (PanelBuilder32™ Software Catalog Number 2711-) In the index?? In that profile we only have Ignore tags of <script src with and end of </script>

Thsi snippet is from the list/edit url section:
Body: ND3
PanelBuilder32 version 3.80 now supports resizable bitmaps and contains an updated color palette with 32 selections. Users can explicitly select color mapping when converting an application from one terminal type to another. Also, support for tag search (where used) and purge functions have been added to the tag editor.

PanelBuilder32 will run on Microsoft Windows XP, 2000, 95, 98, and NT operating systems. It uses a Windows graphical interface, color palettes, pre-configured symbols, objects, and graphics to easily create new applications or reuse existing screen configurations developed for other PanelView Standard terminals.
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

parts of page not being indexed.

Post by John »

You may want to change the start expression to just <script (with a space after it), as some of the scripts start <script language> instead, so it may remove from a </script> backward till it hits a <script src.
John Turnbull
Thunderstone Software
jgdoke
Posts: 167
Joined: Wed Jul 14, 2004 10:52 am

parts of page not being indexed.

Post by jgdoke »

How does that apply to this particular page?? Is that what is happening here?
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

parts of page not being indexed.

Post by John »

Probably not, more likely it is the Remove Common setting if there are other pages that start with the same common beginning of "PanelBuilder32™ Software Catalog Number 2711-"

When possible it is generally more efficient and less prone to inadvertant removal to use the keep or ignore tags, e.g. there is <!-- START OF BODY COPY AREA ==================================== -->
and <!-- END BODY COPY AREA ====================================== --> on that page.
John Turnbull
Thunderstone Software
jgdoke
Posts: 167
Joined: Wed Jul 14, 2004 10:52 am

parts of page not being indexed.

Post by jgdoke »

From talks with support, the remove common only works from the beginning of the page. IE right after the <body> tag. Not in the middle of a page...

There are no other pages in that crawl that have: "PanelBuilder32™ Software"
much less "PanelBuilder32™ Software Catalog Number 2711"
So I ask again, Why is this text stripped out??

Second: Is there anyway to find out what is stripped with the option "Remove Common"?
jgdoke
Posts: 167
Joined: Wed Jul 14, 2004 10:52 am

parts of page not being indexed.

Post by jgdoke »

Here is some additional help to figure this out. I am finding these pages with this text stripped out:

http://www.ab.com/eoi/ethernetip.html
Page text:
PanelView Standard Terminals with EtherNet/IP Communication
Bulletin 2711
Database Body starts with: 2711

http://www.ab.com/eoi/panelbuilder32.html
Page text:
PanelBuilder32™ Software
Catalog Number 2711-ND3
Database Body starts with: ND3

http://www.ab.com/eoi/
Page Text:
Electronic Operator Interface Solutions
At Rockwell Automation, we understand that your automation system needs to provide the...
Database Body start with:
we understand that your automation system needs to provide the....

http://www.ab.com/eoi/classicdisplays/
Page text: Classic Message Displays

DataLiner™ RediPANEL™ DTAM Plus™ DTAM Micro™

Before the advent of the PanelView product lines,

Database Body starts with:(First line is the "Alt text" from the pictures)
Dataliner Info RediPANEL Info DTAM Plus Info DTAM Micro Info
DataLiner™ RediPANEL™ DTAM Plus™ DTAM Micro™
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

parts of page not being indexed.

Post by mark »

Those all look like classic examples of remove common, especially the eoi page. There must be some other page in the database that "had" the same leading text. No stripping occurs except for remove common, keep/ignore tags, data from field, and ignore chars.

There's no way to see what remove common did without comparing to another walk that has remove commonality off.
jgdoke
Posts: 167
Joined: Wed Jul 14, 2004 10:52 am

parts of page not being indexed.

Post by jgdoke »

If there is such a page I do not know what it is. I searched the entire web site for code that matched and found none. NO matter. I am turning it off.

In your post #2 you said
"You may want to change the start expression to just <script (with a space after it), as some of the scripts start <script language> instead, so it may remove from a </script> backward till it hits a <script src."

This sounds like it does not START with the start text. then go until the END text. This is a problem

I have a menu that I want to remove.
<form action="" name="DropSearch"
is the start of this form. And I am going until </form>

What kind of danger am I starting if I use this as the start of an ignore tag? Is it going to take ANY </form> tag and run it backwards??
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

parts of page not being indexed.

Post by mark »

Sorry for the confusion. It scans forward from the begin tag looking for the end tag (or end of page).
Post Reply