Exclude index pages

Post Reply
pete.smith
Posts: 73
Joined: Tue May 17, 2005 2:08 pm

Exclude index pages

Post by pete.smith »

Hi here is my issue - I have a frontend for newsgroups that I am crawling. I really only want THE RESULTS (the posts) not the indeces of posts (the thread view) in my results. Urls are pretty standard like thread.php?= or ?article.php? I just want articles.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Exclude index pages

Post by mark »

Ideally the index pages would have
<meta name="robots" content="noindex,follow">
on them. See http://www.robotstxt.org/wc/meta-user.html

If you can't control that you can use "Data from Field" to simulate it. Set it to recognize your index page urls or content and set "Exclude" to "Pages only".
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Exclude index pages

Post by John »

With exclude by field you could search for thread.php in the URL and exclude pages only, not links.
John Turnbull
Thunderstone Software
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Exclude index pages

Post by mark »

oops, I meant "Exclude by field" not "Data from field".
Post Reply