Page 1 of 1
Exclude index pages
Posted: Wed Jun 28, 2006 12:42 pm
by pete.smith
Hi here is my issue - I have a frontend for newsgroups that I am crawling. I really only want THE RESULTS (the posts) not the indeces of posts (the thread view) in my results. Urls are pretty standard like thread.php?= or ?article.php? I just want articles.
Exclude index pages
Posted: Wed Jun 28, 2006 12:55 pm
by mark
Ideally the index pages would have
<meta name="robots" content="noindex,follow">
on them. See
http://www.robotstxt.org/wc/meta-user.html
If you can't control that you can use "Data from Field" to simulate it. Set it to recognize your index page urls or content and set "Exclude" to "Pages only".
Exclude index pages
Posted: Wed Jun 28, 2006 1:52 pm
by John
With exclude by field you could search for thread.php in the URL and exclude pages only, not links.
Exclude index pages
Posted: Wed Jun 28, 2006 1:54 pm
by mark
oops, I meant "Exclude by field" not "Data from field".