Page 1 of 1

How to exclude the directory browsing page?

Posted: Fri Oct 19, 2007 3:34 pm
by twu
Hi, I am creating a site that contains two URLs. e.g.
http://www.mysite.com/myApplication/ -- this contains all application pages.
http://www.mysite.com/NR/Resource/ -- this contains all the resourse documents that are linked in application pages. e.g. pdf files, doc files.

I put both of URLs as Base URL. Because there are several level subdirectory down at http://www.mysite.com/NR/Resource/, and there are no index page in them, only pdf files, so I have to enable Directory Browsing Option for http://www.mysite.com/NR/Resource/; The problem now is when user types "index of" to search, it will list all the directory index pages, how can I avoid this?
I did a little bit search, and find out modify dowalk script will do the trick, but is there a way that just use configuration? Such as exclude by field?

How to exclude the directory browsing page?

Posted: Fri Oct 19, 2007 3:50 pm
by jason112
If you're generating the pages, you can add a standard "robots" command to the page that tells Webinator (and other search engines) not to use the page's contents.

Add this in the <head> of the html pages:

<meta name="robots" content="noindex"/>

How to exclude the directory browsing page?

Posted: Fri Oct 19, 2007 3:58 pm
by jason112
Re-reading I realize you're not _generating_ the index pages yourself. Exclude By Field would probably be the best way to do it based on URL.

The logic is that we want to exclude everything under http://www.mysite.com/NR/Resource/ that ends with a slash.

The rex expression for this should be

http://www.mysite.com/NR/Resource=!http ... urce*/=>>=

set the field to "url", and exclude to "pages only", should do the trick.

How to exclude the directory browsing page?

Posted: Fri Oct 19, 2007 3:59 pm
by twu
Thanks for your quick response, but actually those are not real pages, they are generated by webserver to list the sub directories and files in them.

How to exclude the directory browsing page?

Posted: Fri Oct 19, 2007 4:02 pm
by mark
They're still "pages" even if they're dynamically generated by the server. Jason's suggestion still applies.

How to exclude the directory browsing page?

Posted: Fri Oct 19, 2007 4:28 pm
by twu
Thanks guys, I tried Jason's REX, but it didn't work. I found that the following setting will exclude most of index pages, except the parent directory(http://www.mysite.com/NR/Resource/).
Query: http://www.mysite.com/NR/Resource/
Field: URL
Exclude to: Links Only

The only index page showing will be the top directory, and pdfs are showing fine.

How to exclude the directory browsing page?

Posted: Fri Oct 19, 2007 5:17 pm
by mark
Since exlcude by field is a metamorph query, not just rex, you'd have to put a leading / on Jason's expression:

/http://www.mysite.com/NR/Resource=!http ... urce*/=>>=

Or you could look at the html of those pages and pick out something unique to match on.