Page 1 of 1

adding exclusions in refresh

Posted: Mon Oct 18, 2004 12:13 pm
by KMandalia
I understand that refresh will only refresh what is already in the database BUT if I add some exclusions and exclude REX after I pause the walk, would it stop refreshing what is already in the database and would it stop bringing similar urls to what is in the existing database? I missed one of the sorttypes to exclude when crawling a website and now that mistake is costing me some 18,000 pages and the refresh walk keeps bringing more pages even after I added the exclusions.

Also, how can I add url patterns that are not directory structures but are query patterns

instead of http://www.somesite.com/somedir/*

I want to add, http://www.somesite.com?somequery?somevar=*

If I have multiple categories and if you can suggest some way to accomplish the query pattern, then what happens to urls that match none of the categories???

adding exclusions in refresh

Posted: Mon Oct 18, 2004 1:08 pm
by mark
You should use list/edit urls to find and delete urls you don't want in the database. Exclusion rules mainly apply to newly discovered urls.

Your example pattern should work. The entire literal url is considered for category matching.

Urls with no category will still be returned when searching in "everything".

adding exclusions in refresh

Posted: Mon Oct 18, 2004 2:01 pm
by KMandalia
That's the problem. If something like,

http://www.somesite.com/somepage.asp?q1=.... is already in database and if I exclude ,

somepage.asp\?q1\=
and even \somepage.asp,

and do a refresh walk, it should no longer bring the pages with this pattern but it looks like it does. List/Edit Urls is always an option, but I just wanted to know if I am doing something wrong.