Including Other URLS

Post Reply
robert.phillips
Posts: 12
Joined: Tue Mar 22, 2005 4:26 pm

Including Other URLS

Post by robert.phillips »

User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Including Other URLS

Post by mark »

You didn't say what REX expression you tried.
Add that as another base url or use an extra urls rex like
>>http://leagis:8080/Public/=.*
robert.phillips
Posts: 12
Joined: Tue Mar 22, 2005 4:26 pm

Including Other URLS

Post by robert.phillips »

I didn't want to include it as another base url because I don't want it crawl the actual web pages starting at the root of "Public". My base page contains numerous links to actual files,which is what I want to include.

I added your expression, it doesn't seem to be working. The links I want indexed show up as children, but they are not being indexed.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Including Other URLS

Post by mark »

Turn verbosity up to 4. Do a new walk with rewalk type set to new. Then the child links under list/edit urls should indicate why they were rejected. Give an actual example url that's not getting indexed.
robert.phillips
Posts: 12
Joined: Tue Mar 22, 2005 4:26 pm

Including Other URLS

Post by robert.phillips »

Did that, here's a list of errors thus far

Recent errors
Visited Reason Url
--------------------+--------------------+-------------------------------------------------------
Less than 1 min ago Unwanted prefix http://leagis:8080/Public/2001C/Bills/0 ... 94579.html
Less than 1 min ago Unwanted prefix http://leagis:8080/Public/2001C/Bills/0 ... 04821.html
Less than 1 min ago Unwanted prefix http://leagis:8080/Public/2001C/Bills/0 ... 30139.html
Less than 1 min ago Unwanted prefix http://leagis:8080/Public/2001C/Bills/0 ... 35844.html
Less than 1 min ago Unwanted prefix http://leagis:8080/Public/2001C/Bills/0 ... 25013.html
Less than 1 min ago Unwanted prefix http://leagis:8080/Public/2001C/Bills/0 ... 64883.html
Less than 1 min ago Unwanted prefix http://leagis:8080/Public/2001C/Bills/0 ... 40029.html
Less than 1 min ago Unwanted prefix http://leagis:8080/Public/2001C/Bills/0 ... 90421.html
Less than 1 min ago Unwanted prefix http://leagis:8080/Public/2001C/Bills/0 ... 55639.html
Less than 1 min ago Unwanted prefix http://leagis:8080/Public/2001C/Bills/0 ... 34074.html
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Including Other URLS

Post by mark »

Sorry, I had my blinders on. Setting extra urls isn't the way to do what you want. You need to put
http://leagis:8080/Public/
in the base url. Then in "exclude by field" enter a query of
/http://leagis:8080/Public/=>>=
for field "URL" and exclude "Pages and Links". That will tell it that site leagis:8080 is acceptable but that the particular page http://leagis:8080/Public/ isn't and shouldn't be followed.
robert.phillips
Posts: 12
Joined: Tue Mar 22, 2005 4:26 pm

Including Other URLS

Post by robert.phillips »

After applying the REX express in #2 above, I later noticed that I was getting on "out of memory" error. I found a reference to that on this knowledge base, which said to restart the walking. I did that, and now it is working as expected.

There were about 27,000 links to crawl.
Post Reply