Page 1 of 1

Crawling Dynamic Database Content

Posted: Wed Jan 03, 2007 1:09 pm
by lcherry0
Is it possible to crawl a dynamic web site using aspx pages via Webinator? If so, how is this done? Do you need a particular version of Webinator to do this?

Crawling Dynamic Database Content

Posted: Wed Jan 03, 2007 1:17 pm
by mark
Yes. Nothing special needed. Just make sure .aspx is in the allowed "extensions" list, remove ? from the default "exclusions", and turn off "strip queries".

If you have a perpetual calendar or any other such thing that just keeps returning more pages forever you should add an exclusion or set max depth to prevent going too deep into those.

Crawling Dynamic Database Content

Posted: Wed Jan 03, 2007 1:51 pm
by lcherry0
Yes, I see that it crawls aspx pages with those settings. What if I want to take this a step further and crawl an underlying database? Information from this database is typically returned on the web site when a user enters a zip code using a web form.

Crawling Dynamic Database Content

Posted: Wed Jan 03, 2007 2:18 pm
by mark
Where the data comes from doesn't really matter much. It's all in how it's presented on a web page. You can put a query string into a url to do form inputs. Something like
http://somesite/find.aspx?zip=44107
or, if it insists on method POST,
http-post://somesite/find.aspx?zip=44107

The first url above will work in a browser so you can experiment. The second is a Webinator specific syntax.

If making the database searchable is the primary purpose of your index you should be using the Texis product instead of Webinator.

Crawling Dynamic Database Content

Posted: Wed Jan 03, 2007 3:07 pm
by lcherry0
If I used Texis, how would I configure it to index a database?

Crawling Dynamic Database Content

Posted: Wed Jan 03, 2007 4:33 pm
by mark
There are various ways to load data into a Texis database. There's timport (command line and vortex versions), C API, Perl DBD, Java JDBC.

Importing the data into Texis is more precise than scraping web pages and allows specialized schemas and more flexible multi-field searches with various grouping and ordering options. Using Texis gives you full control of the application instead of trying to shoehorn something into Webinator.