Excuse all the questions, but we recently received the search appliance, and I am working through the settings trying to figure out why the results are not coming back as I expect.
Current issue...
I have been changing one setting at a time, then re-indexing my server. The most results I've been able to get ar 627 files (it's still not going into the option lists, but I'll work on that later). When I set Max Page Size to "-1" (without quotes), which is supposed to allow any file size if I'm reading the instructions correctly, cuts my return result set down to 173.
When I set it to "20000000" (adding a zero to the default) it also returns 173 results.
Shouldn't increasing this option allow it to index more pages?
For HTML pages it will typically just truncate the larger pages. Depending on what you have the Max Process Size set to you might want to try a refresh crawl and if it stopped because of a memory size issue that will resume.
That setting should only have an effect if there were pages getting truncated. Your change in number is probably something else. Make sure you're doing a new walk each time while playing with settings rather than a refresh.
The thing to do is turn verbosity to 4 and do a new walk. Then go to list/edit urls. Click submit to get everything. Click on a page to get info about it. Then click on the "Children" link on the info page. That will show what urls were found on the page and indicate which are in the database or not and the reason they are not in the database. That should clue you in fairly quickly about what settings you might need to change.
Another approach is to determine a page you thing should be in the database but isn't. Determine which page on your site links to that page (it's parent). Lookup that parent page in List/Edit urls and check the Children links.
I've changed the 2000000 to -1, 2500000, 3000000, 1000000, 20, etc. and I get odd results from what I'd expect.
I still get the most returns with 2000000. When I look at the error file for vortex, I see a lot of pdf files that have been truncated. That's why I've tried increasing the size. I also set process size to "large" since I'm only running this one at a time.
I also have a ton of /dir/dir/post and /dir/dir/undefined errors in the log. I'm not sure what it's looking for since it's only listing the directory root. Any suggestions?
The walker will descend down discovered directories under your specified directory. We'd need more specifics on the errors encountered to comment on them.
Those urls were found on one of the pages it walked. They don't exist on your server. Use the techniques described above to find what page links to those urls. Then examine that page for why it links to those pages.