Titles from anchor text

webinatoruser
Posts: 39
Joined: Fri Apr 08, 2005 8:54 am

Titles from anchor text

Post by webinatoruser »

I see most of my sites use anchor text that is the same as the title of the document, but the html/pdf titles are either garbage or generated and therefore the same every time. I looked over the "data from field" and noticed there is no option to do this (although I got excited at first seeing the URL anchor option, but it's for something else). I suspect that info is not sent over by the server so no way of holding on to it at crawl/fetch time. Any ideas other than writing a custom regex for finding the titles on each page (they are all different for each site and if the page designs are ever changed, all the regex has to be rewritten).
User avatar
John
Site Admin
Posts: 2597
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Titles from anchor text

Post by John »

The other issue with link text is that you could have multiple links with different text, and knowing which one to use could be problematic.

There probably isn't a good way right now if the <title> tag isn't used properly on the site, and the pages aren't consistent.

A future update will have a better title guesser for PDFs by looking at the content of the page.
John Turnbull
Thunderstone Software
webinatoruser
Posts: 39
Joined: Fri Apr 08, 2005 8:54 am

Titles from anchor text

Post by webinatoruser »

for pdf it will be good to wait for the future update. For html, sometimes the title is somewhere in the document that might be fetched. I tried using the data from field but any time I use a regex I get no match. Really frustrating.

I found the title in a h1 tag like the following:

<h1 class="content">text of title</h1>

my rex is <h1\x20class\x3d\x22content\x22>\P=!</h1>+\F</h1>+

this works if I fetch the page in a test script but not in webinator. is there something special about the data from field rex syntax?

I checked already to see if I could pick up other stuff from the html field by trying to rex for something link <td> and that worked, but not the above.

many thanks for your help
webinatoruser
Posts: 39
Joined: Fri Apr 08, 2005 8:54 am

Titles from anchor text

Post by webinatoruser »

Just as an added note, I checked one of the challenging sites with the google index and found that they have the right title, even though the page title is garbage! How would they ever know to check the h1 tag for the title??? Or are they using the referring anchor text to replace titles in all cases?
User avatar
John
Site Admin
Posts: 2597
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Titles from anchor text

Post by John »

Do you have Keep or Ignore tags? The data from field applies after those.

Using an <h1> would make sense if there is no <title> tag. The rules for distinguishing a real title from a garbage title would be trickier, but there may be some heuristics as to length and use of language.
John Turnbull
Thunderstone Software
User avatar
mark
Site Admin
Posts: 5515
Joined: Tue Apr 25, 2000 6:56 pm

Titles from anchor text

Post by mark »

What are you using for the Replace? It should be \2
webinatoruser
Posts: 39
Joined: Fri Apr 08, 2005 8:54 am

Titles from anchor text

Post by webinatoruser »

Thanks John,

I see your point about the difficulty. I just wish I could select at the profile level whether to go with the default action or send the anchor text to the title field instead.

I don't have anything in the keep or ignore tags. Just data from field against the html content. Should I be specifying something there?

Mark, I thought I could just do the rex bit and it would be automatically passing on matches to the title field without needing to specify a replace argument? But I tried the \2 and still not working.

I thought it might require MM syntax and put a forward slash but that's not working either.

Is the HTML fetcher stripping tag attributes by any chance? Maybe I am rexing for stuff that isn't there anymore?
User avatar
mark
Site Admin
Posts: 5515
Joined: Tue Apr 25, 2000 6:56 pm

Titles from anchor text

Post by mark »

Perhaps the best thing is if you can provide a link to the page so we can try it.
webinatoruser
Posts: 39
Joined: Fri Apr 08, 2005 8:54 am

Titles from anchor text

Post by webinatoruser »

www.forum18.org

you will notice all their stories have the same title, which is the site name, while the titles are in a H1 tag on every story page.
User avatar
mark
Site Admin
Posts: 5515
Joined: Tue Apr 25, 2000 6:56 pm

Titles from anchor text

Post by mark »

Their pages have some NULLs in them. There's a small bug with NULL handling in data from field. You can work around it by adding some code. In the datafromfield function after
<case "HTML">
<$txt = $htmlpage>
add
<strfmt "%U" $htmlpage><sandr "%00" "" $ret><strfmt "%!U" $ret><$txt=$ret>
Post Reply