Search::Circa::Parser - provide functions to parse HTML pages by Circa
use Search::Circa::Indexer;
my $index = new Search::Circa::Indexer;
$index->connect(...);
$index->Parser->look_at({ url => url,
idr => account });
This module use HTML::Parser facilities. It's call by Search::Circa::Indexer
for index each document. Main method is look_at.
Index an url. Job done is:
Keys for refHashParameters:
Return (-1,0) if url isn't valide, number of word and number of links found else
Split data in words, and put them in global %$RM with score. Hash structure is ('mots'=>facteur).
Check if url $links will be add to Circa. Url must begin with $self->host_indexed, and his extension must be not doc,zip,ps,gif,jpg,gz, pdf,eps,png,deb,xls,ppt,class,GIF,css,js,wav,mid.
If $links is accepted, return url. Else return 0.
$Revision: 1.27 $
Alain BARBET alian@alianwebserver.com
| Dernière modification le Wed Oct 29 18:11:23 2003 |
© Alain & Estelle BARBET Textes et images 1997-2003 |