Circa::Indexer - provide functions to administrate Circa, a www search engine running with Mysql
use Circa::Indexer;
my $indexor = new Circa::Indexer;
die "Erreur à la connection MySQL:$DBI::errstr\n"
if (!$indexor->connect);
$indexor->create_table_circa;
$indexor->drop_table_circa;
$indexor->addSite({url => "http://www.alianwebserver.com/",
email => 'alian@alianwebserver.com',
title => "Alian Web Server"});
my ($nbIndexe,$nbAjoute,$nbWords,$nbWordsGood) = $indexor->parse_new_url(1);
print "$nbIndexe pages indexées,"
"$nbAjoute pages ajoutées,"
"$nbWordsGood mots indexés,"
"$nbWords mots lus\n";
$indexor->update(30,1);
Look too in circa_admin,admin.cgi,admin_compte.cgi
This is Circa::Indexer, a module who provide functions to administrate Circa, a www search engine running with Mysql. Circa is for your Web site, or for a list of sites. It indexes like Altavista does. It can read, add and parse all url's found in a page. It add url and word to MySQL for use it at search.
This module provide routine to :
Remarques:
Circa parse html document. convert it to text. It count all word found and put result in hash key. In addition of that, it read title, keywords, description and add a weight to all word found.
Example: A config:
my %ConfigMoteur=( 'author' => 'circa@alianwebserver.com', # Responsable du moteur 'temporate' => 1, # Temporise les requetes sur le serveur de 8s. 'facteur_keyword' => 15, # <meta name="KeyWords" 'facteur_description' => 10, # <meta name="description" 'facteur_titre' => 10, # <title></title> 'facteur_full_text' => 1, # reste 'facteur_url' => 15, # Mots trouvés dans l'url 'nb_min_mots' => 2, # facteur min pour garder un mot 'niveau_max' => 7, # Niveau max à indexer 'indexCgi' => 0, # Index lien des CGI (ex: ?nom=toto&riri=eieiei) );
A html document:
<html>
<head>
<meta name="KeyWords"
CONTENT="informatique,computing,javascript,CGI,perl">
<meta name="Description"
CONTENT="Rubriques Informatique (Internet,Java,Javascript, CGI, Perl)">
<title>Alian Web Server:Informatique,Société,Loisirs,Voyages</title>
</head>
<body>
different word: cgi, perl, cgi
</body>
</html>
After parsing I've a hash with that:
$words{'informatique'}= 15 + 10 + 10 =35
$words{'cgi'} = 15 + 10 +1
$words{'different'} = 1
Words is add to database if total found is > $ConfigMoteur{'nb_min_mots'} (2 by default). But if you set to 1, database will grow very quicly but allow you to perform very exact search with many worlds so you can do phrase searches. But if you do that, think to take a look at size of table relation.
After page is read, it's look into html link. And so on. At each time, the level grow to one. So if < to $Config{'niveau_max'}, url is added.
You can use the following keys in PARAMHASH:
Get or set proxy for LWP::Robot or LWP::Agent
Ex: $circa->proxy('http://proxy.sn.no:8001/');
ref_hash can have these keys: url, email, title, categorieAuto, cgi, rep, file
Create account with first url url. Return id of account created
Parse les pages qui viennent d'être ajoutée. Le programme va analyser toutes les pages dont la colonne 'parse' est égale à 0.
Retourne le nombre de pages analysées, le nombre de page ajoutées, le nombre de mots indexés.
Update url not visited since nb days for account id account. If idp is not given, 1 will be used. Url never parsed will be indexed.
Return ($nb,$nbAjout,$nbWords,$nbWordsGood)
Create tables needed by Circa - Cree les tables necessaires à Circa:
Drop all table in Circa ! Be careful ! - Detruit touted les tables de Circa
Drop table for account id account
Create tables needed by Circa for account id account
Export data from Mysql in directory of export/circa.sql with mysqldump.
mysqldump: path of bin of mysqldump. If not given, search in /usr/bin/mysqldump, /usr/local/bin/mysqldump, /opt/bin/mysqldump.
<directory of export>: path of directory where circa.sql will be created. If not given, create it in $CircaConf::export, else in /tmp directory.
Import data in Mysql from circa.sql
path_of_bin_mysql : path to reach bin of mysql. If not given, search in /usr/bin/mysql, /usr/local/bin/mysql, /opt/bin/mysql, ENV{PATH}
path_of_circa_file : path of directory where circa.sql will be read. If not given, read it from $CircaConf::export, else /tmp directory.
Return hash with some informations account id account Keys are:
Search::Circa, Root class for circa
Search::Circa::Parser, Manage Parser of Indexer
circa_admin, command line to use indexer
$Revision: 1.39 $
Alain BARBET alian@alianwebserver.com
| Dernière modification le Wed Oct 29 18:11:23 2003 |
© Alain & Estelle BARBET Textes et images 1997-2003 |