PubCrawler - Automated Retrieval of PubMed and GenBank Reports
usage: pubcrawler.pl [-add_path] [-base_URL <URL>] [-c <config_file>] [-check] [-copyright] [-d <directory>] [-db <database>] [-extra_range <range for extra entries>] [-force_mail] [-format <results format>] [-fullmax <max-docs in full>] [-getmax <max-docs to get>] [-h] [-help] [-head <output-header>] [-i] [-indent <pixels>] [-l <log_file>] [-lynx <alternative-browser>] [-mail <address for results] [-mail_ascii <address for text-only results] [-mail_simple <address for slim HTML results] [-mute] [-n <neighbour_URL>] [-notify <address for notification] [-no_decap] [-no_test] [-os <operating_system>] [-out <output-file>] [-p <proxy_server>] [-pp <proxy_port>] [-pauth <proxy_authorization>] [-ppass <proxy_password>] [-pre <prefix>] [-q <query_URL>] [-r <retrieve_URL>] [-relentrezdate <relative-entrez-date>] [-retry <number of retries>] [-s <search-term] [-spacer <gif>] [-t <timeout>] [-u <test_URL>] [-v <verbose>] [-viewdays <view-days>] [-version]
options: -add_path adds the path /cwd/lib to @INC (list of library directories) where cwd stands for the current working directory -base_URL specify a URL, that will be used for the 'Back To Top' link the name of the output file will be appended to that (specify 'local_file' to go to anchors in local file) -c configuration file for pubcrawler -check checks if program and additional files are setup correctly -copyright shows copyright information -d pubcrawler working directory (config,databas,and output) -db name of database file -extra_range specifies the number of documents combined in a link minimum value is 1, defaults to 'fullmax' -fullmax maximum number of full length reports shown (per search) -getmax maximum number of documents to retrieve (per search) -h this help message -head HTML-header for output file -help same as -h -i include configuration file in HTML-output -indent indent PubCrawler comments n pixel (default 125) -l name of file for log-information -lynx command for alternative browser -mail e-mail address to send results to (optionally append '#' or '@@' and user name) -mail_ascii e-mail address to send text-only results to (optionally append '#' or '@@' and user name) -mail_simple e-mail address to send slim HTML results to (optionally append '#' or '\@\@' and user name) -mute suppresses messages to STDERR -n URL where neighbourhood searches are directed to -notify e-mail address for notification (optionally append '#' or '@@' and user name) -no_decap prevents results from NCBI from being processed (chopping of head and tail and collecting UIs) -no_test skips the proxy-test -os operating system (some badly configured versions of Perl need this to be set explicitly -> 'MacOS', 'Win', and 'Unix') -out name of file for HTML-output -p proxy -pp proxy port -pauth proxy authorization (user name) -ppass proxy password -pre prefix used for default file names (config-file,database,log) -q URL where normal (not neighbourhood) searches are directed to -r specify a URL, that will be used for retrieving new hits from NCBI -relentrezdate maximum age (relative date of publication in days) of a document to be retrieved other valid entries: '1 year','2 years','5 years','10 years','no limit' -retry specify how often PubCrawler should retry queries in case of server errors (default = 0) -s search-term ('database#alias#query#') -spacer location of gif that acts as space holder for left column -t timeout (in seconds, defaults to 180) -u test-URL (to test proxy configuration) -v verbose output -viewdays number of days each document will be shown -version show version of program
Some command-line options can also be set in the configuration file. If both are set and they conflict, the command-line setting takes priority.
PubCrawler automates requests for user-specific searches in the PubMed- and the GenBank-database at NCBI at http://www.ncbi.nlm.nih.gov/ .
To test if everything is setup correctly, run the program in check-mode first. At your command prompt, enter the name of the script together with any command line options that might be necessary and the -check option.
Mac users can do this either by setting the check-variable in their configuration file to '1', or by entering -check at the command line (when the prompt-variable in the configuration file is set to '1', the user will be prompted for command-line options after the start of the program).
Windows users can start the Perl executable in a DOS-box followed by the name of the script, followed by any necessary parameters, including the -check option.
The program will perform a quick check on all the settings and report any errors that it encountered.
Recommended when using the program for the first time!
PubCrawler allows two forms of customization:
It is divided into three parts: Mandatory Variables, Optional Variables and Search Terms.
The value of any variable can be set by writing the variable name and its value separated by a blank on a single line in the configuration file. The value can consist of several words (eg. Mac-directory-name). Any leading or trailing quotes will be stripped, when the data is read in.
Each search-specification has to be written on one line. The first word must specify the database (genbank or pubmed). Any following words enclosed in single quotes (') will be used as an alias for this query, otherwise they will be considered Entrez-search-terms, as will the rest of the line. You must not enclose anything but the alias in single quotes!
A Web-tool is available for comfortable generation of configuration files. Visit PubCrawler Configurator at http://pubcrawler.gen.tcd.ie/configurator.html
PubCrawler makes most sense if you have it started automatically, for example overnight. This would assure that your database is always up to date.
The automation depends on the operating system you are using. In the following a few hints are given how to set up schedules:
The program crontab lets you maintain your crontab files. To run PubCrawler every night at 1:00 a.m. execute the following command:
and insert a line like the following:
0 1 * * * $HOME/bin/PubCrawler.pl
presuming your executable resides in $HOME/bin/PubCrawler.pl. Any command line options can be added as desired.
For more information about crontab type
Start the Task Scheduler by clicking on
Start->Programs->Accessories->System Tools->Scheduled Tasks
Follow the instructions to add or create a new task. Select the Perl-executable (perl.exe), type in a name for the task and choose time and day of start. Open the advanced properties and edit the Run:-line by adding the name of the script and any command line options. It might look something like the following:
C:\perl\bin\perl.exe D:\pubcrawler\PubCrawler.exe -d D:\pubcrawler
You might also consider entering the working directory for PubCrawler in the Start in:-line.
By default PubCrawler looks for a configuration file named PubCrawler.config (or more precisely, prefix.config if the prefix has been changed by using the
-pre command-line option).
The first part of the name (prefix) defaults to the main part of the program name (the basename
up to the last dot, normally PubCrawler).
Other prefixes can be specified via the command line option
-pre followed by a prefix.
Other names for the configuration file can be specified via the command line
-c followed by the file name.
Using different prefix names allows multiple uses of the program with multiple output files.
The program will first look for a configuration file in the PubCrawler working directory, if specified via command line option. The second place to look for it is the home directory as set in the environmental variable 'HOME'. Last place is the current working directory. If no configuration file could be found and not all of the Mandatory Variables are specified via command line options (see SYNOPSIS) then the program exits with an error message.
There is a number of variables for which PubCrawler expects values given by the user. These are:
The values for these variables can be specified either in the PubCrawler configuration file or as command-line options (see SYNOPSIS).
For some variables an optional value can be set in the configuration file. These are:
CAUTION: Storing passwords in a file represents a possible security risk! Use command line option ('-ppass' flag) instead!
The definition of the search terms is the same as given by Entrez Search System. Please look up their site for more information ( http://www.ncbi.nlm.nih.gov/Entrez/entrezhelp.html#ComplexExpression )
A search-term entered at the command-line (e.g.
pubcrawler.pl -s 'pubmed#Test-search#Gilbert [AUTH]') hides all other queries specified in the configuration script.
The HTTP-requests are handled by Perl-modules, namely LWP::Simple, LWP::UserAgent, and HTML::Parser. These are included in the latest distributions of Perl for Windows and MacPerl (check out http://www.perl.com/pace/pub/perldocs/latest.html ). They are also freely available from the CPAN-archive at http://www.perl.com/CPAN/.
In case you would like to run PubCrawler without these modules you have to provide a command-line browser as an alternative (like lynx, available at http://lynx.browser.org ).
To disable the use of the modules comment out the three lines following #### ADDITIONAL MODULES #### at the beginning of the file by putting a '#' in front of them. You also have to comment out the line
@proxy_config = split /\n/, (get "$proxy_conf");
in the subroutine proxy_setting, somewhere near the middle of the program script (approx. line 2080).
Only queries to the same databases are allowed to share the same alias. If queries with the same alias but different databases are detected, their alias will be changed to the query-name. That means that the results for this entry will appear in a separate section.
Valid database specifiers are:
Please look at the sample configuration file for a varieties of queries.
If you have difficulty installing the program or if it doesn't produce the expected results, please read this documentation carefully and also check out PubCrawlers web-page at http://www.pubcrawler.ie
If none of this advice helps you should send an e-mail with a detailed description of the problem to pubcrawlerREMOVECAPShelp@gmail.com
Original author: Dr. Ken H. Wolfe, firstname.lastname@example.org
Second author: Dr. Karsten Hokamp, email@example.com
Both from the Department of Genetics, Trinity College, Dublin
If you have problems, corrections, or questions, please see REPORTING PROBLEMS above.
This is free software, and you are welcome to redistribute it under conditions of the GNU General Public License (see file 'COPY' or http://www.gnu.org/copyleft/gpl.html for more information).
This program has a home page at http://www.pubcrawler.ie/ .
$Id: pubcrawler.pl,v 1.80 2002/11/26 02:31:36 pubcrawl Exp pubcrawl $