PubCrawler - POD File

(extracted from the POD-section [Plain Old Documentation] at the end of the PubCrawler Perl-script)

NAME
SYNOPSIS
DESCRIPTION
USAGE

Testing
Customization
Automated Execution

MISCELLANEOUS

Location of the Configuration File
Mandatory Variables
Optional Variables
Search Terms
Perl-Modules and Alternative Choice
Setup of Queries

REPORTING PROBLEMS
DOCUMENTATION
AUTHOR
REDISTRIBUTION
HOME PAGE
LAST MODIFIED

NAME

PubCrawler - Automated Retrieval of PubMed and GenBank Reports

SYNOPSIS

     usage: pubcrawler.pl [-add_path] [-base_URL <URL>] [-c <config_file>]
       [-check] [-copyright] [-d <directory>] [-db <database>]
       [-extra_range <range for extra entries>] [-force_mail] 
       [-format <results format>] [-fullmax <max-docs in full>]
       [-getmax <max-docs to get>] [-h] [-help] [-head <output-header>]
       [-i] [-indent <pixels>] [-l <log_file>] 
       [-lynx <alternative-browser>] [-mail <address for results]   
       [-mail_ascii <address for text-only results]
       [-mail_simple <address for slim HTML results] [-mute] 
       [-n <neighbour_URL>] [-notify <address for notification]
       [-no_decap] [-no_test] [-os <operating_system>]
       [-out <output-file>] [-p <proxy_server>] [-pp <proxy_port>]
       [-pauth <proxy_authorization>] [-ppass <proxy_password>]
       [-pre <prefix>] [-q <query_URL>] [-r <retrieve_URL>] 
       [-relentrezdate <relative-entrez-date>] [-retry <number of retries>]
       [-s <search-term] [-spacer <gif>] [-t <timeout>] [-u <test_URL>]
       [-v <verbose>] [-viewdays <view-days>] [-version]

    options:
    -add_path adds the path /cwd/lib to @INC (list of library directories)
              where cwd stands for the current working directory
    -base_URL specify a URL, that will be used for the 'Back To Top' link
              the name of the output file will be appended to that
              (specify 'local_file' to go to anchors in local file)
    -c       configuration file for pubcrawler
    -check   checks if program and additional files are setup correctly
    -copyright shows copyright information
    -d       pubcrawler working directory (config,databas,and output)
    -db      name of database file
    -extra_range specifies the number of documents combined in a link
                  minimum value is 1, defaults to 'fullmax'
    -fullmax maximum number of full length reports shown (per search)
    -getmax  maximum number of documents to retrieve (per search)
    -h       this help message
    -head    HTML-header for output file
    -help    same as -h
    -i       include configuration file in HTML-output
    -indent  indent PubCrawler comments n pixel (default 125)
    -l       name of file for log-information
    -lynx    command for alternative browser
    -mail    e-mail address to send results to
             (optionally append '#' or '@@'  and user name)
    -mail_ascii  e-mail address to send text-only results to
             (optionally append '#' or '@@'  and user name)
    -mail_simple  e-mail address to send slim HTML results to
             (optionally append '#' or '\@\@'  and user name)
    -mute    suppresses messages to STDERR
    -n       URL where neighbourhood searches are directed to
    -notify  e-mail address for notification
             (optionally append '#' or '@@' and user name)
    -no_decap prevents results from NCBI from being processed
             (chopping of head and tail and collecting UIs)
    -no_test skips the proxy-test
    -os      operating system (some badly configured versions of Perl need  
             this to be set explicitly -> 'MacOS', 'Win', and 'Unix')
    -out     name of file for HTML-output
    -p       proxy
    -pp      proxy port
    -pauth   proxy authorization (user name)
    -ppass   proxy password
    -pre     prefix used for default file names (config-file,database,log)
    -q       URL where normal (not neighbourhood) searches are directed to
    -r       specify a URL, that will be used for retrieving
             new hits from NCBI
    -relentrezdate maximum age (relative date of publication in days) 
             of a document to be retrieved
             other valid entries: '1 year','2 years','5 years','10 years','no limit'
    -retry   specify how often PubCrawler should retry queries in case of
             server errors (default = 0)
    -s       search-term ('database#alias#query#')
    -spacer  location of gif that acts as space holder for left column
    -t       timeout (in seconds, defaults to 180)
    -u       test-URL (to test proxy configuration)
    -v       verbose output
    -viewdays number of days each document will be shown
    -version show version of program

Some command-line options can also be set in the configuration file. If both are set and they conflict, the command-line setting takes priority.

DESCRIPTION

PubCrawler automates requests for user-specific searches in the PubMed- and the GenBank-database at NCBI at http://www.ncbi.nlm.nih.gov/ .

USAGE

Testing

To test if everything is setup correctly, run the program in check-mode first. At your command prompt, enter the name of the script together with any command line options that might be necessary and the -check option.

Mac users can do this either by setting the check-variable in their configuration file to '1', or by entering -check at the command line (when the prompt-variable in the configuration file is set to '1', the user will be prompted for command-line options after the start of the program).

Windows users can start the Perl executable in a DOS-box followed by the name of the script, followed by any necessary parameters, including the -check option.

The program will perform a quick check on all the settings and report any errors that it encountered.

Recommended when using the program for the first time!

Customization

PubCrawler allows two forms of customization:

command-line options

For a temporary change of parameters the command-line options as listed in SYNOPSIS can be used.

configuration file

Permanent customization of settings can be achieved by editing the PubCrawler configuration file.

It is divided into three parts: Mandatory Variables, Optional Variables and Search Terms.

The value of any variable can be set by writing the variable name and its value separated by a blank on a single line in the configuration file. The value can consist of several words (eg. Mac-directory-name). Any leading or trailing quotes will be stripped, when the data is read in.

Each search-specification has to be written on one line. The first word must specify the database (genbank or pubmed). Any following words enclosed in single quotes (') will be used as an alias for this query, otherwise they will be considered Entrez-search-terms, as will the rest of the line. You must not enclose anything but the alias in single quotes!

A Web-tool is available for comfortable generation of configuration files. Visit PubCrawler Configurator at http://pubcrawler.gen.tcd.ie/configurator.html

Automated Execution

PubCrawler makes most sense if you have it started automatically, for example overnight. This would assure that your database is always up to date.

The automation depends on the operating system you are using. In the following a few hints are given how to set up schedules:

UNIX/LINUX

Unix and Unix-like operating systems normally have the cron-daemon running, which is responsible for the execution of scheduled tasks.

The program crontab lets you maintain your crontab files. To run PubCrawler every night at 1:00 a.m. execute the following command:

    crontab -e

and insert a line like the following:

    0 1 * * * $HOME/bin/PubCrawler.pl

presuming your executable resides in $HOME/bin/PubCrawler.pl. Any command line options can be added as desired.

For more information about crontab type man crontab.

WINDOWS 95/98

Windows 98 and the Plus!-pack for Windows 95 come with the ability to schedule tasks. To set up a schedule for PubCrawler, proceed as follows:

Start the Task Scheduler by clicking on

    Start->Programs->Accessories->System Tools->Scheduled Tasks

Follow the instructions to add or create a new task. Select the Perl-executable (perl.exe), type in a name for the task and choose time and day of start. Open the advanced properties and edit the Run:-line by adding the name of the script and any command line options. It might look something like the following:

Run: C:\perl\bin\perl.exe D:\pubcrawler\PubCrawler.exe -d D:\pubcrawler

You might also consider entering the working directory for PubCrawler in the Start in:-line.

MACINTOSH

Unfortunately the Mac operating system does not have a built in scheduler, but there is a shareware program available called Cron for Macintosh. It costs $10 and has a web-site at http://gargravarr.cc.utexas.edu/cron/cron.html

MISCELLANEOUS

Location of the Configuration File

By default PubCrawler looks for a configuration file named PubCrawler.config (or more precisely, prefix.config if the prefix has been changed by using the -pre command-line option). The first part of the name (prefix) defaults to the main part of the program name (the basename up to the last dot, normally PubCrawler). Other prefixes can be specified via the command line option -pre followed by a prefix. Other names for the configuration file can be specified via the command line option -c followed by the file name. Using different prefix names allows multiple uses of the program with multiple output files.

The program will first look for a configuration file in the PubCrawler working directory, if specified via command line option. The second place to look for it is the home directory as set in the environmental variable 'HOME'. Last place is the current working directory. If no configuration file could be found and not all of the Mandatory Variables are specified via command line options (see SYNOPSIS) then the program exits with an error message.

Mandatory Variables

There is a number of variables for which PubCrawler expects values given by the user. These are:

html_file: name of the ouput-file
viewdays: number of days each document will be shown
relentrezdate: maximum age (in days) of database entries to be reported
getmax: maximum number of documents to be retrieved for each search carried out
fullmax: the maximum number of documents for which a full report is being presented
include_config: whether or not to append config-file to output
search_URL: URL where standard searches are being carried out (for neighbourhood searches see next section)
retrieve_URL: URL from which documents are being retrieved

The values for these variables can be specified either in the PubCrawler configuration file or as command-line options (see SYNOPSIS).

Optional Variables

For some variables an optional value can be set in the configuration file. These are:

work_dir

working directory for PubCrawler

extra_range

The reports that exceed the number of fully presented articles (variable 'fullmax') are incorporated into the results page as links. The variable extra_range specifies, how many documents are assigned to each link. The minimum value is 1 and it defaults to the value of 'fullmax'.

check

PubCrawler runs in check-mode if set to '1' (see Testing)

prompt

PubCrawler prompts Mac-users for command-line options if set to '1'

verbose

PubCrawler prints log messages to screen if set to '1'

lynx

PubCrawler uses specified command to evoke command-line browser for HTTP-requests (see also Perl-Modules and Alternative Choice)

header

location of header (in HTML-format) that will be used for the output file

base_URL

The specified URL will be used for the 'Back To Top' link in the HTML output. The name of the output file will be appended to that. This is particular useful if the results are viewed at remote hosts over the internet. Specify 'local_file' to go to anchors in local file.

mail

You can specify an e-mail address to which the results are being sent after each PubCrawler run. This feature needs the program 'metasend' to work.

mail_ascii

You can specify an e-mail address to which the results in text-only format are being sent after each PubCrawler run. This feature needs the program 'metasend' and 'lynx' to work.

mail_simple

You can specify an e-mail address to which the results in slim HTML format are being sent after each PubCrawler run. This feature needs the program 'metasend' and 'lynx' to work.

notify

You can specify an e-mail address to which a notification will be sent after each PubCrawler run. It contains the number of new hits and can be personalized by appending '#' and a name to the address. This feature needs the program 'metasend' to work.

neighbour_URL

This is the URL from where ui's for neighbourhood searches (related articles/sequences) are retrieved

prefix

alternative prefix for standard files (configuration, database, log)

system

explicit assignment of operating system ('MacOS','Win','Unix', or 'Linux')

proxy

proxy server for internet connection

proxy_port

port of the proxy server, defaults to 80

proxy_auth

user name for proxy authorization

proxy_pass

pass-word for proxy authorization

CAUTION: Storing passwords in a file represents a possible security risk! Use command line option ('-ppass' flag) instead!

spacer

Specify the location of a gif that acts as a space holder for the left column of the results (defaults to http://pubcrawler.gen.tcd.ie/pics/spacer.gif )

time_out

time in seconds to wait for internet responses, defaults to 180

test_URL

URL for test of proxy-configuration

no_test

disables test of proxy-configuration (chopping of head and tail and collecting UIs)

no_decap

disables processing of Entrez results

indent

indents PubCrawler comments n pixel (default 125) to align it with Entrez output

format

format for reports from PubMed and GenBank available options: 'DocSum','Brief','Abstract','Citation','MEDLINE','ASN.1','ExternalLink','GenBank','FASTA' (make sure to use one compatible with the databases you are querying!)

Search Terms

The definition of the search terms is the same as given by Entrez Search System. Please look up their site for more information ( http://www.ncbi.nlm.nih.gov/Entrez/entrezhelp.html#ComplexExpression )

A search-term entered at the command-line (e.g. pubcrawler.pl -s 'pubmed#Test-search#Gilbert [AUTH]') hides all other queries specified in the configuration script.

Perl-Modules and Alternative Choice

The HTTP-requests are handled by Perl-modules, namely LWP::Simple, LWP::UserAgent, and HTML::Parser. These are included in the latest distributions of Perl for Windows and MacPerl (check out http://www.perl.com/pace/pub/perldocs/latest.html ). They are also freely available from the CPAN-archive at http://www.perl.com/CPAN/.

In case you would like to run PubCrawler without these modules you have to provide a command-line browser as an alternative (like lynx, available at http://lynx.browser.org ).

To disable the use of the modules comment out the three lines following #### ADDITIONAL MODULES #### at the beginning of the file by putting a '#' in front of them. You also have to comment out the line

    @proxy_config = split /\n/, (get "$proxy_conf");

in the subroutine proxy_setting, somewhere near the middle of the program script (approx. line 2080).

Setup of Queries

Only queries to the same databases are allowed to share the same alias. If queries with the same alias but different databases are detected, their alias will be changed to the query-name. That means that the results for this entry will appear in a separate section.

Valid database specifiers are:

pubmed: searches PubMed
genbank: searches GenBank
pm_neighbour: search PubMed for articles related to a paper specified by PMID
gb_neighbour: Currently disabled! (search GenBank for sequences related to an entry specified by GI)

Please look at the sample configuration file for a varieties of queries.

REPORTING PROBLEMS

If you have difficulty installing the program or if it doesn't produce the expected results, please read this documentation carefully and also check out PubCrawlers web-page at http://www.pubcrawler.ie

If none of this advice helps you should send an e-mail with a detailed description of the problem to pubcrawlerREMOVECAPShelp@gmail.com

DOCUMENTATION

AUTHOR

Original author: Dr. Ken H. Wolfe, khwolfe@tcd.ie

Second author: Dr. Karsten Hokamp, karsten@oscar.gen.tcd.ie

Both from the Department of Genetics, Trinity College, Dublin

If you have problems, corrections, or questions, please see REPORTING PROBLEMS above.

REDISTRIBUTION

This is free software, and you are welcome to redistribute it under conditions of the GNU General Public License (see file 'COPY' or http://www.gnu.org/copyleft/gpl.html for more information).

HOME PAGE

This program has a home page at http://www.pubcrawler.ie/ .

LAST MODIFIED

$Id: pubcrawler_pod.html,v 1.8 2018/05/25 12:12:10 pubcrawl Exp pubcrawl $