Step 2 for Unix PubCrawler:
Customizing your searches.

Please report problems to

Customizing your searches

When you have completed Step ALPHA(downloading and installing PubCrawler) and the test searches have run correctly, you can customize it to make your own searches as follows.

1.  The search queries are written using a syntax developed by NCBI.  It's quite easy to learn, but you should look carefully at the examples below. These examples specified the test searches that you just ran:
What they mean
pubmed 'Candida' candida[ALL] BUTNOT candida albicans[ALL]
This query searches PubMed for all articles about Candida, but not those that contain the phrase Candida albicans.  The alias for this search is 'Candida'.
pubmed 'Rivals' Dujon B[AUTH]
pubmed 'Rivals' Oliver SG[AUTH]
pubmed 'Rivals' Philippsen P[AUTH]
These 3 queries search PubMed for articles by three different authors. The results are merged into a single alias called 'Rivals'.
pubmed 'Yeast' yeast[ALL] AND (DNA[ALL] OR gene[ALL])
pubmed 'Yeast' (Mol Cell Biol[JOUR] OR Curr Genet[JOUR]) AND yeast[ALL]
These 2 queries search PubMed for papers about yeast.  One looks for papers about yeast genes or yeast DNA.  The other looks for papers about yeast published in the journals Mol. Cell. Biol. or Curr. Genet.  The results are merged into a single alias called 'Yeast'.
genbank 50000:350000[SLEN] AND Venter JC[AUTH] AND human[ORGN]
This query searches GenBank for all human sequences 50 - 350 kb long, with JC Venter as author. No alias was defined for this search.
genbank 'Acyl-CoA synth*' acyl-coa synth*[ALL] BUTNOT est[PROP]
This query searches GenBank for sequences entires containing the text  "Acyl-CoA synth..." but excluding EST sequences.  The asterisk is a wild-card so the query will match words such as "synthesis" or "synthase".  The alias for this search is 'Acyl-CoA synth*'.

How to write a search query
There are 3 parts to each query in PubCrawler:

database name: pubmed or genbank.
    This is mandatory.

alias: a short name for the search, such as 'Candida' or 'Rivals'.
    This must be written inside single quotes (').  It is not mandatory to supply an alias name (e.g., the first genbank search in the example has no alias name).
    The same alias name can be used for multiple searches at the same database (e.g., the alias 'Yeast' in the example), in which case the "hits" from each search are merged together in the results.  This is useful because you may need to write several searches to try to cover a scientific topic completely, and these searches may partly overlap.  It does not matter if the same database entry is hit by two searches that are part of the same alias, because when the hits are merged the record will only be shown once in the final output.

search description: The full syntax for this is set out in the NCBI Boolean Query Syntax document.  The most important things are:

  • Use AND, OR, BUTNOT (and parentheses () as necessary) to combine search elements
  • e.g.: (Mol Cell Biol[JOUR] OR Curr Genet[JOUR]) AND yeast[ALL]
  • Each element is in the form text [FIELD]
  • An asterisk (*) can be used as a wildcard at the end of text words to cover all possible endings
  • The main FIELDs are:
  • [ALL] all text (for PubMed, this means title, abstract, and index terms)
    [TITL]  title (for PubMed) or the Description line (for GenBank)
    [AUTH] author name (e.g., Smith BJ)
    [JOUR] journal name (standard abbreviation, or you can spell out the full name)
    [ORGN] organism (for GenBank only)
    [PROP] properties (for GenBank only; useful for getting rid of ESTs)
    [SLEN] sequence length (for GenBank only)

    2.  Try composing one search query of your own.  Before putting it into PubCrawler, test it directly at NCBI as follows:

  • Compose one query, such as:  gene[ALL] AND map[ALL] BUTNOT map kinase[ALL] (for PubMed) or

  • chloroplast[ALL] BUTNOT rbcL[ALL] (for GenBank).
  • Paste your query into the the NCBI Advanced Medline query box (for PubMed) or the NCBI Entrez Nucleotide query box (for GenBank) as appropriate.
  • If you get an error message, there is something wrong with your query syntax and you need to fix it. Try looking at the NCBI's Boolean Query Syntax description document.
  • Note: If you specify a very vague search description (such as "cancer[ALL]"), you will get zillions of hits each day and the results will be unmanageable.  But if your searches are very specific you might see no new hits some days. When you're testing your searches you can get an idea of how many hits to expect per day by changing the "Entrez Date limit" box on the results page to "30 days", clicking on "Retrieve", and dividing the number of hits by 30.

  • 3.  If it works, you can now add the search description to your PubCrawler config file.  This file is normally called pubcrawler.config and is located in the PubCrawler folder.  The config file contains your search strategies and your preferences about how you want the results to be presented.

    4.  Open the file pubcrawler.config using a word-processor in text-only mode (like vi or emacs), and scroll to the bottom of the file (you must go below the line reading "##### SEARCH SPECIFICATION ######").  In this file, a hash mark (#) indicates a comment, so any text to the right of a hash mark is ignored (up to the end of that line).

    5.  Paste the query you just composed into the bottom of the pubcrawler.config file.  On the same line, in front of your search query, write the database name and an alias for your search.  For example:

    pubmed 'gene maps' gene[ALL] AND map[ALL] BUTNOT map kinase[ALL]
    genbank 'my first attempt at a search' chloroplast[ORGN] BUTNOT rbcL[ALL]
    6.  Save the pubcrawler.config file (as a text-only or ASCII file).

    7.  Run PubCrawler with the -v option (for verbose output) by typing   pubcrawler -v   into your (x-)terminal, and your new search should be included in the results.

    8.  Add more searches as you wish (go back to point 2, above).  Try making some that share an alias name.

    9.  Finally, you can delete the example searches in pubcrawler.config, unless you're very interested in yeast or Craig Venter.  Or you can put hash marks (#) in front of them to turn them into comments (so the program will ignore them).

    You are now ready to go on to step GAMMA

    Last modified at $Date: 1999/05/20 16:47:41 $