Back to PFAF home page Database Intro Main Search Page

How the Web Version of the Species Database works

Some people have asked how the database manages to be so fast, so here's how it works.

The original database is in Microsoft Access format, but I haven't used this for a number of reasons. 1) Its Microsoft; 2) Linking a database to the web usually requires some "backend" software which runs all the time and handles the web requests as they come in. This can place a big demand on the service provider; 3) It's easier to get the look and feel I want.

So how is it done? Basically the web version consists of a number of text files plus a few C programs. Each time a request comes in the appropriate C program starts up. It reads in the text file, finds matching records and prints the results. These can be made pretty quick and don't put a big load on the server. Just using text files also makes the whole thing smaller.

Theres actually four different C programs. One find_lat just looks for Latin and Common names which match. There's a separate file which contains this data. This file is quite so such a search can be carried out fairly quickly.

Another find_use searches for uses, habitats, distributions and growing properties. The uses, habitats and distributions are each stored in separate files, and again can be searched quickly. Most of the data about each plant is stored in 26 files, one for each letter of the alphabet; searching through these takes a bit of time.

find_gen is the slowest, this needs to go through the 26 alphabetical files searching for particular words. As there's about 17MB of data this can take a little while.

The data-sheet for each file is printed by arr_html this takes the exact Latin name of the plant. It opens the file which starts with the first letter of the name and prints things out. It also reads in a number of other files which contain details of cultivars, scented plants, and references to other pages on the web.

All the programs have been written to make them as fast as possible, for example I create a 2MB buffer and read a whole file in at a time. Quite a bit of hand optimisation has also gone in to speed the crucial lines up a bit. Most of the programs are fairly straight forward apart from a bit of fun with and-ing and or-ing linked lists. Most of the programs are less than 2000 lines inclusive and I've used the NCSA routines to clean up the URL's to prevent hackers.

Most of the data-files are in a pipe-delimited format:

	Abelia triflora||Caprifoliaceae|Zabelia triflora. (Wallich.)Makino.
	Abelmoschus esculentus|Okra|Malvaceae|Hibiscus esculentus. L.
	Abelmoschus manihot|Aibika|Malvaceae|Hibiscus manihot.
	Abelmoschus moschatus|Musk mallow|Malvaceae|Hibiscus abelmoschus.
	Abies alba|Silver fir|Pinaceae|A. pectinata.  A. picea.

apart from the habitat one which is a little more compressed. In general each file is in alphabetical order which helps to speed things up a bit. A slight complication is caused the presence of newlines in some of the text descriptions - keeping track of the number of fields read in copes with this.

The original source of the information about native ranges is in a text description, which is hard to search through. I've compiled a list of about 600 words and phrases which indicate particular countries or regions. When updating the database each of these words is search in the records for each plant and a index of the regions the plant can be found in is built up. It takes about half an hour to do this.

Compiling the links to other information and pictures on the web was a bit of fun. From each of the sources mentioned in the links page I've either downloaded the data-files, or crawled through their web pages (plane text data-file are nice). A few Perl scripts convert this into a format Access can handle. I then take all the Latin Names and synonyms and cross reference them with the PFAF ones. Quite often there are slight changes in the spelling of the Latin names, in particular the ending can change according to the genders in Latin. So I've found those names where there's a slight change in spelling (1 extra character etc. Levenstine distance 1 I think). This still takes a bit of hand editing to find the allowable differences in spelling. Once this is done there are a couple more cross references so that I can get plants whose synonyms match one of ours' and we're done. All the sources are stored in one big file with details of how to construct the appropriate URLS.

Hum, thats about it. Its taken just over a year (from Oct '97) to get to its current state. There are still a few other things I'd like to do

create a mechanism where people can add data to the database,
make a combined search where you can search for a Latin name, a use and a word all at the same time.
Add in regular expressions.
Cross reference with more web sources
Figure out a way where people could get the database on their home system without having to set up a web server. (Or find a public domain server, which is easy to set up).

if you think any of these will be good let us know. Also if you want any of the code/data-files I'll be happy to oblige.

If you want to find out about all about forms and CGI scripts then the An Instantaneous Introduction to CGI Scripts and HTML Forms is a good place to start. Also check out Matt's script Archive.

How to add a link to the database

Note: if you want to link to pages about particular plants it may be best to use find_lat as this will check for synonyms and is less likely to fail if a user gets the name slightly wrong. arr_html is very fussy and needs the exact name. For example you could use the URL

	http://www.comp.leeds.ac.cgi-bin/find_lat?LAT=Allium&CAN=my_name
or
	http://www.ibiblio.org/pfaf/cgi-bin/find_lat?LAT=Allium&CAN=my_name

to find members of the onion species. The CAN=my_name bit is optional but its nice for me as I can keep a track on where the hits are coming from.

If you really must link to the actual plant you can use

	http://www.pfaf.org/database/plants.php?Allium+triquetrum&CAN=my_name
or
	http://www.ibiblio.org/pfaf/cgi-bin/arr_html?Allium+triquetrum&CAN=my_name

Note the latin name begins with a capital, and spaces are replaced by plus signs +.

Updates and Bug fixes

Fixed bug where going through the ten or more pages of the lists of plants matching a particular query the caused the query string to grow so much it over flowed an internal buffer. (Nov 99)
Now if a common name matches one of the other data sources then the matching name is printed. (Nov 99)
Fixed bug where not all the common names from other sources were found by sorting lists. (Nov 99)
A few little html bugs were found and eliminated using web-lint and html-tidy.
Improved the way the urls for Prev and Next links are constructed so files from the browser cache will be used. (Dec 99)
Improvements on the log files kept when searchs are made. Browser vesions are now recorded. You can see the results on the Stats page. (Dec 99)
A textual description of the many of the fields is now produced, much more human readable than giving the results for the different fields. (Summer 2000)
Plants can now be grouped by genus which produces a much shorter list.
The lists of plants can be browsed by either latin or common names and tere are pages listing the names in alphabetical order.
The PFAF ratings are now used more extensively and printed in the lists of plants. (Sept 02)

Back to PFAF Home Page, Main Search Page, Help, Bibliography.

Plant information taken from the Plants For A Future - Species Database.
Copyright (c) 1997-2000. Last Update: Feb 2000.
WEB search engine by Rich Morris - Home Page
Plants for a Future, The Field, Penpol, Lostwithiel, Cornwall, PL22 0NG, UK.
www.pfaf.org - Contact Info