Release 2.8 - 4/12/93 Original README below is followed by update info at end of this file. *************************************************************************** * * Release 2.0 of NCBI Software Toolkit Available * 10/92 * *************************************************************************** This is to announce the availability of release 2.0 of the NCBI Software Development Kit. This includes the version of the ASN.1 specification which will be "frozen" for the next 12 months. No changes will be made to data produced by NCBI which is incompatible with that specification for one year. This means that the software in release 2.0 will be able to read and write NCBI data for one year as well without change. We expect to adding functionality to the software tools over the course of this year, and these additions will be made part of periodic software releases. However, if you do not care for the additional functionality you do not need to change your software. In six months we will solicit and propose changes in the ASN.1 specification for next year. In nine months we will freeze the proposals and produce sample data files and software that conforms to the new spec. In twelve months we will begin producing data releases that conform to the new spec. Of course, in the event of an unforseen, catastrophic bug, we will be forced to make an emergency release, but we will do everything in our power to avoid this. And, there are always new releases of operating systems and compilers, which may uncover or create new problems, but unfortunately this sort of thing is beyond our control. However, it is in our interest (and yours) to keep this to an absolute minimum, and that is what we will do. The release is available for anonymous ftp from ncbi.nlm.nih.gov. cd toolbox\ncbi_tools bin get ncbi.tar.Z quit There is also a self-extracting Mac version (ncbi.sea.hqx) and a self- extracting PC version (ncbiZ.exe) in the same directory. All three versions contain source code for ALL supported hardware platforms. In toolbox\vms_utils are a readme file and source code for a number of utilities to facilitate transfer of the source code onto VMS systems. The documentation has not caught up with the software and specification. However, now that it is frozen, documentation writing has begun. It will be announced on BITS when it is complete. The README file from ncbi.tar.Z follows: Release 2.8 - Update history at end of this file. Introduction This distribution is release 2.0 of the NCBI core library for building portable software, and AsnLib, a collection of routines for handling ASN.1 data and developing ASN.1 software applications. AsnLib and the asntool application are built using the CoreLib routines. This software supports release 1.0 of the Entrez cdrom. The lowest layer of code is the CoreLib [chapter 2]. These are multi- platform functions for memory allocation (including byte stores), string manipulation, file input and output, error and general messages, and time and date notification. These functions have been written only where we found that the existing ANSI functions were not sufficiently multi-platform or well- behaved among all of the platforms that we support. For each platform (a combination of processor, operating system, compiler, and windowing system), we supply a specific ncbilcl.h file, which contains typedefs and defines for multi-platform symbols, and includes a number of standard header files. (For example, ncbilcl.msw is used for the Microsoft C compiler under Microsoft Windows on the PC.) Use of these symbols, and of the functions in the CoreLib, allow us to write multi-platform source code for a variety of disparate platforms. The next layer of code is the AsnLib stream reader [chapter 5]. This is used in conjunction with a header file and a parse table loader file, both of which are produced by processing the formal ASN.1 specification with the AsnTool application [chapter 4]. The symbolic defines in the header file are pointers into the parse table, in which the ASN.1 specification is represented. To read at the stream reader level, a program alternates between calls to AsnReadId and AsnReadVal. AsnReadId returns a pointer into the parse table, which can be compared against the defines in the AsnTool-generated header. For example, in the specification for MEDLINE records, the Medline-entry section has an item called "uid", for the unique ID of the record. This is symbolized in the header file as MEDLINE_ENTRY_uid. When AsnReadId returns this symbol, the program calls AsnReadVal to obtain the uid for that record. AsnKillValue is also needed to free any memory allocated by AsnReadVal, which occurs when the value is a string and not an integer. The entire set of records on the Entrez CD-ROM can be read as a single stream with the AsnLib functions. The ASN.1 records may be accessed at a higher level through the object loaders [chapter 6], which utilize the stream processing functions to load C memory structures with the contents of the ASN.1 objects. For each ASN.1 object we specify, we also define an equivalent C memory structure. The object loader level of code contains functions to read and write each ASN.1 object. These are hierarchical, as are the ASN.1 specifications. Calling the top level loader, SeqEntryAsnRead, will load an entire SeqEntry from an open AsnIo channel, and will return a pointer to the loaded memory structure. The read function for an AsnIo channel can be swapped to refer to a normal disk file, a network socket, or to compressed data, which it automatically decompresses. The object loader code can interconvert between the highly-branched memory object and a linear ASN.1 message with complete fidelity. The object loaders have additional functions, including the ability to explore the structure and notify the program when particular data elements are encountered. The entire contents of the Entrez CD-ROM can also be streamed through the object loaders. However, most calls to the object loaders for simply reading a particular record are done via the data access functions (see below). The data access functions allow a program to call the object loaders on a sequence or MEDLINE record given the uid of the record [chapter 7]. This will get the data into memory regardless of whether the data are compressed on the Entrez CD-ROM or are obtained through a service over the Internet. This means that a detailed understanding of the files and formats on the Entrez disc is not needed by application programmers. The function to load a sequence record, SeqEntryGet, needs the uid to retrieve and a complexity code parameter. A sequence record is in the form of a NucProt set. This contains a nucleotide (which may itself be composed of segments) and all of the proteins it is known to encode. The set of segments is called a SegSet, and the individual sequences are called BioSeqs. We have taken the liberty of producing this integrated view, but the complexity code parameter allows the record to be easily loaded in a simpler, more traditional form, if desired. The accession number term list is built to supply the proper uids to support this facility. This access library is compatible with Entrez release 1.0 or later only. The sequence utilities and application programmer interface layer [chapter 7] allows exploration of the loaded memory structures and generation of standard literature or sequence reports from those objects. For example, a BioSeq can be converted to FASTA or GenBank flat file formats and saved to a file, and a MEDLINE record can be saved in MEDLARS format, which is suitable for entry into personal bibliographic database programs. A sequence port can be opened that gives a simple, linear view of a segmented sequence, converting alphabets, merging exon segments, and dealing with information on both strands of the DNA. This layer also includes some functions to explore the NucProt set. The explore functions visit each individual BioSeq in the set, calling a callback function for each sequence node so that a program can examine feature tables and other information that are associated with the NucProt or SegSets or with the individual sequences. Vibrant is a multi-platform user interface development library that runs on the Macintosh, Microsoft Windows on the PC, or X11 and OSF/Motif on UNIX and VAX computers [separate documentation]. It is used to build the graphical interface for the Entrez application (whose source code is in the browser directory). The philosophy behind Vibrant is that everything in the published user interface guidelines (the generic behavior of windows, menus, buttons, etc.), as well as positioning and sizing of graphical control objects, is taken care of automatically. The program provides callback functions that are notified when the user has manipulated an object. Vibrant and Entrez code are not supported, but are provided on an as-is basis. The advantage of using AsnLib and the object loaders, as they are implemented, is that application program developers merely need to recompile their programs with the new (AsnTool-generated) header files and load the new parse tables (included with the Entrez software) in order to be able to read the new data. This process is straightforward, and will not break existing program code. The application is free to ignore new fields if it does not choose to take advantage of the new kinds of information. The documentation is currently being brought up to date. The programs in the demo directory are designed to teach the proper use of many of the functions discussed above. Many of these programs are not yet documented. The simplest is testcore.c, which tests various functions in the CoreLib. The most complex is getfeat.c, which takes an accession number of locus name, determines the unique seq ID, retrieves the entry from the Entrez CD-ROM using the data access library, locates all coding region features using the explore functions, and prints the DNA sequences of all exons using sequence port functions. If you cannot extract and print the doc.tar.Z file, please send an email message with your land mailing address and phone number to toolbox@ncbi.nlm.nih.gov, and we will mail a copy to you. The contents of the ncbi directory (the highest level, containing the NCBI Software Development Kit source code in several subdirectories) is shown below. The readme file contains instructions on copying the appropriate make files to be built in the build directory. The makeall file copies headers to the include directory builds four libraries (ncbi, ncbiobj, ncbicdr and vibrant), copying them to the lib directory. The makedemo file builds the demo programs and the Entrez application: api Application Programmer Interface, Sequence Utilities asn ASN.1 specifications for publications and sequences asnlib Source code for AsnLib and asntool asnload AsnLib headers and dynamic parse tables (Mac and PC) asnstat AsnLib headers that use static memory (UNIX and VMS) bin Asntool executable copied here browser Source code for Entrez application build Empty directory for building tools and libraries cdromlib Access routines for data on the Entrez CD-ROM config Configuration files for NCBI software: dos mac unix vms win corelib Source code for NCBI Core Software Library data Data files used for sequence conversion demo AsnLib and sequence utility demonstration programs doc Documentation in ASCII files include Include files required by applications are copied here lib Libraries copied here link Contains three subdirectories with build accessory files: macmpw Macintosh MPW C msdos Microsoft C and Borland C for DOS mswin Microsoft C and Borland C for Windows make Make files for various systems network Network version of data access (to be added later) apple entrcli entrserv ncsasock netmanag nsclilib nsconfig nsdaemon nsdemocl object Functions for reading and writing complex objects readme File that contains important building instructions vibrant Source code for Vibrant portable interface package The platforms that are supported (as indicated by the suffix on the relevant ncbilcl.h file) are shown below. Those marked with an asterisk (*) are available as-is: 370* IBM 370 acc SUN acc compiler alf DEC Alpha under OSF/1 aov DEC Alpha under AXP/OpenVMS aux* Macintosh A/UX bor Borland for DOS bwn Borland for Microsoft Windows ccr CenterLine CodeCenter cpp SUN C++ cra* Cray cvx* Convex gcc Gnu gcc hp * Hewlett Packard mpw Macintosh Programmer's Workshop msc Microsoft C for DOS msw Microsoft for Windows nxt* NeXT r6k* IBM RS 6000 sgi Silicon Graphics thc THINK C on Macintosh ult DEC ULTRIX vms DEC VAX/VMS Questions or comments can be directed to toolbox@ncbi.nlm.nih.gov. It is our intent that the ASN.1 specifications [chapter 3] will be frozen for at least 12 months (through September of 1993). As molecular biology progresses, new features (e.g., RNA editing) will need to be represented in the ASN.1 specifications. Proposed additions will be made available for evaluation (with code and sample records included) in April of 1993, and the specifications will be finalized in July of 1993. Changes will be made with every effort given to backward compatibility with data. We expect to add only new choices and optional fields. Existing records will thus continue to be valid instances of the new specification, eliminating the need for us to reprocess the databases. While we feel confident we can to hold to this schedule, we hope users will understand if events force us to make an interim release of some sort. In addition, we will make releases of software with additional features over the next year, but the use of such releases will be at the option of the software developer. ANSI C: This software requires an ANSI C compiler. This will be no problem at all except to people on Sun machines, where the bundled C compiler, cc, is non-ansi. However, you can use the Sun unbundled compiler, acc, or the Gnu compiler, gcc (which is free) and that works just fine. If you have written applications on the Sun with non-ANSI functions, the ANSI compilers will complain. See the notes below if this is a problem. Installation ALL - change to the build subdirectory MS-DOS (Also see NEW MAKEFILES, below) Microsoft C version 7.00 copy ..\make\*.dos ren makeall.dos makefile nmake MSC=1 [note: nmake requires windows or DPMI] copy ..\config\ncbi.dos ncbi.cfg check paths in ncbi.cfg file [see section on CONFIGURATION] Optional: edit AUTOEXEC.BAT with "set NCBI=[path to directory containing ncbi.cfg". reboot to activate To make demo programs: nmake -f makedemo.dos MSC=1 Microsoft Windows version 7.00 copy ..\make\*.dos ren makeall.dos makefile nmake MSW=1 [note: nmake requires windows or DPMI] check paths in "ncbi.ini" as above copy ncbi.ini to your windows directory To make demos: nmake -f makedemo.dos MSW=1 Borland C++ 3.1 copy ..\make\*.dos ren makeall.dos makefile make -DBOR then set paths as in Microsoft C, above. To make demos: make -f makedemo.dos -DBOR Borland C++ 3.1 for Windows copy ..\make\*.dos ren makeall.dos makefile make -DBWN then set paths as in Microsoft Windows, above. To make demos: make -f makedemo.dos -DBWN Mac tested on THINK C 5.0.4 and MPW C 3.2 Both - copy config:mac:ncbi.cnf to your System Folder, or to the System Folder:Preferences subfolder edit the "ASNLOAD" line in "ncbi.cnf" to point to the ncbi:asnload directory in this release edit the "DATA" line to point to the ncbi/data directory Think C - rename corelib:ncbimain.mac ncbimain.c rename corelib:ncbienv.mac ncbienv.c rename corelib:ncbilcl.thc ncbilcl.h build separate projects, allocating .c files as described in the makeall.thc document. MPW C - copy make:makeall.mpw build:makeall.make copy make:makedemo.mpw build:makedemo.make set directory to build build makeall build makedemo Unix tested on Sun Sparc, Silicon Graphics, IBM 3090 with AIX Sun (with gcc version 2.2 but not Vibrant) cp ../make/*.unx . mv makeall.unx makefile make LCL=gcc CC=gcc RAN=ranlib To make demos: make -f makedemo.unx CC=gcc Sun (with gcc version 2.2 with Vibrant) cp ../make/*.unx . mv makeall.unx makefile source ../make/viball.gcc To make demos: source ../make/vibdemo.gcc Silicon Graphics (no Vibrant) cp ../make/*.unx . mv makeall.unx makefile [ edit makefile to delete make $(LIB4) ] make LCL=sgi OTHERLIBS="-lm -lPW -lsun" To make demos: make -f makedemo.unx LCL=sgi OTHERLIBS="-lm -lPW -lsun" Silicon Graphics (with Vibrant) cp ../make/*.unx . mv makeall.unx makefile source ../make/viball.sgi To make demos: source ../make/vibdemo.sgi Other UNIX: AIX, ULTRIX, NeXt, Sun acc, DEC Alpha OSF/1 Follows models above. Read header in makeall.unx and makedemo.unx for details. Look at viball.* and vibdemo.* for model shell scripts for Vibrant versions. for all UNIX, edit .ncbirc as described above for MSC. optional edit .login to "setenv NCBI=[path to .ncbirc file" VMS (without Vibrant) on VAX $set def [ncbi.build] $copy [-.make]*.dcl *.com $@makeall check ncbi.cfg as described above for MSC. edit LOGIN.COM to "define NCBI [path to ncbi.cfg file]" To make demos: $@makedemo VMS (with Vibrant) on VAX $set def [ncbi.build] $copy [-.make]*.dcl *.com $@viball check ncbi.cfg as described above for MSC. edit LOGIN.COM to "define NCBI [path to ncbi.cfg file]" To make demos: $@vibdemo Testing VMS only: look in rundemo.dcl in [make] to see how to give command line arguments. Not all demo programs are shown. Run at least testcore. All else: In build should be a program called testcore. Type "testcore -" and it should show you some default arguments. Type "testcore" and it will run through a variety of functions in CoreLib, prompting you for responses along the way. It should run without a crash or error report. If you made Vibrant versions all demos will have startup dialog boxes. If not, they take command line arguments. If testcore runs, read the documentation for CoreLib and for AsnLib. In the AsnLib documentation are instructions for running asntool itself. for running a few of the demo programs. There are a large number of demo programs now (including Entrez itself, if you made the Vibrant versions). CONFIGURATION OR SETTINGS FILES: One of the fundamental problems in writing portable software concerns configuration issues. Each individual user's computer will have its own particular hardware and software environment, and each machine will have its disk file hierarchy set up in a unique manner. A program that needs accessory information, such as help files, parse tables, or format converters, must be given a means of finding the data regardless of where the user has placed the files. The difficulty is compounded by the different conventions for naming files and specifying paths on each class of machine. For example, the name of a CD-ROM on the Macintosh is fixed, determined by information on the CD itself, whereas on the PC it is addressed by a drive letter, which can be assigned by the user, but which cannot be reconciled with the name the Macintosh sees. An associated problem is that many programs will want to allow the user to make persistent changes to parameters. These parameters typically involve numbers or font specifications, but may also include paths to data files. Some platforms supply such configuration information in preferences files, others in environment variables. Manipulating these settings is platform dependent, as is the format in which the preference is specified. The NCBI Software Toolkit core library addresses these problems by providing configuration or settings files. These are modeled after the .INI files used by Microsoft Windows. Settings files are plain ASCII text files that may be edited by the user or modified by the program. They are divided into sections, each of which is headed by the section name enclosed in square brackets. Below each section heading is a series of key=value strings, somewhat analogous to the environment variables that are used on many platforms. The ncbi configuration file supplies general purpose configuration information on paths for commonly used data files. The typical file set up for the Entrez application running on the PC under Microsoft Windows is shown below: [NCBI] ROOT=D: ASNLOAD=C:\ENTREZ\ASNLOAD\ DATA=C:\ENTREZ\DATA The only section is entitled NCBI. The ROOT entry refers to the path to the Entrez CD-ROM. In this example, the user has configured the machine to use drive letter D. (On the Macintosh, the name of the disc is SEQDATA, which cannot be changed by the user.) The ASNLOAD specifies the path to the ASN.1 parse tables. These files are required by the AsnLib functions, and all higher-level procedures that call them, including the Object Loader, Sequence Utility, and Data Access functions. Files pointed to by the DATA entry contain information necessary to convert biomolecule sequence data into different alphabets (e.g., unpacking the 2-bit nucleotide code stored on the Entrez CD into standard IUPAC letters). Although the contents of a configuration file is similar regardless of platform, the name of the file and its location is platform dependent. If the base name of the configuration file is xxx, then the actual file name is shown below for each platform: Macintosh xxx.cnf Microsoft Windows xxx.INI MS-DOS (without Windows) xxx.CFG UNIX .xxxrc VMS xxx.cfg Samples of such files are in subdirectories of \config. The UNIX version does not have the leading '.' in filename so you can see it. Since VMS and DOS both use the same file name (ncbi.cfg) the DOS version was called ncbi.dos. You will have to rename it. Remember these are just models. You will have to set the paths appropriately for your machine yourself. The location in which these files must reside is also platform dependent, and the functions that manipulate the contents may look in several places to find these files. On the Macintosh, the function first looks in the System Folder, then in the Preferences folder within the System Folder. Under Microsoft Windows, the file must be in the Windows directory, along with all of the other .INI files. Under DOS without Windows, the function first looks in the current working directory, then in the directory whose path is specified in the NCBI environment variable. Under UNIX and VMS, the current working directory is first checked, then the user's home directory, and finally the directory specified by the NCBI environment variable. (Under UNIX, when it uses the environment variable, it will check for configuration files first without and then with the initial dot.) On the multi- user platforms (UNIX and VMS), the use of the NCBI environment variable allows a common settings file to be used as the default by multiple users. If such a settings file is changed under program control, it is copied over into the user's home directory, and the new copy is modified. The order of searching for settings files ensures that this new copy is used in all subsequent operations. contents of ASNLOAD are in ncbi/asnload contents of DATA are in ncbi/data MAJOR CHANGES FROM DOCUMENTATION: AsnNode structures have proved to be generally useful and moved from AsnLib to ncbimisc. In addition, some elements of structs used in the object loaders were called "class" to match the ASN.1 names. Class is a C++ reserved word, so all instances of "class" have been changed to "_class". To conform to our naming conventions, we have changed the names appropriately: AsnValue = DataVal AsnNode = ValNode class = _class A global search and replace of your code with these strings (not restricted to words... we want to change AsnNodePtr = ValNodePtr as well) should fix any problems. Field names within structures have not been changed. If your code uses only the object loaders, you may not find these strings in your code at all. DATA ACCESS LIBRARIES cdromlib contains data access routines compatible with release 1.0-6.0 of the Entrez CDROM. The documentation for these functions are out of date. The routines in cdromlib have been split into entrez, sequence, and medline access functions. The interface you should normally program to is defined in accentr.[ch], accseq.[ch], and accml.[ch]. The form of this calls has been changed to make them compatible with the NCBI network server, a client/server version of data access. A program written to use these calls can access the the cdrom data, the network data, a combination, or that plus a local database by just fiddling with defines. The form of the api for these functions has also been changed to hide the details of storage and caching more so that the different optimizations done to support cdrom and network access are transparent to the application programmer. The end user tool called "Entrez" now uses these libraries as it's only means of data access (i.e., you can write an application of your own with any or all of Entrez's functionality using just these routines). DOCUMENTATION We are rewriting the documentation to conform with all the new features contained in this software. We will add it to the package as soon as possible. DEMO PROGRAMS As in the tools, there are a number of undocumented programs in the demo directory as well, that use a number of the utility functions in api. There is also a demo program called "getseq" in the cdromlib directory which retrieves a sequence from the cdrom given any valid sequence id. These will be described in more detail in the next set of documentation. Briefly: asn2gnbk.c converts ASN.1 to GenBank flatfile asn2rpt.c converts ASN.1 to human readable report dosimple.c converts ASN.1 to a "simple sequence" getseq.c gets sequence from Entrez Cdrom using data access library, writes to disk getfeat.c ditto, but writes sequence of any CdRegion features to "test.out" getmesh.c documented getpub.c documented indexpub.c documented seqtest.c reads ASN.1 sequence, converts to iupac, reports segmented sequences, outputs fasta format to seqtest.out testcore.c documented testobj.c tests Medline object loader, demonstrates error checking using NULL asnio stream. entrez If Vibrant is installed, the full Entrez program is made. asndhuff Demonstrates streaming ASN.1 data from the huffman compressed Entrez CDROM (only works on release 1.0 or later). Update History of this release from 2.0 Release 2.8 - 4/12/93 many small fixes and improvements performance improvements in cdromlib vibrant improvements improvements to toreport and togenbk Release 2.7 - 3/12/93 continued refinement of access functions accentr made "stand-alone" with no dependency on accml, accseq See changes in demo program, getseq.c minor bug fixes in object loaders refinements in tognbk many improvements and refinements to vibrant bug fix to asnlex.c for piping text asn1 streams changed SeqFeat.except to SeqFeat._except to avoid reserved word in Win32 Release 2.6 - 2-16-93 much internal rearrangement of access functions at low levels to accommodate 2 cdrom, and network data access. Invisible to applications access.doc - a brief writeup on data access and exploring in /doc improvements to toreport and tognbk for more sequence data sources Release 2.5 - 1-4-93 many small fixes and improvements asnlib: added generic copy (through file) AsnIoCopy(), copy (in memory) AsnIoMemCopy(), and compare in memory, AsnIoMemComp() for any object define in object/*.h vms version: (Will Gilbert) a new Makefile which compiles the sources from there native directory This version also checks to see if the user's system has the proper include file to make the vibrant library. Support for OpenVMS on DEC Alpha machine added. data access libraries: Changed data access library so that CD-only, network only, or Net+Cd may be built from the same baseline. Eliminated memory leaks in data access library. environment: Modified ncbienv.* so that GetAppParam() holds the last-read configuration file in a cache. Also modified GetAppParam() and SetAppParam() so that they can read all the keys in a section, or destroy a section, if the key parameter is NULL. files: Modified ncbifile.c so that EjectCd() takes the raw device name as an extra parameter. configuration files: Added VMS ncbi config files for the various multi-CD/hard disk configurations, and moved all ncbi and entrez config files into their appropriate config subdirectories (vms, unix, win, etc.). demo programs: Will Gilbert added many tutorial comments to getseq.c, getfeat.c, and seqtest.c Release 2.4 - 12-3-92 sequtil.c - improvements to BioseqContextGetSeqFeat() and SeqIdPrint(). toreport.c - improvements for PDB data togenbnk?.c - improvements for unusual ASN.1 objloc.c - bug fix in reading release date for PDB id. seqport.c - addition of translation routine given CdRegion data access libraries - support for pre-release 0.6 removed support for multiple CDROM's (Entrez:Sequences and Entrez:MEDLINE) at once. vibrant - performance improvements Release 2.3 - ncbimain.xxx - GetArgs now handles negative numbers correctly Release 2.2 - 10-29-92 objfeat.c - write to uninitialized memory when reading a tRNA feature which contains codon, but no amino acid transferred. Release 2.1 - 10/22/92 ncbifile.c - VMS open did not work correctly on CDROM. objseq.c - Seq-hist incorrectly read