Minimum Project Technical Standards

From the Alberta Heritage Digitization minimum standards

The AHDP employs standards consistent with academic and industry practice for digitization of paper documents. The industry standard TIFF format with LZW compression is used for high-quality archival digital images. Documents are scanned at original size at 300 dpi in 8-bit greyscale mode to preserve as much of the original document information as possible. When documents are in colour, 24-bit RGB scans are made.

Capture Standards
As the chosen standard for archiving generates large files, the AHDP uses 4 sizes to provide on-line delivery to the user: a 200x200 pixel thumbnail for quick reference, and three sizes (600 pixels high, 768 pixels high and 1000 pixels high) for more detailed examination and usage. On-line delivery files are created as JPG images at a medium compression level, balancing onscreen quality with overall size of download.

The workflow for digitization consists of the following:

  1. Meta-Data
    1. Creation of descriptive meta-data (usually exists for original)
    2. Addition of fields for U of C specific items
    3. Input of meta-data into workflow database
  2. Scanning
    1. Preliminary survey of material for completeness and scanning issues
    2. Scan document using assigned Volume_Code to name files
    3. Record scan parameters into the workflow database
    4. Run automated image clean-up applications
  3. Quality Assurance (Scans)
    1. Compare images against original document to ensure that all pages are present and
    2. that pages all meet quality standards
    3. Ensure filenaming is consistent
    4. Record any corrections in the workflow database
    5. Correct images as needed
    6. Archive the source images on CD-R (2 copies) and generate a CRC-32 value for future
  4. reference
    1. Record CD-R location information
    2. Automated Procedures
    3. OCR images generated one text file per image
    4. Create the web display images (4 sizes)
  5. Indexing
    1. Generate a table of contents based on the structure of the book by recording
    2. each chapter with an associated UC_PageID
    3. Quality assurance - double check the table of contents against the original
  6. Web Mounting
    1. Export the workflow database records to the web database format
    2. Upload the files to the web server
    3. Update the server database and text indices
    4. Test the volume for correct linking and image order
    5. Activate the volume for public use

Equipment and Software Used

Scanning
Windows 98/2000 workstations
Ricoh 450 Page Scanners
HP 6300 Page Scanners
Wicks and Wilson 4100 Microfilm scanner
ScreenScan Microfiche scanner
Maxtor Firewire drives (storage)
Adobe Photoshop
Cerious ThumbsPlus (image indexing and processing)
NameWiz (filerenaming)
OCR
  • high accuracy - Prime Recognition
  • average accuracy - Abbyy Finereader
Web
Windows 2000 Advanced Server
White box web server
  • Pentium III-1000 Processor
  • 512 MB RAM
  • 10 73-GB SCSI drives in a RAID-5 configuration
Microsoft SQL Server 2000 Standard
Microsoft Access 2000
TextDB (Full Text Indexing)


Cataloguing and Meta Data Standards

Currently, the AHDP has benefited greatly from the work already done by librarians in the item-level cataloguing of the works that the AHDP has digitized. Currently, items in the AHDP have their descriptive cataloguing drawn from the University of Calgary library catalogue. New items are catalogued by trained cataloguers from U of C library bibliographic services following AACR2 rules. The core records exist in MARC format.

Currently, the AHDP exists primarily in English although full-text searching is in the original language of the document. Each item in the AHDP has an index page that provides an interactive table of contents to that volume. A Dublin Core header will be added to each of these index pages to provide harvesting of each individual volume.
Currently, the AHDP employs an 8.3 file name convention in the name of all of its files. The general convention is:

Character Role
1 Letter - Describes the general collection to which the item belongs (handles 26 collections)
2-3 Alphanumeric - Identifies the specific item in the collection (handles 1,296 items)
4 Letter - Describes the type of file (A - size 1 display, B - size 2 display, etc.)
5-8 Numeric - Identifies the specific page of the document (handles 10,000 pages)
.  
9-11 Letter - standard MIME extension for the file format

The AHDP uses the first three characters to provide persistent access to the item on its site.
Notes: DC_Subject and DC_Description are currently not used.

Database Specifications

Currently the AHDP uses a database structure that tracks administrative and descriptive meta data for item and page level information in relatively simple table structures. Additional tables are utilized for controlled vocabularies but all information at either the page or item level can be exported to a comma-delimited format. The core database engine for the AHDP is the Microsoft Database Engine (MDE); the project utilizes both Microsoft Access and SQL Server for the operations.

Web Site Specifications

General
One of the guiding principles for the AHDP is to provide a simple interface to the resources. Complex client-side scripting and the usage of plug-ins is avoided where possible. Pages are created using standard HTML 4.0 code. The majority of the scripting and validation occurs at the server level.

Web Site Auditing and Evaluation
Complete logs of server activity are maintained by the AHDP.

Programming and Scripting Languages
Currently, the AHDP employs Active Server Pages 3.0 for its server side scripting. Cookies have been avoided in the past over concerns of privacy and lack of support at the client side (whether by choice or because of incompatible clients). As stated above, client side scripting is avoided where ever possible to ensure that the majority of users will have little to no problems accessing the site.

Preservation and Records Management
The AHDP currently uses CD-R technology to store digital files for long term storage and employs only standard file formats (TIFF, JPEG, HTML) that are open and non-proprietary for all of its work. CD-Rs are written using standard ISO-9660 format to ensure maximum compatibility.

The general principles that the AHDP employ to ensure long term access to the media is to create CRC-32 values for individual files and to do spot checks on the media on a regular basis. Duplicate copies of media are made to store with one on-site and one off-site copy. In addition to backing up the database information separately, the AHDP plans to save the meta data with the files in XML format once the template has been developed.