Alberta Heritage Digitization minimum standards
The AHDP employs standards consistent with academic and industry practice for
digitization of paper documents. The industry standard TIFF format with LZW compression
is used for high-quality archival digital images. Documents are scanned at original
size at 300 dpi in 8-bit greyscale mode to preserve as much of the original document
information as possible. When documents are in colour, 24-bit RGB scans are made.
As the chosen standard for archiving generates large files, the AHDP uses 4 sizes
to provide on-line delivery to the user: a 200x200 pixel thumbnail for quick reference,
and three sizes (600 pixels high, 768 pixels high and 1000 pixels high) for more
detailed examination and usage. On-line delivery files are created as JPG images
at a medium compression level, balancing onscreen quality with overall size of download.
The workflow for digitization consists of the following:
- Creation of descriptive meta-data (usually exists for original)
- Addition of fields for U of C specific items
- Input of meta-data into workflow database
- Preliminary survey of material for completeness and scanning issues
- Scan document using assigned Volume_Code to name files
- Record scan parameters into the workflow database
- Run automated image clean-up applications
- Quality Assurance (Scans)
- Compare images against original document to ensure that all pages are present and
- that pages all meet quality standards
- Ensure filenaming is consistent
- Record any corrections in the workflow database
- Correct images as needed
- Archive the source images on CD-R (2 copies) and generate a CRC-32 value for future
- Record CD-R location information
- Automated Procedures
- OCR images generated one text file per image
- Create the web display images (4 sizes)
- Generate a table of contents based on the structure of the book by recording
- each chapter with an associated UC_PageID
- Quality assurance - double check the table of contents against the original
- Web Mounting
- Export the workflow database records to the web database format
- Upload the files to the web server
- Update the server database and text indices
- Test the volume for correct linking and image order
- Activate the volume for public use
Equipment and Software Used
- Windows 98/2000 workstations
- Ricoh 450 Page Scanners
- HP 6300 Page Scanners
- Wicks and Wilson 4100 Microfilm scanner
- ScreenScan Microfiche scanner
- Maxtor Firewire drives (storage)
- Adobe Photoshop
- Cerious ThumbsPlus (image indexing and processing)
- NameWiz (filerenaming)
- high accuracy - Prime Recognition
- average accuracy - Abbyy Finereader
- Windows 2000 Advanced Server
- White box web server
- Pentium III-1000 Processor
- 512 MB RAM
- 10 73-GB SCSI drives in a RAID-5 configuration
- Microsoft SQL Server 2000 Standard
- Microsoft Access 2000
- TextDB (Full Text Indexing)
Cataloguing and Meta Data Standards
Currently, the AHDP has benefited greatly from the work already done by librarians
in the item-level cataloguing of the works that the AHDP has digitized. Currently,
items in the AHDP have their descriptive cataloguing drawn from the University of
Calgary library catalogue. New items are catalogued by trained cataloguers from U
of C library bibliographic services following AACR2 rules. The core records exist
in MARC format.
Currently, the AHDP exists primarily in English although full-text searching is in
the original language of the document. Each item in the AHDP has an index page that
provides an interactive table of contents to that volume. A Dublin Core header will
be added to each of these index pages to provide harvesting of each individual volume.
Currently, the AHDP employs an 8.3 file name convention in the name of all of its
files. The general convention is:
||Letter - Describes the general collection to which the item belongs (handles 26 collections)
||Alphanumeric - Identifies the specific item in the collection (handles 1,296 items)
||Letter - Describes the type of file (A - size 1 display, B - size 2 display, etc.)
||Numeric - Identifies the specific page of the document (handles 10,000 pages)
||Letter - standard MIME extension for the file format
The AHDP uses the first three characters to provide persistent access to the item
on its site.
Notes: DC_Subject and DC_Description are currently not used.
Currently the AHDP uses a database structure that tracks administrative and descriptive
meta data for item and page level information in relatively simple table structures.
Additional tables are utilized for controlled vocabularies but all information at
either the page or item level can be exported to a comma-delimited format. The core
database engine for the AHDP is the Microsoft Database Engine (MDE); the project
utilizes both Microsoft Access and SQL Server for the operations.
Web Site Specifications
One of the guiding principles for the AHDP is to provide a simple interface to
the resources. Complex client-side scripting and the usage of plug-ins is avoided
where possible. Pages are created using standard HTML 4.0 code. The majority of the
scripting and validation occurs at the server level.
Web Site Auditing and Evaluation
Complete logs of server activity are maintained by the AHDP.
Programming and Scripting Languages
Currently, the AHDP employs Active Server Pages 3.0 for its server side scripting.
Cookies have been avoided in the past over concerns of privacy and lack of support
at the client side (whether by choice or because of incompatible clients). As stated
above, client side scripting is avoided where ever possible to ensure that the majority
of users will have little to no problems accessing the site.
Preservation and Records Management
The AHDP currently uses CD-R technology to store digital files for long term
storage and employs only standard file formats (TIFF, JPEG, HTML) that are open and
non-proprietary for all of its work. CD-Rs are written using standard ISO-9660 format
to ensure maximum compatibility.
The general principles that the AHDP employ to ensure long term access to the media
is to create CRC-32 values for individual files and to do spot checks on the media
on a regular basis. Duplicate copies of media are made to store with one on-site
and one off-site copy. In addition to backing up the database information separately,
the AHDP plans to save the meta data with the files in XML format once the template
has been developed.