Global Control of Database Cleanup.
by Johanna Bowen, Jonathan Jiras and David Ritchie
Presentation delivered by Johanna Bowen at the LITA/ALCTS Program entitled: "Database Cleanup After Retrospective Conversion" on Sunday July 7, 1996 at ALA Annual in New York
Opening up the issue of post-recon database cleanup within a technical services department can elicit a variety of responses ranging from the "who cares about older records" attitude to the maniacal zeal of the purists for whom every deviation is a major stumbling block in the ongoing pursuit of the perfect database. Perfection itself is up for re-definition when a project is being planned. Ideally the whole library will enter into a dialogue of whether an effort to cleanup bad but benign MARC tagging should take precedence over projects which are driven by the degree to which user access is directly affected. Both goals should share some of the resources devoted to post-recon cleanup. Bad but benign MARC is benign today but not necessarily benign in some future update from your vendor or a migration to another OPAC provider.
Patron access to a cleanly defined set of records following a properly structured search is a RIGHT not a privilege. All of Bibliographic Instruction and the hoped for outcome of the Reference Interview should have the effect of empowering the user to control the parameters of a search and find out what exactly a library has in its holdings that matches the need of the patron. Where in the past the misfiled card could effectively remove from the card catalog the existence of a title, in today's OPAC world it is the mis-tagged record which serves to effectively remove material from the pool of available resources.
Early recospective conversion activities focused with some success on matching the item in hand to a record in a utility's database. The data from a card could be carefully matched to the textual content of the record. Unfortunately the card being matched to the record gave no clue as to the correctness of the MARC tags. This focus on the text of the record extended for years in activities which essentially used the online bibliographic records to produce cards where what you read was often more important than how it was tagged. Often, in fact, the tagging focus was based on tagging to control print constants for the appearance of a card.
For years libraries produced catalog card sets from a bibliographic utility record. Ultimately most libraries moved to an online catalog. Many of the global clean-up projects which I will outline are designed to solve problems which become apparent once a library switches from a card catalog production environment to an on-line environment. If we had all remained in a card environment, we would not even have many of these problems. Likewise, if a library had chosen a different OPAC provider they may not have discovered many of these specific problems, but others would have, no doubt, surfaced.
Part Two. Overview of global cleanup activities
Global database cleanup is possible because the MARC standards exist and deviations from those standards can be found with relatively simple programming statements in the appropriate software environment. They can not only be identified but in many cases they can be corrected with relatively simple programming statements.
Technical Services librarians can set out to specify global searches which are designed to:
* find the types of errors which can clearly cause problems for users.
* find errors that are inconsistent with a library's cataloging policies
* gather statistical data about a library's OPAC
In the three software tools that I am familiar with, MARC Review, MultiLIS, and Innovative Interfaces the structure of the activity is roughly the same. Using a combination of operators, either classic Boolean or supplied by the software, a librarian creates a global search for a class or category of records which meet some parameters. The resulting file is a subset of the database.
With MultiLIS one learns to write powerful replacement rules that select the fields for export and make the changes at the point of export. With MARCReview Global version changes can be made within the sub-set file after export. For both MARCReview and MultiLIS changes can be examined and sampled for success prior to the final replacement reload. When the file is reloaded the changes are effected by the replacement of original data, addition of new fields or data, or deletion of offending data or fields. Thus the search for the problems is GLOBAL -- across the entire database, but the global correction process is effected within a sub-group of records which has no result in the main database until it is tried and tested. Innovative Interfaces has a sophisticated and easy to use sub-system which allows for finding and identifying the set of records which share attributes, but correction has to be one by one in-house. The company will cost out making global corrections for customers. The representative I spoke with seemed to be shocked at the potential for damage which this activity could carry if it were in the hands of the library itself. When the Software package used can identify problems but it can not fix them, a library is faced with a huge labor intensive manual project or a custom programming bill. This leads to the frustrating experience where one has a feeling of success at manipulating the software, at identifying a class or group of records with the "problem" and no ability to effect a similarly simple software fix of the problem.
Part Three: Global Cleanup examples from the Card Catalog environment
Example 1 -- GMD's in added title entries
In the card environment both Cortland and Hobart & William Smith had decided to add GMD's to all title added entries. In an on-line system, the only title which appears in a results screen is the title from the 245. As such, there is no longer any reason to include GMDs in title added entries. For the sake of consistency, both libraries decided to delete all GMD's from title added entries.
Example 2 - Listing alternate locations in "Notes" fields [Hobart and William Smith]
One of the best advantages of an online system is the ability to make a clear distinction between the bibliographic record and the item record. Each serves a different purpose. The bib record is concerned with description of and providing access points to the work. The item record is primarily concerned with providing location(s) and availability. In an online system indicating that a work appears in two or more formats and/or locations in the library is a simple matter of adding an additional item record for each location in which a work resides. In a card environment it was not that simple. We could only list one location in the call number area of the card. If we wanted to list another location, we had to make a "local note" in the bib record. Since these location specific local notes are not necessary any more, the notes were deleted. The fact that consistent language had been used for constructing these notes was a great help in locating and deleting them.
Example 3. Presence of Bibliographies value used to create an index
Cortland established a searchable index for Bibliographies which depended on the existence of a `b' in position 24 of the 008. In the OPAC there were 137,459 records which satisfied the Document Type=Bibliography search. There were also, unfortunately 33,225 records which had the character strin 'ibliog' in a 504 field and lacked a 'b' in the 008.24.
One year ago, with only manual fixing as an option, it was decided to ignore the bibliography index problem. Today, using MARC Review's Global option the correction has been made.
Example 4. - State Government documents value used to create an Index
When profiling its original OPAC Cortland specified an index for State level government documents. It was thought this would allow us to assess the availability of NY publications on a particular topic. Much to our surprise the resulting set included poetry collections and assorted "non-governmental" publications emanating from state university presses. In this case we are bumping into a little known practice of the Library of Congress. All publications of State University Presses are encoded as state government level publications. A search for the character string 'niversity' as a publisher combined with the presence of the s in the government publications identification created a pool of records. This identification followed by the use of MultiLIS replacement rules to remove the "s" in the Fixed Field for Government publications resulted in the index which was originally specified. They still have to deal with the College of William and Mary but there aren't that many State institutions without the word University in the name of the Press.
Example 5. - Missing microform GMD's
SUNY Cortland found that they had 1785 records which were in microtext format but did NOT have $h [microform] in the 245. Using global fix capabilities, they added the $h in the correct place by going through several iterations (the correct place for $h varies with the presence of a $b).
Example 6. - Change 305's to 300's (305's are obsolete collatiion statements in early sound recording cataloging)
Example 7. - Change 500 with "ibliog" character string to 504
Example 8. - Is the 008 date 1 the same as 260 $ c ??
Copy the 260 $ c to the 008
Example 9. - Delete 222's (Serial record key titles)
Example 10. - Delete 510's
- or, alternatively, delete specific 510 character strings.
(510's are Abstracting and Indexing coverage data for Serial records)
Example 11. - Computer files without a 753
Cortland is receiving an increasing number of computer files and interactive multimedia for the library to catalog. Local policy dictated the addition of 753 for every computer file record. A search of the database produced a small file of records which were identified in position 6 of the 000 as computer files and yet did not have a 753 $a for make and model of the platform associated with the software package.
Search statement: AND 000.6=`m', NOT 753a=` '
Part Four - The Difference Between OCLCMARC, USMARC, and "VENDOR"MARC: Or, Fun with Print Constants
Both OCLC and any local system will vehemently insist that they adhere to the USMARC standard, and I am not suggesting that they don't. USMARC is after all, merely a communications standard, designed as a means of transferring bibliographic, authority, or holdings records from one system to another. How any one system, OCLC, MultiLIS, or otherwise, makes use of the data in a MARC record is to an alarming degree completely up to their own design. Indicators used to generate print constants in one system do not necessarily print or display uniformly across systems. When we were in the card catalog environment, we were completely within a single implementation of USMARC -- OCLCMARC. Once we brought up our on-line system we suddenly had to cope with two different implementations of USMARC. We cataloged and often produced a shelflist card in one implementation and the record was displayed in another. Focusing in on print constants led Hobart and William Smith College to the discovery of numerous database cleanup projects.
Example 1: Brackets around GMDs:
In the card catalog environment we did not see the need to add square brackets when they were missing because OCLC would conveniently print them automatically when catalog cards were produced. In our on-line environment, these same records became very confusing. It seemed as if "map," "sound recording," or "videorecording" was part of the title proper. There was no way to tell where the title ended and the GMD began.
At SUNY Albany the head of cataloging was asked to do something about the existence of the title "When the saints go marching in microform." It was a simple matter for Hobart and William Smith to add the brackets around the GMD by using the export replacement rules in MultiLIS. Unfortunately at SUNY Albany, a GEAC Advance library, they have added the brackets only upon display. The bracket has not been added to the record and there is the potential of confronting the entire problem again if they ever migrate to another system.
Example 2 : 78x Linking entry translation
Linking title added entries in serial records The 780 and 785 fields are used for title added entries for preceding and succeeding titles in serial records. The second indicator of these fields is used to automatically generate a note. MultiLIS is able to use the second indicator to display most of the print constants, but some of them are not displayed properly. The single note "Formed by the union of X and Y" will display in MultiLIS as two separate notes: "Formed by the union of X" and "Formed by the union of Y." The single note: "Split into X and Y" Will display on MultiLIS as two separate notes, "Spilt into X" and "Split into Y." And, the single note: "Merged with X to form Y." Will display as two separate notes "Merger of X" and "Merger of Y." The MultiLIS version of these print constants is confusing to users. We were able to go into these records and change the first indicator of the 780 and 785 fields in question from 0 to 1.
This suppressed the misleading notes. At the same time, we were able to generate 580 notes with the correct syntax of the note explicitly stated.
Part Five: Resident Auto Review Search Programs In MARCREVIEW
MARCReview was originally designed to serve as a quality filter for new records entering or about to enter a database. A library could run the software over the new cataloging records in an export file before loading them into an OPAC. In January 1995, Cortland received an export file including every MARC record in their database. At that time it was decided to use this software as a way to define and list problems in the database which would be addressed through database maintenance activities.
These resident auto review programs are bundled with the MARC Review software to check the presence and correct use of certain types of MARC coding errors. Although these reviews are designed to be used against the mini-environment of daily work. they can be highly revealing if applied to an entire database.
1. Auto Review -- non-filing character code indicators [All formats].
The utility of this program is immediate if the OPAC loads using the filing indicators. It is not immediately relevant if the OPAC is using tables which match the initial articles.
Once again a library has to decide whether or not fixing an error which has no immediate consequence should be done or not. There may be a serious consequence of ignoring the problem in either a future migration or in a future combination into a regional consortium OPAC.
2. Auto review which checks to see if conditions are being met for indexed data [10 reviews, all formats]
3. Auto Review -- Missing elements [7 reviews -- All formats]
4. Auto Review -- Book Records only
examples:
-CIP record that was not upgraded
-if 008 has bibliography coded in Cont, is there also a bibliography note, or vice versa?
-if 008 has index code is there an index note, or vice versa?
-if 008 has Illustration code, is there illustration data in 300?
Part Six: Playing around in the database, cataloger curiosity:
Missing Statements of Responsibility
Our first independently constructed search of our database using MARC Review software was designed to assess the number of records lacking a statement of responsibility for personal authors. There were 60,487 records without a statement of responsiblity.
Example of MARCReview Search statement:
AND 100=` ', NOT 245c=` '
Missing Call Numbers:
Out of close to 300,000 records there were 150 Records without a call number
These 150 records were of two types: 32% had no call number in the bib record but had a call number in the item record and 68% had no call number whatsoever. The 32% without a call number in the item record could easily be fixed by adding an 090 field in the bib record. The 68% that were returned with no call number whatsoever are a little more tricky. If we could find the books at all, we could transcribe their spine labels into an 090 field, but finding them on the shelves might be too much bother considering the small number of records with this problem.
Who did the cataloging that built the SUNY Cortland OPAC?
1. 7% of the records were original cataloging
2. 35% of the records were contributed cataloging
3. 62% of the records were LC copy
How many records were AACR2?
31% of our records are AACR2.
Part Seven: Serials at SUNY Cortland
How many latest entry serial records were in SUNY Cortlands OPAC?
In theory Cortland had never accepted a latest entry record. In fact there were 127 records in the database that were coded for latest entry cataloging.
Serials and choice of main entry at SUNY Cortland
Another local practice applied during serial recon was to correct records which had a corporate main entry which did not satisfy the requirements of AACR2: 21.1B2. Cortland converted the 110 to 710 and adjust the 245 tag accordingly. They had also reshelved the entire alphabetically organized periodical collection to agree with title main entries. There were 721 titles identified. Each would in fact have to be examined by a librarian to decide whether the title would satisfy AACR2 21.1B2 rules for entry under corporate bodies. The first ten examined needed to be shifted to 245 main entry as this is how they are shelved and this is how they would be cataloged if they were being cataloged by strict AACR2.
Working with this type of software can provide librarians with a unique opportunity to examine their database for its "compliance" with national standards and for compliance with internally specified local policy and practice. One is able to specify a search, and view the resulting set of records that match criteria within hours. This is a particularly satisfying experience for people who like cataloging. Speaking for all of us I can say that these activities were more "fun" than routine cataloging activities and daily managing of workflow usually provide
When I gave a presentation a year ago on this topic I subtitled the paper "The Race is not to the swift" (Eclesiastes 9:11) This has been true for automation in libraries. Those hardy folk who jumped into early applications and early recon had a great time and were excited about the potential of the new toys. I am sure that database cleanup after recon is particularly onerous for those who did very early recon projects in the late 70's or early 80's. SUNY Cortland began using OCLC in 1974 for daily cataloging, and did the major retrospective conversion pass through the shelf list in 1981-1983. In 1983 there were only 10,000,000 records in the OCLC database and the enhance program was just beginning. Unfortunately, Cortland was in the position of not seeing the records online for another decade. It is true in any learning situation that you need feedback in order to adjust your work habits and optimize your efforts. Database analyses based on MARC can be a major dose of feedback.
Overview of a project to analyze a database.
1. Define the problems, based on user needs and future migration.
2. Structure the searches, lots of trial and error.
3. Organize results by:
Serves the user
Serves future migration
4. Present results to all Librarians, Public Services and Technical Services
5. Prioritize.
Just as we once sought budget money for recon, we may now be seeking budget money for post recon cleanup. The justification for funding this activity is simple: Patron access to a clearly defined set of records following a properly structured search is a RIGHT not a privilege.
Global Control of Database Cleanup. by Johanna Bowen, Jonathan Jiras
and David Ritchie (c)1996
<jobowen@cabrillo.edu><jiras@hws.edu><ritchie@snycorva.cortland.edu>
Presentation delivered by Johanna Bowen at the LITA/ALCTS Program entitled:
"Database Cleanup After Retrospective Conversion" on Sunday July 7, 1996 at ALA
Annual in New York.
Part One:Historical context of Recon activities
Part Two: Global searching
Combine knowledge of MARC standards
with,
Software search tool
to produce,
Set of records which meet the specifications
Part Three: Global cleanup examples
Example 1 -- GMD's in added title entries
Example 2 - Listing alternate locations in "Notes" fields
Example 3. - Presence of Bibliographies value used to create an Index:
Part 4 - The Difference Between OCLCMARC, USMARC, and "VENDOR"MARC: Or, Fun with Print Constants
Part 7. Serials --