Data Preservation and Migration

TalkBank

Data Preservation

TalkBank has a preservation policy based on backups in CMU Cloud and longterm preservation by the University.
Transcript data is backed up through github.com repositories. The git repositories are stored both in several local machines from which commits are managed and in a master repository on a server in the CMU Cloud facility running Ubuntu Linux. The TalkBankDB facility also stores time-stamped versions of the database.
Media data are on a local machine with four 5TB local disks, along with backup disks for each local disk. Whenever new data are ingested, the sync-all.sh script sends copies of new materials to synchronize with a master copy in the CMU Cloud Plus facility.
Media data can be recovered from either the local backups or the CMU Cloud Plus backups.
In addition to transcript and media data, we include a variety of documentation files inside the transcript databases. These include Excel .xlsx files, Word .docx files, Adobe PDF files, and various image files.
Transcript and documentation data can be recovered from the git repositories.
Risk management is based on trying to minimize the possibility of either complete data loss through disk failure or hacking or partial data loss through system error. The former is addressed through keeping multiple image copies and the latter through running of ChronoSync comparison between image copies and the current archive.
One image copy is kept offsite, one in another University building, and one in another part of Baker Hall. All are under lock and key.
Because the storage media are hard drives, deterioration means disk failure. If one drive fails, we can restore the data from one of the three remaining complete copies or from CMU Campus Cloud. Because these local drives are only accessed during the copying process, they do not have much wear and tear and they never fail. The chances of all four data storage methods failing at once are extremely low, barring catastrophes impacting the entire city of Pittsburgh. In that case, copies of much of the data would still be preserved in Nijmegen and throughout European CLARIN centers. In the event of a fullscale nuclear war, involving several continents, it is possible that all of the data would be lost.

Data Migration and Compatibility

Our basic file format relies on text-only Unicode files. We expect only minor changes in this format over time. However, the CHAT coding system occasionally undergoes changes. To guarantee preservation of the data on this level, we use the Chatter program to make sure that the XML version of the CHAT files can be roundtripped from CHAT to XML and back without changes. Obsolescence of media files is a more difficult problem. For audio, we maintain both MP3 and WAV formats, in hope that the latter could be converted without loss to any new popular formats. However, some corpora come with only MP3 files. For video, we have stored raw video for some corpora, but for others we only have resources to store compressed versions. For those we focus on making sure that everything is in .H264 format.
The transcript files will be usable in their current format as long as computers can read text files and Unicode. We have developed programs that convert when necessary to six other current file formats, but we rely on CHAT format as the current standard in the field.

Responses to Core Trust Seal (CTS) queries

Does the repository have a documented approach to preservation? Response: Yes, it is given on this web page.
Is the level of responsibility for the preservation of each item understood? How is this defined? Response: We assume responsibility for preservation of all items in TalkBank, including transcripts, media, and documentation files.
Are plans related to future migrations or similar measures to address the threat of obsolescence in place? Response: We do not think that .txt or .wav files will become obsolescent. If the MP4 video format becomes obsolescent, we will convert it to a new format. We did that in previous decades for .mov, .avi, and .mpeg video formats.
Does the contract between depositor and repository provide for all actions necessary to meet the responsibilities? Response: We assume responsibility for preservation and migration.
Is the transfer of custody and responsibility handover clear to the depositor and repository? Response: Yes, it is clear. If a depositor wishes to remove data, we can de-accession it.
Does the repository have the rights to copy, transform, and store the items, as well as provide access to them? Response: Yes, this is a fundamental requirement for keeping the TalkBank databases in the best possible condition. Earlier versions of transcripts or the whole database can be retrieved from TalkBankDB.
Are actions relevant to preservation specified in documentation, including custody transfer, submission information standards, and archival information standards? Response: Yes, the standards are documented in the CHAT manual. Contribution procedures are documented at https://talkbank.org/share/contributing.html
Are there measures to ensure these actions are taken? Response: Yes, our curation software requires that all standards be met.