Governance is an important means of assessing the long-term viability of a repository and the data it contains. Non-profit organizations are governed by a board of directors or similar body that has a fiduciary responsibility to the organization and its mission. As part of this responsibility, the board must take the long view and keep in mind the interests of the community it serves. A repository overseen by a commercial entity, no matter how community-minded, must necessarily concern itself with profit-making, which may result in greater submitter and/or end-user costs or the cessation of the service should it not be profitable.
The source of funding for the repository is another way of evaluating its durability. Government funding is subject to congressional appropriations, which may fluctuate based on the priorities of the prevailing political party and general well-being of the economy. Repositories under the auspices of large governmental agencies (e.g., the National Institutes of Health), however, are not likely to fold. Repositories funded through foundation gifts are susceptible to gaps between grants or even the suspension of funding altogether. Organizations with several sources of foundation funding (e.g. the Center for Open Science), however, are more likely to persist.
Understanding the technical and financial means a repository employs for data backup is a critical assessment criterion. A repository that explicitly states how it provides durability and geographic dispersion, and the funding arrangement for such services, is likely to take great care in executing this important activity. It is also worth investigating the organization’s plan for data retrieval by owners should the repository be decommissioned or the organization lose its funding.
Some repositories are 100% free and open access, while others employ a tiered fee structure based on the size of the files in your dataset. A cost free option is often available for datasets up to a certain size, with additional tiers past certain storage benchmarks. It’s important to consider the potential future growth of your dataset, in case an increase in size incurs additional storage costs. Some services also offer a free public storage option with a paid private storage option.
A repository’s support services may include FAQs, documentation on the repository, instructional videos, and staff that are available to answer questions. A repository with limited user support can present significant challenges to depositing and accessing data. This criterion is especially important for collaborative projects where multiple users seek to deposit and access data in a repository that may require coordination and extended knowledge of the repository.
Most repositories will accept a wide variety of file formats, but that is no guarantee that your research data will be accepted. Check the FAQ and other documentation for the repository in question to make sure your data is properly formatted.
Portability of data is also important. Research data is sometimes collected and recorded in paid software or other proprietary tools. If you use specialized programs in your work, it is very helpful to other researchers to also make that data available in a standard and portable format. Example formats include Comma Separated Value text files (CSVs), XML files, PDF, and raw text.
The size of individual files in your dataset and the total size of your data can determine a repository’s suitability. For example, a repository may not have a limit on how much data you deposit, but it may have a limit on the individual file sizes you upload. If your data consists of large image files over 1GB each, you may exceed a repository’s file upload limit. Alternatively, some repositories that accept large file sizes may require coordination with campus IT due to bandwidth considerations.
Research projects typically generate several versions of data. Versioning, the practice of tracking changes to files, is essential if your research project involves multiple collaborators. Additionally, if your published data is ever updated, it is important that you and future researchers are able to cite and access the actual dataset that supports your or their research.
Many scientific findings have come into question or disproven altogether in recent years for their inability to replicate. This “reproducibility crisis” has compelled an increasing number of journal publishers and foundations to mandate scholars make publicly available replication files so third party researchers can validate findings. In order to replicate computational or statistical findings, it is important that the repository allow you to maintain directory structures when submitting your data in exactly the manner necessary to call, analyze, and produce outputs (e.g., tables). To the degree the ability to reproduce your work in this way is important to you, your funder, or your publisher, you should seek a repository with the ability to maintain folder structures, either through creation of folders in the host repositories, or maintenance of a .zip folder with its internal directory structure intact.
Metadata is structured information that can facilitate the discovery, understanding, and reuse of your data. Examples of dataset metadata include the name of principal investigator of the research project, the date when the data was captured, and the license associated with the data. When selecting a repository check to see if it uses metadata standards that have been widely adopted, especially by your discipline. While it is preferable to use metadata standards where possible, examine if the repository allows you to customize metadata fields to suit your needs.
One of the primary reasons for adding your data to a repository is to make your research widely available and citable. A unique and stable identifier like a DOI (digital object identifier) or handle makes your data more visible by providing a citable reference to the material, and enables other researchers to access your data long after you have published it.
If multiple collaborators will be working with your data, consider data repositories that allow you to specify different roles and permissions. Some collaborators might require administrator permissions while others might just require submission or read-only access permissions. Examine if the data repository has a review process whereby new submissions or changes to existing content must be approved.
There are cases where you should or might want to restrict access to your data. For example, your data might include personally identifiable information, protected health information, or sensitive information; your publisher might set an embargo on your data; or your data is a work in progress and not ready to be shared. For information about managing data that involves human subjects contact Haverford's Institutional Review Board.