File formats are a standardized form in which data are coded so that a software program can read and interpret those data. As technology changes over time, some hardware and software may become unsupported and obsolete. During your research project you may use different formats to collect and analyze, and share your data. Even if some of your working formats aren't ideal for long-term preservation and accessibility, keep copies of them for data reproducibility.
Some things to keep in mind when choosing file formats:
- Privilege open, non-proprietary formats when possible.
- Privilege lossless formats when possible.
- Privilege unencrypted formats when possible (exception: sensitive data).
- Formats preferred by academic discipline or subject domain.
- Formats best suited for data creation.
- Formats best suited for data manipulation and analysis.
- Formats best suited for conversion to other formats.
Some preferred file formats (source: Stanford Libraries):
- Containers: TAR, GZIP, ZIP
- Databases: XML, CSV
- Geospatial: SHP, DBF, GeoTIFF, NetCDF
- Moving images: MOV, MPEG, AVI, MXF
- Sounds: WAVE, AIFF, MP3, MXF
- Statistics: ASCII, DTA, POR, SAS, SAV
- Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
- Tabular data: CSV
- Text: XML, PDF/A, HTML, ASCII, UTF-8
- Web archive: WARC
File naming conventions are a set of agreed upon rules for naming files. Consistently structured files help you automate your workflows, find your files, and keep track of different versions of your files. Documentation about your file naming convention should include rules, any codes or abbreviations used, and examples.
Some things to keep in mind when creating a file naming convention:
- All file names should be unique.
- Keep file names as short as possible.
- Include descriptive information in the file name such as the experiment name, researcher name, or date the experiment was run. Consider using acronyms or codes for some of the descriptive information.
- Filenames should be consistently structured. Each filename should contain the same information in the same defined sequence.
- The file naming convention should provide a logical sequence that can easily be identified and reproduced.
- The more complicated the file naming convention, the more susceptible the filename is to human error during manual input.
- Think long term: How well will this file naming convention scale as more data are collected? Is any component of the convention likely to change over time?
Additional file naming tips:
- Restrict characters to letters of the Latin alphabet (a-z, A-Z); Arabic numerals (0-9); and underscore (_).
- Limit character length to no more than 32 characters.
- Avoid using blank spaces between characters.
- When using a sequential numbering convention, use leading zeros (e.g, "001, 002, ...010, 011 ... 100, 101, etc." instead of "1, 2, ...10, 11 ... 100, 101, etc.") to facilitate sorting in numeric order. Take into account the maximum number of digital objects that will be a part of the collection and reflect that in the number of digits used.
- When including dates, format them according to the ISO 8601 standard: YYYYMMDD, YYYYMM, or YYYY.