QIIME 2 archives

QIIME 2 archives

One of the first things that new QIIME 2 users often notice is the .qza and .qzv files that QIIME 2 uses. All files generated by QIIME 2 are either .qza or .qzv files, and these are simply zip files that store your data alongside some QIIME 2-specific metadata. You can unzip these files with any typical unzip utility, such as WinZip, 7Zip, or unzip, and you don’t need to have QIIME 2 installed to do that.

For example, let’s download a .qza file and take a quick look.

curl -sL \
  "https://docs.qiime2.org/2020.11/data/tutorials/moving-pictures/rep-seqs.qza" > \
  "sequences.qza"
unzip sequences.qza

Notice that we haven’t used any QIIME 2 commands so far. We downloaded a file, and then unzipped it as we would any zip file. Allowing users to access their data without QIIME 2 was one of the earliest design goals of the system. This ensures that if QIIME 2 isn’t available to you for some reason, you can still access any data that you generated with QIIME 2.

If you look through the list of files that were created by the unzip command above, you’ll see there is a top-level directory in the output with a crazy-looking name. This directory contains copies of all of the files and directories from sequences.qza. Within this crazy-named directory the data directory contains a single file, sequences.fasta, which contains (you guessed it!) sequence data in fasta format. If, for example, you’re interested in getting your sequence data out of QIIME 2 to analyze it with another program, you can unzip your .qza file, and use the sequences.fasta file for what ever you need to do with it.

You might wonder why we bothered with having QIIME 2 create these zip files in the first place, rather than just have it use the typical file formats like fasta, newick, biom, and so on. That has to do with the other stored in the zip file. The other files in the zip file are not intended to be viewed by a human, and you don’t need any of them to work with the file (or files) in the data directory. QIIME 2 uses the information in the provenance directory to record data provenance, helping you to ensure that your bioinformatics work will be reproducible. It also stores a unique identifier for the data in the metadata.yaml file, which facilitates data management. That unique identifier is also the name of the directory that contains all of the files from the .qza when you unzip it. The metadata.yaml file also identifies the semantic type of the data (we’ll come back to that shortly). This other information empowers your bioinformatics work in ways that will be more clear as you get further along in your learning. We’ll revisit this topic throughout the book.

The .qza file extension is an abbreviation for QIIME Zipped Artifact, and the .qzv file extension is an abbreviation for QIIME Zipped Visualization. .qza files (which are often simply referred to as artifacts) are intermediary files in a QIIME 2 analysis, usually containing raw data of some sort. These files are generated by QIIME 2 and are intended to be consumed by QIIME 2. `.qzv`` files (which are often simply referred to as visualizations) are terminal results in a QIIME 2 analysis, such as an interactive figure or the results of a statistical test. These files are generated by QIIME 2 and are intended to be consumed by a human.

QIIME 2 also provides some of its own utilities for getting data out of .qza and .qzv files. If you’re working with the QIIME 2 command-line interface (which we’ll use a lot in this book), the most relevant command is qiime tools export. If you were to run this on the .qza file we downloaded above, you’d see the following:

qiime tools export --input-path sequences.qza --output-path exported-sequences/

This command will unzip the archive, and take all of the files from the data directory and place them in exported-sequences. Thus if you do have QIIME 2 installed, you can get your data out of a QIIME 2 artifact without all of the QIIME 2-specific metadata using this command.

Jargon: Confused by the term “artifact”?

It has been brought to our attention that the term artifact can be confusing, since it is often used in science to indicate a feature that is not present naturally in a system but rather observed as a result of some technical aspect of studying that system. For example, homopolyer runs such as the As in ACTGTACTAAAAAAAAAAATGCACGTGAC were commonly reported by some early sequencing instruments to be longer then they were in nature due to the way the sequencing reaction worked. In QIIME 2 we use the definition of an artifact as an object that was created by some process, like an archaeological artifact. This is common in data science, and we didn’t realize the potential for confusion until we were a little too far along to easily change the name.