Sequencing run QC¶
Get some sequence data¶
The first thing we'll do is to get some sequence data to work with. If you are working on a new sequencing project the data might instead come directly from a sequencing facility. For this tutorial we'll work with data that is available in public databases.
Published sequence data is usually archived in the NCBI SRA, the ENA, and the DDBJ. Each of these databases provide convenient search interfaces, which can be used to find samples by keyword and sample type (e.g. shotgun metagenome). The data-sets we'll use in the following exercises (SRA accessions SRR9323808, SRR9323810, SRR9323811, SRR9323809) are small in size and therefore quick to process.
With an accession number in hand, the easiest way to obtain the related sequencing data from the databases is via the fastq-dump
software.
First, let's create a new conda environment, install Python along with the SRA Toolkit and the activate it. To do this, start a terminal session in your Jupyter server (click the Terminal icon) and run the following command (its ok to copy and paste):
Setup an initial set of tools
# fast download from github git clone https://github.com/rvalieris/parallel-fastq-dump.git # create a new conda environment and activate it conda create -y -n workshop "python<3" sra-tools conda activate workshop
Info
We've just created a new conda environment and activated it. You will see your prompt now tells you this fact with (workshop)
.
Now that you've installed fastq-dump we can use it to download data by accession number. Copy and paste the following to your terminal:
Download the sequencing data for our data-sets
parallel-fastq-dump/parallel-fastq-dump -s SRR9323808 -s SRR9323810 -s SRR9323811 -s SRR9323809 \ --threads 4 --split-files --gzip --outdir qc_data/
If the download was successful, a new subfolder qc_data
should exist and within that you should find the fastq files. Using your terminal session, try listing the contents of the subfolder (ls qc_data
). You should see something like the following:
You can check that the fastq files exist
Evaluating sequence quality with FastQC and MultiQC¶
The very first thing one would normally do when working with a new dataset is to look at some basic quality statistics on the data.
There are many, many tools available to compute quality metrics. For our current data we will use the FastQC software, applied to each sample, and then combine the results using MultiQC to a single report. First step is to install fastqc
and multiqc
.
Install fastqc
and multiqc
for QC analysis
conda install -y -n workshop fastqc multiqc
In the above we've used conda to install fastqc, but we've used another way to install multiqc -- something called pip
. pip is an installer for python programs, and like conda it will download and install the software along with any dependencies. The reason we use pip in this case is because conda can be very, very slow to install some programs and in this case pip is much faster.
Use fastqc to analyse the datasets
cd qc_data ls *.fastq.gz | xargs fastqc
Unpacking the commands
- First we change the current directory (
cd qc_data
).- any subsequent command will, by default, now run in qc_data.
- Next we get the list of fastq files (
ls *.fastq.gz
) and pass that toxargs
.- the wildcard expression (
*.fastq.gz
) matches all of the sequencing data files. - this list of files is then piped (
|
) to the next commandxargs
. - the
xargs
command allows you to build and execute command lines. In this simplest form,xargs
merely takes the list of files it receives and makes them arguments tofastqc
. fastqc
is then run, where each of the fastq files will included as an argument.
- the wildcard expression (
If this step has worked, then you should have several new .zip
files containing the QC data in that directory, along with some html files. When we have a lot of samples it is too tedious to look at all the QC reports individually for each sample, so we can summarize them using multiqc.
Create a summary report
multiqc --title "Raw Reads" .
At this point a multiqc report file will appear inside the QC directory. First double click to open the QC folder.
Opening our working directory in Jupyter
Once we've navigated to the QC directory, a file called Raw-Reads_multiqc_report.html
will appear in the listing. To open this report, simply double-click on the file name and a new tab will appear in Jupyter.
Opening the HTML report in Jupyter
Switching to the report tab, first click "Trust HTML" button in the top left-hand corner. Doing this will permit JavaScript to execute within the page, which is used for drawing the figures.
Opening the HTML report in Jupyter
From here we can evaluate the quality of the libraries.
Cleaning up reads with fastp¶
Now that we have evaluated the condition of our raw data-sets, we are in a better position to judge whether everything looks ok. Though we may feel that our data is in an acceptable state for further analysis, sometimes smaller details can escape us. Therefore, subjecting data-sets to a experimentally suitable clean-up procedure is good practice. What is suitable for a given experiment depends on what will ultimately be inferred. In the case of assembly, contaminating sequence or an abundance of low-quality base-calls will likely lead to worse results. Therefore, identifying and removing this type of error will be advantageous. In constrast, clean-up procedures the attempt to correct base-calling errors can be inappropriate when a sample is not clonal (ie. metagenomics). Here, the assumption that SNV are errors, leads to software removing the micro-diversity within the sample.
While sequencing technology continues to evolve and refine, data-sets generated from even the latest NGS platforms will not be error-free. There exist both random and systematic sources of error in sequencing data and a wide assortment of tools are available to identify and eliminate them. It is important to note that not every clean-up procedure is appropriate for all types of experiment and care should be taken not to falsely remove true observations from your data.
For our shotgun sequencing data, we will use fastp
, which runs quickly and is easy to use.
Install fastp for cleaning up reads
conda install -y -n workshop fastp
By default, fastp
will apply the following clean-ups:
- automatically identify and remove lingering adapter sequence.
- trim off low quality bases on the tail-end.
- filter out very short reads and reads with too many Ns.
Lingering adapter sequence can be thought of an experimental contamination and represents a misleading source of systematic error, while low quality base calls are more-so a source of random error.
Run fastp
on our data-sets
mkdir cleaned for r1 in $(ls *_1.fastq.gz) do # extract the SRR from the filename srr=$(basename $r1 _1.fastq.gz) # run fastp on paired reads fastp --in1 ${srr}_1.fastq.gz --in2 ${srr}_2.fastq.gz \ --out1 cleaned/${srr}_1.fastq.gz --out2 cleaned/${srr}_2.fastq.gz done
Unpacking the commands
- create a new subfolder for holding the cleaned data-sets (
mkdir cleaned
) - using a for-loop in Bash, individually submit R1/R2 pairs to fastp
- the loop iterates over the list of just R1 files, obtained using the wildcard
*_1.fastq.gz
- for each R1 file, extract the base SRR id (
srr=$(basename $r1 _1.fastq.gz)
) - call fastp for paired R1/R2 files, constructing from the srr:
- the input files (
${srr}_1.fastq.gz
,${srr}_2.fastq.gz
) - the output files (
cleaned/${srr}_1.fastq.gz
,cleaned/${srr}_2.fastq.gz
)
- the input files (
- the loop iterates over the list of just R1 files, obtained using the wildcard
Although fastp
produces its own reports, we will again use fastqc
and multiqc
so that we can compare our cleaned and raw data-sets.
Perform QC analysis on the cleaned data-sets
ls cleaned/*.fastq.gz | xargs fastqc multiqc --title "Cleaned Reads" cleaned
The resulting multiqc
report will be entitled Cleaned-Reads_multiqc_report.html
. Double click this file in Jupyter to open the tab and select "Trust HTML" to enable the JavaScript figures as described above for the raw data-sets.
As none of our data-sets suffers from serious problems, when we compare the Raw and Cleaned reports, there are not many obvious changes. One difference that is easily noticeable is found in the section "Adapter Content". Although only a small proportion of reads were affected by lingering adapter sequence in the raw data-sets, fastp
has removed what remained. Without too much effort, we have eliminated a source of systematic error -- tools downstream will no longer be tasked with reconciling an artifact of data collection with real observations.
Taxonomic analysis¶
Metagenome taxonomic analysis offers a means to estimate a microbial community profile from metagenomic sequence data. It can give us a very high-level, rough idea of what kinds of microbes are present in a sample. It can also give an idea of how complex/diverse the microbial community is -- whether there are many species or few. It is useful as an initial quality check to ensure that the microbial community composition looks roughly as expected, and to confirm that nothing obvious went wrong during the sample collection and sequencing steps.
Taxonomic analysis with Metaphlan2¶
While it may be possible to install metaphlan2 via conda, at least in my experience, conda struggles with "solving the environment". Therefore it's suggested to install it via the simple download method described on the metaphlan tutorial page:
Skip downloading metaphlan2
The >500MB Metaphlan2 package has already been downloaded and we'll just be extracting it.
Obtaining metaphlan
# this would be how you could download the package yourself # wget -c -O metaphlan.tar.bz2 https://bitbucket.org/nsegata/metaphlan/get/default.tar.bz2 # extract the archive cd ; tar xvjf /data/metaphlan.tar.bz2 # rename the folder to something simple mv nsegata-metaphlan* metaphlan # apply a technical fix for metaphlan plotting sed -i 's/axisbg/facecolor/' metaphlan/plotting_scripts/metaphlan_hclust_heatmap.py # install dependencies conda install -y -n workshop bowtie2 numpy scipy matplotlib
Once metaphlan
has been prepared we can run it on our QC samples:
Running metaphlan
on our data
cd ~/qc_data for pig in SRR9323808 SRR9323810 SRR9323811 SRR9323809 do zcat ${pig}*.fastq.gz | ~/metaphlan/metaphlan.py --input_type multifastq --bt2_ps very-sensitive \ --bowtie2db ~/metaphlan/bowtie2db/mpa --bowtie2out ${pig}.bt2out -o ${pig}.mph2 done
Unpacking the commands
The above series of commands is another exampe of a Bash "for loop", where each iteration processes one of our metagenome samples. As before, this is a convenient and less error-prone way to process many samples by avoiding the need to type out the commands for each sample. At each loop iteration, one of the sample names is placed into the loop variable ${pig}
, where it can get used in the command inside the loop.
Secondly, we decompress (zcat
) and then feed each pig's sequencing data to metaphlan
through a pipe (|
)
Finally we can plot the taxonomic profile of the samples:
Plot the taxonomic profile of each sample
~/metaphlan/utils/merge_metaphlan_tables.py *.mph2 > pig_mph2.merged ~/metaphlan/plotting_scripts/metaphlan_hclust_heatmap.py -c bbcry --top 25 --minv 0.1 -s log --in pig_mph2.merged --out mph2_heatmap.png
Once that has completed successfully, a new file called mph2_heatmap.png
will appear in the qc_data folder of our Jupyter file browser. We can double click it to view.
There are other ways to visualize the data, and they are described in the graphlan section of the metaphlan tutorial page.
Taxonomic analysis with other tools¶
There are a whole range of other software tools available for metagenome taxonomic analysis. They all have strengths and weaknesses.
A few other commonly used tools are listed here:
Evaluating the host genomic content¶
In many applications of metagenomics we are working with host-associated samples.
The samples might have come from an animal gut, mouth, or other surface.
Similarly for plants we might be working with leaf or root surfaces or rhizobia.
When such samples are collected the resulting DNA extracts can include a significant fraction of host material.
Let's have a look at what this looks like in data.
To do so, we'll download another set of pig gut samples: a set of samples that was taken using scrapings or swabs from different parts of the porcine digestive system, including the duodenum, jejunum, ileum, colon, and caecum.
Rather than using metaphlan2
to profile these, we will use the kraken2
classifier in conjunction with the bracken tool for estimating relative abundances.
We first need to install kraken2
and bracken
:
Install kraken2
and braken
conda install -y -n workshop kraken2 bracken
Next, we need a kraken2
database.
For this tutorial we will simply use a precomputed kraken2
database, note however, the very important limitation that the only non-microbial genome it includes is the human genome.
If you would like to evaluate host genomic content on plants or other things, you should follow the instructions on the kraken2
web site to build a complete database.
Skip downloading the database
The Kraken team maintain this and a number of other prepared databases which they make available for download. However, as a matter of courtesy, we will avoid repeatedly requesting the same data.
The archive has already been downloaded and we will just be extracting the contents as follows:
Extract the precomputed kraken2
reference database
cd ; tar xvzf /data/minikraken2_v2_8GB_201904_UPDATE.tgz
Finally we are ready to profile our samples with kraken2
and bracken
.
We'll first download the sample set with parallel-fastq-dump
and then run kraken2
and bracken
's est_abundance.py
script on each sample using a bash for loop.
The analysis can be run as follows:
Download some new samples
# the list of samples samples="SRR9332442 SRR9332438 SRR9332439 SRR9332443 SRR9332440" # download the associated data parallel-fastq-dump/parallel-fastq-dump ${samples//SRR/-s SRR} --threads 4 --split-files \ --gzip --outdir host_qc/ # switch to the data directory and analyse cd host_qc for s in $samples do # run kraken2 analysis kraken2 --paired ${s}_1.fastq.gz ${s}_2.fastq.gz --db ~/minikraken2_v2_8GB_201904_UPDATE/ \ --report ${s}.kreport > ${s}.kraken # estimate abundance with braken est_abundance.py -i ${s}.kreport -o ${s}.bracken \ -k ~/minikraken2_v2_8GB_201904_UPDATE/database150mers.kmer_distrib done
Once the above has completed, navigate over to the host_qc
folder in the Jupyter file browser and click on the *.bracken
files to open them.
What do you see?¶
In particular, why does sample SRR9332438 look so different to sample SRR9332440?
Keep in mind the isolation sources of the samples were as follows:
Sample | Source |
---|---|
SRR9332442 | Duodenum |
SRR9332440 | Caecum |
SRR9332439 | Ileum |
SRR9332438 | Jejunum |
SRR9332443 | Colon |
Some challenge questions¶
- If we sequenced samples from pigs, why is human DNA being predicted in these samples?
- If we were to design a large study around these samples which of them would be suitable for metagenomics, and why?
- How much sequencing data would we need to generate from sample SRR9332440 to reconstruct the genome of the Bifidobacterium in that sample? What about the E. coli?
- Are there really six species of Lactobacillus present in SRR9332440?
- Go to the NCBI SRA search tool and find a metagenome of interest to you. Download the first 100000 reads from it (use the
--maxSpotId
parameter) and analyse it with kraken2. Is it what you expected? How does it compare to the others we've looked at?
A note on negative controls¶
Negative controls are a key element in any microbiome profiling or metagenome analysis work. Every aspect of the sample collection and processing can be impacted by the presence of contaminating microbial DNA. This is true even of 'sterile' material -- just because no viable cells exist in a sample collection swab, for example, does not mean there is no microbial DNA on that swab. It is well known that molecular biology reagents frequently contain contaminating DNA. Usually it is at low levels, but this is not always the case.
Therefore, the best practice is to collect multiple negative control samples that are taken all the way through sequencing. These negative controls can then be used to correct for contamination artifacts in the remaining samples.