Assembly and more quality controls

Last week we focused on doing more quality controls, and we started with doing alignments. After the adapter of the reads had been removed a quality control of the reads with FastQC clearly showed an increase in the quality of the data.

Then we proceeded to do trim and filter the reads with FastX. Trimming is when bases with poor quality are removed, and filtering is when entire reads of poor quality are removed from the dataset, either due to poor average quality, ambiguous base calling or short length. FastX is a collection of command line tools for pre-processing short reads. Using FastX to filter the data did not result in any changes of the quality of the data. The file sizes of the fastq files before and after remained the same and analyzing the files with FastQC confirmed this since there were no changes in the plots compared to the plots before filtering. The per base quality plot below is given as an example of no changes to before the filtering.

The per base quality plot after filtering the reads.

We decided to continue without the filtered reads.

Next we tried assembling the reads into contigs with two different assemblers: Velvet and SPAdes. After discussion with Anders Nilsson and consulting documentation from Illumina it was decided to assemble the reads with Velvet with three different k-mer sizes: 21, 41 and 61. SPAdes has an algorithm for calculating the k-mer size. In the literature it is also recommended to assemble with a lower coverage that is common when assembling phage genomes. This is because phage genomes are small and with a high coverage there is a risk that systematic errors would be treated as natural variations. For this reason the reads were also assembled with only 10% of the reads, for both Velvet and for SPAdes. In addition, when the reads were assembled with SPAdes they were assemble with the setting “careful” on and off. In total ten assemblies were made using SPAdes and Velvet.

The software Quast was used to analyze and require metrics for all assemblies. The result reported by Quast was (only contigs > 500 bp reported):

Velvet 21 k-mer:
Number of contigs: 32
Largest contig: 2926 bp
N50: 749 contigs
Total length: 24391 bp

Velvet 41 k-mer:
Number of contigs: 43
Largest contig: 1730 bp
N50: 711 contigs
Total length: 32159 bp

Velvet 61 k-mer:
Number of contigs: 63
Largest contig: 1241 bp
N50: 660 contigs
Total length: 42473 bp

Velvet 21 k-mer (10% coverage):
Number of contigs: 63
Largest contig: 5390 bp
N50: 1862 contigs
Total length: 85004 bp

Velvet 41 k-mer (10% coverage):
Number of contigs: 64
Largest contig: 7011 bp
N50: 2777 contigs
Total length: 100054 bp

Velvet 61 k-mer (10% coverage):
Number of contigs: 60
Largest contig: 14272 bp
N50: 2194 contigs
Total length: 103048 bp

SPAdes (careful):
Number of contigs: 34
Largest contig: 90035 bp
N50: 76572 contigs
Total length: 193763 bp

SPAdes (uncareful):
Number of contigs: 34
Largest contig: 90035 bp
N50: 76700 contigs
Total length: 193891 bp

SPAdes (10% coverage and careful):
Number of contigs: 8
Largest contig: 90035 bp
N50: 90035 contigs
Total length: 169704 bp

SPAdes (10% coverage and uncareful):
Number of contigs: 7
Largest contig: 90035 bp
N50: 90035 contigs
Total length: 169705 bp

We are unsure how to interpret which of these assemblies are the best based on these metrics, but the genome of the phage is expected to be 80-90 kb in total. None of the SPAdes assemblies fall within this range. Out of the Velvet assemblies only the 21 k-mer with 10% coverage fall with in the expected genome size. But the best assembly still remains to be discussed.

Quality control: FastQC before and after adapter removal

Since we realized that we need to quality controls of the reads and remove adapters and trimming and filtering we have found the software FastQC which is a quality control tool for high throughput data. FastQC does not modify the reads. It just give different kinds of graphs that report the quality of the reads.

We analyzed the original dataset of reads with FastQC and the quality of the data is reported as good even though some categories give the failure and warnings. The per base sequence quality shows that the general quality of the bases is good even though it starts to drop by the end of the reads.

The per base quality before the adapter removal.

The category for sequence duplication levels is the only category that gives a failure. This may indicate some kind of enrichment.

Sequence duplication levels of the raw reads.

The category for over-represented sequences gives a warning. A sequence is regarded as over-represented and the software will raise a warning if the sequence makes up more than 0.1% of the total number of sequences. An over-represented sequence may be due to biological importance or to contamination. In the table it can be seen that the over-representation of two of the sequences are due to the Illumina adapter sequence.

Overrepresented reads in the dataset.

The category for adapter content also gives a warning. The graph shows, though, that the source of the warning is a significant amount of Illumina adapter sequences.

The graph for adapter content shows a significant amount of Illumina adapter sequences.

The analysis of the raw reads shows that there is a significant amount of Illumina adapter sequences in the dataset and thus adapter removal should be performed. This was previously done with Trimmomatic and the resulting reads were again analyzed with FastQC.

The per base quality improved drastically compared to before the removal of the adapters as seen in the image below.

The per base quality improved after the adapters were removed.

Before the adapter removal the distribution of sequence lengths had a perfect score since all the sequences where 300 nucleotides. After the adapters were removed the distribution of sequence lengths changed, as expected. Most of the sequences are very close to 300 nucleotides. According to the FastQC manual the software raises a warning as if all the sequences are not the same length. But this should not be a big issue in this case.

The sequence length distribution changed from a pass to a warning after the adapter removal.

The sequence duplication levels did not change much after the removal of the adapters compared to before the adapter removal and still raises a failure, indicating an enrichment bias. The software raises a failure if more than 50% of the total amount of sequences are non-unique. I’m not sure if this will cause an issue with the assembly but we decided to continue with the next steps without looking closer into this issue.

The sequence duplication levels did not change compared to before the adapters were removed.

In the table of overrepresented sequences it can be seen that the sequences of the adapter have been removed, as expected. The rest of them from before the removal are still present and their sources are still unknown but should probably be BLASTed to find out more about their importance.

In the overrepresented table adapter sequences have been removed.

Finally, as expected the graph for adapter content shows that all the adapters were removed. This category turned from a warning to a pass.

The adapter content shows that all the adapters were removed.

In conclusion we decided that the adapter removal improved the data quality and decided to continue with the adapter removed reads.