Quality control: Removing adapters from reads in dataset

The group met today and we added quality control of the reads to the project plan. We looked at using either Trimmomatic or Cutadapt. Trimmomatic would be the preferred option since it is a trimming tool for Illumina NGS data. The adapter sequences to be removed are also distributed with the software, unlike Cutadapt, where the user has to specify the adapter sequences that should be removed.

According to the manual of Trimmomatic a FASTA file should be specified (in addition to the dataset) that contains the adapter sequences (and PCR sequences etc). This file is distributed with Trimmomatic and contains the Illumina adapter sequences. It does not really make sense to me that the pathway of this file needs to be specified since it is distributed with the software, and since the software only works with data from Illumina sequencing machines there is not a lot of different options for the user to specify. Finding this file in a distributed system like Uppmax is what took the most time in trying to use this software. The solution was instead to find this FASTA file with the Illumina sequences on the internet and upload it to the same folder as the files with the reads.

The options used for Trimmomatic were the default options that are specified in the example of the webpage of Trimmomatic (for paired end data):

module add bioinfo-tools

module add trimmomatic

java -jar $TRIMMOMATIC_HOME/trimmomatic.jar PE -phred33 ETECp7_TCCGCGAA-CAGGACGT_L001_R1_001.fastq.gz ETECp7_TCCGCGAA-CAGGACGT_L001_R2_001.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

And the output was:

TrimmomaticPE: Started with arguments:
-phred33 ETECp7_TCCGCGAA-CAGGACGT_L001_R1_001.fastq.gz ETECp7_TCCGCGAA-CAGGACGT_L001_R2_001.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Using PrefixPair: ‘TACACTCTTTCCCTACACGACGCTCTTCCGATCT’ and ‘GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT’
Using Long Clipping Sequence: ‘AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA’
Using Long Clipping Sequence: ‘AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC’
Using Long Clipping Sequence: ‘GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT’
Using Long Clipping Sequence: ‘TACACTCTTTCCCTACACGACGCTCTTCCGATCT’
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 799754 Both Surviving: 718152 (89.80%) Forward Only Surviving: 77273 (9.66%) Reverse Only Surviving: 631 (0.08%) Dropped: 3698 (0.46%)
TrimmomaticPE: Completed successfully

I can’t really interpret the summary of the output. Are the results good or bad? But will look into it tomorrow.