Quality control of sequencing data
screen, slurm, Apptainer, FastQC, MultiQC
Regardless of what type of bioinformatics project you are working on, you will have to assess whether the data you have is of good enough quality to proceed with - bad input data can seriously impair your ability to draw conclusions from the samples later on!
Poor quality sequencing data can be caused by a variety of factors, such as sample contamination, improper sample handling, technical problems. Some of these problems will be checked for before and during the sequencing - it is always good to read all documentation coming from the sequencing facility.
When it comes to sequencing data, FastQC is a well known and often used software to check raw reads for their quality.
Sample preparation includes fragmenting the genome into the sequencing library.
A read is the inferred nucleotide base sequence of a genome fragment as determined by the sequencer.
Quality Control Tutorial
Within this tutorial we will
- use the job scheduler
slurm - learn about containers, and where to get them, and how to use them with
apptainer - run
fastqcon some raw sequencing files to practice and to understand the output - run
multiqcto summarize the FastQC output - run
fastpto do adapter trimming
Preparations
Connect to HPC2N
For this tutorial we will connect to the course server.
To connect to the server, please follow the instructions here.
Quick access link to the on-demand service of HPC2N:
Start a screen session
Screen or GNU Screen is a terminal multiplexer. In other words, it means that you can start a screen session and then open any number of windows (virtual terminals) inside that session. Processes running in Screen will continue to run when their window is not visible even if you get disconnected.
Start a named session, with the name qc:
screen -S qcYou can detach from the screen session. The process within the screen will continue to run.
Ctrl + a dYou can always reattach to the session. If you have a number of screen running, or are unsure of the name or ID of the screen you want to reattach to you can list the currently running screens:
screen -lsTo resume your screen session use the following command:
screen -r nameCreate a directory to work in
From within your personal directory on the course project, start by creating a workspace for the raw data used in this exercise in your scratch space, and then move into it:
mkdir -p NGS_course/raw
cd NGS_course/rawCreate symbolic link to the data
The raw data files are located in:
path/to/course/training_dataYou could copy the files into your workspace to access them. However, it is better to create symbolic links (also called soft links) to the data files. This saves disk space and still allows you to work with them as if they were in your own directory.
Create symbolic links to the fastq files in your workspace:
ln -s path/to/course/data/*.fastq .You now have four files in your directory: two for the TMEB117 cultivar containing the DNA sequences, and two for the TMEB419 cultivar, containing the RNA sequencing results.
Leave the /raw directory and make sure that you are back in your personal course directory/NGS_course folder.
Data format
You now have the raw sequencing data, in a very common data format, fastq. Let’s have a look at the data (change the name of the file accordingly):
cat raw/sample.fastq.gz | head -n 10 For each read we see an entry consisting of four lines. Every new read starts with an @.
| Line | Content |
|---|---|
| 1 | Information about the read, always starts with an @ |
| 2 | nucleic sequence of the read |
| 3 | starts with `+`, can (but does not have to) contain data |
| 4 | characters representing the quality scores with the bases of the read sequence |
The quality score is encoded in ASCII, to represent scores from 0 (high probability of error in calling the base) to 42 (low probability of error in calling the base).
We could now look through the fastq file and scan the quality scores for the reads in our sample one by one. Sounds kind of tedious, doesn’t it? Luckily, here comes FastQC (and other, similar tools).
FastQC Intro
Fastqc is a simple tool to monitor the quality and the properties of a NGS sequencing file in fastq, SAM and BAM format.
FastQC will give us an overview of the entire sample, so we won’t have to look at the quality score of every single read in the sample.
In addition to the quality scores, FastQC also performs a series of quality control analyses, called modules. The output is a HTML report with one section for each module, and a summary evaluation of the results in the top. “Normal” results are marked with a green tick, slightly abnormal results raise warnings (orange exclamation mark), and very unusual results raise failures (red cross).
Keep in mind that even though FastQC is giving out pass/fail results for your samples, these evaluations must be taken in the context of what you expect from your library. A ‘normal’ sample as far as FastQC is concerned is random and diverse. However, because of project design your samples might deviate from this expectation. The summary evaluations should be pointers to where you have to concentrate your attention and understand why your library may not look random and diverse.
Let’s see which version is installed on the server:
fastqc --versionOh no! FastQC is not installed on the server, what can we do?
Containers
In this tutorial, we will be using containers to run tools and specific tool versions that may not be directly installed on the system we are working on.
What are containers?
Containers are stand-alone pieces of software that require a container management tool to run. They are build and exchanged as container images that specify the contents of the container, such as the operating system, all dependencies, and software in an isolated environment. The container management tool then takes the images and build the container. These management tools can be run on all operating systems, and since the container has the operating system within it, it will run the same in all environments. Container images are easily portable and immutable, so they are stable over time.
Running Containers
There are several programs that can be used to build and run containers. Docker, Appptainer, and Podman are the most commonly used platforms to date. They all have their pros and cons. If you are using a Windows machine that only you are using, then Docker is likely the least complex tool to install. On multi-user systems like a server, Apptainer is the best tool for the job. For this tutorial and the rest of the course, we will use Apptainer commands. There are small syntax changes between bash and powershell commands, but they are very similar.
Getting apptainer image
One good place to get quality controlled Apptainer/Singularity containers that contain the tools we want to use is seqera containers.
Go to their homepage.
In the searchbar, type in the tool you want - fastqc
Add the tool you want to have in your container (in this case
fastqcfrom Bioconda).In the container settings underneath the search bar, select Singularity and linux/amd64
Click “get container”.
- Once the container is ready copy the path of the image.
Download container images
To have a nice and clean project directory we will make a new sub-directory that will contain all the singularity images we will use during this tutorial.
mkdir singularity_images
cd singularity_imagesNow we can pull the container image from its location into our folder:
singularity pull --name fastqc_0.12.1.sif oras://community.wave.seqera.io/library/fastqc:0.12.1--104d26ddd9519960Then we move out of the directory again:
cd .. Running Containers
Once you have pulled the container image, you want to be able to use it. Apptainer can be used to build the container from the image.
running “from the outside”
There are 2 different ways to use a container: run and exec. The apptainer run command launches the container and first runs the %runscript for the container if one is defined, and then runs your command (we will cover %runscript in the Building Containers section). The apptainer exec command will not run the %runscript even if one is defined. It is a small, fiddly detail that might be applicable if you use other people’s containers. After calling Apptainer and the run or exec commands, you can use your software as you usually would
apptainer exec singularity_images/fastqc_0.12.1.sif fastqc --versionThis command runs your fastqc_0.12.1.sif container from the image, calls on the program fastqc that is within the container, and shows you the version. If you had installed FastQC locally, you would have just used
fastqc --versionFastQC is just an example. If you want to run any other tool everything after apptainer run or apptainer exec has to be substituted by the name of the specific container image and the run commands for that particular tool!
running interactively “from the inside”
You can also enter the container, and work interactively from within. For that you use the apptainer shell command:
apptainer shell singularity_images/fastqc_0.12.1.sifInside the container, your prompt will change to Singularity (remember, that is the legacy name for Apptainer). Now you can use the tools inside the container.
To exit the container, simply type and enter exit.
Great, now we have FastQC ready and can use it. How can we submit the job to the server?
SLURM
HPC2N is running SLURM (Simple Linux Utility for Resource Management) as its job scheduling system. When you submit a job on the cluster log-in node, SLURM will start, execute and monitor your jobs on the working nodes. It allocates access to the cluster resources, and manages the queue of pending jobs. Here is a more thorough documentation of SLURM on Kebnekaise.
To be able to do so, SLURM needs a bit of information from us when we submit a job:
- -A: project_ID (to deduct the used computing time from the correct project)
- -t allocated time dd-hh:mm:ss (to optimize the job queue)
- -n number of cores (default is one)
Running fastqc with sbatch
We will now write a script that runs fastqc on our samples - using the container we pulled. We will add the slurm flags from above and then submit to the server queue with the SLURM command sbatch.
Again, we want to maintain a clean and orderly project directory:
In your NGS_course folder, create a new directory called scripts, within this directory create a file called fastqc.sh.
mkdir scripts
cd scripts
nano fastqc.shNano is a Linux command line text editor. Commands are prefixed with ^or M characters. The caret symbol ^ represents the Ctrl key. For example, the ^X commands mean to press the Ctrl and X keys at the same time. The letter M represents the Alt key.
More information here.
Copy the following into the file, and save the contents. Read through the file and try to understand what the different lines are doing.
#! /bin/bash -l
# The name of the compute account you are running in, mandatory.
#SBATCH -A hpc2n2025-XXX
# Request runtime for the job (HHH:MM:SS) where 168 hours is the maximum. Here asking for 15 min.
#SBATCH -t 15:00
# Request resources - here for four cpus
#SBATCH -n 4
apptainer exec singularity_images/fastqc_0.12.1.sif \
fastqc -o fastqc/ raw/*.fastq
echo "complete"The slurm options used here:
- A: project ID
- t allocated time dd-hh:mm:ss
- n number of cpus
Move back into the NGS_course directory and submit the script to slurm:
cd ..
sbatch scripts/fastqc.shAfter running a bash script you will get a slurm output. Look at that output. See if you understand what that output contains.
less slurm-XXXXX.outLocate the output of FastQC.
Which output directory did you specify in the batch file?
For each fastq file you will get two output files:
TMEB117_R1_frac_fastq.zip (report, data files and graphs)
TMEB117_R1_frac_fastq.html (report in html)
Now you can download the .html report from the server and look at them:
In VScode: right click on the file in the file explorer –> download
From another terminal: open a local different terminal and navigate to where you want the files on your computer. Then copy the files with rsync (modify as needed).
rsync -ah <user_name>@kebnekaise.hpc2n.umu.se:/proj/nobackup/hpc2nstor2025-XXX/<folder_name>/NGS_course/fastqc .- What is the quality of your sample?
- Look at online resources to see if what you see is a problem, or expected because of the type of data you work with. A nice resource here is for example the FastQC tutorial from the Michigan State University.
You see that it is getting kind of tedious to look through all the different files one by one. Okay with only a few files, but imagine having to sift through a few dozen, or even hundreds of reports.
MultiQC
MultiQC searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools. It aggregates results from bioinformatics analyses across many samples into a single report.
Pull the apptainer image
On the seqera container page choose bioconda::multiqc for your container image. Proceed to pull the container image, following the steps we did for fastqc.
Download container image
Download the container image with singularity pull. The --name flag lets you re-name the image to a more intuitive name.
Good practice is to name it after the tool and its version number.
Copy the image into your singularity_images folder, if it isn’t there yet.
Running multiqc with sbatch
Within your scripts directory, make a new file, multiqc.sh, and add the following:
#! /bin/bash -l
#SBATCH -A hpc2n2025-XXX
#SBATCH -t 15:00
#SBATCH -n 1
singularity exec singularity_images/multiqc_1.25.1.sif \
multiqc -f -o multiqc .Navigate out of the scripts directory back into the NGS_course directory.
Run the bash script with
sbatch scripts/multiqc.sh The command output (in the slurm.out file) looks something like:
/// MultiQC 🔍 v1.32
config | Loading config settings from: multiqc_config.yaml file_search | Search path: /cfs/klemming/scratch/a/amrbin/NGS_course
fastqc | Found 4 reports
write_results | Data : multiqc/multiqc_data (overwritten) write_results | Report : multiqc/multiqc_report.html (overwritten) multiqc | MultiQC complete
Download the report and look at it.
Understand what is going on. Read the documentation.
Do we need to adapter trim any samples?
FastP
FastP is a FASTQ data pre-processing tool. The algorithm has functions for quality control, trimming of adapters, filtering by quality, and read pruning.
Dependent on what analysis you need to do with the NGS data it is wise to process the data according to the quality control and remove low score sequences and/or low score 5’ and 3’ fragments. It makes sense to trim adapters for downstream analyses, but quality filtering can remove information that modern downstream tools can still utilize.
Let’s get the output into a different directory:
mkdir fastpThen retreive the container image from seqera containers:
singularity pull --name fastp_1.0.1.sif oras://community.wave.seqera.io/library/fastp:1.0.1--a5a7772c43b5ebcbMake sure the image is in the same folder as the other images we used so far.
Run fastp with the following bash script:
#! /bin/bash -l
#SBATCH -A hpc2n2025-XXX # Project allocation
#SBATCH -t 15:00 # Time limit
#SBATCH -n 4 # Number of cores
# Get CPUs allocated to the script
CPUS=$SLURM_NPROCS
# Define input files
DATA_DIR=raw/
OUT_DIR=fastp/
FILES=( $DATA_DIR/*_R1*.fastq )
# Function to run fastp
apply_fastp() {
READ1="$1" # Read 1 of the pair
READ2="$2" # Read 2 of the pair
# Ensure READ1 and READ2 are distinct
if [ "$READ1" == "$READ2" ]; then
>&2 echo "Error: READ1 and READ2 are the same file. Check string substitution."
exit 1
fi
# Extract prefix from READ1
PREFIX=$(basename "${READ1%_R1*}")
# Run fastp within the Singularity container
singularity exec singularity_images/fastp_1.0.1.sif \
fastp -w $CPUS \
-i "$READ1" \
-I "$READ2" \
-o "${OUT_DIR}${PREFIX}_fastp-trimmed_R1.fastq" \
-O "${OUT_DIR}${PREFIX}_fastp-trimmed_R2.fastq" \
--json "${OUT_DIR}${PREFIX}_fastp.json" \
--html "${OUT_DIR}${PREFIX}_fastp.html"
echo "Processed ${PREFIX}"
}
# Main script execution
# Process files as pairs
for ((i = 0; i < ${#FILES[@]}; i+=1)); do
FASTQ="${FILES[i]}"
apply_fastp "$FASTQ" "${FASTQ/_R1/_R2}"
done
echo "complete"Do you understand the bash script? Discuss with your neighbour and check out the manual for fastp.
Once you get the cleaned sequences run multiqc again to check the result.