Introduction to Bioinformatics NGS
  • Server Accounts
  • Server Access
  • Data
  • Quality Control
  • Mapping

On this page

  • Quality Control Tutorial
  • Preparations
    • Connect to HPC2N
    • Navigate to your course directory
    • Start a screen session
    • Create a directory to work in
    • Create symbolic link to the data
  • Data format
  • FastQC Intro
  • Containers
    • What are containers?
    • Running Containers
    • Getting apptainer image
    • Download container images
      • Running Containers
        • running “from the outside”
        • running interactively “from the inside”
    • SLURM
    • Running fastqc with sbatch
  • MultiQC
    • Pull the apptainer image
    • Download container image
    • Running multiqc with sbatch
  • FastP

Quality control of sequencing data

screen, slurm, Apptainer, FastQC, MultiQC


Regardless of what type of bioinformatics project you are working on, you will have to assess whether the data you have is of good enough quality to proceed with - bad input data can seriously impair your ability to draw conclusions from the samples later on!

Poor quality sequencing data can be caused by a variety of factors, such as sample contamination, improper sample handling, technical problems. Some of these problems will be checked for before and during the sequencing - it is always good to read all documentation coming from the sequencing facility.

When it comes to sequencing data, FastQC is a well known and often used software to check raw reads for their quality.

Note

Sample preparation includes fragmenting the genome into the sequencing library.

A read is the inferred nucleotide base sequence of a genome fragment as determined by the sequencer.

Quality Control Tutorial

Within this tutorial we will

  • use the job scheduler slurm
  • learn about containers, and where to get them, and how to use them with apptainer
  • run fastqc on some raw sequencing files to practice and to understand the output
  • run multiqc to summarize the FastQC output
  • run fastp to do adapter trimming

Preparations

Connect to HPC2N

For this tutorial we will connect to the course server.

To connect to the server, please follow the instructions here.

Quick access link to the on-demand service of HPC2N:

HPC2N on-demand link.

Navigate to your course directory

Make. sure you are in your folder in the course directory.

Tip

Use the command line commands ls to list the contents of the directory you are currently in, pwd to print the path of the current working directory, and cd to navigate between directories.

Start a screen session

Screen or GNU Screen is a terminal multiplexer. In other words, it means that you can start a screen session and then open any number of windows (virtual terminals) inside that session. Processes running in Screen will continue to run when their window is not visible even if you get disconnected.

Start a named session, with the name qc:

screen -S qc

You can detach from the screen session. The process within the screen will continue to run.

Ctrl + a d

You can always reattach to the session. If you have a number of screen running, or are unsure of the name or ID of the screen you want to reattach to you can list the currently running screens:

screen -ls

To resume your screen session use the following command:

screen -r name

Create a directory to work in

From within your personal directory on the course project, start by creating a workspace for the raw data used in this exercise in your scratch space, and then move into it:

mkdir -p  NGS_course/raw
cd NGS_course/raw

Create symbolic link to the data

The raw data files are located in:

path/to/course/training_data

You could copy the files into your workspace to access them. However, it is better to create symbolic links (also called soft links) to the data files. This saves disk space and still allows you to work with them as if they were in your own directory.

Create symbolic links to the fastq files in your workspace:

ln -s path/to/course/data/*.fastq .

You now have four files in your directory: two for the TMEB117 cultivar containing the DNA sequences, and two for the TMEB419 cultivar, containing the RNA sequencing results.

ImportantTo do for you

Leave the /raw directory and make sure that you are back in your personal course directory/NGS_course folder.

Data format

You now have the raw sequencing data, in a very common data format, fastq. Let’s have a look at the data (change the name of the file accordingly):

cat raw/sample.fastq.gz | head -n 10 

For each read we see an entry consisting of four lines. Every new read starts with an @.

Fastq format
Line Content
1 Information about the read, always starts with an @
2 nucleic sequence of the read
3 starts with `+`, can (but does not have to) contain data
4 characters representing the quality scores with the bases of the read sequence
Note

The quality score is encoded in ASCII, to represent scores from 0 (high probability of error in calling the base) to 42 (low probability of error in calling the base).

We could now look through the fastq file and scan the quality scores for the reads in our sample one by one. Sounds kind of tedious, doesn’t it? Luckily, here comes FastQC (and other, similar tools).

FastQC Intro

Fastqc is a simple tool to monitor the quality and the properties of a NGS sequencing file in fastq, SAM and BAM format.

FastQC will give us an overview of the entire sample, so we won’t have to look at the quality score of every single read in the sample.

In addition to the quality scores, FastQC also performs a series of quality control analyses, called modules. The output is a HTML report with one section for each module, and a summary evaluation of the results in the top. “Normal” results are marked with a green tick, slightly abnormal results raise warnings (orange exclamation mark), and very unusual results raise failures (red cross).

Keep in mind that even though FastQC is giving out pass/fail results for your samples, these evaluations must be taken in the context of what you expect from your library. A ‘normal’ sample as far as FastQC is concerned is random and diverse. However, because of project design your samples might deviate from this expectation. The summary evaluations should be pointers to where you have to concentrate your attention and understand why your library may not look random and diverse.

ImportantTo do for you

Let’s see which version is installed on the server:

fastqc --version

Oh no! FastQC is not installed on the server, what can we do?

Containers

In this tutorial, we will be using containers to run tools and specific tool versions that may not be directly installed on the system we are working on.

What are containers?

Containers are stand-alone pieces of software that require a container management tool to run. They are build and exchanged as container images that specify the contents of the container, such as the operating system, all dependencies, and software in an isolated environment. The container management tool then takes the images and build the container. These management tools can be run on all operating systems, and since the container has the operating system within it, it will run the same in all environments. Container images are easily portable and immutable, so they are stable over time.

Running Containers

There are several programs that can be used to build and run containers. Docker, Appptainer, and Podman are the most commonly used platforms to date. They all have their pros and cons. If you are using a Windows machine that only you are using, then Docker is likely the least complex tool to install. On multi-user systems like a server, Apptainer is the best tool for the job. For this tutorial and the rest of the course, we will use Apptainer commands. There are small syntax changes between bash and powershell commands, but they are very similar.

Getting apptainer image

One good place to get quality controlled Apptainer/Singularity containers that contain the tools we want to use is seqera containers.

  • Go to their homepage.

  • In the searchbar, type in the tool you want - fastqc



  • Add the tool you want to have in your container (in this case fastqc from Bioconda).

  • In the container settings underneath the search bar, select Singularity and linux/amd64

  • Click “get container”.



  • Once the container is ready copy the path of the image.

Download container images

To have a nice and clean project directory we will make a new sub-directory that will contain all the singularity images we will use during this tutorial.

mkdir singularity_images
cd singularity_images

Now we can pull the container image from its location into our folder:

singularity pull --name fastqc_0.12.1.sif oras://community.wave.seqera.io/library/fastqc:0.12.1--104d26ddd9519960

Then we move out of the directory again:

cd .. 

Running Containers

Once you have pulled the container image, you want to be able to use it. Apptainer can be used to build the container from the image.

running “from the outside”

There are 2 different ways to use a container: run and exec. The apptainer run command launches the container and first runs the %runscript for the container if one is defined, and then runs your command (we will cover %runscript in the Building Containers section). The apptainer exec command will not run the %runscript even if one is defined. It is a small, fiddly detail that might be applicable if you use other people’s containers. After calling Apptainer and the run or exec commands, you can use your software as you usually would

apptainer exec singularity_images/fastqc_0.12.1.sif fastqc --version

This command runs your fastqc_0.12.1.sif container from the image, calls on the program fastqc that is within the container, and shows you the version. If you had installed FastQC locally, you would have just used

fastqc --version
Important

FastQC is just an example. If you want to run any other tool everything after apptainer run or apptainer exec has to be substituted by the name of the specific container image and the run commands for that particular tool!

running interactively “from the inside”

You can also enter the container, and work interactively from within. For that you use the apptainer shell command:

apptainer shell singularity_images/fastqc_0.12.1.sif

Inside the container, your prompt will change to Singularity (remember, that is the legacy name for Apptainer). Now you can use the tools inside the container.

To exit the container, simply type and enter exit.

Great, now we have FastQC ready and can use it. How can we submit the job to the server?

SLURM

HPC2N is running SLURM (Simple Linux Utility for Resource Management) as its job scheduling system. When you submit a job on the cluster log-in node, SLURM will start, execute and monitor your jobs on the working nodes. It allocates access to the cluster resources, and manages the queue of pending jobs. Here is a more thorough documentation of SLURM on Kebnekaise.

To be able to do so, SLURM needs a bit of information from us when we submit a job:

  • -A: project_ID (to deduct the used computing time from the correct project)
  • -t allocated time dd-hh:mm:ss (to optimize the job queue)
  • -n number of cores (default is one)

Running fastqc with sbatch

We will now write a script that runs fastqc on our samples - using the container we pulled. We will add the slurm flags from above and then submit to the server queue with the SLURM command sbatch.

Again, we want to maintain a clean and orderly project directory:

In your NGS_course folder, create a new directory called scripts, within this directory create a file called fastqc.sh.

mkdir scripts
cd scripts
nano fastqc.sh
TipNano

Nano is a Linux command line text editor. Commands are prefixed with ^or M characters. The caret symbol ^ represents the Ctrl key. For example, the ^X commands mean to press the Ctrl and X keys at the same time. The letter M represents the Alt key.

More information here.

Copy the following into the file, and save the contents. Read through the file and try to understand what the different lines are doing.

#! /bin/bash -l
# The name of the compute account you are running in, mandatory.
#SBATCH -A hpc2n2025-XXX
# Request runtime for the job (HHH:MM:SS) where 168 hours is the maximum. Here asking for 15 min. 
#SBATCH -t 15:00
# Request resources - here for four cpus
#SBATCH -n 4

apptainer exec singularity_images/fastqc_0.12.1.sif \
    fastqc -o fastqc/ raw/*.fastq
    echo "complete"

The slurm options used here:

  • A: project ID
  • t allocated time dd-hh:mm:ss
  • n number of cpus

Move back into the NGS_course directory and submit the script to slurm:

cd ..
sbatch scripts/fastqc.sh

After running a bash script you will get a slurm output. Look at that output. See if you understand what that output contains.

less slurm-XXXXX.out

Locate the output of FastQC.

Note

Which output directory did you specify in the batch file?

For each fastq file you will get two output files:

TMEB117_R1_frac_fastq.zip (report, data files and graphs)

TMEB117_R1_frac_fastq.html (report in html)

ImportantTo do for you

Now you can download the .html report from the server and look at them:

In VScode: right click on the file in the file explorer –> download

From another terminal: open a local different terminal and navigate to where you want the files on your computer. Then copy the files with rsync (modify as needed).

rsync -ah <user_name>@kebnekaise.hpc2n.umu.se:/proj/nobackup/hpc2nstor2025-XXX/<folder_name>/NGS_course/fastqc .
  • What is the quality of your sample?
  • Look at online resources to see if what you see is a problem, or expected because of the type of data you work with. A nice resource here is for example the FastQC tutorial from the Michigan State University.

You see that it is getting kind of tedious to look through all the different files one by one. Okay with only a few files, but imagine having to sift through a few dozen, or even hundreds of reports.

MultiQC

MultiQC searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools. It aggregates results from bioinformatics analyses across many samples into a single report.

Pull the apptainer image

On the seqera container page choose bioconda::multiqc for your container image. Proceed to pull the container image, following the steps we did for fastqc.

Download container image

Download the container image with singularity pull. The --name flag lets you re-name the image to a more intuitive name.

Note

Good practice is to name it after the tool and its version number.

Copy the image into your singularity_images folder, if it isn’t there yet.

Running multiqc with sbatch

Within your scripts directory, make a new file, multiqc.sh, and add the following:

#! /bin/bash -l

#SBATCH -A hpc2n2025-XXX
#SBATCH -t 15:00
#SBATCH -n 1

singularity exec singularity_images/multiqc_1.25.1.sif \
    multiqc -f -o multiqc .

Navigate out of the scripts directory back into the NGS_course directory.

Run the bash script with

sbatch scripts/multiqc.sh 

The command output (in the slurm.out file) looks something like:

/// MultiQC 🔍 v1.32

config | Loading config settings from: multiqc_config.yaml file_search | Search path: /cfs/klemming/scratch/a/amrbin/NGS_course

fastqc | Found 4 reports

write_results | Data : multiqc/multiqc_data (overwritten) write_results | Report : multiqc/multiqc_report.html (overwritten) multiqc | MultiQC complete

Download the report and look at it.

ImportantTo do for you

Understand what is going on. Read the documentation.

Do we need to adapter trim any samples?

FastP

FastP is a FASTQ data pre-processing tool. The algorithm has functions for quality control, trimming of adapters, filtering by quality, and read pruning.

Dependent on what analysis you need to do with the NGS data it is wise to process the data according to the quality control and remove low score sequences and/or low score 5’ and 3’ fragments. It makes sense to trim adapters for downstream analyses, but quality filtering can remove information that modern downstream tools can still utilize.

Let’s get the output into a different directory:

mkdir fastp

Then retreive the container image from seqera containers:

singularity pull --name fastp_1.0.1.sif oras://community.wave.seqera.io/library/fastp:1.0.1--a5a7772c43b5ebcb

Make sure the image is in the same folder as the other images we used so far.

Run fastp with the following bash script:

#! /bin/bash -l

#SBATCH -A hpc2n2025-XXX     # Project allocation
#SBATCH -t 15:00             # Time limit
#SBATCH -n 4                 # Number of cores

# Get CPUs allocated to the script
CPUS=$SLURM_NPROCS

# Define input files
DATA_DIR=raw/
OUT_DIR=fastp/
FILES=( $DATA_DIR/*_R1*.fastq )

# Function to run fastp
apply_fastp() {
    READ1="$1"      # Read 1 of the pair
    READ2="$2"      # Read 2 of the pair

    # Ensure READ1 and READ2 are distinct
    if [ "$READ1" == "$READ2" ]; then
        >&2 echo "Error: READ1 and READ2 are the same file. Check string substitution."
        exit 1
    fi

    # Extract prefix from READ1
    PREFIX=$(basename "${READ1%_R1*}")

    # Run fastp within the Singularity container
    singularity exec singularity_images/fastp_1.0.1.sif \
        fastp -w $CPUS \
        -i "$READ1" \
        -I "$READ2" \
        -o "${OUT_DIR}${PREFIX}_fastp-trimmed_R1.fastq" \
        -O "${OUT_DIR}${PREFIX}_fastp-trimmed_R2.fastq" \
        --json "${OUT_DIR}${PREFIX}_fastp.json" \
        --html "${OUT_DIR}${PREFIX}_fastp.html"

    echo "Processed ${PREFIX}"
}

# Main script execution
# Process files as pairs
for ((i = 0; i < ${#FILES[@]}; i+=1)); do 
    FASTQ="${FILES[i]}"
    apply_fastp "$FASTQ" "${FASTQ/_R1/_R2}"
done

echo "complete"
ImportantTo do for you

Do you understand the bash script? Discuss with your neighbour and check out the manual for fastp.

ImportantTo do for you

Once you get the cleaned sequences run multiqc again to check the result.

Source Code
---
title: "Quality control of sequencing data"
subtitle: "screen, slurm, Apptainer, FastQC, MultiQC"
---

<br>
Regardless of what type of bioinformatics project you are working on, you will have to assess whether the data you have is of good enough quality to proceed with - bad input data can seriously impair your ability to draw conclusions from the samples later on!

Poor quality sequencing data can be caused by a variety of factors, such as sample contamination, improper sample handling, technical problems. Some of these problems will be checked for before and during the sequencing - it is always good to read all documentation coming from the sequencing facility. 

When it comes to sequencing data, FastQC is a well known and often used software to check raw reads for their quality. 

::: {.callout-note}
Sample preparation includes fragmenting the genome into the **sequencing library**.

A **read** is the inferred nucleotide base sequence of a genome fragment as determined by the sequencer. 
:::


# Quality Control Tutorial

Within this tutorial we will

- use the job scheduler `slurm`
- learn about containers, and where to get them, and how to use them with `apptainer`
- run `fastqc` on some raw sequencing files to practice and to understand the output
- run `multiqc` to summarize the FastQC output
- run `fastp` to do adapter trimming


# Preparations

## Connect to HPC2N

For this tutorial we will connect to the course server.

To connect to the server, please follow the instructions [here](server_access.qmd).

Quick access link to the on-demand service of HPC2N: 

[HPC2N on-demand link](https://portal.hpc2n.umu.se/public/landing_page.html).

## Navigate to your course directory

Make. sure you are in your folder in the course directory. 

::: {.callout-tip}
Use the command line commands `ls` to list the contents of the directory you are currently in, `pwd` to print the path of the current working directory, and `cd` to navigate between directories. 
:::

## Start a screen session

[Screen](https://www.gnu.org/software/screen/manual/screen.html) or GNU Screen is a terminal multiplexer. In other words, it means that you can start a screen session and then open any number of windows (virtual terminals) inside that session. Processes running in Screen will continue to run when their window is not visible even if you get disconnected.


Start a `named session`, with the name `qc`:

```{.bash}
screen -S qc
```
You can detach from the screen session. The process within the screen will continue to run.

```{.bash}
Ctrl + a d
```

You can always reattach to the session. If you have a number of screen running, or are unsure of the name or ID of the screen you want to reattach to you can list the currently running screens:

```{.bash}
screen -ls
```

To `resume your screen session` use the following command:

```{.bash}
screen -r name
```

## Create a directory to work in

From within your personal directory on the course project, start by creating a workspace for the raw data used in this exercise in your scratch space, and then move into it:

```{.bash}
mkdir -p  NGS_course/raw
cd NGS_course/raw
```

## Create symbolic link to the data

The raw data files are located in:

```{.bash}
path/to/course/training_data
```

You could copy the files into your workspace to access them. However, it is better to create symbolic links (also called soft links) to the data files. This saves disk space and still allows you to work with them as if they were in your own directory.

Create symbolic links to the fastq files in your workspace:

```{.bash}
ln -s path/to/course/data/*.fastq .
```

You now have four files in your directory: two for the TMEB117 cultivar containing the DNA sequences, and two for the TMEB419 cultivar, containing the RNA sequencing results.

::: {.callout-important}
## To do for you
Leave the `/raw` directory and make sure that you are back in your personal course `directory/NGS_course` folder.
:::


# Data format

You now have the raw sequencing data, in a very common data format, fastq. Let's have a look at the data (change the name of the file accordingly): 

```{.bash}
cat raw/sample.fastq.gz | head -n 10 
```

For each read we see an entry consisting of four lines. Every new read starts with an `@`. 

| Line | Content|
|-|-----|
| 1    | Information about the read, always starts with an `@`|
| 2    | nucleic sequence of the read|
| 3    | starts with \`+\`, can (but does not have to) contain data|
| 4    | characters representing the quality scores with the bases of the read sequence |

: Fastq format

::: {.callout-note}
The quality score is encoded in ASCII, to represent scores from 0 (high probability of error in calling the base) to 42 (low probability of error in calling the base). 
:::

We could now look through the fastq file and scan the quality scores for the reads in our sample one by one. Sounds kind of tedious, doesn't it? Luckily, here comes FastQC (and other, similar tools). 


# FastQC Intro

[`Fastqc`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is a simple tool to monitor the quality and the properties of a NGS sequencing file in fastq, SAM and BAM format.

FastQC will give us an overview of the entire sample, so we won't have to look at the quality score of every single read in the sample. 

In addition to the quality scores, FastQC also performs a series of quality control analyses, called modules. The output is a HTML report with one section for each module, and a summary evaluation of the results in the top. "Normal" results are marked with a green tick, slightly abnormal results raise warnings (orange exclamation mark), and very unusual results raise failures (red cross).

Keep in mind that even though FastQC is giving out pass/fail results for your samples, these evaluations must be taken in the context of what you expect from your library. A ‘normal’ sample as far as FastQC is concerned is random and diverse. However, because of project design your samples might deviate from this expectation. The summary evaluations should be pointers to where you have to concentrate your attention and understand why your library may not look random and diverse.

::: {.callout-important}
## To do for you
Let's see which version is installed on the server:

```{.bash}
fastqc --version
```
:::

Oh no! FastQC is not installed on the server, what can we do? 

# Containers

In this tutorial, we will be using `containers` to run tools and specific tool versions that may not be directly installed on the system we are working on.

## What are containers?

Containers are stand-alone pieces of software that require a container management tool to run. They are build and exchanged as container images that specify the contents of the container, such as the operating system, all dependencies, and software in an isolated environment. The container management tool then takes the images and build the container. These management tools can be run on all operating systems, and since the container has the operating system within it, it will run the same in all environments. Container images are easily portable and immutable, so they are stable over time.

## Running Containers

There are several programs that can be used to build and run containers. [Docker](https://www.docker.com), [Appptainer](https://apptainer.org), and [Podman](https://podman.io) are the most commonly used platforms to date. They all have their pros and cons. If you are using a Windows machine that only you are using, then Docker is likely the least complex tool to install. On multi-user systems like a server, Apptainer is the best tool for the job. For this tutorial and the rest of the course, we will use Apptainer commands. There are small syntax changes between bash and powershell commands, but they are very similar.

## Getting apptainer image

One good place to get quality controlled Apptainer/Singularity containers that contain the tools we want to use is [seqera containers](https://seqera.io/containers/).

- Go to their homepage.

- In the searchbar, type in the tool you want - fastqc

[![](images/seqera_cont_one)]()

<br><br>

- Add the tool you want to have in your container (in this case `fastqc` from Bioconda).

- In the container settings underneath the search bar, select Singularity and linux/amd64

- Click "get container".

[![](images/seqera_cont_2)]()

<br><br>

- Once the container is ready copy the path of the image.

[![](images/seqera_cont_3)]()


## Download container images

To have a nice and clean project directory we will make a new sub-directory that will contain all the singularity images we will use during this tutorial.

```{.bash}
mkdir singularity_images
cd singularity_images
```

Now we can pull the container image from its location into our folder:

```{.bash}
singularity pull --name fastqc_0.12.1.sif oras://community.wave.seqera.io/library/fastqc:0.12.1--104d26ddd9519960
```

Then we move out of the directory again:

```{.bash}
cd .. 
```


### Running Containers
Once you have pulled the container image, you want to be able to use it. Apptainer can be used to build the container from the image.

#### running "from the outside"

There are 2 different ways to use a container: `run` and `exec`. The `apptainer run` command launches the container and first runs the `%runscript` for the container if one is defined, and then runs your command (we will cover `%runscript` in the `Building Containers` section). The `apptainer exec` command will not run the `%runscript` even if one is defined. It is a small, fiddly detail that might be applicable if you use other people's containers. After calling Apptainer and the `run` or `exec` commands, you can use your software as you usually would

```{.bash}
apptainer exec singularity_images/fastqc_0.12.1.sif fastqc --version
```

This command runs your `fastqc_0.12.1.sif` container from the image, calls on the program `fastqc` that is within the container, and shows you the version. If you had installed FastQC locally, you would have just used 

```{.bash}
fastqc --version
```

::: {.callout-important}
FastQC is just an example. If you want to run any other tool everything after `apptainer run` or `apptainer exec` has to be substituted by the name of the specific container image and the run commands for that particular tool!
:::


#### running interactively "from the inside"

You can also enter the container, and work interactively from within. For that you use the `apptainer shell` command:

```{.bash}
apptainer shell singularity_images/fastqc_0.12.1.sif
```

Inside the container, your prompt will change to `Singularity` (remember, that is the legacy name for Apptainer). Now you can use the tools inside the container. 

To exit the container, simply type and enter `exit`. 

Great, now we have FastQC ready and can use it. How can we submit the job to the server?

## SLURM

HPC2N is running SLURM (Simple Linux Utility for Resource Management) as its job scheduling system. When you submit a job on the cluster log-in node, SLURM will start, execute and monitor your jobs on the working nodes. It allocates access to the cluster resources, and manages the queue of pending jobs. [Here](https://docs.hpc2n.umu.se/documentation/batchsystem/intro/) is a more thorough documentation of SLURM on Kebnekaise. 

To be able to do so, SLURM needs a bit of information from us when we submit a job:

- -A: project_ID (to deduct the used computing time from the correct project) 
- -t allocated time dd-hh:mm:ss (to optimize the job queue) 
- -n number of cores (default is one)

## Running fastqc with sbatch

We will now write a script that runs fastqc on our samples - using the container we pulled. We will add the slurm flags from above and then submit to the server queue with the SLURM command `sbatch`. 

Again, we want to maintain a clean and orderly project directory:

In your NGS_course folder, create a new directory called scripts, within this directory create a file called fastqc.sh.

```{.bash}
mkdir scripts
cd scripts
nano fastqc.sh
```

::: {.callout-tip}
## Nano
Nano is a Linux command line text editor. Commands are prefixed with `^`or `M` characters. The caret symbol `^` represents the `Ctrl` key. For example, the `^X` commands mean to press the `Ctrl` and `X` keys at the same time. The letter `M` represents the `Alt` key. 

More information [here](https://linuxize.com/post/how-to-use-nano-text-editor/).
:::

Copy the following into the file, and save the contents. Read through the file and try to understand what the different lines are doing.

```{.bash}
#! /bin/bash -l
# The name of the compute account you are running in, mandatory.
#SBATCH -A hpc2n2025-XXX
# Request runtime for the job (HHH:MM:SS) where 168 hours is the maximum. Here asking for 15 min. 
#SBATCH -t 15:00
# Request resources - here for four cpus
#SBATCH -n 4

apptainer exec singularity_images/fastqc_0.12.1.sif \
    fastqc -o fastqc/ raw/*.fastq
    echo "complete"
```

The slurm options used here:

- A: project ID
- t allocated time dd-hh:mm:ss
- n number of cpus

Move back into the NGS_course directory and submit the script to slurm:

```{.bash}
cd ..
sbatch scripts/fastqc.sh
```

After running a bash script you will get a slurm output. Look at that output. See if you understand what that output contains.

```{.bash}
less slurm-XXXXX.out
```

Locate the output of FastQC.

::: {.callout-note}
Which output directory did you specify in the batch file?
:::


For each fastq file you will get two output files:

> TMEB117_R1_frac_fastq.zip (report, data files and graphs) 

> TMEB117_R1_frac_fastq.html (report in html)

::: {.callout-important}
## To do for you
Now you can download the .html report from the server and look at them: 

In VScode: right click on the file in the file explorer --> download

From another terminal: open a local different terminal and navigate to where you want the files on your computer. Then copy the files with rsync (modify as needed). 

```{.bash}
rsync -ah <user_name>@kebnekaise.hpc2n.umu.se:/proj/nobackup/hpc2nstor2025-XXX/<folder_name>/NGS_course/fastqc .
```

- What is the quality of your sample? 
- Look at online resources to see if what you see is a problem, or expected because of the type of data you work with. A nice resource here is for example the [FastQC tutorial from the Michigan State University](https://rtsf.natsci.msu.edu/genomics/technical-documents/fastqc-tutorial-and-faq.aspx).
:::

You see that it is getting kind of tedious to look through all the different files one by one. Okay with only a few files, but imagine having to sift through a few dozen, or even hundreds of reports.

# MultiQC

[MultiQC](https://seqera.io/multiqc/) searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools. It aggregates results from bioinformatics analyses across many samples into a single report.

### Pull the apptainer image

On the seqera container page choose `bioconda::multiqc` for your container image. Proceed to pull the container image, following the steps we did for `fastqc`.

### Download container image

Download the container image with singularity pull. The `--name` flag lets you re-name the image to a more intuitive name. 

::: {.callout-note}
Good practice is to name it after the tool and its version number.
:::

Copy the image into your singularity_images folder, if it isn't there yet.


### Running multiqc with sbatch

Within your scripts directory, make a new file, multiqc.sh, and add the following:

```{.bash}
#! /bin/bash -l

#SBATCH -A hpc2n2025-XXX
#SBATCH -t 15:00
#SBATCH -n 1

singularity exec singularity_images/multiqc_1.25.1.sif \
    multiqc -f -o multiqc .
```

Navigate out of the scripts directory back into the NGS_course directory. 

Run the bash script with

```{.bash}
sbatch scripts/multiqc.sh 
```

The command output (in the slurm.out file) looks something like:


> /// MultiQC 🔍 v1.32
> 
> config | Loading config settings from: multiqc_config.yaml
> file_search | Search path: /cfs/klemming/scratch/a/amrbin/NGS_course
> 
> fastqc | Found 4 reports
> 
> write_results | Data        : multiqc/multiqc_data   (overwritten)
> write_results | Report      : multiqc/multiqc_report.html   (overwritten)
> multiqc | MultiQC complete


Download the report and look at it. 

::: {.callout-important}
## To do for you

Understand what is going on. Read the documentation.

Do we need to adapter trim any samples?
:::

# FastP

[FastP](https://github.com/OpenGene/fastp) is a FASTQ data pre-processing tool. The algorithm has functions for quality control, trimming of adapters, filtering by quality, and read pruning.

Dependent on what analysis you need to do with the NGS data it is wise to process the data according to the quality control and remove low score sequences and/or low score 5' and 3' fragments. It makes sense to trim adapters for downstream analyses, but quality filtering can remove information that modern downstream tools can still utilize.

Let's get the output into a different directory:

```{.bash}
mkdir fastp
```

Then retreive the container image from seqera containers:

```{.bash}
singularity pull --name fastp_1.0.1.sif oras://community.wave.seqera.io/library/fastp:1.0.1--a5a7772c43b5ebcb
```

Make sure the image is in the same folder as the other images we used so far.

Run fastp with the following bash script:

```{.bash}
#! /bin/bash -l

#SBATCH -A hpc2n2025-XXX     # Project allocation
#SBATCH -t 15:00             # Time limit
#SBATCH -n 4                 # Number of cores

# Get CPUs allocated to the script
CPUS=$SLURM_NPROCS

# Define input files
DATA_DIR=raw/
OUT_DIR=fastp/
FILES=( $DATA_DIR/*_R1*.fastq )

# Function to run fastp
apply_fastp() {
    READ1="$1"      # Read 1 of the pair
    READ2="$2"      # Read 2 of the pair

    # Ensure READ1 and READ2 are distinct
    if [ "$READ1" == "$READ2" ]; then
        >&2 echo "Error: READ1 and READ2 are the same file. Check string substitution."
        exit 1
    fi

    # Extract prefix from READ1
    PREFIX=$(basename "${READ1%_R1*}")

    # Run fastp within the Singularity container
    singularity exec singularity_images/fastp_1.0.1.sif \
        fastp -w $CPUS \
        -i "$READ1" \
        -I "$READ2" \
        -o "${OUT_DIR}${PREFIX}_fastp-trimmed_R1.fastq" \
        -O "${OUT_DIR}${PREFIX}_fastp-trimmed_R2.fastq" \
        --json "${OUT_DIR}${PREFIX}_fastp.json" \
        --html "${OUT_DIR}${PREFIX}_fastp.html"

    echo "Processed ${PREFIX}"
}

# Main script execution
# Process files as pairs
for ((i = 0; i < ${#FILES[@]}; i+=1)); do 
    FASTQ="${FILES[i]}"
    apply_fastp "$FASTQ" "${FASTQ/_R1/_R2}"
done

echo "complete"
```

::: {.callout-important}
## To do for you

Do you understand the bash script? Discuss with your neighbour and check out the manual for fastp.
:::



::: {.callout-important}
## To do for you

Once you get the cleaned sequences run multiqc again to check the result.
:::