Introduction to NGS Mapping

Now that you have looked at the quality of the samples, let’s do the next step of a typical analysis: mapping the reads to a reference genome. Once we know where our reads map we could then continue to annotate our reads. However this will be beyond the scope of this tutorial.

Learning objectives

map NGS sequences to a reference
use different tools for different types of data
train the command line more and more independently
navigate absolute and relative paths when using the command line
structure your project directories to be well organized

Preparations

Connect to Dardel

For this tutorial we will connect to Dardel. For everyone connecting via Kerberos this is the command:

ssh -o GSSAPIAuthentication=yes <PDC username>@dardel.pdc.kth.se

Start a screen session

Screen or GNU Screen is a terminal multiplexer. In other words, it means that you can start a screen session and then open any number of windows (virtual terminals) inside that session. Processes running in Screen will continue to run when their window is not visible even if you get disconnected.

Start a named session

screen -S mapping

You can detach from the screen session. The process within the screen will continue to run.

Ctrl + a d

You can always reattach to the session. If you have a number of screen running, or are unsure of the name or ID of the screen you want to reattach to you can list the currently running screens:

screen -ls

To resume your screen session use the following command:

screen -r name

Exercise

Can you find the screen session that you started last friday?

Change into PDC scratch space

To move into the scratch space, change into it:

cd $PDC_TMP

You can check that you are in it by printing your working directory:

pwd

/cfs/klemming/scratch/<user letter>/<user name>

Exercise

Do you remember why we decided to work from scratch?

Read mapping

When studying an organism with a reference genome, it is possible to infer which transcripts are expressed by mapping the reads to the reference genome (genome mapping) or transcriptome (transcriptome mapping). Mapping reads to the genome requires no knowledge of the set of transcribed regions or the way in which exons are spliced together. This approach allows the discovery of new, unannotated transcripts. ¹

Read mapping is the process to align the reads on a reference genome. A mapper takes as input a reference genome and a set of reads. Its aim is to align each read in the set of reads on the reference genome, allowing mismatches, indels and clipping of some short fragments on the two ends of the reads.

Below is an illustration of the mapping process. The input consists of a set of reads and a reference genome. In the middle, it gives the results of mapping: the locations of the reads on the reference genome. The first read is aligned at position 100 and the alignment has two mismatches. The second read is aligned at position 114. It is a local alignment with clippings on the left and right. The third read is aligned at position 123. It consists of a 2-base insertion and a 1-base deletion ².

Data

In this tutorial we will map data of two experiments against a reference sequence.

The first data set is a paired end DNA sequence of a cassava genotype called TMEB117 and the second data set is a paired end RNA sequencing from a cassava genotype called TMEB419. You have worked with these samples in the last exercise.

And again a special shoutout to Andreas Gisel for providing these samples!

The reference sequence is chromosome01 from the latest cassava genome built V8.

Create a directory to work in

Start by creating a workspace for this exercise in your scratch folder, and then move into it:

mkdir -p  NGS_course/mapping/ref
cd NGS_course/mapping/ref

Symbolic links to data

To save time and computation power we will use only chromosome01 for the exercise.

The reference data files are located in:

/sw/courses/slu_bioinfo

Create symbolic links to the fastq files in your workspace:

ln -s /sw/courses/slu_bioinfo/chromosome01.fasta  .
ln -s /sw/courses/slu_bioinfo/Mesculenta_671_v8.1.gene_exons.gtf .
ln -s /sw/courses/slu_bioinfo/Mesculenta_671_v8.1.gene_exons.gff3 .

Exercise

Do you remember why we decided to link to the files instead of copying them?

Check that you can now see the linked files in the directory and move back into the directory “mapping”.

ls
cd ..

Mapping DNA sequences

Bowtie 2

Now we are all set to map our paired end DNA sequences onto the reference.

Bowtie2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.

Apptainer image

Go to seqera containers, and get the container image path for bioconda::bowtie2. Remember to get the singularity container, not the one for docker.

Pull the image.

Expand to see the code - try yourself first

singularity pull --name bowtie2_2.5.4.sif https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/9b/9bbc1c148cefc585f681c71d9369a84f72f46ff1195850c7d416f9bbf65cb51e/data

Create index

Each NGS mapper uses indices to accelerate the mapping process, but unfortunately each one has his own way to create these indices. Therefore we first need to create the index for bowtie2 of our reference sequence, which we will call Bowtie2Cassava01Index and will be located in the reference directory.

Create a bash file within the scripts directory:

nano bowtie_index.sh

And copy the followiing into the file:

#! /bin/bash -l

#SBATCH -A edu24.bk0001
#SBATCH -t 05:00
#SBATCH -n 1
#SBATCH -p shared

set -ueo pipefail

module load PDC apptainer

# Define input reference and output directory
REFERENCE=ref/chromosome01.fasta
OUTPUT_DIR=ref/

singularity exec  -B /sw/courses/slu_bioinfo/ ../singularity_images/bowtie2_2.5.4.sif \
    bowtie2-build -o 3 $REFERENCE $OUTPUT_DIR/Bowtie2Cassava01Index
    
echo "complete"

Save the above code and run the script using

sbatch ../scripts/bowtie_index.sh

Important

Make sure you run this code from within the “mapping directory”.

You will get the following index files:

Map with bowtie2

Now that we have the index we create a directory for the bowtie2 results:

mkdir bowtie2

and then we can run the mapping:

Make a file in the scripts folder bowtie2_mapping.sh.

#! /bin/bash -l

#SBATCH -A edu24.bk0001
#SBATCH -t 15:00
#SBATCH -n 4
#SBATCH -p shared

set -ueo pipefail

module load PDC apptainer

# Get CPUS allocated to slurm script (-n above)
CPUS=$SLURM_NPROCS

INDEX=ref/Bowtie2Cassava01Index
R1=../raw/TMEB117_R1_frac.fastq
R2=../raw/TMEB117_R2_frac.fastq

singularity exec  -B /sw/courses/slu_bioinfo/ -B $PWD/../raw ../singularity_images/bowtie2_2.5.4.sif \
    bowtie2 --very-sensitive-local \
    --threads $CPUS \
    -x $INDEX \
    -S bowtie2/TMEB117.sam \
    -1 $R1 \
    -2 $R2 |& tee bowtie2/TMEB117.log

echo "complete"

Save the script in the scripts directory and run the code.

We will get a mapping file in the SAM format and the log file with the mapping statistics.

mapping/bowtie2/TMEB117.sam
mapping/bowtie2/TMEB117.log

Exercise

Check the log and find how many unique hits of pairs we have. What does that even mean?

Check the SAM file and check how many hits we have - use samtools to do so. Module load the tool, or get the appropriate container. Look into the samtools manual to figure out how to look at the file.

Now you can play around a bit. You could see what influence the trimming might have on the mapping of these data. For that re-run the mapping with the trimmed data. Make sure you point towards the correct file paths!

Check the log and SAM file and see what difference we have.

Time allowing, I propose to run a third test mapping either using the raw data of trimmed data by changing the mapping mode from local to end-to-end or from very-sensitive to fast, loosing some accuracy but reducing the mapping time.

Replace in the commands above –very-sensitive-local to –very-sensitive or from –very-sensitive-local to –fast-local.

Again, compare the output with the previous ones.

Mapping RNA seq data

The data set TMEB117 is DNA sequencing data and bowtie2 the right tool for mapping and data analysis.

TMEB419 however is RNA sequencing data. While we could map the data with bowtie2 (try it out if you want to), there are aligners that are dedicated for RNA sequencing data. These find splicing sites within the reads and guarantee an accurate reconstruction of the sequenced transcripts.

STAR

STAR is one such spicing site sensitive mapper for RNA sequencing data.

Apptainer image

Go to seqera containers, and get the container image path for bioconda::star.

Pull the image into the same directory as the other container images.

Note

Which one was that again?

Expand to see the code for the pull - try yourself first

singularity pull --name star_2.7.11b.sif https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/9b/9b8ecb2f9a77b5e7573ef6fae2f4c2e771064f7a129ed1329913c1025c33f365/data

Prepare index

The data we will use for this part of the tutorial is paired end RNA sedquencing data and is already in your /raw directory (TMEB419RNA_frag_R1.fastq, TMEB419RNA_frag_R2.fastq).

First, we need to generate the STAR index, and this time we will include gene annotations to enable the mapped reads to be associated with known genes and transcripts.

Next we have to create the directory where we will store the STAR index for chromosome 1. Let’s use the /refdirectory for that:

mkdir ref/STARCassava01Index

The indexing script for STAR will be the following:

Exercise

Do you remember all the steps you need to do to run the script?

Expand to see the code for the pull - try yourself first

Save the below in a file:

star_index.sh

in the appropriate directory, and run with

sbatch star_index.sh

from the appropriate directory.

#! /bin/bash -l

#SBATCH -A edu24.bk0001
#SBATCH -t 05:00
#SBATCH -n 1
#SBATCH -p shared

set -ueo pipefail

module load PDC apptainer

INDEX=ref/STARCassava01Index
REFERENCE=ref/chromosome01.fasta
GTF=ref/Mesculenta_671_v8.1.gene_exons.gtf

singularity exec  -B /sw/courses/slu_bioinfo/ -B $PWD/../raw ../singularity_images/star_2.7.11b.sif \
    STAR    --runMode genomeGenerate \
        --genomeDir $INDEX \
        --genomeFastaFiles $REFERENCE \
        --sjdbGTFfile $GTF \
        --genomeSAindexNbases 11 \
        --genomeChrBinNbits 16 \
        --outTmpDir ./star

With this you will end up with a bunch of index files:

Files like geneInfo.tab, exonGeTrInfo.tab, transcriptInfo.tab, exonInfo.tab, sjdbList.fromGTF.out.tab, sjdbList.out.tab, sjdbInfo.txt contain the indexing information of the predicted genes, transcripts, exons and splicing sites annotated in the Mesculenta_671_v8.1.gene_exons.gtf file. Have a look around!

Exercise

Have a look around the files to see what you discover!

Map with STAR

Now we are ready to map the DNA sequencing files to the reference sequence.

Make a directory for the output:

mkdir star

And then save the following script and run it with sbatch (like above).

#! /bin/bash -l

#SBATCH -A edu24.bk0001
#SBATCH -t 15:00
#SBATCH -n 4
#SBATCH -p shared

set -ueo pipefail

module load PDC apptainer

# Get CPUS allocated to slurm script (-n above)
CPUS=$SLURM_NPROCS

INDEX=ref/STARCassava01Index
READ1=../raw/TMEB419RNA_frag_R1.fastq
READ2=../raw/TMEB419RNA_frag_R2.fastq

singularity exec  -B /sw/courses/slu_bioinfo/ -B $PWD/../raw ../singularity_images/star_2.7.11b.sif \
    STAR --genomeDir $INDEX \
        --outFilterMismatchNoverLmax 0.06 \
        --outFilterMatchNminOverLread 0.5 \
        --outFilterScoreMinOverLread 0.2 \
        --alignIntronMax 20000 \
        --alignMatesGapMax 10000 \
        --outFileNamePrefix star/TMEB419RNA_  \
        --readFilesIn $READ1 $READ2

echo "complete"

Run with sbatch.

Your output will be the SAM file, report files and the list of predicted splicing sites.

Exercise

Inspect the report file and check the mapped reads and the average mapped length.

Is this a good mapping?

Next steps

After quality control and mapping, a typical next step when working with NGS data would be to annotate the reads, so you get an idea of what you are looking at. You will do that later in this course.

Exercise

Look back at the NGS_course directory. What do you think, did you manage to keep the chaos at bay? What went well in keeping order and what could be improved? Discuss with your neighbour.

For now, please do not forget to fill out the quick quizzes (one for the quality control exercise, one for the mapping exercise). Then you are done for the day. (o:

Footnotes

# Introduction to NGS Mapping Now that you have looked at the quality of the samples, let's do the next step of a typical analysis: mapping the reads to a reference genome. Once we know where our reads map we could then continue to annotate our reads. However this will be beyond the scope of this tutorial. # Learning objectives - map NGS sequences to a reference - use different tools for different types of data - train the command line more and more independently - navigate absolute and relative paths when using the command line - structure your project directories to be well organized # Preparations ## Connect to Dardel For this tutorial we will connect to Dardel. For everyone connecting via Kerberos this is the command: ```{.bash} ssh -o GSSAPIAuthentication=yes <PDC username>@dardel.pdc.kth.se ```` ## Start a screen session [Screen](https://www.gnu.org/software/screen/manual/screen.html) or GNU Screen is a terminal multiplexer. In other words, it means that you can start a screen session and then open any number of windows (virtual terminals) inside that session. Processes running in Screen will continue to run when their window is not visible even if you get disconnected. Start a `named session` ```{.bash} screen -S mapping ``` You can detach from the screen session. The process within the screen will continue to run. ```{.bash} Ctrl + a d ``` You can always reattach to the session. If you have a number of screen running, or are unsure of the name or ID of the screen you want to reattach to you can list the currently running screens: ```{.bash} screen -ls ``` To `resume your screen session` use the following command: ```{.bash} screen -r name ``` ::: {.callout-tip} ## Exercise Can you find the screen session that you started last friday? ::: ## Change into PDC scratch space To move into the scratch space, change into it: ```{.bash} cd $PDC_TMP ``` You can check that you are in it by printing your working directory: ```{.bash} pwd ``` > /cfs/klemming/scratch/\<user letter>/\<user name> ::: {.callout-tip} ## Exercise Do you remember why we decided to work from scratch? ::: # Read mapping When studying an organism with a reference genome, it is possible to infer which transcripts are expressed by mapping the reads to the reference genome (**genome mapping**) or transcriptome (**transcriptome mapping**). Mapping reads to the genome requires no knowledge of the set of transcribed regions or the way in which exons are spliced together. This approach allows the discovery of new, unannotated transcripts. ^[[ https://www.ebi.ac.uk/training/online/courses/functional-genomics-ii-common-technologies-and-data-analysis-methods/rna-sequencing/performing-a-rna-seq-experiment/data-analysis/read-mapping-or-alignment/ ]( https://www.ebi.ac.uk/training/online/courses/functional-genomics-ii-common-technologies-and-data-analysis-methods/rna-sequencing/performing-a-rna-seq-experiment/data-analysis/read-mapping-or-alignment/ )] Read mapping is the process to align the reads on a reference genome. A mapper takes as input a reference genome and a set of reads. Its aim is to align each read in the set of reads on the reference genome, allowing mismatches, indels and clipping of some short fragments on the two ends of the reads. Below is an illustration of the mapping process. The input consists of a set of reads and a reference genome. In the middle, it gives the results of mapping: the locations of the reads on the reference genome. The first read is aligned at position 100 and the alignment has two mismatches. The second read is aligned at position 114. It is a local alignment with clippings on the left and right. The third read is aligned at position 123. It consists of a 2-base insertion and a 1-base deletion ^[[https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html ](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html )]. <br><br> ![](images/mapping.png) # Data In this tutorial we will map data of two experiments against a reference sequence. The first data set is a paired end DNA sequence of a cassava genotype called TMEB117 and the second data set is a paired end RNA sequencing from a cassava genotype called TMEB419. You have worked with these samples in the last exercise. And again a special shoutout to Andreas Gisel for providing these samples! The reference sequence is chromosome01 from the latest cassava genome built [V8](https://phytozome-next.jgi.doe.gov/info/Mesculenta_v8_1). ## Create a directory to work in Start by creating a workspace for this exercise in your scratch folder, and then move into it: ```{.bash} mkdir -p NGS_course/mapping/ref cd NGS_course/mapping/ref ``` ## Symbolic links to data To save time and computation power we will use only chromosome01 for the exercise. The reference data files are located in: ```{.bash} /sw/courses/slu_bioinfo ``` Create symbolic links to the fastq files in your workspace: ```{.bash} ln -s /sw/courses/slu_bioinfo/chromosome01.fasta . ln -s /sw/courses/slu_bioinfo/Mesculenta_671_v8.1.gene_exons.gtf . ln -s /sw/courses/slu_bioinfo/Mesculenta_671_v8.1.gene_exons.gff3 . ``` ::: {.callout-tip} ## Exercise Do you remember why we decided to link to the files instead of copying them? ::: Check that you can now see the linked files in the directory and move back into the directory "mapping". ```{.bash} ls cd .. ``` # Mapping DNA sequences ## Bowtie 2 Now we are all set to map our paired end DNA sequences onto the reference. [Bowtie2](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml) is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes. ## Apptainer image Go to [seqera containers](https://seqera.io/containers/), and get the container image path for `bioconda::bowtie2`. Remember to get the singularity container, not the one for docker. Pull the image. ::: {.callout-caution collapse="true"} ## Expand to see the code - try yourself first ```{.bash} singularity pull --name bowtie2_2.5.4.sif https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/9b/9bbc1c148cefc585f681c71d9369a84f72f46ff1195850c7d416f9bbf65cb51e/data ``` ::: ## Create index Each NGS mapper uses indices to accelerate the mapping process, but unfortunately each one has his own way to create these indices. Therefore we first need to create the index for `bowtie2` of our reference sequence, which we will call Bowtie2Cassava01Index and will be located in the reference directory. Create a bash file within the scripts directory: ```{.bash} nano bowtie_index.sh ``` And copy the followiing into the file: ```{.bash} #! /bin/bash -l #SBATCH -A edu24.bk0001 #SBATCH -t 05:00 #SBATCH -n 1 #SBATCH -p shared set -ueo pipefail module load PDC apptainer # Define input reference and output directory REFERENCE=ref/chromosome01.fasta OUTPUT_DIR=ref/ singularity exec -B /sw/courses/slu_bioinfo/ ../singularity_images/bowtie2_2.5.4.sif \ bowtie2-build -o 3 $REFERENCE $OUTPUT_DIR/Bowtie2Cassava01Index echo "complete" ``` Save the above code and run the script using ```{.bash} sbatch ../scripts/bowtie_index.sh ``` ::: {.callout-important} Make sure you run this code from within the "mapping directory". ::: You will get the following index files: ![](images/Bowtie_Index_files.png) ## Map with bowtie2 Now that we have the index we create a directory for the bowtie2 results: ```{.bash} mkdir bowtie2 ``` and then we can run the mapping: Make a file in the scripts folder `bowtie2_mapping.sh`. ```{.bash} #! /bin/bash -l #SBATCH -A edu24.bk0001 #SBATCH -t 15:00 #SBATCH -n 4 #SBATCH -p shared set -ueo pipefail module load PDC apptainer # Get CPUS allocated to slurm script (-n above) CPUS=$SLURM_NPROCS INDEX=ref/Bowtie2Cassava01Index R1=../raw/TMEB117_R1_frac.fastq R2=../raw/TMEB117_R2_frac.fastq singularity exec -B /sw/courses/slu_bioinfo/ -B $PWD/../raw ../singularity_images/bowtie2_2.5.4.sif \ bowtie2 --very-sensitive-local \ --threads $CPUS \ -x $INDEX \ -S bowtie2/TMEB117.sam \ -1 $R1 \ -2 $R2 |& tee bowtie2/TMEB117.log echo "complete" ``` Save the script in the scripts directory and run the code. We will get a mapping file in the SAM format and the log file with the mapping statistics. > mapping/bowtie2/TMEB117.sam\ > mapping/bowtie2/TMEB117.log ::: {.callout-tip} ## Exercise Check the log and find how many unique hits of pairs we have. What does that even mean? ::: Check the SAM file and check how many hits we have - use `samtools` to do so. Module load the tool, or get the appropriate container. Look into the `samtools manual` to figure out how to look at the file. Now you can play around a bit. You could see what influence the trimming might have on the mapping of these data. For that re-run the mapping with the trimmed data. Make sure you point towards the correct file paths! Check the log and SAM file and see what difference we have. Time allowing, I propose to run a third test mapping either using the raw data of trimmed data by changing the mapping mode from local to end-to-end or from very-sensitive to fast, loosing some accuracy but reducing the mapping time. Replace in the commands above *--very-sensitive-local* to *--very-sensitive* or from *--very-sensitive-local* to *--fast-local*. Again, compare the output with the previous ones. # Mapping RNA seq data The data set `TMEB117` is `DNA` sequencing data and `bowtie2` the right tool for mapping and data analysis. `TMEB419` however is `RNA` sequencing data. While we could map the data with bowtie2 (try it out if you want to), there are aligners that are dedicated for RNA sequencing data. These find splicing sites within the reads and guarantee an accurate reconstruction of the sequenced transcripts. ## STAR [STAR](https://github.com/alexdobin/STAR) is one such spicing site sensitive mapper for RNA sequencing data. ## Apptainer image Go to [seqera containers](https://seqera.io/containers/), and get the container image path for `bioconda::star`. Pull the image into the same directory as the other container images. ::: {.callout-note} Which one was that again? ::: ::: {.callout-caution collapse="true"} ## Expand to see the code for the pull - try yourself first ```{.bash} singularity pull --name star_2.7.11b.sif https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/9b/9b8ecb2f9a77b5e7573ef6fae2f4c2e771064f7a129ed1329913c1025c33f365/data ``` ::: ## Prepare index The data we will use for this part of the tutorial is paired end `RNA sedquencing data` and is already in your `/raw` directory (`TMEB419RNA_frag_R1.fastq, TMEB419RNA_frag_R2.fastq`). First, we need to generate the STAR index, and this time we will include gene annotations to enable the mapped reads to be associated with known genes and transcripts. Next we have to create the directory where we will store the STAR index for chromosome 1. Let's use the `/ref`directory for that: ```{.bash} mkdir ref/STARCassava01Index ``` The indexing script for STAR will be the following: :::{.callout-tip} ## Exercise Do you remember all the steps you need to do to run the script? ::: ::: {.callout-caution collapse="true"} ## Expand to see the code for the pull - try yourself first Save the below in a file: ```{.bash} star_index.sh ``` in the appropriate directory, and run with ```{.bash} sbatch star_index.sh ``` from the appropriate directory. ::: ```{.bash} #! /bin/bash -l #SBATCH -A edu24.bk0001 #SBATCH -t 05:00 #SBATCH -n 1 #SBATCH -p shared set -ueo pipefail module load PDC apptainer INDEX=ref/STARCassava01Index REFERENCE=ref/chromosome01.fasta GTF=ref/Mesculenta_671_v8.1.gene_exons.gtf singularity exec -B /sw/courses/slu_bioinfo/ -B $PWD/../raw ../singularity_images/star_2.7.11b.sif \ STAR --runMode genomeGenerate \ --genomeDir $INDEX \ --genomeFastaFiles $REFERENCE \ --sjdbGTFfile $GTF \ --genomeSAindexNbases 11 \ --genomeChrBinNbits 16 \ --outTmpDir ./star ``` With this you will end up with a bunch of index files: ![](images/star_index_files.png) Files like `geneInfo.tab`, `exonGeTrInfo.tab`, `transcriptInfo.tab`, `exonInfo.tab`, `sjdbList.fromGTF.out.tab`, `sjdbList.out.tab`, `sjdbInfo.txt` contain the indexing information of the predicted genes, transcripts, exons and splicing sites annotated in the `Mesculenta_671_v8.1.gene_exons.gtf file`. Have a look around! :::{.callout-tip} ## Exercise Have a look around the files to see what you discover! ::: ## Map with STAR Now we are ready to map the DNA sequencing files to the reference sequence. Make a directory for the output: ```{.bash} mkdir star ``` And then save the following script and run it with sbatch (like above). ```{.bash} #! /bin/bash -l #SBATCH -A edu24.bk0001 #SBATCH -t 15:00 #SBATCH -n 4 #SBATCH -p shared set -ueo pipefail module load PDC apptainer # Get CPUS allocated to slurm script (-n above) CPUS=$SLURM_NPROCS INDEX=ref/STARCassava01Index READ1=../raw/TMEB419RNA_frag_R1.fastq READ2=../raw/TMEB419RNA_frag_R2.fastq singularity exec -B /sw/courses/slu_bioinfo/ -B $PWD/../raw ../singularity_images/star_2.7.11b.sif \ STAR --genomeDir $INDEX \ --outFilterMismatchNoverLmax 0.06 \ --outFilterMatchNminOverLread 0.5 \ --outFilterScoreMinOverLread 0.2 \ --alignIntronMax 20000 \ --alignMatesGapMax 10000 \ --outFileNamePrefix star/TMEB419RNA_ \ --readFilesIn $READ1 $READ2 echo "complete" ``` Run with `sbatch`. Your output will be the SAM file, report files and the list of predicted splicing sites. :::{.callout-tip} ## Exercise Inspect the report file and check the mapped reads and the average mapped length. Is this a good mapping? ::: # Next steps After quality control and mapping, a typical next step when working with NGS data would be to annotate the reads, so you get an idea of what you are looking at. You will do that later in this course. :::{.callout-tip} ## Exercise Look back at the NGS_course directory. What do you think, did you manage to keep the chaos at bay? What went well in keeping order and what could be improved? Discuss with your neighbour. ::: For now, please do not forget to fill out the quick quizzes (one for the quality control exercise, one for the mapping exercise). Then you are done for the day. \(o: