Star2.5.3 Rna Seq User Manual

Rna Seq Protocol
Rna-seq Analysis

Strand-aware RNA-seq data. There is no such thing. There are strand-specific library preparations, though, which generate strand-specific (relative to the mRNA) reads, which is probably what you mean. Does STAR perform strand-aware mapping, which we can use downstream to get antisense transcripts. Cufflinks will not normalize fragment counts by transcript length at all. Use this option when fragment count is independent of the size of the features being quantified (e.g. For small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length).

After trimming the adaptors and cleaning up low quality reads, we want to know what genomic regions these reads are from and what genes they are aligned to.

Rna Seq Protocol

Software

Log in and do checks

User Guide: Illumina sequencing technologies – RNA-Seq Version 6.2. 2 Table of contents. When working with biological samples that are destined to be submitted for RNA profiling (e.g. RNA-Seq, small RNA-Seq), it is recommended to: wear new gloves and keep samples on ice. User Manual A limited-use label license covers this product. By use of this product, you accept the terms and conditions outlined in the Licensing and Warranty Statement contained in this user manual. XRNA Exosome RNA-Seq Library Kit Store Kits at -20ºC upon receipt. 8 Ion Total RNA-Seq Kit v2 User Guide. Required materials not supplied For the Safety Data Sheet (SDS) of any chemical not distributed by Thermo Fisher Scientific, contact the chemical manufacturer. Before handling any chemicals, refer to the SDS provided by the manufacturer, and observe all relevant precautions.

First log in to the ACF and go to your home directory. Commands can be found at the bottom of the course homepage.

Location check

Check to make sure you are in your directory by using the command for 'print working directory':

pwd

This should return:

/lustre/haven/courses/EPP531-2019Su/youruserid

Login node check

No one should be on a login node for this exercise. See what computer you are on:uname -a

If it does not say 'login' anywhere in the name, you are good. Otherwise, you need to run the qsub command!

Clean up your directory

Mapping reads with STAR

Basic STAR workflow consists of 2 steps:

Generating genome indexes files.
In this step user supplied the reference genome sequences (FASTA files) and annotations (GFF file), from which STAR generates genome indexes that are utilized in the 2nd (mapping) step. The genome indexes are saved to the folder with the fasta file and need only be generated once for each genome/annotation combination.
Mapping reads to the genome.
In this step the user supplies the genome files generated in the 1st step, as well as the RNA-seq reads (sequences) in the form of FASTA or FASTQ files. STAR maps the reads to the genome, and writes several output files, such as alignments (SAM/BAM), mapping summary statistics, splice junctions, unmapped reads, signal (wiggle) tracks etc.

Now, let's start mapping. Load the STAR software into your working environment

Generate genome indexes files

First, we need to create a folder to store the index files

Next, we use our genome Ppersica_v2.0_chr1.fa and annotation Ppersica_2.0_chr1.genes.gff3 Sony xperia acro s user manual.

--runMode genomeGenerate option directs STAR to run genome indices generation job.

--genomeDir specifies path to the directory (henceforth called ”genome directory” where the genome indices are stored. This directory has to be created (with mkdir) before STAR run and needs to writing permissions. The file system needs to have at least 100GB of disk space available for a typical mammalian genome. It is recommended to remove all files from the genome directory before running the genome generation step. This directory path will have to be supplied at the mapping step to identify the reference genome.

--genomeFastaFiles specified one or more FASTA files with the genome reference sequences.Multiple reference sequences (henceforth called chromosomes) are allowed for each fasta file.

--runThreadN option defines the number of threads to be used, it has to be set to the number of available cores on the server node.

--sjdbGTFfile specifies the path to the file with annotated transcripts in the standard GTF format. STAR will extract splice junctions from this file and use them to greatly improve accuracy of the mapping. While this is optional, and STAR can be run without annotations, using annotations is highly recommended whenever they are available.

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value.

In addition, for GFF3 formatted annotations you need to use--sjdbGTFtagExonParentTranscript Parent. In general, for --sjdbGTFfile files STAR only processes lines which have --sjdbGTFfeatureExon (=exon by default) in the 3rd field (column). https://poweruproad524.weebly.com/logitech-z-5500-digital-speaker-system-user-manual.html. The exons are assigned to the transcripts using parent-child relationship defined by the --sjdbGTFtagExonParentTranscript (=transcript id by default) GTF/GFF attribute.

So many options! This is just the tip of the iceberg..The Very Long STAR Manual

Lets also take a look at the gff file

The third column defines genomic regions of gene, mRNA, exon, five_prime_UTR, CDS..--sjdbGTFtagExonParentTranscript Parent will only take the gene into account.

Mapping reads to the reference genome

STAR mapping of one file

--genomeDir https://poweruproad524.weebly.com/behringer-ultracurve-pro-deq2496-user-manual.html. specifies path to the genome directory where genome indices where generated.

--readFilesIn name(s) (with path) of the files containing the sequences to be mapped (e.g. RNA-seq FASTQ files). If using Illumina paired-end reads, the read1 and read2 files have to be supplied. If the read files are compressed, use the --readFilesCommand UncompressionCommand option, where UncompressionCommand is the un-compression command that takes the file name as input parameter, and sends the uncompressed output to stdout. For example, for gzipped files (*.gz) use --readFilesCommand zcat OR --readFilesCommand gunzip -c. For bzip2-compressed files, use --readFilesCommand bunzip2 -c.

--outFileNamePrefix specifies the output file location and name.

--outSAMtype BAM SortedByCoordinate specifies to go ahead and get coordinate-sorted BAM files as output (saving us a few steps in creating these files from the sam files)

Let's see what output files you will get.

Log.out: main log file with a lot of detailed information about the run. This file is most useful for troubleshooting and debugging.

Log.progress.out: reports job progress statistics, such as the number of processed reads, % of mapped reads etc. It is updated in 1 minute intervals.

Log.final.out: summary mapping statistics after mapping job is complete, very useful for quality control. The statistics are calculated for each read (single- or paired-end) and then summed or averaged over all reads.SJ.out.tab: each splicing is counted in the numbers of splices, which would correspond to summing the counts in SJ.out.tab.

Aligned.out.bam: alignments in coordinate-sorted BAM format.

Formats

Bam files are binary, which make it computer readable only. Lets try for fun anyway.

Yep, unreadable. Lets use samtools to convert this file to a sam file and see what it looks like in human-readable format.

Look at the head of the sam file and see what you can interpret based on the sam format information you just learned.

What if you had sam and needed to convert it to bam? Work with a partner to come up with a command to do this. Manual for the version of samtools on the acf is here

Rna-seq Analysis

Running all files with a for loop

If you have many files to map to the same reference genome, you can write a loop to let the computer to run them for you at once. (This takes about an hour so don't actually do this, its just an example).

I've already pre-created all the files, so you can copy them into your directory.