Genome assembly and resequencing

?
  • Created by: lridgeway
  • Created on: 13-11-20 10:15

Sequence alignment

Aligning sequences is a fundemental problem in bioinformatics, used for: 

  • Identification of homologous sequences 
  • Inferring evolutionary relationships between sequences and organisms
  • Assembling sequence reads 
  • Mapping sequences to a reference genome

Comparison of sequences

When comparing sequences it is easier to explain in terms of evolution one longer misalign that many shorter gaps and misaligns as one long one is likely to occur due to an insertion or deletion.

Scoring can be done in which numbers are added if there is a match and taken away if there isn't a match. In nucleotide alignment all mismatches are treated the same, but for amino acids the scoring matrix called PAM70 is used. This is where biochemically simiar amino acid substitutions are penalised less. The matrix was fromed from data extrapolated from 1571 mutations in 71 protein family alignments. 

1 of 11

Global and local alignment

Global alignment: Assuming alignment across the whole length of the DNA being aligned. E.g. aligning two complete homologous proteins 

Local alignment: Picking two (or more) good regions of alignment. E.g. protein domains/ regions of conservation 

2 of 11

BLAST

BLAST -  Basic Local Alignment Search Tool 

Finds regions of similarity between biological sequences. Outputs best hits at the top with an alignment score and an E-value. An E-value is the P value normalised to the database size and the length of the query sequence. Can be thought of as the number of hits that good that would be expected to be found by chance in a database of that size. If you click on the hits you can find out more information like the number of bases that match and those that do not (these are also presented as percentages. 

3 of 11

De novo assembly

Like a jigsaw puzzle but with: 

  • No box (unknown target) 
  • Missing pieces (coverage bias) 
  • Broken pieces (sequencing errors) 
  • Duplicate pieces (repeats) 
  • Disconnected sub puzzles (multiple replicons) 
  • Random pieces from another puzzle (contamination) 
  • No corner or edge pieces (circular genomes) 

Overlap layout consensus is the method of assemby used in shotgun sequencing. You take reads and look for overlaps. This would work well if the genome weren't repetitive and sequencing wasn't error prone. Repeats can cause errors in assembly where chunks of DNA are missed out. 

4 of 11

De Bruijn Graphs

Used to avoid assembly errors. You split a sequence up into camers. Adjacent camers overlap by all but 1 base and each camer only goes in the graph once. If there are no repeats then there is only one way through the graph.

Repeats lead to a loop (aka bubble) in the graph as the same camer is found twice. At repeats there are more than one path option so this needs to be resolved in assembly to work out the correct sequence. 

Firstly you need to use the graph to find contigs (unique paths that were definitely present in the original sequence). So now you have contigs but you don't know what order these sequences are present in or how many times they are present due to the repeats. We therefore need to resolve the repeat bubbles. 

This can be done using read pairs. You can identify where the read pairs are on the graph and in the contigs and we know that these should be 500 bp apart so this should help us figure out the order of the genome and which contigs sit close together. 

The fewer base pairs in a camer the more complicated the de Bruijn graph 

5 of 11

Resequencing

Reasons to resequence 

  • Individulas of a sequence are not identical 
  • Study of genetic disorders
  • Sequence a cancer
  • Understand functional elements of the human genome

You can do this easily by mapping to a reference genome, which acts like an index in a book 

The process again uses camers, in which you align in many places and use a scoring system to see which alignment is most likely. Repeats again cause problems in this, in which paired data can be used to resolve. 

6 of 11

IGV and SNPs

You can visualise mapped reads on a programme called IGV. You can get an overview of the whole chromosome, which focused region highlighted, and genomic corrdinates shown. A "pile up" plot shows read depth (coverage) at each posistion. More reads gives nigher coverage, high GC content regions tend to have more reads. It also shows the position and orientation and mapped reads and mismatches to reference are highlighted. If the sequence is in an exon, a more zoomed in view will also show the amino acid translation. Equivalent information from other samples can be viewed below. 

If a mismatch is alone and only occurs on a single read it is likely to be a sequencing error. If it is through almost all the reads the point is likely to be a SNP or the reference genome has a mutation there. Heterozygous SNPs will present as two different alleles. It is very difficult to distinguish between heterozygous SNPs, homozygous SNPs and sequence error. 

SNPs can be : (increasing pathogenicity down the list) 

  • Intergenic - not in a gene 
  • Intronic - in an intron 
  • Synonymous - no change to encoded protein sequence 
  • Regulatory 
  • Non- synonymous - alters encoded protein sequence
  • Nonsense - introduces premature stop codon
7 of 11

GWAS

Genome-wide association studies (GWAS) sequence the genome of lots of unrelated individuals with a genetic condition and lots without. They look for genomic changes that are statistically more likely to be found in affected individuals. 

The results of these studies are presented on manhattan plots, which show the genomic regions associated with a particular phenotype. 

8 of 11

Gene finding strategies

In bacteria and archaea large genes can be identified by long open reading frames, start codons and other features. Long ORFs are very likely to be genes and can be used to train a model to identify shorter genes. 

Eukaryotic genomes are more difficult as introns are present. You can look transcription start sites, splice sites etc, but these don't work well. RNA-seq can tell us which regions are present in mRNA which can help locate genes. 

9 of 11

Gene annotation

Genes used to be annotated manually, but that has become impossible due to increase in volume of data. 

Pipelines such as Prokka (bacteria/archaea) and MAKER (eukaryotes) provide automated annotation 

10 of 11

Structural variant detection

Not all changes to sequence result in SNPs. Structural variants are a result of large scale chromosome rearrangements: insertions, deletions, duplications, inversions, translocations. 

Using coverage depth to detect these: 

  • Duplications - extra reads
  • Deletions - no reads 

Using read pairs to detect these: 

  • Either side of deletion will map onto a reference genome further apart than 500bp 
  • If a read of a genome is over an area that has been deleted in this genome the read will get split when mapping onto the reference genome as the area of deletion will be present in the reference genome

Using assmebly to detect it 

  • use reads that don't map in an assembly way to see if there has been a sequence insertion 
11 of 11

Comments

No comments have yet been made

Similar Biotechnology resources:

See all Biotechnology resources »See all Genetics resources »