Genome assembly from sequence reads

Matija Rezar (2016) Genome assembly from sequence reads. MSc thesis.

Preview

Abstract

Assembling genomes can be roughly described as finding Eulerian tours in de Bruijn graphs. We present the theory behind (de Bruijn) graph data structures and describe some of the implementations. A directed graph G(V,E) can be represented as a set of its edges in the form of ordered pairs vi → vj ∈ E. De Bruijn graphs are defined in a way that allows all possible neighbors of a node to be calculated from the given node’s label, which means that, given the adjacency set, we can navigate the graph by testing set membership. The edge set can be stored as a dictionary. The dictionary can be either a deterministic data structure, like a tree or an FM-index, or a probabilistic data structure, like a Bloom filter. In this thesis we present kBWT, a new space efficient deterministic data structure for storing a de Bruijn graph, which uses near-optimal n · σ + o(n) bits of memory, where n is the number of k-grams in the graph and σ is the size of the alphabet. It can retrieve neighborhood information for a given node in Θ(σ · k) time. We also compare it to an existing data structure found in the GATB framework, which is based on Bloom filters and therefore probabilistic. Benchmarks of the deterministic kBWT show it is slower in practice, compared to GATB’s data structure. Testing showed kBWT had better cache efficiency, which did not make up for the number of processor cycles used for executing the algorithm.

Item Type:

Thesis (MSc thesis)

Keywords:

genome, genome assembly, graph theory, de Bruijn graph, Eulerian cycle

Number of Pages:

Language of Content:

English

Mentor / Comentors:

Name and Surname	ID	Function
doc. dr. Andrej Brodnik	5540	Mentor

Link to COBISS:

http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=1537310659)

Institution:

University of Ljubljana

Department:

Faculty of Computer and Information Science

Item ID:

3666

Date Deposited:

01 Dec 2016 11:46

Last Modified:

15 Dec 2016 13:18

URI:

http://eprints.fri.uni-lj.si/id/eprint/3666

Actions (login required)

View Item