ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Genome assembly from sequence reads

Matija Rezar (2016) Genome assembly from sequence reads. MSc thesis.

[img]
Preview
PDF
Download (628Kb)

    Abstract

    Assembling genomes can be roughly described as finding Eulerian tours in de Bruijn graphs. We present the theory behind (de Bruijn) graph data structures and describe some of the implementations. A directed graph G(V,E) can be represented as a set of its edges in the form of ordered pairs vi → vj ∈ E. De Bruijn graphs are defined in a way that allows all possible neighbors of a node to be calculated from the given node’s label, which means that, given the adjacency set, we can navigate the graph by testing set membership. The edge set can be stored as a dictionary. The dictionary can be either a deterministic data structure, like a tree or an FM-index, or a probabilistic data structure, like a Bloom filter. In this thesis we present kBWT, a new space efficient deterministic data structure for storing a de Bruijn graph, which uses near-optimal n · σ + o(n) bits of memory, where n is the number of k-grams in the graph and σ is the size of the alphabet. It can retrieve neighborhood information for a given node in Θ(σ · k) time. We also compare it to an existing data structure found in the GATB framework, which is based on Bloom filters and therefore probabilistic. Benchmarks of the deterministic kBWT show it is slower in practice, compared to GATB’s data structure. Testing showed kBWT had better cache efficiency, which did not make up for the number of processor cycles used for executing the algorithm.

    Item Type: Thesis (MSc thesis)
    Keywords: genome, genome assembly, graph theory, de Bruijn graph, Eulerian cycle
    Number of Pages: 67
    Language of Content: English
    Mentor / Comentors:
    Name and SurnameIDFunction
    doc. dr. Andrej Brodnik5540Mentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=1537310659)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 3666
    Date Deposited: 01 Dec 2016 11:46
    Last Modified: 15 Dec 2016 13:18
    URI: http://eprints.fri.uni-lj.si/id/eprint/3666

    Actions (login required)

    View Item