Analysis and application of hash-based similarity estimation techniques for biological sequence analysis

dc.contributor.advisorRahmann, Sven
dc.contributor.authorTimm, Henning
dc.contributor.refereeMosig, Axel
dc.date.accepted2022-01-13
dc.date.accessioned2023-01-12T07:44:47Z
dc.date.available2023-01-12T07:44:47Z
dc.date.issued2021
dc.description.abstractIn Bioinformatics, a large group of problems requires the computation or estimation of sequence similarity. However, the analysis of biological sequence data has, among many others, three capital challenges: a large amount generated data which contains technology-specific errors (that can be mistaken for biological signals), and that might need to be analyzed without access to a reference genome. Through the use of locality sensitive hashing methods, both the efficient estimation of sequence similarity and tolerance against the errors specific to biological data can be achieved. We developed a variant of the winnowing algorithm for local minimizer computation, which is specifically geared to deal with repetitive regions within biological sequences. Through compressing redundant information, we can both reduce the size of the hash tables required to save minimizer sketches, as well as reduce the amount of redundant low quality alignment candidates. Analyzing the distribution of segment lengths generated by this approach, we can better judge the size of required data structures, as well as identify hash functions feasible for this technique. Our evaluation could verify that simple and fast hash functions, even when using small hash value spaces (hash functions with small codomain), are sufficient to compute compressed minimizers and perform comparable to uniformly randomly chosen hash values. We also outlined an index for a taxonomic protein database using multiple compressed winnowings to identify alignment candidates. To store MinHash values, we present a cache-optimized implementation of a hash table using Hopscotch hashing to resolve collisions. As a biological application of similarity based analysis, we describe the analysis of double digest restriction site associated DNA sequencing (ddRADseq). We implemented a simulation software able to model the biological and technological influences of this technology to allow better development and testing of ddRADseq analysis software. Using datasets generated by our software, as well as data obtained from population genetic experiments, we developed an analysis workflow for ddRADseq data, based on the Stacks software. Since the quality of results generated by Stacks strongly depends on how well the used parameters are adapted to the specific dataset, we developed a Snakemake workflow that automates preprocessing tasks while also allowing the automatic exploration of different parameter sets. As part of this workflow, we developed a PCR deduplication approach able to generate consensus reads incorporating the base quality values (as reported by the sequencing device), without performing an alignment first. As an outlook, we outline a MinHashing approach that can be used for a faster and more robust clustering, while addressing incomplete digestion and null alleles, two effects specific for ddRADseq that current analysis tools cannot reliably detect.en
dc.identifier.urihttp://hdl.handle.net/2003/41188
dc.identifier.urihttp://dx.doi.org/10.17877/DE290R-23034
dc.language.isoende
dc.subjectDNAde
dc.subjectRNAde
dc.subjectProteinde
dc.subjectSequencingen
dc.subjectNGSde
dc.subjectSGSde
dc.subjectPCRde
dc.subjectHashingen
dc.subjectHash tableen
dc.subjectHash functionen
dc.subjectHopscotch hashingen
dc.subjectMinHashen
dc.subjectMinimizeren
dc.subjectWinnowingen
dc.subjectLSHde
dc.subjectLocality sensitive hashingen
dc.subjectddRADde
dc.subjectddRADseqde
dc.subject.ddc004
dc.subject.rswkDNSde
dc.subject.rswkRNSde
dc.subject.rswkProteinede
dc.subject.rswkSequenzanalyse <Chemie>de
dc.subject.rswkHash-Algorithmusde
dc.titleAnalysis and application of hash-based similarity estimation techniques for biological sequence analysisen
dc.typeTextde
dc.type.publicationtypedoctoralThesisde
dcterms.accessRightsopen access
eldorado.secondarypublicationfalsede

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Timm_Dissertation.pdf
Size:
16.65 MB
Format:
Adobe Portable Document Format
Description:
DNB
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
4.85 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections