Discovering nucleotide-level and structural variants in cancer genome data from second- and third-generation sequencing technologies
Loading...
Date
2024
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The field of bioinformatics is a diverse one, as it — the name giving a quite obvious hint — bridges the fields of biology and computer science. For example, in (human) cancer research, one commonly
1. obtains a blood and/or tissue sample of a patient in a study,
2. sequences or otherwise analyses the sample on a specialized device to obtain relevant information (such as its genome, transcriptome or methylome),
3. determines variation between the sample and some reference sample(s) or determines other aspects that may be of interest,
4. annotates, filters and analyses these for further examination,
5. and subsequently makes use of them to further one’s research or study goal.
In this thesis, we mainly concern ourselves with the last three aspects, though we also explain the sequencing process for two different technologies. Also, we will focus on the genome part of the second item, i.e. the DNA contained within the cells of most organisms. In this context, determining variation usually involves comparing many short DNA sequences to a larger reference DNA sequence (such as ”the human genome”).
homopolymer-aware PairHMM:
We introduce a homopolymer-aware PairHMM, which addresses one major issue of Oxford Nanopore Technologies (ONT) sequencing: due to the design of the technology, so-called homopolymers — runs of identical nucleotides — often have inaccurately called lengths, which impacts results negatively. This specialized model allows for more accurate alignments and probability estimates, and can be applied to any ONT sequencing data.
Detecting extrachromosomal circular DNA:
The homopolymer-aware PairHMM finds practical application in the calling of extrachromosomal circular DNA. Extrachromosomal circular DNA is DNA that is both circular and located outside the chromosomal structure of the genome. Such extrachromosomal circular DNA plays an important part in cancer research, as a biomarker for certain cancers or as a way to track tumour progression. We developed a graph-based method to detect eccDNA in ONT sequencing samples, and also provide an easy to set up workflow that guarantees reproducibility and provides an interactive report for exploration of results and documentation of the pipeline.
Detecting copy number variation:
A different kind of variation is copy number variation, where (larger) regions of a sample’s genome have been repeated or deleted. Over the years, many approaches making use of different kinds of information have been explored, for example read-depth information: Determining variation is often done by comparing, mapping or aligning short DNA sequences to a reference sequence. For each position in the reference sequence, it is then possible to find the number of short sequences that overlap the respective position, which is basically the read-depth. This read-depth information may then be used to locate and estimate the magnitude of copy number variation. However, the mapping process to acquire the read-depth information can both be computationally expensive and require lots of disk-space. Therefore, to improve resource usage, we use a kind of pseudo-mapping based on k-mers instead, and show that using k-mer counts is at least as good as relying on classic read-depth information, while at the same time saving disk space.
Filtering of variant calls:
Because the field of bioinformatics wouldn’t be the same without its custom file formats, we address one specific format — the Variant Call Format (VCF). VCF is a text-based format describing genomic variation, for example a duplication (copy number variation) or a gene fusion or circle. As the format is widely used and there initially was no binary counterpart, it can be alluring to resort to established text modifying tools such as awk, sed and grep. However, the intricacies of the file format essentially prohibit this, as the results will likely be not what one expects. Even with existing specialized tools such as bcftools, certain syntax and semantic combinations are quite unintuitive and potentially lead to incorrect results. We therefore provide vembrane, a VCF tool which does not introduce its own domain specific language but uses Python instead. It also has built-in support for certain custom annotations (such as provided by SnpEff or VEP) and is the only tool which handles breakend records correctly. For example, extrachromosomal circular DNA variants have to be encoded using such breakend records.
Description
Table of contents
Keywords
Bioinformatics, Algorithms, DNS-sequencing, Genomics, Variants, Hidden Markov model, Copy number variation, Extrachromosomal circular DNA, K-mer methods, Variant calling, Sequencing errors, Nanopore sequencing, Software, Workflows