master's thesis
Abstract
Transcription start site (TSS) prediction is a classification problem at the intersection of machine learning and laboratory gene expression measurement methods. The site is significant as it represents the location where the first nucleotide is transcribed by RNA polymerase and can help characterize the genome of an organism. We have developed two variants of prediction models in the plant model organism \textit{Arabidopsis thaliana} based on an existing expression model Enformer, using upscaling and a custom loss function that proved crucial for training success. The GFF model type uses genome annotation information to supplement the context, and this has proven to facilitate the transfer between plant organisms, demonstrated by transfer learning on corn. The MultiTSS model type uses DNA sequence alone with no substantial performance degradation compared to the GFF model, demonstrating that it is able to capture and learn important motifs that characterize a TSS. We show that the developed methods are comparably better than existing approaches and can be applied without retraining as well. We also describe the procedure and pitfalls of the problem area with potential solutions.
Keywords
transcription start site;polymerase;bioinformatics;transformer;computer science;master's thesis;
Data
Language: |
English |
Year of publishing: |
2023 |
Typology: |
2.09 - Master's Thesis |
Organization: |
UL FRI - Faculty of Computer and Information Science |
Publisher: |
[D. Miškić] |
UDC: |
004:575.112(043.2) |
COBISS: |
169486083
|
Views: |
61 |
Downloads: |
13 |
Average score: |
0 (0 votes) |
Metadata: |
|
Other data
Secondary language: |
Slovenian |
Secondary title: |
Učenje s prenosom znanja za napovedovanje začetnega mesta transkripcije med različnimi vrstami rastlin |
Secondary abstract: |
Napovedovanje začetnega mesta transkripcije (TSS) je klasifikacijski problem na presečišču strojnega učenja in laboratorijskih metod merjenja ekspresije. To mesto predstavlja položaj, kjer polimeraza RNA začne prepisovati prvi nukleotid in lahko pomaga pri karakterizaciji genoma organizma. Razvili smo dve različici modela na podatkih modelnega organizma pri rastlinah, \textit{A. thaliana}, ki temeljita na jedru obstoječega modela napovedovanja izražanja Enformer. Temu smo dodali sloje za večanje ločljivosti in funkcije izgube po meri, ki se je izkazala ključna za uspeh učenja. Tip modela GFF uporablja informacijo iz anotacije genoma za dopolnjevanje konteksta, kar je dokazano olajšalo prenos med rastlinami, to smo pokazali tudi na primeru koruze. Tip modela MultiTSS uporablja samo zaporedje DNA in brez bistvenega poslabšanja zmogljivosti v primerjavi z GFF dokazuje, da je ta arhitektura sposobna zajeti in se naučiti pomembnih motivov, ki so značilni za TSS. Demonstriramo tudi, da so razvite metode primerljivo boljše od obstoječih pristopov in jih je mogoče uporabljati tudi brez ponovnega učenja. Opisali smo tudi postopek in pasti tega problema ter predlagali možne rešitve. |
Secondary keywords: |
začetno mesto transkripcije;polimeraza;transformer;prenos učenja;magisteriji;Bioinformatika;Računalništvo;Univerzitetna in visokošolska dela; |
Type (COBISS): |
Master's thesis/paper |
Study programme: |
1000471 |
Embargo end date (OpenAIRE): |
1970-01-01 |
Thesis comment: |
Univ. v Ljubljani, Fak. za računalništvo in informatiko |
Pages: |
XII, 83 str. |
ID: |
20010373 |