master's thesis
Uroš Polanc (Author), Tomaž Curk (Mentor), Jan Zrimec (Co-mentor)

Abstract

Predicting tissue-specific gene expression is a crucial task in understanding the complex regulatory mechanisms governing gene expression. In this research, we employed three distinct models, two convolutional neural networks (CNNs) and DNABERT, to explore predictive models for tissue-specific gene expression. For the genome, we opted for the publicly available \textit{Arabidopsis thaliana}. Our approach involved systematically testing various methodologies, encompassing diverse transcript filtering techniques and an array of input sequences. The integration of multiple models and comprehensive input variations represents a significant step towards enhancing our understanding of tissue-specific gene expression prediction and furthering advancements in bioinformatics and computational biology. Our findings demonstrate the significance of both sequence data and additional CDS features in predicting gene expression. Combining these features showed only a marginal performance increase. DNABERT struggled with sequence-only inputs but performed comparably to CNN models with augmented CDS features. The Washburn model exhibited the most pronounced tissue-specific performance (R-squared $\approx$ 0.40), followed by DNABERT (R-squared $\approx$ 0.34) and Zrimec (R-squared $\approx$ 0.31). The models faced challenges in predicting both low- and highly-expressed genes but excelled in predicting mid-expressed genes. Additionally, predicting tissue-specific expression closely resembled predicting transcript mean expression, showing a consistent performance ordering across tissues. We analyzed kernel activations to showcase the model's pattern recognition skills. We cross-referenced these patterns with databases, finding around 650 matches. We used sequence occlusion to pinpoint important areas within the sequences. Our results highlighted the importance of the promoter near the TSS and the 5'UTR near the CDS in shaping model performance, especially with shorter occlusions. Additionally, all genomic regions except the terminator proved relevant when occluding their entire regions. In conclusion, we have demonstrated the model's capability to forecast tissue-specific gene expression and underscored the significance of non-coding genomic regions. While there remains ongoing research in this field, we aspire that our findings contribute to the understanding of tissue-specific gene expression.

Keywords

bioinformatics;convolutional neural network;DNA;DNABERT;gene expression;machine learning;sequence motifs;mRNA;predictive models;regulatory mechanisms;tissue-specific gene expression;tissue-specificity;computer science;master's thesis;

Data

Language: English
Year of publishing:
Typology: 2.09 - Master's Thesis
Organization: UL FRI - Faculty of Computer and Information Science
Publisher: [U. Polanc]
UDC: 004.85:575(043.2)
COBISS: 177600003 Link will open in a new window
Views: 52
Downloads: 5
Average score: 0 (0 votes)
Metadata: JSON JSON-RDF JSON-LD TURTLE N-TRIPLES XML RDFA MICRODATA DC-XML DC-RDF RDF

Other data

Secondary language: Slovenian
Secondary title: Globoko učenje tkivno specifičnega izražanja genov iz zaporedij DNA
Secondary abstract: Predvidevanje tkivno specifične genske ekspresije je ključno za razumevanje kompleksnih regulatornih mehanizmov, ki urejajo izražanje genov. V tem delu smo za raziskovanje napovednih modelov za tkivno specifično gensko ekspresijo uporabili tri različne modele: dve konvolucijski nevronski mreži (angl. convolutional neural network, CNN) in DNABERT. Za genom smo izbrali javno dostopno \textit{Arabidopsis thaliana}. Naš postopek je obsegal sistematično testiranje različnih metode, ki zajemajo različne tehnike filtriranja transkriptov in raznolika vhodna zaporedja. Integracija večjega števila modelov in variacije vhodov predstavljajo pomemben korak k izboljšanju razumevanja napovedi tkivno specifične ekspresije genov ter prispevajo k napredku bioinformatike in računske biologije. Naši rezultati kažejo na pomembnost tako vhodnih zaporedij kot dodatnih značilk kodirajoče regije (CDS) pri napovedovanju izražanja genov. Kombinacija teh vhodnih podatkov je pokazala le zmerno izboljšanje učinkovitosti. DNABERT se je spopadal z vnosom samo vhodnih zaporedij, vendar je dosegel rezultate, primerljive z modeli CNN z dodanimi značilkami CDS. Najizrazitejšo tkivno specifično učinkovitost je pokazal model Washburn (R-kvadrat približno 0,40), sledila sta model DNABERT (R-kvadrat približno 0,34) in model Zrimec (R-kvadrat približno 0,31). Modeli so se soočali z izzivi pri napovedovanju nizko in visoko izraženih genov, izkazali pa so se pri napovedovanju zmerno izraženih genov. Ocena napovedi tkivno specifičnega izražanja genov je podobna oceni napovedi povprečne vrednosti vseh primerov transkripta. Dodatno smo pokazali, da sta oba modela CNN ovrednotila tkiva s primerljivim vrstnim redom. Da bi prikazali modelove spretnosti prepoznavanja vzorcev, smo analizirali aktivacije konvolucijskih jeder. Te vzorce smo primerjali z referencami v bazah podatkov in našli približno 650 ujemanj. Da bi natančneje določili pomembna območja znotraj zaporedij, smo uporabili zamegljevanje zaporedja. Naši rezultati so poudarili pomen promotorja blizu TSS in 5'UTR blizu CDS pri oblikovanju učinkovitosti modela, še posebej pri krajših zameglitvah. Pri zameglitvi celotnih območjih so se vse genomske regije razen terminatorja izkazale za pomembne. Dokazali smo torej, da je model sposoben napovedati tkivno specifične genske ekspresije, in poudarili pomembnost nekodirajočih genomskih območij. Čeprav na tem področju poteka nenehno raziskovanje, si želimo, da bi naši ugotovitvi prispevali k napredku razumevanja tkivno specifičnega izražanja genov.
Secondary keywords: DNA;DNABERT;genska ekspresija;konvolucijska nevronska mreža;mRNA;napovedni modeli;regulatorni mehanizmi;sekvenčni motivi;tkivna specifičnost;tkivno specifična genska ekspresija;magisteriji;Globoko učenje (strojno učenje);Bioinformatika;Genetika;Računalništvo;Univerzitetna in visokošolska dela;
Type (COBISS): Master's thesis/paper
Study programme: 1000471
Embargo end date (OpenAIRE): 1970-01-01
Thesis comment: Univ. v Ljubljani, Fak. za računalništvo in informatiko
Pages: VIII, 88 str.
ID: 21172141