master's thesis
Abstract
The penetration of modern language technologies into the legal industry is necessary for it to deal with large amounts of texts it produces. Search is a core feature allowing users to perform their work better and faster. The use of modern context-aware approaches can aid in many features related to search, by better quantifying similarity between text.
As a solution, we propose a transformer-based model for creating document embeddings using two interlaced encoders. We train three models with various levels of interlacing and also inform one model of the relative location of each segment within the document. As no differences were detected in the training stage, the most feature rich model was selected and compared in human evaluation to a baseline doc2vec model on a task of recommending similar documents.
Based on the results, doc2vec proved to be a better and more suitable model for the selected task. The testing outlined some key problems with the proposed model in terms of its concept of similarity, which does not match the requirements of legal document recommendation.
Keywords
document similarity;document recommendation;legal documents;long documents;natural language processing;transformer neural networks;computer science;master's thesis;
Data
Language: |
English |
Year of publishing: |
2022 |
Typology: |
2.09 - Master's Thesis |
Organization: |
UL FRI - Faculty of Computer and Information Science |
Publisher: |
[L. Vranješ] |
UDC: |
004.8:81'322(043.2) |
COBISS: |
125574147
|
Views: |
26 |
Downloads: |
14 |
Average score: |
0 (0 votes) |
Metadata: |
|
Other data
Secondary language: |
Slovenian |
Secondary title: |
Podobnost poljubno dolgih pravnih besedil |
Secondary abstract: |
Uporaba sodobnih jezikovnih tehnologij v pravni industriji je potrebna, da se ta lažje spopade z velikimi količinami besedila, ki ga proizvede. Učinkovito iskanje je ena izmed ključnih rešitev, ki dovoljuje uporabnikom, da svoje delo upravljajo bolje in hitreje. Z boljšim zavedanjem konteksta lahko moderni pristopi izboljšajo mnogo funkcij povezanih z iskanjem.
Kot rešitev predlagamo arhitekturo na osnovi nevronske mreže transformer, ki z uporabo dveh prekritih kodirnikov ustvari predstavitev dokumenta. Testirali smo tri modele z različnimi nivoji prekrivanja in eden model katerega informiramo o relativni lokaciji segmenta znotraj dokumenta. Med njimi na validacijski množici nismo zaznali razlik, zato smo za ročno testiranje uporabili najbolj dodelan model. V ročnem testiranju na nalogi predlaganja podobnih dokumentov, primerjamo naš izbrani model z modelom doc2vec.
Rezultati kažejo, da je model doc2vec primerenejši za uporabo na testiranem problemu. Testiranje je pokazalo pomanjkljivosti predlaganega modela, še posebej v smislu predstavitve podobnosti, ki se ne ujema z zahtevanim v kontekstu priporočanja podobnih pravnih besedil. |
Secondary keywords: |
podobnost dokumentov;predlaganje dokumentov;pravni dokumenti;dolgi dokumenti;nevronske mreže transformer;magisteriji;Obdelava naravnega jezika (računalništvo);Računalniško jezikoslovje;Računalništvo;Univerzitetna in visokošolska dela; |
Type (COBISS): |
Master's thesis/paper |
Study programme: |
1000471 |
Embargo end date (OpenAIRE): |
1970-01-01 |
Thesis comment: |
Univ. v Ljubljani, Fak. za računalništvo in informatiko |
Pages: |
IV, 46 str. |
ID: |
16643704 |