Open Science Slovenia – access to knowledge from Slovenian research organizations

In 2013, Slovenian universities, with co-funding from the European Regional Development Fund and the Ministry of Education, Science and Sport, launched a national Open Science Portal and repositories for open access to theses and research results of researchers. Bilingual web and mobile interfaces and a recommendation system are available to users from all over the world. This infrastructure was complemented in 2014-2022 by repositories for independent research organizations and other higher education and higher education institutions, a national server for assigning persistent identifiers and a wholesale data repository.  We have also included several other providers of their research results. The repositories and the national Open Science portal make the research results of Slovenian research organizations accessible to researchers, students, businesses, and other users at home and around the world. Researchers have an infrastructure that enables them to comply with the provisions on mandatory open access to research results from publicly funded research.

For a more detailed description of the establishment of a national open access infrastructure in 2013, see Milan Ojsteršek, Janez Brezovnik, Mojca Kotar, Marko Ferme, Goran Hrovat, Albin Bregant, Mladen Borovič, (2014) "Establishing of a Slovenian open access infrastructure: a technical point of view", Program: electronic library and information systems, Volume 48, Issue: 4, pp. 394 - 412.

The Slovenian open-access infrastructure

Figure 1: Structure diagram of Slovenian open-access infrastructure

The Slovenian open-access infrastructure consists of Slovenian universities’ repositories (a repository of the University of Ljubljana, a Digital library of the University of Maribor, a Repository of the University of Primorska, and a Repository of the University of Nova Gorica), a Repository for research organizations (DIRROS), the Repository for standalone faculties which are not part of universities (REVIS) and a national portal that aggregates content from the repositories and other Slovenian archives (Digital library of Slovenia (dLib.si), VideoLectures.NET, Digital library of Ministry of Defence (DKMORS), Social Science data archive (SSDA), CLARIN.si and repositories of Slovenian open access publishers of journals and monographs). The national portal provides a common search engine, a recommender system of similar publications, and text-matching software. It is also used for deduplication of publications, normalization of authors of publications, and distribution of content and metadata to other repositories if authors are from different institutions. The repositories are connected to the national bibliographic system COBISS.SI and the national current research information system SICRIS. Aggregated data from repositories enable funders to check the actual openness of scientific publications, research data, and other research results. Metadata from repositories are harvested from OpenAire, Google Scholar Google dataset search, Core, from the European portal for doctoral theses DART-Europe, and other directories, aggregators, and search engines. The Slovenian open-access infrastructure has a national service for assigning persistent identifiers to digital objects and a big data archive for the storage of big data sets from repositories. The digital objects from the national infrastructure are used in the Slovenian national supercomputing network SLING, the national COVID-19 data portal, and the European COVID-19 data portal. The Academic and Research Network of Slovenia (ARNES) enable us to use backbone network based on broadband fiber-optic links. They also host DIRROS and REVIS repositories on their server infrastructure. We also use their disk storage to store backups. The Institute of Information Sciences from Maribor provides us with the use of the VEGA supercomputer and a big data archive from which users can process data on the Slovenian National Supercomputing Network (SLING) and other supercomputers in Europe. The infrastructure enables Slovenia to implement the policy of open access to the results of nationally funded research, as expected of the EU Member States participating in the ERA.

Web links:

  • National portal: http://www.openscience.si/
  • Description of Slovenian open-access infrastructure:
  • University of Ljubljana Repository: https://repozitorij.uni-lj.si/info/index.php/eng/
  • University of Maribor Digital Library: https://dk.um.si/info/index.php/eng/  
  • Repository of the University of Primorska: http://repozitorij.upr.si/info/index.php/eng/  
  • Repository of the University of Nova Gorica: http://repozitorij.ung.si/info/index.php/eng/   
  • Repository of independent higher and higher education organizations: http://revis.openscience.si/info/index.php/eng/
  • Repository of Slovenian research organizations that are not part of Slovenian universities: http://dirros.openscience.si/info/index.php/eng/
  • Social science data archive: https://www.adp.fdv.uni-lj.si
  • CLARIN.si: https://www.clarin.si/repository/xmlui/
  • Digital library of Ministry of Defence: https://dk.mors.si/info/index.php/sl/
  • Digital library of Slovenia: https://www.dlib.si/
  • Videolectures.net: https://videolectures.net/
  • Journals from Slovenian Academy of Science: https://ojs.sazu.si/
  • Journals from the University of Maribor: https://journals.um.si/index.php/
  • Slovenian COVID-19 national portal: http://covid19dataportal.si/

Key facts:

  • A national approach to building open science infrastructure and FAIR digital objects.
  • National PID service.
  • National big data archive.
  • Templates for amendments to the policies about a mandatory copy of research publications, research data, final theses, and other research results (software, workflows, lab notebooks, online courses…) are developed for all partner institutions.
  • Development of adapted processes for filling publications, research data sets, and other research results from students and researchers in all partner institutions.
  • OpenAire compatibility is established to facilitate in registration, discovery, access, and re-use of research publications and research data, in particular in the context of funded projects across European countries.
  • Integration with national bibliographic system COBISS, national current research information system SICRIS, ARNES AAI, Crossref, Datacite, university information systems, and university authentication systems.
  • Plagiarism detection software is developed and included in processes for filling publications from students and researchers.
  • A recommender system of similar works within repositories and between repositories and others is implemented.

 

Repository and its structure

Figure 2: Structure diagram of repository infrastructure

The repository software (figure 2) is based on the software solution used by the Digital Library of the University of Maribor and developed by the Laboratory for Heterogeneous Computing Systems of the University of Maribor. It has been significantly updated and upgraded with new functionalities due to the establishment of different processes of publication submission by students and university staff.

For the needs of the processes of submission, preservation and cataloguing of digital objects, each institutional repository of the universities of Maribor, Ljubljana, Primorska and Nova Gorica is connected to the university authentication system, the university higher education information system and the COBISS.SI system. The REVIS repository also has a number of institutions that have their academic information system linked to the repository software.

Each digital object is given a national persistent identifier (PID) so that, once it has been cataloged, the repository software calls a national service that returns the persistent identifiers. We use EUDAT's B2Handle service to assign persistent identifiers.

The repositories store the big data sets in a big data archive. We use EUDAT's B2Safe service to archive them. We use EUDAT's B2Stage service for transferring data between the supercomputers and the big data archives.

To speed up the metadata deposit of digital objects already assigned a persistent DOI in repositories, we use the API services offered by Crossref and Datacite. The repository software calls the service by sending the DOI persistent identifier as input to the service and retrieves back the metadata stored by Crossref and Datacite about that digital object.

For the purpose of aggregation of metadata by OpenAire, we have established an OAI-PMH service that returns metadata according to the instructions provided by OpenAire. Other aggregators (Core, Dart Europe, Base...) also aggregate metadata through this service. For Google Scholar and Google Dataset search, we have embedded metadata for individual digital objects in the repository’s website according to the Highwire press format and the Schema.org specification.

At the University of Ljubljana, digital objects are stored in the University's document system after being deposited in their repository. The national portal, ARNES archive, and IZUM big data archive are used to archive digital objects and their metadata from the repositories.

The repositories send metadata and electronic versions of digital objects to the national portal as soon as they have been cataloged in the national bibliographic system COBISS.SI. From the national portal, the University of Ljubljana repository retrieves metadata and electronic versions of publications from ePrints.FRI, PeFprints and ADP. The repository also retrieves additional metadata about researchers and research organizations from the national portal, which it extracts from SICRIS.

The repositories send to dCOBISS the data needed for open access analytics. The metadata they send is linked to the projects that funded the research and to the APC payments charged by the publishers.

The national portal carries out the recommendation of digital objects. When a metadata of the digital object is clicked on in the repository, a list of similar documents is sent from the national portal to the institutional repository. The recommendation consists of the titles of documents within the repository and the titles of documents in other university repositories, dLib.si, the Social Science Data Archive CLARIN.si, journal and monograph publishers' repositories, VideoLectures.NET and DKMORS.

The repositories provide both curator-oriented and user-oriented functionalities. The curator part is used by the librarians, the data stewards, and the system administrators and is designed differently for each institution. The student office employees carry out the checking and locking of students' final theses. Librarians review student and staff publications, catalog them in COBISS, and transfer their metadata from COBISS.SI to the repository. In the curator part, the librarian can import the publication metadata from the local COBISS.SI database and add an electronic version of the digital object to it. In this way, digital objects that are already cataloged in COBISS.SI, for which electronic versions exist and for which the University has the appropriate copyrights, can also be stored in the repository.

The user part of the institutional repository is divided into a part for the interested public and a part for registered users (students and university staff). After registration, students and university staff can submit their works to the repository and browse their content (metadata and similar content found by the text-matching software). The part accessible to the interested public is bilingual (Slovene and English user interface) and is accessible online and on mobile platforms (Android and IOS). The web version is friendly for users with disabilities and contains the main features of web applications that comply with the WAI specification. The web interface allows usage by people with reduced mobility and people with slightly reduced vision (e.g. the elderly and visually impaired).

The software allows easy and advanced search and browsing. A member institution can integrate the content from the repository on its website by calling the JavaScript API to access simple or advanced search and browsing of the institutional repository. The same API is also used by mobile applications. University members and staff can also export metadata about their publications in RSS, JSON, and RDF formats.

The repository displays various statistics that can be used to identify for each institution or individual unit within an institution the total number of its digital objects in the repository and how many have been stored in the last period, as well as the number of metadata accesses and downloads of the digital object. For the faculties of each university, the statistics of interest are those that report the number of views and downloads of the faculty's materials for the previous years on an annual basis. The statistics on thesis mentors can be used to find out with which tutors they cooperate and which theses have been produced by students under their supervision. Another interesting statistic is based on the keywords of the mentor's publications, it indirectly shows which research areas the mentor is involved in and how the mentor's research area has changed over time.

At the University of Ljubljana, digital objects are also stored in the University's document system after being uploaded to their repository. The national portal and the infrastructure set up at ARNES and IZUM are used to archive digital objects and their metadata.

The repositories send metadata and electronic versions of digital objects to the national portal as soon as they have been cataloged in COBISS.SI. From the national portal, the University of Ljubljana repository retrieves metadata and electronic versions of publications from ePrints.FRI, PeFprints and ADP. The repository also retrieves additional data on researchers and research organizations from the national portal, which it extracts from SICRIS.

The repositories send to dCOBISS the data needed for open access analytics. The data they send is linked to the projects that funded the research and to the APC payments charged by the publishers.

 

Common services offered by the national portal

The repositories use common services offered by the national portal (figure 3). These services are:

  • Persistent Identifier (PID) Assignment and Resolution Service: For each digital object is assigned a national Persistent Identifier (PID) by calling the national service that returns the PIDs after it has been cataloged by the repository software. We use EUDAT's B2Handle service to assign PIDs.
  • Big Data Archive: Big data research data sets are stored by repositories in a big data archive. We use EUDAT's B2Safe service to archive them. We use EUDAT's B2Stage service to transfer data between the supercomputers and the wholesale data archives.
  • Shared services:
    • Recommendation system service. The service returns for each digital object the most similar digital objects in the same repository and digital objects from other repositories and external repositories and archives included in the national open access infrastructure.
    • A service for the conversion of different types of documents into text.
    • Optical image recognition and text-to-image service.
    • Similar Content Detection Service: For each digital object containing files from which text can be extracted, the service searches for the most similar texts.
    • A service for determining geographical and temporal coverage. The service shall allow the determination of the geographical and temporal coverage defined by the user of the repository through a web application. The geographic and temporal coverage metadata shall be added to the metadata of the specified digital object. The service is still in the test phase.

 
Figure 3: Structure diagram of research big data archive and PID service infrastructure

 

Established processes

The process, which was established at universities, allows authors to submit their work into the institutional repository. After which, the submission is reviewed and catalogued into COBISS.SI by a librarian. Based on the experiences accumulated since 2008, it was found that a manual review of metadata is mandatory as in many cases the submissions are incomplete or contain spelling errors. In addition, only librarians can normalize those authors using the CONOR.SI authority file. After the publication metadata have been successfully catalogued within the COBISS.SI, they can be transferred to the institutional repository via SRU/SRW services.

The working group, which consisted of librarians, legal experts and IT professionals from all four Slovenian universities, suggested the submission processes of the final study works. Slovenia lacks a common university academic information system (UAIS), therefore each university provides its own variation of this process. Students at the universities of Maribor and of Nova Gorica submit their final study works into the university repository, which is filled with some metadata from the university academic information system. A sequence diagram of the final study work publications of the universities of Maribor and of Nova Gorica is presented in figure 4. The submissions at the universities of Ljubljana and Primorska take place at the university academic information system, the sequence of operations is otherwise the same as described for the universities of Maribor and of Nova Gorica.

Slika, ki vsebuje besede besedilo, posnetek zaslona, številka, vrstica

Opis je samodejno ustvarjen

Figure 4: A sequence diagram of final study work submission and publication at the universities of Maribor and of Nova Gorica

Slika, ki vsebuje besede besedilo, posnetek zaslona, številka, vzporedno

Opis je samodejno ustvarjen

Figure 5: A sequence diagram of research item submission and publication

All four institutions have established the same process for submission of  research publications (figure 5), which is presented the sequence diagram in the slide. Researchers can submit articles, monographic chapters, monographs, conference papers, e-lectures, publications about patents, research data, and other types of publications. The types of publications are adapted according to the COBISS.SI typology (http://home.izum.si/COBISS/bibliografije/Tipologija_eng.pdf), which is used by the Slovenian Research Agency for researchers’ bibliographies evaluation. Once researchers are logged into the institutional repository, they can submit new content as a whole, or they can use metadata from the catalogued record in COBISS.SI (optional request and reply on the beginning of the process). In the latter case, the institutional repository takes care of the metadata transfer from COBISS.SI using the SRU/SRW protocol. In this case, researchers only need to provide the electronic version of their publications. A link to the SHERPA/RoMEO portal is also enabled for the publication authors, so they can check what type of access they can use depending on the publisher's copyright transfer agreement. During the insertions of names and surnames, suggestions from the CONOR.SI authoritative file are provided. These suggestions include the year of birth, if available in CONOR.SI, and researcher identifier from SICRIS, which simplifies the determination of the correct author. This can greatly simplify the librarian’s work of cataloguing the publication in COBISS.SI. Authors can also determine the copyright holder and the type of access to the full-text publication. They can choose between immediate publication, closed access, or delayed publication with embargo (these metadata are part of OpenAIRE compliance).

Establishing processes to support the handling of research data in the national open access infrastructure:

Pre-publication activities:

1. Phase before publication of research data:

  • Planning and finding data sources.
  • Preparation of a research data management plan, applications for the ethics committee, and proposals for informed consent, proposals for declarations by data providers.
  • Obtaining relevant statements and opinions.
  • Data collection or creation.
  • Data processing and analysis.
  • Preparation of files in appropriate formats.
  • Preparation of documentation.

2. Before a researcher applies for the publication of a research dataset in the national open access infrastructure, he must have:

  • a data management plan (if requested by the funder or the organization in which he is employed),
  • metadata about the research dataset,
  • documentation that is necessary for understanding and using the data,
  • data files in appropriate formats,
  • ethical approval if the research study involves humans, animals or environmental data,
  • statements of data providers and signed informed consents of research participants,
  • defined licenses for the use of research data,
  • the software, containers, workflows that was used to generate or process the data, if he created it himself,
  • research notes and other research results, if any.

Publication in the repository or data archive:

  • The researcher inserts the research data set and other research results into the repository or data archive himself or his librarian inserts them.
  • The librarian checks the adequacy of the metadata and whether the appropriate documentation is available.
  • The librarian informs the appropriate authority within the institution, which is in charge of checking the appropriateness of data publication and other research results, that the data set and other research results have been uploaded. They are accessible in closed access and are only available via a link that requires a password provided by the librarian.
  • The appropriate body within the institution, which is in charge of checking the adequacy of the data publication, checks the adequacy of the content of the data set and other research results. If the content is appropriate, inform the librarian that the data set and other research results can be published.
  • The librarian, after a positive response from the body within the institution, which is in charge of checking the appropriateness of data publication, publishes the data set and other research results in the repository and performs cataloging in COBISS.
  • The central specialized information center of the scientific field, established by the Slovenian research and innovation agency checks the adequacy of the typology, metadata, and documentation of the research data set and other research results.

Digital preservation:

Data can be stored in different formats and in several versions. For the digital preservation of research data, we must ensure the independence of the data from the technology. We work on the establishment of processes for digital preservation according to the OAIS reference model ( ISO 14721 ).

 

Recommendation system

Content-based recommendation within the Slovenian national open access infrastructure is carried out at the national portal. On viewing a document within the institutional repository, the national portal sends a list of similar documents. Two different kinds of recommendation lists are sent. The first consists of similar documents within the institutional repository. The second shows similar content from across all other repositories including dLib.si, VideoLectures.NET and DKMORS. Both recommendation lists are cached for each document and stored within a database for enabling real-time responses.

The objective of the recommendation system is to enable the visitors of institutional repositories to find similar documents after clicking on a document within the institutional repository. Partial duplicates are omitted from the recommendations. Partial duplicates are determined using similar sentence and substring detection. If two documents have a sentence-based or substring-based coverage value of more than 60 %, they are marked as partial duplicates. The recommendation software includes content-based document recommendation that uses the BM25 ranking function and utilizes additional weights during ranking. Firstly, the metadata for each publication is obtained (authors, title, keywords, and abstract). Secondly, the metadata and the full text of the publication are lemmatised. By utilising Wikipedia articles, a semantic tagging process of metadata and full text of publications is used during the third step. Term frequencies (TF) and inverse document frequencies (IDF) are calculated for each publication during the fourth step. TF and IDF weights are determined from semantically tagged metadata and full text.  Similarity with other publications is then calculated using a BM25 ranking function suggested by Robertson, Zaragoza and Taylor. Those pairs of documents with similarity values of 0 are discarded, as they are dissimilar. The result is a list of similar document pairs, which is then stored within the database. The recommendation threshold is set depending on the BM25 values of the documents on the recommendation list. Thus, the recommendation of similar publications is a result of selecting the top five highest-ranking publications that exceed the threshold on the list, as ordered by the BM25 value. Additionally, other criteria such as the issue year, number of downloads, number of views, and average rating, are used during the ranking process. A recommendation list could also be empty if the system cannot find any similar publications. The essential task is to maintain the database with up-to-date similarities when new documents are added to the system.

Slika, ki vsebuje besede besedilo, posnetek zaslona, dokument, pisava

Opis je samodejno ustvarjen

Figure 6. Example of recommendations of similar digital objects

References:

  • Bobadilla, J., Ortega, F., Hernando, A. in Gutiérrez, A. (2013). Recommender systems survey. Knowledge-based systems, 46 (7), 109-132. http://dx.doi.org/10.1016/j.knosys.2013.03.012.
  • Robertson, S., Zaragoza, H. in Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. V Proceedings of the thirteenth ACM international conference on Information and knowledge management. New York: ACM, 42–49.
  • Su, X. in Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in artificial intelligence, Article ID 421425.:http://dx.doi.org/10.1155/2009/421425.

 

Text matching software

Slika, ki vsebuje besede besedilo, posnetek zaslona, spletna stran, spletno mesto

Opis je samodejno ustvarjen

Figure 7: Example of interactive report of text matching system

In terms of content, there is a distinction between checking for similarity of content and checking for plagiarism. The relevant software determines the degree of similarity of the content. Plagiarism is decided by a human based on the degree of similarity and other criteria. The software solution for the detection of similar content is tailored to the analysis of texts in the Slovenian language (it takes into account inflection of the language, synonyms), which is its main advantage compared to competing products that are mainly focused on checking texts in the English language.

Text matching software (document similarity content detection) is done within the national portal. The results of plagiarism detection are available to students and staff. They can only see those documents where they had the roles of author, co-author or mentor. The detection is performed for each publication, submitted into the repositories. Both coarse-checking (step 1) and fine-checking (step 2) analyses are supported by the custom-built software of our laboratory. The software does not check the similarities between images.

Software for coarse-checking texts returns similar sentences that are longer than forty characters (an example of a result is shown in the figure - above). It enables coarse-checking due to its use of our tool for natural language processing, called TextProc. The language infrastructure used during the plagiarism detection process consists of a morphological dictionary for the Slovenian language, which contains approximately 8,000,000 word forms and 320,000 lemmas. Wikipedia labels from article titles from Slovenian, English, and German Wikipedia are also used. They were extracted from Dbpedia, Wikidata. A domain-specific semantic dictionary was created using keywords from publications metadata within the Open Science Slovenia portal. Sentences, marked by the coarse-checking software as similar, are undoubtedly the same in both texts. They differ only if the authors used synonyms, used a different grammatical person or used filler words (e.g. »therefore«, »however«). The software detects similarity even if the word order is changed or if any of the words are misspelled. The Coarse-checking algorithm which has been subsequently updated, first converts the text into UTF-8 format and eliminates extra whitespaces and new line characters (CR, LF). Then it splits the content into sentences. Words from these sentences are then lemmatised. Common words (e.g. »and«, »or«, »that«, etc.) are filtered out and all the remaining words are sorted alphabetically. This step also carries out spelling corrections using a morphological dictionary and POS tagger. In order to correct spelling errors the “Symmetric Delete Spelling Correction Algorithm”  is used. After the lemmatization, the algorithm normalizes those synonyms stored within the dictionary and can be transformed into single forms without changing the semantic meanings. A good example of this are the normalisations of the verbs »to present«, »to describe« and »to show«, which are synonyms in most cases. The algorithm then calculates hashes for these sentences. Finally, it compares the hashes of sentences from other documents within the corpus of documents, calculates the coverage of hashes for document pairs and provides a list of similar documents for every document within the corpus.

Those documents that are found to be more than 1 % similar to the reviewed document during the coarse-checking step become candidates for entering into the fine-checking step. If the number of these candidates is less than 50, the rest of the similar documents are retrieved using the BM25 ranking function- The fine-checking algorithm (figure below) finds the longest common sub-sequences between two texts. Kärkkäinen's algorithm is used for finding common sub-sequences greater than 14 characters. Our algorithm (authors are Ferme and Ojsteršek) for sequence matching is used in cases where the pairs of documents have more than 60 % coarse-checking coverage of hashes. This algorithm has a time complexity of nearly O(N) if the pairs of documents are very similar. No spelling corrections are carried out for those texts using ‘as-is’. If two documents are fine-checked (figure 7) the software marks those phrases or parts of sentences that are the same within both documents. A reviewer's task is to determine whether a sentence or a paragraph has been copied. Some sentences can be semantically identical as a whole or in part if the author has paraphrased the copied content. Similarity constants (1% of coarse-checking coverage, 50 candidate documents for fine-checking, sentence length of 40 characters, subsequence length of more than 14 characters, etc.) are selected on the basis of experiences from examination plagiarism cases. 

 

Segmentation of pdf documents

Document segmentation is performed in order to better detect similar content as well as to enable knowledge extraction and recommendation  It is based on document structure parsing using regular expressions. Currently, segmentation works properly on final study works, which is also the more common format of documents in the institutional. The goal of segmentation is to extract the title, abstract, keywords in primary and secondary language, table of tables, table of figures, a list of URLs, DOIs and URNs, list of equations, a glossary of terminology, a glossary of abbreviations and tables of contents, chapters and bibliographies. The chapters are also split into sentences. Using segmentation, we can enrich the publication metadata. This is done because some libraries exclude abstracts, titles, and keywords in an alternative language. The software also suggests these metadata to the librarian in these cases.

The quality of segmentation is highly dependent on how well the students abide by the instructions and guidelines for properly designing the final study work, as provided by the faculty and academy. There were quite a lot of inconsistencies in this aspect, which resulted in poor segmentation. This could especially be seen in the fact that the table of contents did not match the segmented chapters. The same problem could be seen when segmenting the tables of figures. Citations and abbreviations were also a problem in some cases. The regular expressions that are used were written quite generally. This approach is very good if one is looking for patterns that do not deviate too much from the average.  Despite all the difficulties, the majority of documents were successfully segmented. Most of the problems still occurred in parsing references, because students do not abide by the guidelines for citations, as provided by the faculty or academy.

 

Mobile applications

The mobile apps for searching the National Open Access Infrastructure run on Android and iOS.

Slika, ki vsebuje besede mobilni telefon, komunikacijska naprava, mobilna naprava, prenosna komunikacijska naprava

Opis je samodejno ustvarjen

Slika, ki vsebuje besede besedilo, elektronika, računalnik, posnetek zaslona

Opis je samodejno ustvarjen