Open Science Slovenia – access to knowledge from Slovenian
research organizations
In 2013, Slovenian universities, with co-funding from the European Regional Development Fund and the Ministry of Education, Science and Sport, launched a national Open Science Portal and repositories for open access to theses and research results of researchers. Bilingual web and mobile interfaces and a recommendation system are available to users from all over the world. This infrastructure was complemented in 2014-2022 by repositories for independent research organizations and other higher education and higher education institutions, a national server for assigning persistent identifiers and a wholesale data repository. We have also included several other providers of their research results. The repositories and the national Open Science portal make the research results of Slovenian research organizations accessible to researchers, students, businesses, and other users at home and around the world. Researchers have an infrastructure that enables them to comply with the provisions on mandatory open access to research results from publicly funded research.
For a more detailed description of the establishment of a national open access infrastructure in 2013, see Milan Ojsteršek, Janez Brezovnik, Mojca Kotar, Marko Ferme, Goran Hrovat, Albin Bregant, Mladen Borovič, (2014) "Establishing of a Slovenian open access infrastructure: a technical point of view", Program: electronic library and information systems, Volume 48, Issue: 4, pp. 394 - 412.
The Slovenian open-access infrastructure

Figure 1: Structure
diagram of Slovenian open-access infrastructure
The Slovenian open-access infrastructure
consists of Slovenian universities’ repositories (a repository of the
University of Ljubljana, a Digital library of the University of Maribor, a Repository
of the University of Primorska, and a Repository of the University of Nova
Gorica), a Repository for research organizations (DIRROS), the Repository for
standalone faculties which are not part of universities (REVIS) and a national
portal that aggregates content from the repositories and other Slovenian
archives (Digital library of Slovenia (dLib.si), VideoLectures.NET, Digital
library of Ministry of Defence (DKMORS), Social Science data archive (SSDA),
CLARIN.si and repositories of Slovenian open access publishers of journals and
monographs). The national portal provides a common search engine, a recommender
system of similar publications, and text-matching software. It is also used for
deduplication of publications, normalization of authors of publications, and
distribution of content and metadata to other repositories if authors are from
different institutions. The repositories are connected to the national
bibliographic system COBISS.SI and the national current research information
system SICRIS. Aggregated data from repositories enable funders to check the
actual openness of scientific publications, research data, and other research
results. Metadata from repositories are harvested from OpenAire, Google Scholar
Google dataset search, Core, from the European portal for doctoral theses
DART-Europe, and other directories, aggregators, and search engines. The
Slovenian open-access infrastructure has a national service for assigning
persistent identifiers to digital objects and a big data archive for the
storage of big data sets from repositories. The digital objects from the
national infrastructure are used in the Slovenian national supercomputing
network SLING, the national COVID-19 data portal, and the European COVID-19
data portal.
The Academic and Research Network of Slovenia (ARNES) enable us to use backbone network based on broadband fiber-optic links. They also host DIRROS and REVIS repositories on their server infrastructure. We also use their disk storage to store backups. The Institute of Information Sciences from Maribor provides us with the use of the VEGA supercomputer and a big data archive from which users can process data on the Slovenian National Supercomputing Network (SLING) and other supercomputers in Europe.
The infrastructure enables Slovenia to implement the policy of
open access to the results of nationally funded research, as expected of the EU
Member States participating in the ERA.
Web links:
- National portal: http://www.openscience.si/
- Description of Slovenian open-access infrastructure:
- University of Ljubljana Repository: https://repozitorij.uni-lj.si/info/index.php/eng/
- University of Maribor Digital Library: https://dk.um.si/info/index.php/eng/
- Repository of the University of Primorska: http://repozitorij.upr.si/info/index.php/eng/
- Repository of the University of Nova Gorica: http://repozitorij.ung.si/info/index.php/eng/
- Repository of independent higher and higher education organizations: http://revis.openscience.si/info/index.php/eng/
- Repository of Slovenian research organizations that are not part of Slovenian universities: http://dirros.openscience.si/info/index.php/eng/
- Social science data archive: https://www.adp.fdv.uni-lj.si
- CLARIN.si: https://www.clarin.si/repository/xmlui/
- Digital library of Ministry of Defence: https://dk.mors.si/info/index.php/sl/
- Digital library of Slovenia: https://www.dlib.si/
- Videolectures.net: https://videolectures.net/
- Journals from Slovenian Academy of Science: https://ojs.sazu.si/
- Journals from the University of Maribor: https://journals.um.si/index.php/
- Slovenian COVID-19 national portal: http://covid19dataportal.si/
Key facts:
- A national approach to building open science infrastructure and FAIR digital objects.
- National PID service.
- National big data archive.
- Templates for amendments to the policies about a mandatory copy of research publications, research data, final theses, and other research results (software, workflows, lab notebooks, online courses…) are developed for all partner institutions.
- Development of adapted processes for filling publications, research data sets, and other research results from students and researchers in all partner institutions.
- OpenAire compatibility is established to facilitate in registration, discovery, access, and re-use of research publications and research data, in particular in the context of funded projects across European countries.
- Integration with national bibliographic system COBISS, national current research information system SICRIS, ARNES AAI, Crossref, Datacite, university information systems, and university authentication systems.
- Plagiarism detection software is developed and included in processes for filling publications from students and researchers.
- A recommender system of similar works within repositories and between repositories and others is implemented.
Repository and its structure

Figure 2: Structure diagram of repository
infrastructure
The repository software (figure 2) is based
on the software solution used by the Digital Library of the University of
Maribor and developed by the Laboratory for Heterogeneous Computing Systems of
the University of Maribor. It has been significantly updated and upgraded with
new functionalities due to the establishment of different processes of
publication submission by students and university staff.
For the needs of the processes of
submission, preservation and cataloguing of digital objects, each institutional
repository of the universities of Maribor, Ljubljana, Primorska and Nova Gorica
is connected to the university authentication system, the university higher
education information system and the COBISS.SI system. The REVIS repository
also has a number of institutions that have their academic information system linked
to the repository software.
Each digital object is given a national
persistent identifier (PID) so that, once it has been cataloged, the repository
software calls a national service that returns the persistent identifiers. We
use EUDAT's B2Handle service to assign persistent identifiers.
The repositories store the big data sets in
a big data archive. We use EUDAT's B2Safe service to archive them. We use
EUDAT's B2Stage service for transferring data between the supercomputers and
the big data archives.
To speed up the metadata deposit of digital
objects already assigned a persistent DOI in repositories, we use the API services
offered by Crossref and Datacite. The repository software calls the service by
sending the DOI persistent identifier as input to the service and retrieves
back the metadata stored by Crossref and Datacite about that digital object.
For the purpose of aggregation of metadata
by OpenAire, we have established an OAI-PMH service that returns metadata
according to the instructions provided by OpenAire. Other aggregators (Core,
Dart Europe, Base...) also aggregate metadata through this service. For Google
Scholar and Google Dataset search, we have embedded metadata for individual
digital objects in the repository’s website according to the Highwire press
format and the Schema.org specification.
At the University of Ljubljana, digital
objects are stored in the University's document system after being deposited in
their repository. The national portal, ARNES archive, and IZUM big data archive
are used to archive digital objects and their metadata from the repositories.
The repositories send metadata and
electronic versions of digital objects to the national portal as soon as they
have been cataloged in the national bibliographic system COBISS.SI. From the
national portal, the University of Ljubljana repository retrieves metadata and
electronic versions of publications from ePrints.FRI, PeFprints and ADP. The
repository also retrieves additional metadata about researchers and research
organizations from the national portal, which it extracts from SICRIS.
The repositories send to dCOBISS the data
needed for open access analytics. The metadata they send is linked to the
projects that funded the research and to the APC payments charged by the
publishers.
The national portal carries out the
recommendation of digital objects. When a metadata of the digital object is
clicked on in the repository, a list of similar documents is sent from the
national portal to the institutional repository. The recommendation consists of
the titles of documents within the repository and the titles of documents in
other university repositories, dLib.si, the Social Science Data Archive
CLARIN.si, journal and monograph publishers' repositories, VideoLectures.NET
and DKMORS.
The repositories provide both curator-oriented
and user-oriented functionalities. The curator part is used by the librarians,
the data stewards, and the system administrators and is designed differently
for each institution. The student office employees carry out the checking and
locking of students' final theses. Librarians review student and staff
publications, catalog them in COBISS, and transfer their metadata from
COBISS.SI to the repository. In the curator part, the librarian can import the
publication metadata from the local COBISS.SI database and add an electronic
version of the digital object to it. In this way, digital objects that are
already cataloged in COBISS.SI, for which electronic versions exist and for
which the University has the appropriate copyrights, can also be stored in the
repository.
The user part of the institutional
repository is divided into a part for the interested public and a part for
registered users (students and university staff). After registration, students
and university staff can submit their works to the repository and browse their
content (metadata and similar content found by the text-matching software). The
part accessible to the interested public is bilingual (Slovene and English user
interface) and is accessible online and on mobile platforms (Android and IOS).
The web version is friendly for users with disabilities and contains the main
features of web applications that comply with the WAI specification. The web
interface allows usage by people with reduced mobility and people with slightly
reduced vision (e.g. the elderly and visually impaired).
The software allows easy and advanced
search and browsing. A member institution can integrate the content from the
repository on its website by calling the JavaScript API to access simple or
advanced search and browsing of the institutional repository. The same API is
also used by mobile applications. University members and staff can also export
metadata about their publications in RSS, JSON, and RDF formats.
The repository displays various statistics
that can be used to identify for each institution or individual unit within an
institution the total number of its digital objects in the repository and how
many have been stored in the last period, as well as the number of metadata
accesses and downloads of the digital object. For the faculties of each
university, the statistics of interest are those that report the number of
views and downloads of the faculty's materials for the previous years on an
annual basis. The statistics on thesis mentors can be used to find out with
which tutors they cooperate and which theses have been produced by students
under their supervision. Another interesting statistic is based on the keywords
of the mentor's publications, it indirectly shows which research areas the
mentor is involved in and how the mentor's research area has changed over time.
At the University of Ljubljana, digital objects are also
stored in the University's document system after being uploaded to their
repository. The national portal and the infrastructure set up at ARNES and IZUM
are used to archive digital objects and their metadata.
The repositories send metadata and electronic versions of
digital objects to the national portal as soon as they have
been cataloged in COBISS.SI. From the national portal, the University of
Ljubljana repository retrieves metadata and electronic versions of publications
from ePrints.FRI, PeFprints and ADP. The repository also
retrieves additional data on researchers and research organizations
from the national
portal, which it extracts from SICRIS.
The repositories send to dCOBISS the data needed for open
access analytics. The data they send is linked to the projects that funded the
research and to the APC payments charged by the publishers.
Common services offered by the national portal
The repositories use common services offered by the national portal (figure 3). These services are:
- Persistent Identifier (PID) Assignment and Resolution Service: For each digital object is assigned a national Persistent Identifier (PID) by calling the national service that returns the PIDs after it has been cataloged by the repository software. We use EUDAT's B2Handle service to assign PIDs.
- Big Data Archive: Big data research data sets are stored by repositories in a big data archive. We use EUDAT's B2Safe service to archive them. We use EUDAT's B2Stage service to transfer data between the supercomputers and the wholesale data archives.
- Shared services:
- Recommendation system service. The service returns for each digital object the most similar digital objects in the same repository and digital objects from other repositories and external repositories and archives included in the national open access infrastructure.
- A service for the conversion of different types of documents into text.
- Optical image recognition and text-to-image service.
- Similar Content Detection Service: For each digital object containing files from which text can be extracted, the service searches for the most similar texts.
- A service for determining geographical and temporal coverage. The service shall allow the determination of the geographical and temporal coverage defined by the user of the repository through a web application. The geographic and temporal coverage metadata shall be added to the metadata of the specified digital object. The service is still in the test phase.
Figure
3: Structure diagram of research big data archive and PID service
infrastructure
Established processes
The
process, which was established at universities, allows authors to submit their
work into the institutional repository. After which, the submission is reviewed
and catalogued into COBISS.SI by a librarian. Based on the experiences
accumulated since 2008, it was found that a manual review of metadata is
mandatory as in many cases the submissions are incomplete or contain spelling
errors. In addition, only librarians can normalize those authors using the
CONOR.SI authority file. After the publication metadata have been successfully
catalogued within the COBISS.SI, they can be transferred to the institutional
repository via SRU/SRW services.
The working group, which consisted of
librarians, legal experts and IT professionals from all four Slovenian
universities, suggested the submission processes of the final study works.
Slovenia lacks a common university academic information system (UAIS),
therefore each university provides its own variation of this process. Students
at the universities of Maribor and of Nova Gorica submit their final study
works into the university repository, which is filled with some metadata from the
university academic information system. A sequence diagram of the final study
work publications of the universities of Maribor and of Nova Gorica is
presented in figure 4. The submissions at the universities of Ljubljana and
Primorska take place at the university academic information system, the
sequence of operations is otherwise the same as described for the universities
of Maribor and of Nova Gorica.

Figure 4: A sequence diagram of final study
work submission and publication at the universities of Maribor and of Nova
Gorica

Figure 5: A sequence diagram of research
item submission and publication
All four institutions have established the same process for submission of research publications (figure 5), which
is presented the sequence diagram in the slide. Researchers can submit articles, monographic chapters,
monographs, conference papers, e-lectures, publications about patents, research data, and other types of
publications. The types of publications are adapted according to the COBISS.SI
typology (http://home.izum.si/COBISS/bibliografije/Tipologija_eng.pdf), which is used by the Slovenian Research Agency for researchers’
bibliographies evaluation. Once researchers are logged into the institutional
repository, they can submit new content as a whole, or they can use metadata
from the catalogued record in COBISS.SI (optional request and reply on the
beginning of the process). In the latter case, the institutional repository
takes care of the metadata transfer from COBISS.SI using the SRU/SRW protocol.
In this case, researchers only need to provide the electronic version of their
publications. A link to the SHERPA/RoMEO portal is also enabled for the
publication authors, so they can check what type of access they can use
depending on the publisher's copyright transfer agreement. During the
insertions of names and surnames, suggestions from the CONOR.SI authoritative
file are provided. These suggestions include the year of
birth, if available in CONOR.SI, and researcher identifier from SICRIS, which
simplifies the determination of the correct author. This can greatly simplify
the librarian’s work of cataloguing the publication in COBISS.SI. Authors can
also determine the copyright holder and the type of access to the full-text
publication. They can choose between immediate publication, closed access, or
delayed publication with embargo (these metadata are part of OpenAIRE
compliance).
Establishing processes to support the
handling of research data in the national open access infrastructure:
Pre-publication activities:
1. Phase before publication of research data:
- Planning and finding data sources.
- Preparation of a research data management plan, applications for the ethics committee, and proposals for informed consent, proposals for declarations by data providers.
- Obtaining relevant statements and opinions.
- Data collection or creation.
- Data processing and analysis.
- Preparation of files in appropriate formats.
- Preparation of documentation.
2. Before a researcher applies for the publication of a research dataset in the national open access infrastructure, he must have:
- a data management plan (if requested by the funder or the organization in which he is employed),
- metadata about the research dataset,
- documentation that is necessary for understanding and using the data,
- data files in appropriate formats,
- ethical approval if the research study involves humans, animals or environmental data,
- statements of data providers and signed informed consents of research participants,
- defined licenses for the use of research data,
- the software, containers, workflows that was used to generate or process the data, if he created it himself,
- research notes and other research results, if any.
Publication in the repository or data archive:
- The researcher inserts the research data set and other research results into the repository or data archive himself or his librarian inserts them.
- The librarian checks the adequacy of the metadata and whether the appropriate documentation is available.
- The librarian informs the appropriate authority within the institution, which is in charge of checking the appropriateness of data publication and other research results, that the data set and other research results have been uploaded. They are accessible in closed access and are only available via a link that requires a password provided by the librarian.
- The appropriate body within the institution, which is in charge of checking the adequacy of the data publication, checks the adequacy of the content of the data set and other research results. If the content is appropriate, inform the librarian that the data set and other research results can be published.
- The librarian, after a positive response from the body within the institution, which is in charge of checking the appropriateness of data publication, publishes the data set and other research results in the repository and performs cataloging in COBISS.
- The central specialized information center of the scientific field, established by the Slovenian research and innovation agency checks the adequacy of the typology, metadata, and documentation of the research data set and other research results.
Digital preservation:
Data can be
stored in different formats and in several versions. For the digital
preservation of research data, we must ensure the independence of the data from
the technology. We work on the establishment of processes for digital
preservation according to the OAIS reference model ( ISO 14721 ).
Recommendation system
Content-based
recommendation within the Slovenian national open access infrastructure is
carried out at the national portal. On viewing a document within the
institutional repository, the national portal sends a list of similar
documents. Two different kinds of recommendation lists are sent. The first
consists of similar documents within the institutional repository. The second
shows similar content from across all other repositories including dLib.si,
VideoLectures.NET and DKMORS. Both recommendation lists are cached for each
document and stored within a database for enabling real-time responses.
The
objective of the recommendation system is to enable the visitors of
institutional repositories to find similar documents after clicking on a
document within the institutional repository. Partial duplicates are omitted
from the recommendations. Partial duplicates are determined using similar
sentence and substring detection. If two documents have a sentence-based or
substring-based coverage value of more than 60 %, they are marked as partial
duplicates. The recommendation software includes content-based document
recommendation that uses the BM25 ranking function and utilizes additional
weights during ranking. Firstly, the metadata for each publication is obtained
(authors, title, keywords, and abstract). Secondly, the metadata and the full
text of the publication are lemmatised. By utilising Wikipedia articles, a
semantic tagging process of metadata and full text of publications is used
during the third step. Term frequencies (TF) and inverse document frequencies
(IDF) are calculated for each publication during the fourth step. TF and IDF
weights are determined from semantically tagged metadata and full text.
Similarity with other publications is then calculated using a BM25 ranking
function suggested by Robertson, Zaragoza and Taylor. Those
pairs of documents with similarity values of 0 are discarded, as they are
dissimilar. The result is a list of similar document pairs, which is then
stored within the database. The recommendation threshold is set depending on
the BM25 values of the documents on the recommendation list. Thus, the
recommendation of similar publications is a result of selecting the top five highest-ranking
publications that exceed the threshold on the list, as ordered by the BM25
value. Additionally, other criteria such as the issue year, number of
downloads, number of views, and average rating, are used during the ranking
process. A recommendation list could also be empty if the system cannot find
any similar publications. The essential task is to maintain the database with
up-to-date similarities when new documents are added to the system.

Figure
6. Example of recommendations of similar digital objects
References:
- Bobadilla, J., Ortega, F., Hernando, A. in Gutiérrez, A. (2013). Recommender systems survey. Knowledge-based systems, 46 (7), 109-132. http://dx.doi.org/10.1016/j.knosys.2013.03.012.
- Robertson, S., Zaragoza, H. in Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. V Proceedings of the thirteenth ACM international conference on Information and knowledge management. New York: ACM, 42–49.
- Su, X. in Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in artificial intelligence, Article ID 421425.:http://dx.doi.org/10.1155/2009/421425.
Text matching software

Figure 7: Example of interactive report of text matching system
In
terms of content, there is a distinction between checking for similarity of
content and checking for plagiarism. The relevant software determines the
degree of similarity of the content. Plagiarism is decided by a human based on
the degree of similarity and other criteria. The software solution for the
detection of similar content is tailored to the analysis of texts in the
Slovenian language (it takes into account inflection of the language,
synonyms), which is its main advantage compared to competing products that are
mainly focused on checking texts in the English language.
Text
matching software (document similarity content
detection) is done within the national portal. The
results of plagiarism detection
are available to students and staff. They can only see those documents where
they had the roles of author, co-author or mentor. The detection is performed
for each publication, submitted into the repositories. Both coarse-checking
(step 1) and fine-checking (step 2) analyses are supported by the custom-built
software of our laboratory. The software does not check
the similarities between images.
Software
for coarse-checking texts returns similar sentences that are longer than forty
characters (an example of a result is shown in the figure - above). It enables
coarse-checking due to its use of our tool for natural
language processing, called TextProc. The language
infrastructure used during the plagiarism detection process consists of a
morphological dictionary for the Slovenian language, which contains
approximately 8,000,000 word forms and 320,000 lemmas. Wikipedia labels from
article titles from Slovenian, English, and German Wikipedia are also used.
They were extracted from Dbpedia, Wikidata. A domain-specific semantic
dictionary was created using keywords from publications metadata within the
Open Science Slovenia portal. Sentences, marked by the coarse-checking software
as similar, are undoubtedly the same in both texts. They differ only if the
authors used synonyms, used a different grammatical person or used filler words
(e.g. »therefore«, »however«). The software detects similarity even if the word
order is changed or if any of the words are misspelled. The Coarse-checking
algorithm which has been subsequently updated, first converts the text into
UTF-8 format and eliminates extra whitespaces and new line characters (CR, LF).
Then it splits the content into sentences. Words from these sentences are then
lemmatised. Common words (e.g. »and«, »or«, »that«, etc.) are filtered out and
all the remaining words are sorted alphabetically. This step also carries out
spelling corrections using a morphological dictionary and POS tagger. In order
to correct spelling errors the “Symmetric Delete Spelling Correction
Algorithm” is used. After the lemmatization, the algorithm normalizes
those synonyms stored within the dictionary and can be transformed into single
forms without changing the semantic meanings. A good example of this are the
normalisations of the verbs »to present«, »to describe« and »to show«, which
are synonyms in most cases. The algorithm then calculates hashes for these
sentences. Finally, it compares the hashes of sentences from other documents
within the corpus of documents, calculates the coverage of hashes for document
pairs and provides a list of similar documents for every document within the
corpus.
Those
documents that are found to be more than 1 % similar to the reviewed document
during the coarse-checking step become candidates for entering into the
fine-checking step. If the number of these candidates is less than 50, the rest
of the similar documents are retrieved using the BM25 ranking function- The fine-checking algorithm (figure below)
finds the longest common sub-sequences between two texts. Kärkkäinen's
algorithm is used for finding common sub-sequences greater than 14 characters. Our algorithm
(authors are Ferme and Ojsteršek) for
sequence matching is used in cases where the pairs of
documents have more than 60 % coarse-checking coverage of hashes. This
algorithm has a time complexity of nearly O(N) if the pairs of documents are
very similar. No spelling corrections are carried out for those texts using
‘as-is’. If two documents are fine-checked (figure 7) the software marks those
phrases or parts of sentences that are the same within both documents. A
reviewer's task is to determine whether a sentence or a paragraph has been
copied. Some sentences can be semantically identical as a whole or in part if
the author has paraphrased the copied content. Similarity constants (1% of
coarse-checking coverage, 50 candidate documents for fine-checking, sentence
length of 40 characters, subsequence length of more than 14 characters, etc.)
are selected on the basis of experiences from examination plagiarism cases.
Segmentation of pdf documents
Document
segmentation is performed in order to better detect similar content as well as
to enable knowledge extraction and recommendation It is based on document
structure parsing using regular expressions. Currently, segmentation works
properly on final study works, which is also the more common format of
documents in the institutional. The goal of segmentation is to extract the
title, abstract, keywords in primary and secondary language, table of tables,
table of figures, a list of URLs, DOIs and URNs, list of equations, a glossary
of terminology, a glossary of abbreviations and tables of contents, chapters
and bibliographies. The chapters are also split into sentences. Using
segmentation, we can enrich the publication metadata. This is done because some
libraries exclude abstracts, titles, and keywords in an alternative language.
The software also suggests these metadata to the librarian in these cases.
The
quality of segmentation is highly dependent on how well the students abide by
the instructions and guidelines for properly designing the final study work, as
provided by the faculty and academy. There were quite a lot of inconsistencies
in this aspect, which resulted in poor segmentation. This could especially be
seen in the fact that the table of contents did not match the segmented
chapters. The same problem could be seen when segmenting the tables of figures.
Citations and abbreviations were also a problem in some cases. The regular
expressions that are used were written quite generally. This approach is very
good if one is looking for patterns that do not deviate too much from the
average. Despite all the difficulties, the majority of documents were
successfully segmented. Most of the problems still occurred in parsing
references, because students do not abide by the guidelines for citations, as
provided by the faculty or academy.
Mobile applications
The mobile apps for searching the National Open Access
Infrastructure run on Android and iOS.

