The EXCITE project aims to extract citations from social science publications and to make more citation data available to researchers. With respect to this objective, a set of algorithms for information extraction and matching has been developed focusing on social science publications in the German language. Excite provides different online services to extract and segment citations. Moreover, other online tools are available to create more gold standard data. The project was jointly run by Institut of Web Science and Technologies, University of Koblenz-Landau) in Koblenz and GESIS (Leibniz Institute for Social Sciences) in Cologne, and was funded by the Deutsche Forschungsgemeinschaft (DFG). Now the support is handed over to Analytic Computing (University of Stuttgart). (more information about Excite) The second phase of the project titled OUTCITE, which will deal with the unmatched citations, is approved and will run at the University of Stuttgart and GESIS in Cologne.
Excite: September 2016 - July 2019
Outcite: April 2021 - August 2023
Analytic Computing - University of Stuttgart: Prof. Dr. Steffen Staab
GESIS - Leibniz-Institut für Sozialwissenschaften Abteilung Wissenstechnologien für Sozialwissenschaften (WTS): Dr. Philipp Mayr.
Former: WeST - The Institute for Web Science and Technologies: Prof. Dr. Steffen Staab
The shortage of citation data for the international and especially the German social sciences is well known to researchers in the field and has itself often been subject to academic studies. Citation data is the basis of effective information retrieval, recommendation systems, and knowledge discovery processes. The accessibility of information in the social sciences lags behind other fields (e.g. the natural sciences) where more citation data is available. The EXCITE project aims to close this gap by developing a toolchain of software components for reference extraction which is applied to existing scientific databases (esp. full texts in the social sciences). The tools are made available to other researchers. The project is to develop a number of algorithms for extracting references and citations from PDF full texts. It also improves the matching of reference strings to bibliographic databases. The extraction of citations is implemented as a five-step process:
This is done with the help of machine learning methods which control the quality of the extracted data of the individual components. The extracted citation data is integrated into the services maintained by the proposers (sowiport) and published as linked open data under permissive licenses to enable reuse. The resulting software of this project is published under open source licenses and made accessible via a web service API.
Several services are provided by Excite to extract and parse citations. All tools are licensed under Creative Commons attribution (CC BY-NC) and their codes are available on GitHub.
EXParser: It is a Python tool that extracts and segment references from PDF files by adopting a feedback mechanism.
EXMatcher: This algorithm is implemented for finding corresponding items in a bibliography corpus (such as Sowiport.org or related-work.net) for reference strings.
EXPublisher: This code is dedicated to the task of converting EXCITE data to a JSON file with OCC ontology.
EXRef-Identifier: It is an annotator tool that helps to annotate reference string in a text files and thus create a gold standard.
EXRef-Segmentation: It is an annotator tool that helps to manually parse reference strings.
RefExt: It is JAVA tool that extracts references from PDF files. Using Conditional Random Fields (CRF).
ULITE (Understanding Literature references in academic full text) workshop co-located with JCDL 2022 took place on 24 June 2022.
Visit workshop website
The German National Science Foundation (DFG) has accepted the second phase of our project that creates an open citation graph from > In the second phase titled OUTCITE that we will run at the University of Stuttgart we will deal with the unmatched citations, these ar>
The ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) is a major international forum focusing on digital libraries and associated technical, practical, and social issues. Two of our papers have been accepted in this Conference:An End-to-end Approach for Extracting and Segmenting High-Variance References from PDF Documents.
Boukhers Z., Ambhore S., Staab S.EXCITE - A toolchain to extract, match and publish open literature references.
Hosseini A., Ghavimi B., Boukhers Z., Mayr P.
The Conference will take place at the I-Hotel and Conference Center on the campus of the University of Illinois at Urbana-Champaign from 2 to 5 of June 2019.read more
EXCITE, Open Citation Corpus, Europe PMC and University of Bologna are organising a workshop on “Open Citations” which will take place at the University of Bologna on September 3rd- 5th. The workshop addresses experts and scholars in open bibliographic metadata and citations and their extraction approaches. Also, the workshop gives a good chance to attend the presentations of our invited speakers who are well experienced in scholarly publishing. At the hack day, new services and data will be presented.read more
EXCITE is pleased to announce its collaboration with the Open Citation Corpus (OCC), which started in 2010 as a one-year project funded by the Joint Information Systems Committee (JISC). In addition to its tools and services, OCC publishes accurate bibliographic and citation data in an open repository made available under a Creative Commons public domain dedication. The collaboration with OCC serves our vision of transparent access to bibliographic metadata as well as citation data for facilitating research in social science in particular, and in all sciences and the humanities in general.
When: 30.03.2017 - 31.03.2017
Where: GESIS-Leibniz-Institut für Sozialwissenschaften, Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Our first community meeting is planned as a “noon to noon” event and has the goal to bring together experts in reference extraction, text mining, and machine learning to explore the possibilities in the project. We plan to have scientific presentations with invited speakers on the first day and hands-on sessions on the second day. For the second day we will release a test corpus (PDF files of scientific papers and manually annotated data) for developers.read more
Boukhers Z., Ambhore S., Staab S. (2019) An End-to-end Approach for Extracting and Segmenting High-Variance References from PDF Documents. In Proceedings of the 19th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2019. ACM. read more
Hosseini A., Ghavimi B., Boukhers Z., Mayr P. (2019) EXCITE - A toolchain to extract, match and publish open literature references. In Proceedings of the 19th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2019. ACM. read more
Körner M., Ghavimi B., Mayr P., Hartmann H., Staab S. (2017) Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications. In: Kirikova M. et al. (eds) New Trends in Databases and Information Systems. ADBIS 2017. Communications in Computer and Information Science, vol 767. Springer, Cham read more