EXCITE : Home page

EXCITE - Extraction of Citations from PDF Documents

The EXCITE project aims to extract citations from social science publications and to make more citation data available to researchers. With respect to this objective, a set of algorithms for information extraction and matching has been developed focusing on social science publications in the German language. Excite provides different online services to extract and segment citations. Moreover, other online tools are available to create more gold standard data. The project was jointly run by Institut of Web Science and Technologies, University of Koblenz-Landau) in Koblenz and GESIS (Leibniz Institute for Social Sciences) in Cologne, and was funded by the Deutsche Forschungsgemeinschaft (DFG). Now the support is handed over to Analytic Computing (University of Stuttgart). (more information about Excite) The second phase of the project titled OUTCITE, which will deal with the unmatched citations, is approved and will run at the University of Stuttgart and GESIS in Cologne.

About Excite

The shortage of citation data for the international and especially the German social sciences is well known to researchers in the field and has itself often been subject to academic studies. Citation data is the basis of effective information retrieval, recommendation systems, and knowledge discovery processes. The accessibility of information in the social sciences lags behind other fields (e.g. the natural sciences) where more citation data is available. The EXCITE project aims to close this gap by developing a toolchain of software components for reference extraction which is applied to existing scientific databases (esp. full texts in the social sciences). The tools are made available to other researchers. The project is to develop a number of algorithms for extracting references and citations from PDF full texts. It also improves the matching of reference strings to bibliographic databases. The extraction of citations is implemented as a five-step process:

One

1) Extraction of text from the source documents,

Two

2) Identification of reference sections in the text,

Three

3) Segmentation of individual references in fields such as author, title, etc.,

Four

4) Matching of reference strings against bibliographic databases,

Five

5) export of the matched references in usable formats and services. Special attention is paid to the optimization of individual components of the citation extraction.

This is done with the help of machine learning methods which control the quality of the extracted data of the individual components. The extracted citation data is integrated into the services maintained by the proposers (sowiport) and published as linked open data under permissive licenses to enable reuse. The resulting software of this project is published under open source licenses and made accessible via a web service API.

Outcite Team

Analytic Computing, The University of Stuttgart/Institut of Web Science and Technologies, University of Koblenz-Landau

Prof. Dr. Steffen Staab

Team Leader

Steffen.Staab@ipvs.uni-stuttgart.de
Dr. Zeyd Boukhers

Researcher

boukhers@uni-koblenz.de
Martin Körner

Researcher

mkoerner@uni-koblenz.de
Anastasiia Iurshina

Researcher

Anastasiia.Iurshina@ipvs.uni-stuttgart.de

GESIS - Leibniz-Institut für Sozialwissenschaften

Dr. Philipp Mayr

Team Leader

philipp.mayr@gesis.org
Tobias Backes

Researcher

tobias.backes@gesis.org
Muhammad Ahsan Shahid

Software Developer

ahsan.shahid@gesis.org
Behnam Ghavimi

Researcher

behnam.ghavimi@gesis.org
Azam Hosseini

Programmer

azam.hosseini@gesis.org

External Supporter

Dr. Heinrich Hartmann

heinrich@heinrichhartmann.com

Software

Several services are provided by Excite to extract and parse citations. All tools are licensed under Creative Commons attribution (CC BY-NC) and their codes are available on GitHub.

EXParser: It is a Python tool that extracts and segment references from PDF files by adopting a feedback mechanism.

Read more ....
EXMatcher: This algorithm is implemented for finding corresponding items in a bibliography corpus (such as Sowiport.org or related-work.net) for reference strings.

Read more ....
EXPublisher: This code is dedicated to the task of converting EXCITE data to a JSON file with OCC ontology.

Read more ....

EXRef-Identifier: It is an annotator tool that helps to annotate reference string in a text files and thus create a gold standard.

Read more ....

live demo
EXRef-Segmentation: It is an annotator tool that helps to manually parse reference strings.

Read more ....

live demo
RefExt: It is JAVA tool that extracts references from PDF files. Using Conditional Random Fields (CRF).

Read more ....

News From Excite/Outcite

ULITE workshop 2022

ULITE (Understanding Literature references in academic full text) workshop co-located with JCDL 2022 took place on 24 June 2022.
Visit workshop website
Proceedings

OUTCITE: second phase of the project is approved!

The German National Science Foundation (DFG) has accepted the second phase of our project that creates an open citation graph from > In the second phase titled OUTCITE that we will run at the University of Stuttgart we will deal with the unmatched citations, these ar>

JCDL 2019 Conference

The ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) is a major international forum focusing on digital libraries and associated technical, practical, and social issues. Two of our papers have been accepted in this Conference:

An End-to-end Approach for Extracting and Segmenting High-Variance References from PDF Documents.

Boukhers Z., Ambhore S., Staab S.

EXCITE - A toolchain to extract, match and publish open literature references.

Hosseini A., Ghavimi B., Boukhers Z., Mayr P.

The Conference will take place at the I-Hotel and Conference Center on the campus of the University of Illinois at Urbana-Champaign from 2 to 5 of June 2019.

Open Citation Workshop 2018

EXCITE, Open Citation Corpus, Europe PMC and University of Bologna are organising a workshop on “Open Citations” which will take place at the University of Bologna on September 3rd- 5th. The workshop addresses experts and scholars in open bibliographic metadata and citations and their extraction approaches. Also, the workshop gives a good chance to attend the presentations of our invited speakers who are well experienced in scholarly publishing. At the hack day, new services and data will be presented.

EXCITE collaborates with the Open Citation Corpus

EXCITE is pleased to announce its collaboration with the Open Citation Corpus (OCC), which started in 2010 as a one-year project funded by the Joint Information Systems Committee (JISC). In addition to its tools and services, OCC publishes accurate bibliographic and citation data in an open repository made available under a Creative Commons public domain dedication. The collaboration with OCC serves our vision of transparent access to bibliographic metadata as well as citation data for facilitating research in social science in particular, and in all sciences and the humanities in general.

EXCITE Workshop 2017

When: 30.03.2017 - 31.03.2017

Where: GESIS-Leibniz-Institut für Sozialwissenschaften, Unter Sachsenhausen 6-8, 50667 Cologne, Germany

Our first community meeting is planned as a “noon to noon” event and has the goal to bring together experts in reference extraction, text mining, and machine learning to explore the possibilities in the project. We plan to have scientific presentations with invited speakers on the first day and hands-on sessions on the second day. For the second day we will release a test corpus (PDF files of scientific papers and manually annotated data) for developers.

An End-to-end Approach for Extracting and Segmenting High-Variance References from PDF Documents

Boukhers Z., Ambhore S., Staab S. (2019) An End-to-end Approach for Extracting and Segmenting High-Variance References from PDF Documents. In Proceedings of the 19th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2019. ACM. read more

EXCITE - A toolchain to extract, match and publish open literature references.

Hosseini A., Ghavimi B., Boukhers Z., Mayr P. (2019) EXCITE - A toolchain to extract, match and publish open literature references. In Proceedings of the 19th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2019. ACM. read more

Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications

Körner M., Ghavimi B., Mayr P., Hartmann H., Staab S. (2017) Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications. In: Kirikova M. et al. (eds) New Trends in Databases and Information Systems. ADBIS 2017. Communications in Computer and Information Science, vol 767. Springer, Cham read more

EXCITE - Extraction of Citations from PDF Documents

General Information

Operational time

Partners

Source of funding

About Excite

One

Two

Three

Four

Five

Outcite Team

Analytic Computing, The University of Stuttgart/Institut of Web Science and Technologies, University of Koblenz-Landau

Prof. Dr. Steffen Staab

Dr. Zeyd Boukhers

Martin Körner

Anastasiia Iurshina

GESIS - Leibniz-Institut für Sozialwissenschaften

Dr. Philipp Mayr

Tobias Backes

Muhammad Ahsan Shahid

Behnam Ghavimi

Azam Hosseini

External Supporter

Dr. Heinrich Hartmann

Software

News From Excite/Outcite

ULITE workshop 2022

OUTCITE: second phase of the project is approved!

JCDL 2019 Conference

Open Citation Workshop 2018

EXCITE collaborates with the Open Citation Corpus

EXCITE Workshop 2017

P ublications

An End-to-end Approach for Extracting and Segmenting High-Variance References from PDF Documents

EXCITE - A toolchain to extract, match and publish open literature references.

Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications

More links

Contact us

Social Links