ScannedTables

Tool Support for the Automatic Extraction of Table Data from Historical Journals

Qualitative and quantitative data analyses in all disciplines rely on a structured data foundation. Textual data, such as that found in newspapers, is often accompanied by additional tabular data to convey information in a structured manner. At first glance, these tables may appear to be well-organized, but in most cases, they are best classified as semi-structured or unstructured, as it is often not possible to access individual elements of the dataset in a targeted way.

The project was carried out in 2024 at Wismar University of Applied Sciences under the direction of RosDH associate member Frank Krüger.

Project description

In the project "Tool Support for the Automatic Extraction of Table Data from Historical Journals" (short title: ScannedTables), an automated pipeline was developed to extract tabular data from historical journals, using the Swinemünder Badeanzeiger as a case study. The aim was to minimize the need for specific tool training and showcase how existing open-source tools can be leveraged. This process included table segmentation using a machine learning approach, OCR extraction with Tesseract, OCR correction, and the structuring of extracted data using a large language model (LLM).

Through this pipeline, approximately 350,000 structured datasets were extracted from editions of the Swinemünder Badeanzeiger published between 1910 and 1932. To disambiguate and link the manually annotated data, geocoordinates of streets and buildings in Swinemünde were manually determined. In line with the FAIR and CARE principles, these datasets—along with the intermediate pipeline results, such as extracted tables, OCR outputs, and OCR corrections—were made openly available.

Project information

Project management:
Prof. Dr.-Ing. Frank Krüger
Professor of Data Science and Machine Learning
Wismar University of Applied Sciences
frank.kruegerhs-wismarde

Project staff:
Dr. Steffen Steiner
Work Area General Electrical Engineering
Wismar University of Applied Sciences

Project period:
April 2024 to December 2024 (9 months)

Project funding:
NFDI Consortium Text+ (Collaborative Project / 2023 Funding Call)
German Research Foundation (DFG)

Publications and talks

Contact

Digital Humanities
Institute for German Studies
Gertrudenstraße 11, Torhaus
18057 Rostock

E-Mail: phf.dhuni-rostockde

Lecture Series:

Digital Humanities in Focus

Zoom-Link
Meeting ID: 630 4747 2241
Passcode: 430211

Venue in the SuSe 2025
Old Physics/Alte Physik
Great Lecture Hall (2nd Floor)
Universitätsplatz 3
18055 Rostock

Lecture Series
RosDH-Hypotheses-Blog
RosDH-Zenodo-Community