ScannedTables
Tool Support for the Automatic Extraction of Table Data from Historical Journals
Qualitative and quantitative data analyses in all disciplines rely on a structured data foundation. Textual data, such as that found in newspapers, is often accompanied by additional tabular data to convey information in a structured manner. At first glance, these tables may appear to be well-organized, but in most cases, they are best classified as semi-structured or unstructured, as it is often not possible to access individual elements of the dataset in a targeted way.
The project was carried out in 2024 at Wismar University of Applied Sciences under the direction of RosDH associate member Frank Krüger.
In the project "Tool Support for the Automatic Extraction of Table Data from Historical Journals" (short title: ScannedTables), an automated pipeline was developed to extract tabular data from historical journals, using the Swinemünder Badeanzeiger as a case study. The aim was to minimize the need for specific tool training and showcase how existing open-source tools can be leveraged. This process included table segmentation using a machine learning approach, OCR extraction with Tesseract, OCR correction, and the structuring of extracted data using a large language model (LLM).
Through this pipeline, approximately 350,000 structured datasets were extracted from editions of the Swinemünder Badeanzeiger published between 1910 and 1932. To disambiguate and link the manually annotated data, geocoordinates of streets and buildings in Swinemünde were manually determined. In line with the FAIR and CARE principles, these datasets—along with the intermediate pipeline results, such as extracted tables, OCR outputs, and OCR corrections—were made openly available.
Project management:
Prof. Dr.-Ing. Frank Krüger
Professor of Data Science and Machine Learning
Wismar University of Applied Sciences
frank.kruegerhs-wismarde
Project staff:
Dr. Steffen Steiner
Work Area General Electrical Engineering
Wismar University of Applied Sciences
Project period:
April 2024 to December 2024 (9 months)
Project funding:
NFDI Consortium Text+ (Collaborative Project / 2023 Funding Call)
German Research Foundation (DFG)
- Krüger, Frank, Antje Theise, Max Schröder, Anja Eggert, and Manuela Reichelt. Code Expedition – Kulturhackathon Rostock 2022: Potentiale offener Kulturdaten. Analyse ankommender Badegäste auf Basis des Swinemünder Bade-Anzeigers. Invited talk, Lecture Series "Digital Humanities im Fokus", University of Rostock, June 26, 2023. https://www.germanistik.uni-rostock.de/forschung/digital-humanities/rosdh/ringvorlesung/2023/n/code-expedition-kulturhackathon-rostock-2022-potentiale-offener-kulturdaten-analyse-ankommender-badegaeste-auf-basis-des-swinemuender-bade-anzeigers-167361/ [Accessed: January 27, 2025].
- Krüger, Frank, Max Schröder, Anja Eggert, and Manuela Reichelt. SwineBad: Data Visualisation of Swinemuender Badeanzeiger. Datensatz, GitHub, 2022. https://github.com/ORDS-MV/SwineBad [Accessed: January 27, 2025].
- Steiner, Steffen, and Frank Krüger. SwineBad: Tabellenextraktion und Informationsstrukturierung aus dem Swinemünder Badeanzeiger. Poster, 3rd Text+ Plenary, Mannheim, October 10–11, 2024.
https://events.gwdg.de/event/638/page/161-posters-text-plenary-2024#poster95 [Accessed: January 27, 2025].
- Steiner, Steffen, and Frank Krüger. SwineBad: Tool support for the automatic extraction of newspaper data from data from historical newspapers (Version 1.0). Software, GitHub, 2025. https://github.com/ORDS-MV/SwineBad-Toolsupport [Accessed: January 27, 2025].
- Steiner, Steffen, and Frank Krüger. OCR Groundtruth for Swinemünder Badeanzeiger (1.0.0). Dataset, Zenodo, 2025. https://doi.org/10.5281/zenodo.14603757.
Contact
Digital Humanities
Institute for German Studies
Gertrudenstraße 11, Torhaus
18057 Rostock
E-Mail: phf.dhuni-rostockde
Lecture Series:
Digital Humanities in Focus
Zoom-Link
Meeting ID: 630 4747 2241
Passcode: 430211
Venue in the SuSe 2025
Old Physics/Alte Physik
Great Lecture Hall (2nd Floor)
Universitätsplatz 3
18055 Rostock