This INDIVIDUAL PAPER may be viewed by clicking the blue VIEW PRESENTATION button (located across from the presenter's name/below the title) OR the View Presentation in the footer of this pop-up.
Ground Truth, Neural Networks, OCR: Towards Full Text of Republican China Newspapers
Presenter Lightning Session(s)
University of Heidelberg, Germany
This presentation introduces an approach towards full-text digitization of Republican China newspapers within the Early Chinese Periodicals Online project (ECPO). Since full pages cannot yet be automatically processed by OCR engines (dense document layout, special characters, imperfect visuals) we developed a workflow where individual processing steps can be adjusted to different publications. We created ground truth for annotations (labeled and grouped bounding boxes) and full text (blind double keying). We then trained a neural network to detect different visual features of the pages. This enables us to, for example, process “masthead” areas separately from “advertisement”. We OCR areas like “article” and build a dictionary to train the engine and improve text recognition. Our outcomes are quality ground truth data sets (texts and segmentation), trained models, and a full text data-set for Republican China newspapers, searchable within ECPO.