Sequence-to-Label Script Identification for Multilingual OCR

Jonathan Michael Baccash
Patrick Michael Hurst
Proceedings of the 14th International Conference on Document Analysis and Recognition (ICDAR), IEEE (2017)
Google Scholar

Abstract

We describe a novel line-level script identification
method. In multilingual OCR, script identification is a crucial
component as it automates the provision of a language hint.
Previous work repurposed an OCR model that generates per-character
script codes, aggregated by a counting heuristic to
obtain line-level script ID. This baseline has two shortcomings.
First, as a sequence-to-sequence model it is more complex than
necessary for the sequence-to-label problem of line script ID,
making it hard to train and inefficient to run. Second, the counting
heuristic may be suboptimal compared to a learned model.
Therefore we reframe line script identification as a
sequence-to-label problem and solve it using two components, trained
end-to-end: Encoder and Summarizer. The encoder converts a line
image into a sequence of features. The summarizer aggregates
this sequence to classify the line. We test various summarizers
while keeping identical inception-style convolutional networks as
encoders. Experiments on scanned books and photos containing
232 languages in 30 scripts show 16% reduction of script ID error
rate compared to the baseline. This improved script ID reduces
the character error rate attributable to script misidentification
by 33%.