Mathematical Formulas Extraction Jianming Jin, Xionghu Han, Qingren Wang Institute of Machine Intelligence, Nankai University, Tianjin, China, 300071
[email protected] Abstract
formulas from document images is the first step of formulas processing.
As a universal technical language, mathematics has
According to the typesetting style, there are two classes
been widely applied in many fields, and it is more accurate
of formulas, namely, isolated formulas (IF) and embedded
than any other languages in describing information.
formulas (EF). There are lots of differences in typesetting
Therefore, numerous mathematical formulas exist in all
styles between IFs and non-IFs, so it is possible to
kinds of documents. There is no doubt that automatic
distinguish IFs from non-IFs without recognition. For
mathematical formulas processing is very important and
example, the line height [1], the line space [1], whether the
necessary, of which extract formulas from document
line is center aligned [3], the number of “long” words in
images is the first step. In this paper, formulas extraction
the line [3], the line density [2][5] and the vertical
methods which are not based on recognition results are
projection result of the line [2], all can be used to
presented: isolated formulas are extracted based on
distinguish IFs from non-IFs. But none of these researches
Parzen window and embedded expressions are extracted
give a detail description on how to use those features and
based on 2-D structures detection. Experiments show that
the effect of their methods. EFs are embedded in non-IFs,
our methods are very effective in formulas extraction.
so they are much more difficult to extract then IFs. Almost all EFs extraction methods [1][3] are based on recognition
1. Introduction
result. [4] classified all characters into nine classes, which is a rough recognition process in fact. [2] presented that the
Mathematics has been widely applied in education,
probability of italic characters belong to EFs is much more
research, management, business and many other fields.
greater then belong to non-EFs, which also need
Mathematics as a universal language for all fields, all
recognition process. In fact, the existence of 2-D structure
nations and all races of mankind, is more accurate than any
formulas
other languages in describing human activities for
dramatically. Therefore, extract 2-D structure EFs before
understanding and/or changing the world. Therefore,
recognition is necessary.
decreases
characters
recognition
ratio
numerous mathematical formulas exist in all kinds of
In this paper, the system model is introduced in section
documents. There is no doubt that automatic mathematical
2. A method to distinguish IFs from non-IFs, based on
formulas processing is very important and necessary.
Parzen window, is presented in section 3. A method to
However, formulas differ greatly from normal texts. For
extract 2-D structure EFs from non-IFs, based on
example, normal text lines are one-dimensional and the
MEANLINE/BASELINE position estimation, is presented
characters are placed one after another, while most
in section 4. Conclusions are drawn in section 5.
formulas are two-dimensional and characters may below, above or include each other. Therefore, that extract
2 System overview
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
(4) Left Indent:
Document Image
LI = li / l
(4)
RI = ri / l
(5)
(5) Right Indent: Formulas
Preprocessing
(6) Distance between the formula and its sequence
Extraction
number: LD = ld / h0
Lines
(6)
In (1)-(6), h is the height of the line, l is the length of
Segmentation
Extracion Result
the line, h0 is the average height of all characters in the line, the definition of as, bs, li, ri and ld are shown in Figure 2. Therefore, each line is represented by x = (HT, AS, BS, LI,
Figure 1. System model
RI, LD).
As shown in Figure 1, the input of the system is a monochromatic scanned document image and the output is the location of all extracted formulas. At the preprocessing
as
li
stage, image is deskewed, noises are removed and
ri
bs
connected components are extracted. At the lines segmentation stage, all text lines are segmented out based on a top-down and bottom-up hybrid page decomposition method. At the formulas extraction stage, firstly all text
ld
lines are classified into IFs and non-IFs by a Parzen classifier, secondly 2-D structure EFs are extracted form
Figure 2. Definition of some variables
non-IFs based on 2-D structure detection.
3.2 Parzen window
3 Isolated formulas extraction Known the class probabilistic density function p(x|ωi) There are many differences of typesetting styles
is the precondition of using Bayes classification method.
between IFs and non-IFs, so it is possible to distinguish IFs
As to the IFs and non-IFs classification problem, the form
from non-IFs without recognition. Features which are
of p(x|ωi) is totally unknown, so non-parametic estimation
represented those differences are extracted to distinguish
method is adopted to estimate p(x|ωi). In this case, a Parzen
IFs from non-IFs.
window is used. Suppose there are Nk trainning samples, xk1 , xk2 , …, xkN , k
in class ωk, then a Parzen classifier is determined by a
3.1 Features extraction
nuclear function and the window width hk.. We adopt (7) as the nuclear function, where pˆ (x|ωk) is the estimation of
Following features are extracted:
(1) Line Height: HT = h / h0
(1)
If p(x|ωk) is estimated, the least error probability Bayes
(2) Space Above Line: AS = as / h0
(2)
(3) Space Below Line: BS = bs / h0
p(x|ωk), Σˆ is the covariance matrix of ωk’s training samples.
(3)
classification
method
{(
can
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
used,
) ( )} ⇒ x∈ω .
p(x ω i )P(ω i ) = max p x ω j P ω j j =1,Λ , k
be
i
namely,
pˆ (x ω k ) =
BASELINE (b0) and MEANLINE (m0) are estimated by
1 × Nk
1 ∑ ˆ j =1 (2π )n / 2 h n Σ k k
1 exp − 2 x − x kj 2hk
(
Nk
1/ 2
)
T
(7)
Σˆ −k 1 x − x kj
(
)
(8), where HP = {hy| y = 1,…,h} is horizontal projection result. b0 = b hb = max {hi } 1≤i < h / 2 m = m h = max {h } m i 0 h / 2 Tsup ω.bottom − ω.top
(10)
ω.top − b > Tsub ω.bottom − ω.top
(11)
then ω is an abnormal character, where Tsup is the threshold
Experiments show that this method is effective in extracting 2-D structure EFs. Future research will focus on: (1) decreasing the confused ratio between IFs and non-IFs by using much more features, (2) extracting 1-D EFs by using recognition information.
used to judge whether ω is above MEANLINE, Tsub is the threshold used to judge whether ω is below BASELINE.
Acknowledgement
If a word is satisfied with (12), n ab > Tab n
(2-18)
then the word is an EF, where n is the total character
This work is supported by NNSFC (National Natural Science
Foundation
of
China)
Grant
number
TY10026002-04-04-01.
number of this word, nab is the total abnormal character number of this word, Tab is the threshold used to judge
Reference
whether the word is a EF. [1] Hsi-Jian Lee and Jiumn-Shine. Wang. “Design of a
4.4 Experimental results
Mathematical Expression Recognition System”. Proceedings of 3rd International Conference on Document Analysis and
Figure 4 shows EFs extraction result. It’s been seen that most 2-D structures EFs are extracted correctly.
Recognition, ICDAR'95, Montréal, Canada, pp. 1084-1087, August, 1995. [2] Richard J. Fateman. “How to Find Mathematics on a Scanned Page”. Technical Report, 1996. [3] J-.Y. Toumit, S. Garcia-Salicetti, H. Emptoz. “A Hierarchical and Recursive Model of Mathematical Expressions for Automatic Reading of Mathematical Documents”. Proceedings of 5th International Conference on Document Analysis and Recognition, ICDAR'99, Bangalore, India, pp. 119-122, 1999. [4] A. Kacem, A. Belaid and M. Ben Ahmed. “EXTRAFOR:
Figure 4. Embedded formulas extraction result
automatic
EXTRAction
of
mathematical
FORmulas”.
Proceedings of 5th International Conference on Document
5 Conclusions
Analysis and Recognition, ICDAR'99, Bangalore, India, pp. 527-530, 1999.
Automatic mathematical formulas processing is a
[5] Kyong-Ho Lee, Yoon-Chul Choy and Sung-Bae Cho.
comprehensive and difficult problem. Formulas extraction
“Geometric Structure Analysis of Document Images: A
is the first step of formulas processing. We present a
Knowledge-Based Approach”. IEEE Transactions on Pattern
method to distinguish isolated formulas from non isolated
Analysis and Machine Intelligence, Vol. 22, No. 11, pp.
formulas, which is based on Parzen widow. 91.65%
1224-1240, November, 2000.
isolated formulas are extracted correctly. After IFs extraction, embedded formulas are extracted from non-IFs. Without recognition information, a method to extract 2-D structure embedded formulas, which is based on BASELINE/MEANLINE position estimation, is presented.
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE