Mathematical Formulas Extraction - Semantic Scholar

Report 18 Downloads 190 Views
Mathematical Formulas Extraction Jianming Jin, Xionghu Han, Qingren Wang Institute of Machine Intelligence, Nankai University, Tianjin, China, 300071 [email protected] Abstract

formulas from document images is the first step of formulas processing.

As a universal technical language, mathematics has

According to the typesetting style, there are two classes

been widely applied in many fields, and it is more accurate

of formulas, namely, isolated formulas (IF) and embedded

than any other languages in describing information.

formulas (EF). There are lots of differences in typesetting

Therefore, numerous mathematical formulas exist in all

styles between IFs and non-IFs, so it is possible to

kinds of documents. There is no doubt that automatic

distinguish IFs from non-IFs without recognition. For

mathematical formulas processing is very important and

example, the line height [1], the line space [1], whether the

necessary, of which extract formulas from document

line is center aligned [3], the number of “long” words in

images is the first step. In this paper, formulas extraction

the line [3], the line density [2][5] and the vertical

methods which are not based on recognition results are

projection result of the line [2], all can be used to

presented: isolated formulas are extracted based on

distinguish IFs from non-IFs. But none of these researches

Parzen window and embedded expressions are extracted

give a detail description on how to use those features and

based on 2-D structures detection. Experiments show that

the effect of their methods. EFs are embedded in non-IFs,

our methods are very effective in formulas extraction.

so they are much more difficult to extract then IFs. Almost all EFs extraction methods [1][3] are based on recognition

1. Introduction

result. [4] classified all characters into nine classes, which is a rough recognition process in fact. [2] presented that the

Mathematics has been widely applied in education,

probability of italic characters belong to EFs is much more

research, management, business and many other fields.

greater then belong to non-EFs, which also need

Mathematics as a universal language for all fields, all

recognition process. In fact, the existence of 2-D structure

nations and all races of mankind, is more accurate than any

formulas

other languages in describing human activities for

dramatically. Therefore, extract 2-D structure EFs before

understanding and/or changing the world. Therefore,

recognition is necessary.

decreases

characters

recognition

ratio

numerous mathematical formulas exist in all kinds of

In this paper, the system model is introduced in section

documents. There is no doubt that automatic mathematical

2. A method to distinguish IFs from non-IFs, based on

formulas processing is very important and necessary.

Parzen window, is presented in section 3. A method to

However, formulas differ greatly from normal texts. For

extract 2-D structure EFs from non-IFs, based on

example, normal text lines are one-dimensional and the

MEANLINE/BASELINE position estimation, is presented

characters are placed one after another, while most

in section 4. Conclusions are drawn in section 5.

formulas are two-dimensional and characters may below, above or include each other. Therefore, that extract

2 System overview

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE

(4) Left Indent:

Document Image

LI = li / l

(4)

RI = ri / l

(5)

(5) Right Indent: Formulas

Preprocessing

(6) Distance between the formula and its sequence

Extraction

number: LD = ld / h0

Lines

(6)

In (1)-(6), h is the height of the line, l is the length of

Segmentation

Extracion Result

the line, h0 is the average height of all characters in the line, the definition of as, bs, li, ri and ld are shown in Figure 2. Therefore, each line is represented by x = (HT, AS, BS, LI,

Figure 1. System model

RI, LD).

As shown in Figure 1, the input of the system is a monochromatic scanned document image and the output is the location of all extracted formulas. At the preprocessing

as

li

stage, image is deskewed, noises are removed and

ri

bs

connected components are extracted. At the lines segmentation stage, all text lines are segmented out based on a top-down and bottom-up hybrid page decomposition method. At the formulas extraction stage, firstly all text

ld

lines are classified into IFs and non-IFs by a Parzen classifier, secondly 2-D structure EFs are extracted form

Figure 2. Definition of some variables

non-IFs based on 2-D structure detection.

3.2 Parzen window

3 Isolated formulas extraction Known the class probabilistic density function p(x|ωi) There are many differences of typesetting styles

is the precondition of using Bayes classification method.

between IFs and non-IFs, so it is possible to distinguish IFs

As to the IFs and non-IFs classification problem, the form

from non-IFs without recognition. Features which are

of p(x|ωi) is totally unknown, so non-parametic estimation

represented those differences are extracted to distinguish

method is adopted to estimate p(x|ωi). In this case, a Parzen

IFs from non-IFs.

window is used. Suppose there are Nk trainning samples, xk1 , xk2 , …, xkN , k

in class ωk, then a Parzen classifier is determined by a

3.1 Features extraction

nuclear function and the window width hk.. We adopt (7) as the nuclear function, where pˆ (x|ωk) is the estimation of

Following features are extracted:

(1) Line Height: HT = h / h0

(1)

If p(x|ωk) is estimated, the least error probability Bayes

(2) Space Above Line: AS = as / h0

(2)

(3) Space Below Line: BS = bs / h0

p(x|ωk), Σˆ is the covariance matrix of ωk’s training samples.

(3)

classification

method

{(

can

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE

used,

) ( )} ⇒ x∈ω .

p(x ω i )P(ω i ) = max p x ω j P ω j j =1,Λ , k

be

i

namely,

pˆ (x ω k ) =

BASELINE (b0) and MEANLINE (m0) are estimated by

1 × Nk

 1  ∑  ˆ j =1 (2π )n / 2 h n Σ  k k

 1 exp − 2 x − x kj  2hk

(

Nk

1/ 2

)

T

 (7)

 Σˆ −k 1 x − x kj   

(

)

(8), where HP = {hy| y = 1,…,h} is horizontal projection result. b0 = b hb = max {hi } 1≤i < h / 2 m = m h = max {h } m i  0 h / 2 Tsup ω.bottom − ω.top

(10)

ω.top − b > Tsub ω.bottom − ω.top

(11)

then ω is an abnormal character, where Tsup is the threshold

Experiments show that this method is effective in extracting 2-D structure EFs. Future research will focus on: (1) decreasing the confused ratio between IFs and non-IFs by using much more features, (2) extracting 1-D EFs by using recognition information.

used to judge whether ω is above MEANLINE, Tsub is the threshold used to judge whether ω is below BASELINE.

Acknowledgement

If a word is satisfied with (12), n ab > Tab n

(2-18)

then the word is an EF, where n is the total character

This work is supported by NNSFC (National Natural Science

Foundation

of

China)

Grant

number

TY10026002-04-04-01.

number of this word, nab is the total abnormal character number of this word, Tab is the threshold used to judge

Reference

whether the word is a EF. [1] Hsi-Jian Lee and Jiumn-Shine. Wang. “Design of a

4.4 Experimental results

Mathematical Expression Recognition System”. Proceedings of 3rd International Conference on Document Analysis and

Figure 4 shows EFs extraction result. It’s been seen that most 2-D structures EFs are extracted correctly.

Recognition, ICDAR'95, Montréal, Canada, pp. 1084-1087, August, 1995. [2] Richard J. Fateman. “How to Find Mathematics on a Scanned Page”. Technical Report, 1996. [3] J-.Y. Toumit, S. Garcia-Salicetti, H. Emptoz. “A Hierarchical and Recursive Model of Mathematical Expressions for Automatic Reading of Mathematical Documents”. Proceedings of 5th International Conference on Document Analysis and Recognition, ICDAR'99, Bangalore, India, pp. 119-122, 1999. [4] A. Kacem, A. Belaid and M. Ben Ahmed. “EXTRAFOR:

Figure 4. Embedded formulas extraction result

automatic

EXTRAction

of

mathematical

FORmulas”.

Proceedings of 5th International Conference on Document

5 Conclusions

Analysis and Recognition, ICDAR'99, Bangalore, India, pp. 527-530, 1999.

Automatic mathematical formulas processing is a

[5] Kyong-Ho Lee, Yoon-Chul Choy and Sung-Bae Cho.

comprehensive and difficult problem. Formulas extraction

“Geometric Structure Analysis of Document Images: A

is the first step of formulas processing. We present a

Knowledge-Based Approach”. IEEE Transactions on Pattern

method to distinguish isolated formulas from non isolated

Analysis and Machine Intelligence, Vol. 22, No. 11, pp.

formulas, which is based on Parzen widow. 91.65%

1224-1240, November, 2000.

isolated formulas are extracted correctly. After IFs extraction, embedded formulas are extracted from non-IFs. Without recognition information, a method to extract 2-D structure embedded formulas, which is based on BASELINE/MEANLINE position estimation, is presented.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE