An Adaptative Recognition System using a Table ... - Semantic Scholar

Report 0 Downloads 57 Views
An Adaptative Recognition System using a Table Description Language for Hierarchical Table Structures in Archival Documents Isaac Martinat, Bertrand Co¨uasnon and Jean Camillerapp IRISA / INSA, Campus universitaire de Beaulieu, F-35042 Rennes Cedex, France [email protected], [email protected] and [email protected]

Abstract. Archival documents are difficult to recognize because they are often damaged. Moreover, variations between documents are important even for documents having a priori the same structure. A recognition system to overcome these difficulties needs an external knowledge. Therefore we present a recognition system using an user description. To use table descriptions in analyzing the image, our system uses final intersections, intersections of two rulings with a close extremity of one or each of these two rulings. We present some results to show how our system can recognize tables with a general description and how it can deal with noise with a more precise description. Keywords: archival documents, table structure analysis, knowledge specification.

1 Introduction Many works were carried out on table recognition [9] but very few were carried out on tables of archival documents. These documents are difficult to analyze because they are often damaged due to their age and conservation. The rulings can be broken and skewed or curved. Another difficulty is that paper is thin, ink bleeds through the paper, thus rulings of flip side can be visible. That is why, these tables are very difficult to recognize. We want also to recognize sets of documents with a same logical structure whose the physical structure can change from one page to the next one. To overcome these difficulties, a recognition system needs to have an a priori knowledge. Therefore we propose a recognition system using a user description from a language. This language allows to define the logical and the physical structures of tables. The advantage of our language is to describe in the same specification a logical structure with important variations in physical structures (figure 4). In this paper, we will first present the related work on table representations and on archival document recognition. In Section 3 we propose a language to describe tables. Section 4 explains how our recognition system works and uses table descriptions. Before concluding our work, we will show with some results that our system can recognize very different kinds of tables with a same general description and can also recognize noisy and very damaged tables with a more precise description.

2 Related work 2.1

Table representations for edition

Wang [8] proposed a model to edit tables, which is composed of a logical part and a physical part. The logical part contains row and column hierarchy. In the physical part, for a cell or a set of cells, the user specifies separator type, size, content display (like police, alignment), ... When an user edits a table, the number of columns and rows has to be known. Many other descriptions for tables exist in different languages like in XML but they are for edition. For edition, a description must be complete for data, the number of cells is fixed. For table recognition we need to have one description for a set of tables that can have differencies between them like the number of columns and rows or the hierarchy. 2.2

Archival document recognition

Many works were carried out on table recognition [9] but very few on damaged tables in archival documents. The analysis of these documents is difficult because they are quite damaged. For the recognition of tables with rulings, Tubbs et al. recognized 1910 U.S. census tables [7] but coordinates for each cell of the tables are given at hand in an input of 1,451 file lines. The drawback of this method is the long time spent by the user to define this description and the recognized documents can not have physical variations. Nielson et al. [6] recognized tables whose rows and columns are separated by rulings. Projection profiles are used to identify rulings. For each document a mesh is created, and individual meshes are combined to form a template with a single mesh. individual meshes must be almost identical to be combined, so this method can not recognize documents with important variations between them. For other archival document recognition methods, a graphical interface is used to recognize archive biological cards [4], lists of Word War II [1]. Esposito et al. [3] designed a document processing system WISDOM++ for some specific archival documents (articles, registration card) and the result of this analysis can be modified by the user. Training observations are generated from these user operations. All these methods are used for a very specific type of document and the information given by the user is very precise. To help the archival document recognition, systems use an user description [7, 2], a graphical interface [4, 1], information of other documents of the same type [6] or user corrections [3]. These works use external knowledge. However, it is often quite long to define and too precise, so these systems do not allow important variations between documents. We presented in [2] a specific description system for military forms of the 19th Century. We also showed that a general system was not able to recognize these archival documents. This specific description was long to write, therefore it is necessary to have a faster way of describing tables. In [5] we presented a table recognition system using a short user description but this system was limited, the row and column numbers were fixed and the row and column hierarchies could not be described. Therefore, we propose a general table recognition system for tables using a table description language and this system can adapt to damaged archival tables with precise description.

3 Table Language for Table Recognition A table is a set of cells organized with columns and rows. We want to recognize the organization of a table, which means locating the cells of a table and labeling each cell with the name and the hierarchy of its column and the name and the hierarchy of its row. We also have to detect table structures from very damaged documents. To solve these two difficulties, we need to use an user description. 3.1

Specification Precision Levels

The language we propose is composed of two parts. The first one is a logical part where the user describes the row and column hierarchy. The second one is a physical part where the user describes the row and column separators, and optionnally he can also define the number and/or size of columns and/or rows. The advantage of this language is to describe tables with different levels of precision. On the one hand, the description can be very general. In this case, documents with important differences can be recognized with the same description but documents to recognize can not contain noise. For example, for a general description, a multi-row hierarchy can be described without specifying the number of rows for each level. On the other hand, the description given by the user can be very precise. In this case, very damaged documents can be recognized but for the same description, variations between documents can not be important. For example, for a precise description, the numbers of rows and columns can be specified. For a more precise description, sizes can also be given for some columns and some rows. The user can change easily and quickly from a general description to a precise one in adding or modifying some specifications with different precisions like in the figure 1. He also can specify a general and precise description, for example the description can be precise for the columns where the number is fixed, and general for the rows where the number is unfixed. 3.2

Table Language Definition

We will now use the term element rather than a column or a row. We propose a language like Wang’s model, composed of two parts, a logical one and a physical one. The main differencies between our language and Wang’s model is that in our language the user can specify for a table an unfixed number of columns and rows. In the logical part, the user describes element hierarchy (COL, ROW) and the relationship between columns and rows (COLS_IN_ROW). The physical part is optional, the user can specify the number of repetition times for an element (REPEAT, REPEAT+ if the number is unfixed), the size of an element, the separator types (SEPCOL,SEPROW). The user can also describe specific separators for some cells (SEPCELL). 3.3

Language Examples

These examples (figure 1) show that a description is easy and fast to write. The words in capital letters are reserved words of the language. To modify the general description to

(a) General Description : It is used to recognize (b) Precise Description : It is used to recognize very different tables (figures 4 and 5) with an very damaged documents (figure 6). unfixed number of columns and rows at each level of hierarchy.

(c) Example of a table which can be recognized (d) Example of a damaged table which can be with the general description. recognized with the precise description. Fig. 1. Example of a general description to recognize very different tables wich can be easily modified to a precise description to recognize very damaged documents.

a more precise description, REPEAT+(1,info) is replaced by REPEAT(7,info) and REPEAT+(1,person) by REPEAT(31,person). With the general description, the recognition system can recognize tables with important variations between them (figures 4 and 5). Indeed the number of columns is unfixed in this description and for the rows, at each level of hierarchy, the number of rows is unknown. The precise description allows the recognition system to recognize very damaged documents (figure 6). For documents where flip side rulings are visible, the user can give again a more precise description in giving sizes for rows and columns. The sizes help the system to avoid detecting flip side rulings.

4 From the Description to the Image 4.1

Final Intersections

From the image, we extract a set of line segments. Our goal is to match the image information with column and row information given by a user. Therefore we need to associate each line segment to a row or a column separator. We also need to have an intermediate level with common elements to match the image information and the user description. These elements must also be stable. To detect row and column separator with hierachy, we need to use line segment extremities. We need to use elements that can be derive from an user description, and these elements must easily be extracted from the image. We propose to define specific intersections, that we define as final intersections with at least one line segment extremity. From the user description, we can derivate the final intersections that have to be found in the image, and from the image we can extract the final intersections. We call final intersection (figure 2), an intersection of two

Fig. 2. Examples of final intersections, double arrows represent the tolerance.

rulings with a close extremity of one or both of these rulings, these two can possibly not intersect each other. In this case we call tolerance the distance between the two rulings. These final intersections allow to detect beginning and end of separators or specific changes in separator types. We do not use cross intersections because these intersections are too ambiguous. The final intersections have stronger dependencies, these intersections are typed and can be differenciated. For example some intersections can be differenciated as a table corner or as the beginning of a row separator. 4.2

Recognition System using Final Intersections

To detect the table structure, our system perfoms an in-depth analysis of rows and for each Terminal Row:

– detects an horizontal Separator we call SepH – from the Table Description : gets the final intersections associated to this Row we call DescrInterList – from the Image : gets the final intersections associated to the SepH we call ImageInterList – matchs the DescrInterList and ImageInterList : • if it succeeds, the vertical separators associated to the image final intersections, are labeled (and detected) with the colum names from the table description. • If this step fails, this matching is delayed, which means it will be run later. When the matching is released, that is to say it is run, the detection of column separators during the delay can allow the matching to succeed. The research of intersections is also extended, the tolerance is automatically increased to help the matching to succeed. 4.3

System Adaptation in Function of Description Level

When the user description is precise, the system can adapt to a document using user description, so it can recognize very noisy documents. If the user description is very general, the system will research final intersections in the image with a weak tolerance, the initial value. When the user description is more precise, if after the first detection the system has not detected in the image a structure matching with the table description, the delayed row detections are released. After this release, the tolerance is automatically increased to help the system to find the right structure. For example, if the number of columns is fixed, when the tolerance is increased, the system searches for final intersections in larger zones and can then detect the right number of column separators. When the user description is very precise, sizes for rows and/or columns are given, the system then searches rulings in image zones delimited by theses sizes. It helps the system to avoid detecting false rulings, for example the flip side rulings.

5 Results The system takes 14 seconds on linux with a 2.0 Ghz processor to recognize an image of 2495x3779 at 256 dpi. 5.1

Example on a Noisy Document

We will show on one synthetic example how our system by using user description can recognize noisy documents. In this first example (fig. 3), the user specifies that the document is a table containing 3 columns (A,B and C) and 3 rows. He specifies also that in the second row, the separator of the column A is blank. For the analysis, a preleminary step derivates from the user description the final intersections that the system has to find in the image. The final intersections in this example are the circles and the ellipses on the image (fig. 3). The system starts the table recognition with the first horizontal separator, extracts from the image the final intersections of the top separator. The system extracts 3 final intersections, although with the user description it would have to detect

Fig. 3. Example where our system can detect a noisy document with a false separator.

4 final intersections, so it delays this matching. The system then detects the two following horizontal separators as well as the final intersections from each separator and the matching with the user description succeeds. The system detects the bottom separator and as the separator for the column A is already detected, the system detects in the prolongation of this vertical separator a final intersection with a higher value of tolerance. After this detection, the delay is released, so the system starts to detect the top separator and the final intersections associated to this separator with a higher value of tolerance and as with the bottom separator, it detects the correct final intersections. This example shows how our system can recognize difficult documents. The description allows the system to eliminate false separators, and to detect separators with missing parts. 5.2

General Description

With the same general description (figure 1), the system can recognize census tables from different years (figures 4 and 5) with different structures. On figure 4, the 1881 table contains 8 columns whereas the 1911 table contains 10 columns. For a same year, the row hierarchy is different for each document, thus it is not possible to have a precise description for this hierarchy. The recognition system using this description, after the boxHead detection, labeled the column separators in using names from the description. For the row detection, an horizontal separator is detected, then the system gets the vertical ruling that intersects with the left extremity of the horizontal separator. From the terminal level of hierarchy, the system checks if the label of this vertical ruling matchs with the specification of the row level. If this checking fails, the system tries again with the upper level of hierarchy until to find the right level. 5.3

Precise description

With a more precise description, the row and column numbers are fixed, the system can recognize the damaged archival document on the figure 6. The row hierarchy is not detected but for the lowest level, the person rows are detected. As the system fails to

(a) original image

(b) level 1:boxHead, house

(c) level 2:head1,head2,household

(d) level 3:person

Fig. 4. Census Table of 1881 and the recognized structure with a general description (fig. 1), column number is unfixed like row number at each level of hierarchy.

recognize structures at the first step, the tolerance for ruling gaps as with final intersections is automatically increased until to recognize the right structure. Therefore the system detects the right structure.

(a) 1881 : 8 columns

(b) level 1:boxHead, house

(c) 1911 : 10 columns

(d) level 1:boxHead, house

Fig. 5. Census Table of 1881 and 1911 and the recognized structures with the same general description (fig. 1), column number is unfixed like row number.

(a) original image

(b) level 3:person

Fig. 6. Damaged census table of 1876 and the recognized structure with a precise description, row and column numbers are fixed.

6 Conclusion We presented a language to describe tables. With the same language, table descriptions can be very precise for damaged document recognition as well as general to detect tables with important variations between them and these descriptions are fast to write. To match a table description and image information, we have shown the interest to use some specific intersections, we defined as final intersections. Finally we have shown in our results how our system can detect a multi-level row hierarchy table with a general description. With this description an important number of different structures can be recognized. If documents are too damaged to be recognized with this description, the user can easily and quickly adding or modifying some specifications to get a more precise description. The system can then detect very damaged documents, which is important for the automatic processing of archival documents. Our future work is to validate our system on a significant number of archival documents and to study limits of our system in different levels of precision.

Acknowledgments This work has been done in cooperation with the Archives d´epartementales des Yvelines in France, with the support of the Conseil G´en´eral des Yvelines.

References 1. A. Antonacopoulos and D. Karatzas. Document image analysis for world war 2 personal records. In 1st International Workshop on Document Image Analysis for Libraries (DIAL 2004), pages 336–341, Palo Alto, CA, USA, January 2004. 2. Bertrand Co¨uasnon. Dmos, a generic document recognition method: Application to table structure analysis in a general and in a specific way. International Journal of Document Analysis and Recognition (IJDAR), 8(2-3):111–122, 2006. 3. F. Esposito, D. Malerba, G. Semeraro, S. Ferilli, O. Altamura, T. M. A. Basile, M. Berardi, M. Ceci, and N. Di Mauro. Machine learning methods for automatically processing historical documents: From paper acquisition to xml transformation. In 1st International Workshop on Document Image Analysis for Libraries (DIAL 2004), pages 328–335, Palo Alto, CA, USA, January 2004. 4. J. He and Andy C. Downton. User-assisted archive document image analysis for digital library construction. In 7th International Conference on Document Analysis and Recognition (ICDAR 2003), pages 498–502, Edinburgh, UK, August 2003. 5. Isaac Martinat and Bertrand Co¨uasnon. A minimal and sufficient way of introducing external knowledge for table recognition in archival documents. In Sixth IAPR International Workshop on Graphics Recognition,GREC 2005, Revised Selected Papers, volume LNCS 3926, pages 206–217, 2006. 6. H.E. Nielson and W.A. Barrett. Consensus-based table form recognition. In 7th International Conference on Document Analysis and Recognition (ICDAR 2003), pages 906–910, Edinburgh, UK, August 2003. 7. K.M. Tubbs and D.W. Embley. Recognizing records from the extracted cells of microfilm tables. In ACM Symposium on Document Engineering, pages 149–156, 2002. 8. X. Wang. Tabular abstraction, editing, and formatting. PhD thesis, University of Waterloo, 1996. 9. Richard Zanibbi, Dorothea Blostein, and James R. Cordy. A survey of table recognition. International Journal of Document Analysis and Recognition (IJDAR), 7(1):1–16, 2004.