Abstract 1 Introduction - CiteSeerX

Report 1 Downloads 411 Views
Segmentation of Color Documents by Line Oriented Clustering using Spatial information Marcel Worring and Leon Todoran Intelligent Sensory Information Systems University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam fworring, [email protected] http://carol.wins.uva.nl/f~worring,~todorang

Abstract

step in which di erent colors are separated out, and another step in which individual pixels or basic elements are grouped into larger objects. The methods di er in the moment of applying the two steps and whether they are applied in a local or global fashion. The DjVu algorithm [1] is geared towards compression of color document images. They decompose the image into a set of small tiles, approximately the size of a character and at that level the assumption is made that the image consists of at most two colors. Hence, a rather simple segmentation method suces. Local coherence between neighboring tiles is used to group the locally obtained segmentation into a global distinction between foreground and background. Both are then compressed using type speci c compression techniques. When aiming at indexing the images, rather than compression, the disadvantage of local methods is that a lot of the structural information is lost in the process. Global methods allow to consider the choice of colors and contrasts a document author used. Furthermore it is hard, if not impossible, to classify pictures on local information only Global methods based on clustering in color space are described in [3][5][7]. To nd clusters in color space [3] uses a standard K-means algorithm with random initialization. In [5] 3-dimensional mathematical morphology is used to erode the shapes in color space till a set of cluster centers remains. The method based on the Euclidean Minimum Spanning Tree (EMST) in color space [7] is similar, but based on graph algorithms rather than mathematical morphology. The clustering obtained with any of the three methods gives a set of distinct colors. These are then used to assign pixels in the image to one of

In this contribution we introduce a new method for global segmentation of color documents with a structure based on text frames and pictures. It is based on an extensive analysis of the expected shape of clusters in RGB-color space. The method provides an improved segmentation, and gives a proper basis for indexing and layout analysis. Results are very promising. keywords: color documents, document segmentation, document structure models, document indexing, document layout.

1 Introduction With color scanners and copiers becoming available at every workplace, the time is ripe to go beyond the mere reproduction of the color pixels of the original, and consider the layout analysis and subsequent indexing of color documents. Color documents as they appear in magazines and other color intensive publications are created using advanced desk top publishing systems, rather than simple word processors. Hence, the layout of color documents is highly complex. Trivial extensions of methods for black-and-white images do not apply. The rst step in analysis of a document is the segmentation of the document into its constituent parts. In color documents these are typically the textframes in the documents, gures and photographs. Although color document analysis is still in its infancy, some methods for this initial step have been published in literature. All thsese methods incorporate at least two steps in the processing of the document. One 1

Most systems are organized using frames. Such a frame can either be a text frame or a picture frame of arbitrary shape containing photographs and/or graphics. For each frame a distinction is made between the foreground containing the content and the background which is mostly used for aesthetic reasons or to draw attention. The frames are then organized in layers, where frames in higher layers can (partially) hide frames in lower layers. A professional graphic designer will not use a set of arbitrary colors in the document, but rather selects a small set of appropriate colors to be used in backgrounds. Contrasting colors are chosen for text and graphics. In some color documents (linear) variations of the di erent colors are also used e.g. to have a background which has one hue, with spatially varying brightness or saturation.

the resulting clusters, giving groups of connected pixels with the same color, which can easily be derived using connected component analysis in the (binary) identi cation image of each color. Both in [5] and [3] it is observed that for documents, the distributions in color space are not random, but exhibit speci c shapes. In the references the presence of such shapes is not explained, neither are they employed in the design of the algorithm. Furthermore, all processing of the image is performed in color space, no spatial information is used. In this contribution we present a new method for document segmentation which incorporates knowledge about the document creation process to identify expected cluster shape in RGB color space. The clustering gives a global segmentation of the image, spatial information is used to do ne-tuning of the result. To nd the initial clusters, spatial information is also used. Grouping of pixels into meaningful entities is immediate as the result of our clustering yields document based entities, rather than color clusters. The paper is organized as follows. In sec.2 the authoring of complex color documents is described formally, so as to de ne a model for the class of documents considered in this paper. Characteristics of the document model are used in sec.3 to describe the expected shape of clusters in RGB color space. Sec.4 presents our proposed algorithm. Results are presented in sec.5.

2.2 Document de nition Based on the above considerations, we de ne the document model as it will be used in the rest of the paper. A document consists of a set of frames denoted by 0::M ;1 , where 0 denotes the frame corresponding to the whole page. In this paper we make the assumption that for the documents under consideration the background color for each frame is uniform over the whole frame. For a frame i , we can therefore de ne ( i ) = ( i i i ) denoting this background color. When the frame is totally lled with a picture, ( i ) is not de ned. For a frame i with textual content we de ne the set of colored text objects ij , with = 1 as the decomposition of the textual symbols within the frame into categories based on their color, assuming colors are used to distinguish within a frame between header, paragraphs etc. As before ( ij ), denotes their color. To describe the structure of the document it is only needed to indicate which frames are on top of each other. This can be described by a directed graph = ( ) where the vertices are the frames and two vertices 1 2 are connected by a directed edge in whenever the corresponding frame are directly on top of each other. For this paper the assumption is made that textual content is always part of the directly enclosing frame. The consequence of this, is that whenever a piece of text is placed over two frames it should be assigned to the frame which encloses them both, it will however M

F

F

F

C ol F

2 Document model

r ;g ;b

C ol F

Up to this point the most popular models for document analysis can be traced back to the ODA standard. Good early examples are [6] and [2]. Most prominent characteristic is the use of a tree to describe both the logical and geometric structure in the document. For magazine pages a hierarchical structure is not appropriate. To de ne a model for such documents to be used in analysis, we consider the creation process which is underlying this class of documents..

F

t

j

::k;

k

k

C ol t

G

2.1 Document creation

V; E

V

v ;v

E

In the document creation process we distinguish two aspects. Firstly, the principles underlying most of the systems used for creation of documents. Secondly, the intention of users when they are using the system to create a document. 2

be decomposed into two components and distributed over the two frames. Finally, when the document is printed it yields the image given by: ( ) = ( ( ) ( ) ( ))

The above also allows to explain the L-shaped clusters observed in [5], these are transitions from two frames or text fragments of di erent color, to the same color. For example text with the same color in frames with di erent background color.

3 Cluster Shape

3.2 Color space visualization

p x; y

r x; y ; g x; y ; b x; y

To understand how non-random shapes can appear in RGB-space, we take the above model of documents and consider what happens if we consider the resulting color histogram in RGB-space.

3.1 Image acquisition In images acquired with a camera, varying intensities and di erent shades due to lighting and orientation for objects of uniform hue are known to lead to line segments in RGB space. If extended, these line segments would pass through the origin [4]. The acquisition process for documents using a scanner is radically di erent. Assuming careful calibration, illumination is constant, and orientation is xed. Hence, no highlights or shadows are present, if they were not part of the original. Ideally uniform regions of xed hue map to one single point in RGB-space. Transitions from one color to another are however altered by the printing of the original and the scanning. A common model for this is to assume that the original step edges in the image lead to Gaussian tapered transitions. Now taking into account the document model de ned, the shapes in RGB-space can be predicted. Each element in the union of ( i ) and ( ij ) over all and leads to one cluster of points around the ideal point in RGB-space. Any text object ij leads to a set of edges in the image, the color of which gradually changes from ( ij ) to ( i ). Furthermore, each edge in leads to a similar transition between the colors of the corresponding frames. In conclusion, in color space, we expect clouds of points around dominant colors, and streaks connecting colors which are part of a dominant contrast. Photographs or gures can map to many different points in color space and although there will be streaks present here also, they are likely to be less pronounced. As the edge in image space is a gradual transition, this gives a cloud of points between the two associated color clusters. These clouds are called streaks here. C ol F

i

C ol t

Figure 1: The RGB color cube exhibiting streaks. Bottom: original document Top left: visualization using presence visualization Top right: density visualization.

j

t

C ol t

C ol F

G

To verify the appearance of streaks in color space we have developed two color space visualization and interaction methods similar to the ones in [3]. Both visualizations are based on a 3D-color cube which is projected and can be viewed with arbitrary viewpoint. The presence visualization method plots a point for any non-empty bin in the histogram. The density visualization method plots a sphere for each non-empty (or signi cant) bin where the radius of the sphere is proportional to the number of points which map to this bin. Both methods use the color associ3

ated with the bin for display. An example visualization is shown in gure.1, other examples in color can be seen at our website 1 . The non-randomness of the histogram is very clear. Interactive veri cation con rmed our expectations. It further indicated that it is fair to assume that streaks are roughly line shaped.

4 The LOCUSi Method We have developed a method for segmentation of document images based on the above mentioned considerations and call it LOCUSi (Line Oriented Clustering Using Spatial information). The method is composed of two main steps. In the rst step the aim is to nd the set of lines = f 1 2 mg, where ideally the endpoints correspond to the colors chosen by the author when creating the original and the segment connecting the endpoints corresponds to the contrasts selected by the author. The latter can either be associated with text on a background or with overlapping frames. In the second step, points in the image are assigned to one of the lines, or remain unclassi ed. PL

Figure 2: Rules used to remove and merge line segments. selected, rather than averaging. The whole detection algorithm is depicted in gure.3

l ; l ; :::; l

4.1 Detection of line segments in color space

We initially select the most dominant colors in the histogram as possible endpoints for lines in In general is chosen such that it yields too many endpoints and this set is then reduced by combining clusters that are close. To conform to the assumption that the author has deliberately chosen a limited number of colors, the smallest cluster is assigned to the larger one, rather than taking their average. To speed up the processing, all analysis is done in a reduced color space. As indicated, the transitions between di erently colored objects are Gaussian shaped. We apply a simple intensity based edge detection scheme to detect candidate connections between colors in color space. As we only use it to nd lines in color space, edges do not have to be complete. To reduce the number of detected lines, irrelevant segments are removed and line segments with almost the same characteristics are merged using a number of rules (see gure.2). Again the dominant line is N

PL.

N

1

Figure 3: The algorithm for line segment detection.

4.2 Point classi cation

Having found the set we have to consider each pixel ( ) = ( ( ) ( ) ( )) and assign it to the appropriate line. Let ( i ) denote the distance in color space from the color of to the line segment i and let min ( ) be the minimum distance found for point when all lines in are considered and de ne the candidate segment list for point as follows: PL

p x; y

r x; y ; g x; y ; b x; y

d p; l

p

l

d

p

PL

http://carol.wins.uva.nl/~todoran/demos.html

CL

4

p

p

Figure 5: An original document created in CorelDraw with RGB-cube in density visualization (only bins with signi cant count shown). mode and indication of lines detected. Figure 4: The algorithm for classi cation

The results shown clearly indicate the quality of the method. Detected lines have a direct relation to document entities. They allow a clear separation between text and background. Furthermore, as apparant in the results of the magazine page, it nds the use of similar contrasts in di erent parts of the document. It should be stressed, that the results shown are based on the endpoints. If we consider the whole line, the complete frame with text is detected and by moving along the line, characters are thickened or thinned depending on the direction. Looking at the complete results for all lines it is found that allmost all lines detected are originating from text to background transitions, rather than overlapping frames. We expect that edge detection at lower scales (a larger Gaussian kernel), might capture the latter aspect. A further observation is that pieces of photographs and pictures pop up in the result. This is of course expected as they contain an unrestricted palette of colors. Subsequent processing of the result, using spatial and topological information allows to distinguish frames from pictures.

( ) = f i 2 j ( i ) ; min ( )  g where is a prede ned threshold. Whenever this set only contains one line, the point can be assigned directly to this line, but if it contains more elements the choice has to depend on other characteristics. We use spatial information by searching in a neighborhood of . Whenever a pixel 0 with signi cantly di erent color ( ( 0 ) ) is found is assigned to the element in for which the distance to the line segment j 0 j is minimal. If no point 0 can be found, we are in a uniform region and we just assign to the line yielding min ( ). CL p

l

P L d p; l

d

p

distT h ;

distT h

M xM

p

p

d p; p

> C olT h

p

CL

pp

p

p

d

p

5 Results The method was tested on a image set consisting of document pages scanned from a magazine, and bitmap representations of document pages created in CorelDraw. The created images are noise free and color variation in uniform regions is minimal (see gure 6). The scanned magazine pages present large variation in color as well as random noise and transparency noise ( gure 8). The scanning resolution was 300dpi, and for processing the color space was reduced to 64*64*64 colors. Parameters were selected manually, but were the same for each document. Example results are shown in gure 6 and gure 8.

6 Conclusion In this paper we have considered the creation model of color documents as they typically appear in magazines and studied its impact on the shapes of clusters in RGB-color space. We found that clusters are indeed not random, which was also observed by other 5

Figure 7: A scanned magazine page with RGB-cube in presence visualization mode. Overlayed are the most prominent lines detected. frames. The results of the method indicate that those entities can be detected reliably. Current research is devoted to a more extensive evaluation of the method on a large set of documents, both created arti cially for quantitative evaluation or scanned from magazines.

Acknowledgements

This research work was supported by Senter Den Haag (IOP research project IBV 96008). Figure 6: Results for the document of gure 5. For two of the lines detected, the left hand shows the elements on one endpoint of the linesegment, the right shows the other side.

References

[1] L. Bottou, P. Ha ner, P.G. Howard, P. Simard, Y. Bengio, and Y. LeCun, High quality document image compression with DjVu, Tech. report, July 1998, http://djvuserver.research.att.com.

authors. We, however, with the document model can explain the observed cluster shapes. In fact, point clouds around the main colors used in the document are found with line-shaped streaks related to the main contrasts used in the document. To be precise, the apparant clusters originate mostly from text on a colored background. Based on these observations we have developed the LOCUSi method which locates line segments in the RGB-color space, using edge information to initialize the process. In the post processing of the result, spatial information is further employed. It leads to a segmentation in which each detected line segment in RGB-color space corresponds to relevant document entities. It does not only yields improved segmentation results, it also provides a proper basis for global indexing of the document based on the use of color and color contrasts, or further layout analysis by examing the relations between detected

[2] A. Dengel, From paper to oce document standard representation, IEEE Computer 25 (1992), no. 7, 63{67. [3] G.C. Fox, R.D. Williams, and P.C. Messina, Parallel computing works, Morgan Kaufmann Publishers, Inc., 1994, URL: http://www.npac.syr.edu/copywrite/pcw/. [4] T. Gevers, A.W.M. Smeulders, and H. Stokman, Photometric invariant region detection, Proceedings of the ninth British Machine Vision Conference, Southhampton, 1998. [5] S.H. Park, I.D. Yun, and S.U. Lee, Color image segmentation based on 3-D clustering: Morpho6

Figure 8: Results for the document of gure 7. For two of the lines detected, the left hand shows the elements on one endpoint of the linesegment, the right shows the other side. logical approach, Pattern Recognition 31 (1998), no. 8, 1061{1076. [6] S. Tsujimoto and H. Asada, Major components of a complete text reading system, 80 (1992), no. 7, 1133{1149. [7] J. Zhou and D. Lopresti, Extracting Text from WWW Images, Proceedings of the 4th ICDAR, Ulm, Germany 1 (1997), 248{252.

7