Segmentation of CAPTCHAs Based on Complex Networks

Report 2 Downloads 51 Views
Segmentation of CAPTCHAs Based on Complex Networks Kun Fang, Zhan Bu, and Zheng You Xia College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China [email protected]

Abstract. CAPTCHA is a simple test that is designed to be easily generated by computers and easily recognized by humams, but difficult for computers to solve. It is now almost a standard security technology. The most widely deployed CAPTCHAs are text-based schemes, but to CAPTCHAs, segmenting the connected and distored characters is still an unsolving problem. In this paper, we proposed a Community Divided Model algorithm which based on complex networks to segment these CAPTCHAs. To evaluate the effectiveness of the proposed segmentation algorithm, we conducted several experiments on database which collected some CAPTCHAs from the Internet randomly. The results showed that the proposed algorithm is effective to segment two or more connected and distored characters. Keywords: CAPTCHA, complex networks, segmentation, connected characters.

1

Introduction

CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) is a program that generates and grades tests that are human solvable, but intend to be beyond the capabilities of current computer programs [1]. This technology is now almost a standard security mechanism that have been widely used in free email and forum accounts registration to against undesirable or malicious computer bot program registration and spam. The most widely deployed CAPTCHAs are text-based schemes, which typically require users to recongize connected and distorted characters [2]. It is well known that CAPTCHAs segmentation is much more diffcult than recongition since machine learning algorithms can efficiently solve the general recongition problem, but currently no effective general algorithm can be used to solve all CAPTCHAs segmentation [3]. If a scheme is vulnerable to be segmented, it is can be broken easily. Convolutional neural network had been widely used for recongising single character [4]. Chellapilla et al. [3] had studied that computers can recongnize single character even at high distortion and clutter settings with using a sequence of character transformations. A commonly accepted goal for CAPTCHAs design is that automated attacks should not be more than 0.01% successful but that the human success rate should be at least 90% [5].

J. Lei et al. (Eds.): AICI 2012, LNAI 7530, pp. 735–743, 2012. © Springer-Verlag Berlin Heidelberg 2012

736

K. Fang, Z. Bu, and Z.Y. Xia

Therefore, various CAPTCHAs segmentation mechanisms have been proposed. Mori and Malik [6] had broken the EZ-Gimpy and the Gimpy CAPTCHAs with sophisticated object recognition algorithms. Chellapilla et al.[7] used the machine learning algorithms to attack a number of early CAPTCHAs. Huang et al.[8] used a projection-based segmentation algorithm to break MSN and YAHOO CAPTCHAs. Bursztein et al. [9] detailedly describe of the text-based CAPTCHA Strengths and Weaknesses. Fig.1 shows an attack on a Microsoft CAPTCHA with using the vertical, color filling segmentation and thick arc removal algorithms [2].

(a)

(b)

Fig. 1. (a) Original image of Microsoft CAPTCHA. (b) Completely segmented image.

The CAPTCHAs made by Yan [2] and Huang [8] were mainly used straight and curved line as image clutter to confuse the defeating program. But when encounter CAPTCHAs with connected and distorted characters, these algorithms are not very effective. So, in this study, we propose a Community Divided Model segmentation algorithm which based on complex networks to segment these connected and distorted characters. And our experiment results indicate that this novel method is effective to segment a connected and distored characters. The remainder of this paper are organized as follows: In section 2, we discuss our motivation and create the database. Section 3 presents our algorithm based on complex networks. Experiment results and analysis are discusseed in Section 4. We draw some conclusions and further studies in Section 5.

2

Motivation and Database Creation

In the last decade, complex networks[10,11] had become a new movement of research. Community structure is an important property of complex networks. Girvan and Newman [12] highlighted the property of community structure, in which network nodes are joined together in tight knit groups, between which there are only looser connections. They tested their method on computer-generated and real-world graphs and found the structure was high sensitivity and reliability. For CAPTCHAs, we find it has fixed length, that is to say we know the community number, and every single character tightly connected while touching characters are connected loosely. And the adjacent connected and distorted characters in most cases are touching at the middle of the core zone. The idea of this method is shown in Fig.2. In Fig.2(a), three characters consist of the CAPTCHAs and they connected each other. We can use the characteristic of these three characters and divide them into three community structures: red is ‘q’, green is ‘o’ and black is ‘s’, the result is showed in Fig.2(b). In this paper, we modify the Girvan and Newman algorithm and propose a Community Divided Model which can segment a connected and distorted characters.

Segmentation of CAPTCHAs Based on Complex Networks

(a)

737

(b)

Fig. 2. (a) A CAPTCHAs collected from internet. (b) Community structure of “qos”. It use adjacency matrix for eight neighbors of each foreground pixels and the three communities were detected by the Girvan and Newman algorithm.

To determine whether the Community Divided Model can be used as an effiective method to segment the CAPTCHAs, we randomly collected hundreds of the CAPTCHAs from Authorize, 360buy, Tianya, Windows Live and Taobao [13,14,15,16,17,18], where the CAPTCHAs were used to register accounts. Fig.3 shows some examples. Our method does not consider the dot shape: “i” and “j”, as Ahmad at el. [5] explained how to detect characters that contains a dot shape in detail.

(a)

(b) (d)

(c) (e)

Fig. 3. Examples of Authorize(a), 360buy(b), Tianya(c), Windows Live(d) and Taobao(e)

According to our database, we discover these CAPTCHAs have following features: Some backgrounds are spotted lightly with different colors. Some characters are tilted, squeezed, bent, moved up and down, connected or distorted. But all of them contain digits and lowercase or uppercase letters. And the characters have fixed length.

3

Proposed Segmentation Algorithm

3.1

Community Divided Model

The connected graph G (V , E ) is composed of vertex set V (G ) and edge set E (G ) . Vertex vi and v j are an elements of V and of E , all the initial values of element of

eij which is their edge weight is an element

eij are 0. Let adjacency list L[i ] (i=1…n) is the

vi neighbors. Queue Q[ k ] is the element of k-th non-connected nodes

group. N is the community number. Since each CAPTCHAs has the fixed length, the algorithm we proposed for identify communities is stated as follows: 1. Calculate the adjacent list of all vertices: Use the eight neighbors of each foreground pixels on the thinned pattern to search the entire image. If v j is one of the eight neighbor nodes of vi (j≠i), then v j is the vi adjacent side, and put it into the L[i ] .

738

K. Fang, Z. Bu, and Z.Y. Xia

2.Calculate the non-connected nodes group number: According to the adjacent list L , calculate all the connected nodes groups and add them into the Q , if the nonconnected nodes group number euqals to N , the system breaks up. 3.Calculate the betweenness score for all edges: Use the DFS algorithm to calculate the shortest paths among the all vertices. If the edge eij pass once, score-1 is added to it. Then calculate the highest betweenness score among them. 4.Remove the highest score edge: Remove the highest betweenness score eij and update the edge set E (G ) .Meanwhile, delete the connected edge in the List L[i ] . 5.Repeat from step 2 and recalculate betweennesses score for all edges affected by the removal until all the edges are removed and the system breaks up into N nonconnected nodes group. 3.2

Pre-processing

Binarization is the first and the most important step. Whether the segmentation step work well depend on bi-level image’s quality. So, we convert the original image into the binary image. The process of this method is done via the standard thresholding method: those color value of all the pixels above a heuristically predetermined threshold is converted to black and those bellow converted to white [5]. Sometimes, the image has some noise points, we scan the entire image and if the unicom region’s pixels are larger than a threshold, we regard it as noise and remove it. On the other hand, we care about the time we processing. In order to reduce the process time, we use Zhang’s althorithm [19] to thin the images. Fig.4 shows the final images after the application of the standard thresholding method and thinning algorithm on Fig.3.

(a) Result images after the application of the standard thresholding method.

(b) Result images after the application of the thinning algorithm. Fig. 4. Example of the Authorize, 360buy, Tianya, Windows Live, Taobao

3.3

Segmentation

We use the Community Divided Model that we proposed to segment the characters. If the characters are connected, we draw a red line to indicate it is the highest score that we should remove it, then we get the characters results of segmentation. Because most of Authorize and 360buy CAPTCHAs are non-connected and they have four or five

Segmentation of CAPTCHAs Based on Complex Networks

739

communities, we get the results of segmentation directly. To, Tianya, Windows Live and Taobao, they have connected characters, we draw a red line to indicate the removing point. Fig.5 shows the removing point with a red line and Fig.6 shows the last results of segmentation. The original images are shown in Fig. 4(b).

Fig. 5. Removing point: red line indicate the highest score

Fig. 6. Results of segmentation of the Authorize, 360buy, Tianya, Windows Live and Taobao

With the above results of segmentaition, we found this method has a great result of non-connected or connected characters segmentation even though these characters are distorted. We draw a red line to indicate the segmentation points, which are the highest score that should be removeed. And we can see that these experitmental results are basically consistent with the fact.

4

Experiment Results and Analysis

To evaluate the effectiveness of our techniques, we get more CAPTCHAs and designed three different difficulty experiments, they are presented as follows: 1) Select 100 CAPTCHAs randomly from 360buy and Authorize, respectively. Every CAPTCHAs from 360buy has four characters and Authorize has five characters, most of characters are non-touching expect few two characters touching. 2) Select 100 CAPTCHAs randomly from Tianya and Windows Live, respectively. Every CAPTCHAs from Tianya has four characters and Windows Live has six characters, most of characters are two characters touching. 3) Select 100 CAPTCHAs randomly from Taobao. Every CAPTCHAs has four characters,most of characters are more than two characters touching In 2008, Huang et al. [8] proposed a Projection-based segmentation Alogorithm to break MSN and Yahoo CAPTCHAs, When the projection values in the sliding window were smaller than the threshold, the algorithm marked the position and erased these clutter items. Yan [2] proposed a low-cost attack on a CAPTCHA designed by Microsoft in 2008, when to segment the connected characters, they worked out the width of the object and then vertically divide the object into the same width. Here, we use the Huang’s and Yan’s algorithms to repeat the above three experimental steps to compare with the algorithm we proposed. If an algorithm can segment 60 from 100

740

K. Fang, Z. Bu, and Z.Y. Xia

images, the segmentation accuracy will be 60/100=0.6, or 60%. In this experiment, the number of two connected characters images is 153, three connected characters is 35 and four connected characters is 72. Table 1 displays results of the experiment by using Huang’s ,Yan’s and proposed algorithm. With the analysis of the results, we find the proposed algorithm is better than Huang’s and Yan’s algorithm. Table 1. Results of segmentation different connected characters numbers by using Huang’s, Yan’s and proposed algorithm Huang’s algorithm

Yan’s algorithm

Proposed algorithm

2 characters

39.22%

60.13%

66.01%

3 characters

22.86%

22.86%

31.43%

≥4 characters

9.72%

12.50%

29.17%

These algorithms are also applied to calculate segmentation rates of Authorize, 360buy,Tianya,Windows Live and Taobao systems. Table 2 displays the results of the calculation by using these algorithms. When to segment the Authorize system, the segmentation rate of the proposed algorithm is the same with Huang’s, but 2% higher than Yan’s. Similarly, when to segment the 360buy system, the segmentation rate of proposed algorithm is equal to Yan’s and 5% higher than Huang’s. And when to segment Tianya, Windows Live and Taobao systems, the proposed algorithm segmentation rates are higher than both of them. Therefore, the proposed algorithm is more effective than Huang’s and Yan’s when to segment two or more connected and distorted characters. Fig. 7 shows an example of the proposed algorithms which compared with Huang’s and Yan’s algorithm. Table 2. Segmentation rates of the Huang’s, Yan’s and proposed algorithm Authorize

360buy

Tianya

Windows Live

Taobao

Huang’s algorithm

98%

90%

58%

40%

15%

Yan’s algorithm

96% 98%

95% 95%

67% 71%

46% 55%

22% 33%

Proposed algorithm

In addition, the average run time is an another important factor when to segment a CAPTCHAs. The average run time of the proposed algorithm to segment Authorize, 360buy, Tianya, Windows Live and Taobao systems is 0.571s, 0.099s, 6.748s, 0.556s and 10.472s, respectively. The cause of these difference is that most of 360buy’s and Authorize’s characters are non-connected but most of Tianya and Windows Live are two characters connected and Taobao are more than two characters connected. Therefore, we can conclude that the more connected characters are, the more timeconsuming is. The experiments are carried out on a PC with Core 2 processor, 2.29 GHz, 2GB RAM, with VC6.0, on Windows XP.

Segmentation of CAPTCHAs Based on Complex Networks

741

(b) Failure of segmentation from Huang’s algorithm.

(c)Failure of segmentation from Yan’s algorithm. (d)Successful segmentation of the proposed. Fig. 7. An example of the proposed algorithm which compared with Huang’s and Yan’s

To evaluate the effectiveness of the proposed segmentation method, we collect more CAPTCHAs from different Internet websites. The results of the experiment, illustrated in Fig. 8, indicated that the proposed algorithm can also effectively segment these connected and distorted characters.

(a) Original images

(b) Removing point

(c) Results of segmentaion

Fig. 8. Some CAPTCHAs and their results of segmentation by using proposed algorithm

However, our method also fail to segment some images. In Fig.9 (a), the characters “5” and “c”, “7” and “e” are overlapping and the red line marks the highest score but it fail to segment the images. In Fig. 9 (b), the characters “E” and “r”, “M” and “c” also be failed to segment because character ‘E’ has much more pixels than ‘r’,and ’M’ has much more pixels than ‘c’. Therefore, our method is ineffective when the characters are overlapping or one character has much more pixels than another.

(a)

(b)

Fig. 9. Failure results of segmentation. (a) Overlapping characters. (b) One character has much more pixels than another.

742

5

K. Fang, Z. Bu, and Z.Y. Xia

Conclusion and Further Work

In this paper, we propose a Community Divided Model algorithm based on complex networks to segment the CAPTCHAs. The results of the experiment show that the algorithm is more effective to segment two or more connected or distored characters. With the growing frequency of using the CAPTCHAs, our research can provide web designers some inspiration to improve the security of their website. To further work, we plan to study better method to segment overlapping characters and one character has much more pixels than another. Moreover, how to segment a text-based CAPTCHAs of Chinese is also our future work.

References 1. von Ahn, L., Blum, M., Hopper, N.J., Langford, J.: CAPTCHA: Using Hard AI Problems for Security. In: Biham, E. (ed.) EUROCRYPT 2003. LNCS, vol. 2656, pp. 294–311. Springer, Heidelberg (2003) 2. Yan, J., El Ahmad, A.S.: A low-cost attack on a Microsoft CAPTCHA. In: 15th ACM Conference on Computer and Communications Security (2008) 3. Chellapilla, K., Larson, K., Simard, P., Czerwinski, M.: Computers beat humans at single character recognition in reading based Human Interaction Proofs (HIPs). In: 2nd Conference on Email and Anti-Spam (2005) 4. Simard, P.Y., Steinkraus, D., Platt, J.C.: Best Practice for Convolutional Neural Networks Applied to Visual Document Analysis. In: 7th International Conference on Document Analysis and Recognition, pp. 958–962. IEEE Computer Society, Los Alamitos (2003) 5. El Ahmad, A.S., Yan, J., Tayara, M.: The Robustness of Google CAPTCHAs. Bericht, Newcastle University (2011) 6. Mori, G., Malik, J.: Recognising objects in adversarial clutter: breaking a visual CAPTCHA. In: IEEE Conference on Computer Vision & Pattern Recognition (2003) 7. Chellapilla, K., Larson, K., Simard, P.Y., Czerwinski, M.: Building Segmentation Based Human-Friendly Human Interaction Proofs (HIPs). In: Baird, H.S., Lopresti, D.P. (eds.) HIP 2005. LNCS, vol. 3517, pp. 1–26. Springer, Heidelberg (2005) 8. Huang, S., Lee, Y., Bell, G., Ou, Z.: A projection-based segmentation algorithm for breaking MSN and YAHOO CAPTCHAs. In: Proceedings of the 2008 International Conference of Signal and Image Engineering, London, UK (2008) 9. Bursztein, E., Martin, M., Mitchell, C.: Text-based CAPTCHA Strengths and Weaknessses. In: 18th ACM Conference on Computer and Communications Security (2011) 10. Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.-U.: Complex networks: Structure and dynamics. Phys. Rep. 424(4-5), 175–308 (2006) 11. Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry 40, 35– 41 (1977) 12. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. PNAS 99, 7821–7826 (2002) 13. Bursztein, E., Bethard, S., Fabry, C., Mitchell, J., Jurafsky, D.: How good are humans at solving CAPTCHAs? a large scale evaluation. In: 2010 IEEE Symposium on Security and Privacy (SP), pp. 399–413 (2010) 14. Authorize CAPTCHAs, https://account.authorize.net/ui/themes/anet/Welcome/Forgott enLoginID.aspx (accessed April 2012)

Segmentation of CAPTCHAs Based on Complex Networks

743

15. 360buy CAPTCHAs, https://passport.360buy.com/new/registpersonal.aspx (accessed April 2012) 16. Tianya CAPTCHAs, http://passport.tianya.cn/register (accessed March 2012) 17. Windows Live CAPTCHAs, https://signup.msn.cn/register (accessed April 2012) 18. Taobao CAPTCHAs, http://member1.taobao.com/member/new_register.jhtml (accessed April 2012) 19. Zhang, T.Y., Suen, C.Y.: A fast parallel algorithm for thinning digital patterns. Communications of the ACM 27(3), 236–239 (1984)