if. g?

Report 2 Downloads 29 Views
US008612207B2

(12) United States Patent

(10) Patent N0.: (45) Date of Patent:

Sakao et a]. (54)

(56)

TEXT MINING DEVICE, METHOD THEREOF, AND PROGRAM

References Cited

5,111,398 A * 5,170,349 A *

Tokyo (JP)

5/1992 Nunberg etal. ................ .. 704/9 12/1992 Yagisawa et a1. ............... .. 704/9

(Continued)

(73) Assignee: NEC Corporation, Tokyo (JP)

(*)

FOREIGN PATENT DOCUMENTS

Subject to any disclaimer, the term of this patent is extended or adjusted under 35

JP JP

8/1992 7/1998

4-218872 10-198697

U.S.C. 154(b) by 1283 days.

(21) Appl. No.:

Dec. 17, 2013

U.S. PATENT DOCUMENTS

(75) Inventors: Yousuke Sakao, Tokyo (JP); Kenji Satoh, Tokyo (JP); Susumu Akamine,

Notice:

US 8,612,207 B2

(Continued)

10/593,375

OTHER PUBLICATIONS Tatsuya Asai, et a1., “Ef?cient Substructure Discovery from Large

Data Mining, SDM2002, Apr. 2002.

Semi-Structured Data”, Proc. Of 2nd SIAM International Conf. On

(22)

PCT Filed:

Mar. 17, 2005

(86)

PCT No.:

PCT/JP2005/005440

§ 371 (0X1)’ (2), (4) Date:

(Continued) Primary Examiner * Michael Colucci

(74) Attorney, Agent, or Firm * Scully, Scott, Murphy &

Feb. 26, 2007

Presser PC

(57)

(87) PCTPub.No.: WO2005/091170 PCT Pub. Date: Sep. 29, 2005

(65)

result. Similar-structure generation adjustment means 25 generates, from an input of an input device, a determination

Prior Publication Data

US 2007/0233458 A1

(30)

item for determining Whether or not the structures are iden

Oct. 4, 2007

Foreign Application Priority Data

Mar. 18, 2004

(JP) ............................... .. 2004-079077

Int. Cl.

(52)

vs. c1. USPC .......... .. 704/9; 704/260; 704/2; 704/1; 704/6;

(58)

tical every type of differences between the sentence struc tures. Similar-structure determination adjustment means 26 generates, from an input of the input device 6, a determination item for determining Whether or not the difference between

attribute values is ignored every type of attribute values. Simi lar-structure generating means 22 generates a similar struc ture of a partial structure forming the sentence structure

(51)

G06F 17/27

ABSTRACT

Language analysis means 21 analyzes texts read from a text DB 11, and generates a sentence structure as the analysis

(2006.01)

obtained by language analysis means 21 in accordance With the determination item from the similar-structure generation adjustment means 25, and sets the generated similar structure

704/10; 704/235; 704/270; 704/7; 704/246; 704/251; 704/255; 704/257; 715/234; 715/260; 715/273; 714/37; 709/206; 707/741

as an equivalent class of the partial structure on the generation source. Frequent-similar-pattem detection means 24 ignores the attribute value in accordance With the determination item

Field of Classi?cation Search USPC ............. .. 704/2, 9, 260, 1, 6, 10, 235, 270, 7,

means 26, detects the frequent pattern on the basis of a set of

704/246, 251, 255, 257; 714/37; 715/234, 715/260, 273; 709/206; 707/741

means 22, and outputs the frequent pattern to an output device 3.

given from the similar-structure determination adjustment equivalent classes from the similar-structure generating

See application ?le for complete search history.

PARTIAL STRUCTURE 3.5-0:

SIMILAR STRUCTLRE 39-1:

4“ w " NFORMATION

351?,

LAR STRUCTURE 35-1:

if.

"5

ABOUTATI'ACHED WORD: PERFECT

16 Claims, 23 Drawing Sheets

INFORMATION

ABOUTM'I'AOHED wean: PERFECT

1532,

g? A

SIMILAR STRUCTURE 3&1!

_

NFORMATION ABOUT ATTACHED WORD: PERFECT

'E'u ' '4

NFORMATION

ABOUTA'ITACHED

WORD: PERFECT

US 8,612,207 B2 Page 2

(56)

References Cited U.S. PATENT DOCUMENTS

5,424,947 5,799,268 5,960,384 6,272,455 6,278,967 6,499,026 6,741,988 7,051,022 7,146,308 7,523,126 2003/ 0004942 2003/0163537 2003/0204496

. . . . . . . . . .

3/2004

2004/0064447 A1*

4/2004 Simske et al.

2004/0260979 A1*

6/1995 Nagao et al. .................... .. 704/9 8/1998 Boguraev ........................ .. 704/9 9/1999 Brash .............................. .. 704/9 8/2001 Hoshen et al. 704/1 8/2001 Akers et al. . 704/2 12/2002 Rivette et a1. .. ....... .. 1/1 5/2004 Wake?eld et al. 707/741 5/2006 12/2006 Lin et al. ......................... .. 704/9 4/2009 Rivette et a1 ....................... .. 1/1 1/2003 8/2003 10/2003 Faisal

2004/0059577 A1*

. . . . ..

12/2004

Pickering .................... .. 704/260

707/5

Kumai .......................... .. 714/37

FOREIGN PATENT DOCUMENTS JP JP JP JP

2000-76274 2001-84250 2001-134575 2002-14990

3/2000 3/2001 5/2001 1/2002

OTHER PUBLICATIONS

1/1

Taku Kudo, et al., “Test Mining Using Linguistic Information”, Infor mation Processing Society of Japan Kenkyu Hokoku, 2002-NL-148, Mar. 5, 2002, V0. 2002, No. 20, pp. 65 to 72.

* cited by examiner

US. Patent

Dec. 17, 2013

Sheet 1 0123

US 8,612,207 B2

US. Patent

Dec. 17, 2013

Sheet 2 0f 23

I

US 8,612,207 B2

INFORMATION

KNOW

ABOUTATTACHED

m,

WORD: NEGATION

SURFACE / \ SURFACE

CASE: HA

CASE: W0 5

I1‘- /

INFORMATION

HE m

SQ§NBNEEN "CIVBOUT ATrAcHCETO W3 OR D: PE RFE

SURFACE

CASE GA

35555,?)

M.

\

TYPEAOF

VEHICLE

g

PR'CE 4%

A

KNOW

H615

HE

INFORMATION ABOUT ATTACHED WORD: NEGATION

HAS BEEN DOWN

SURFACE

TYPEAOF

CASE: HA

VEHICLE

PR'CE

‘*

INFORMATION

SURFACE

OVBOQQLQTQEESE'EETD

CASE: WO

I 4%

SURF-ACE CASE: GA

11*‘

SURFACE CASE: W0

1a

FIG. 3

'



US. Patent

Dec. 17, 2013

‘ZRIIOIFIIQEAJ AND

Sheet 3 0f 23

US 8,612,207 B2

rii'UELIéQ‘EA .I

_

AND

rig/Haiku

rT-‘EClEué‘FEA .J

"FAST TYPE OF VEHICLE

"FAST AND CHEAP TYPE A

ISA " AND

OF VEHICLE "AND

"TYPE A oF VEHICLE

"CHEAP AND FAST

IS FAST"

TYPE A OF VEHICLE '

TYPE A OF VEHICLE

FAST

SURFACE

SURFACE A I

CASE: HA {'31

CASE: HA I: TYPE A 0F

TYPE A CF VEHICLE

TYPE A OF VEHICLE

~

i?u

Eu

FEE/Alt Imiil AND

_



r??mg1?w H

SYNTAX STRUCTURE AND

REPCEE’ESFFAIEQI‘AAED

TYPEA OF VEHICLE HAS HIGH VELOCITY"

HIGH

FAS;_J VELOCITY ‘LU E51 SURFACE

SURFACE

CASE: HA

CASE: HA

H-

I

TYPE A OF VEHICLE

TYPE A OF VEHICLE

?A

ES'ASFREACJI'RENSF VEH CLE ARE FAST,



V

BI

E

l.I

FAST I

FAST ?L )

SURFACE L‘ SURFACE A SURFACE CASE: HA‘ b‘HI CASE: HA CASE: HA

It/

TYPE B OF VEHICLE TYPE A OF

VEHICLE

SURFACE ‘

CASE: Towt TYPE A OF VEHICLE

A

FIG. 4C

FIG. 4D

\

IS

TYPE B OF VEHICLE

US. Patent

Dec. 17, 2013

Sheet 4 0f 23

US 8,612,207 B2

rAIxhniiJ AND ré?mmiil

rAki?u) AND ré?Ali?'IT-JUJ A

"

"TYPE A OF VEHICLE IS FAST" AND

"ACCELERATION OF

"TYPE A OF VEHICLE WAS FAST"

TYPE A OF VEHICLE " .

INF

ACCELERATEI ACCELERATEI in

SURFACE CASE: HA

FAST | PAST IABSITTATTTACHEDI

n

SURFACE CASE: NO

I

TYPE A OF VEHICL

H M iEHWORD: PERFECT SURFACE CASE: HA

I

TYPE A OF VEHICLE

:2 :25 A

I

SURFACE CASE: HA

a)

MA ION

It

TYPE A OF VEHICLE

:: ‘.2:

TYPE A OF

VEHICLE

A

2;‘ :5 A

1. ‘I ———————————————— -—~L—————'---_-I

IMEMCRY I DEVICE

: I

I

I

I

i :I

2‘I

11 I TEXT DB

:I

i. ________________________

r -------- --\- ---------------------------------- —-1

I DATA

I PROCESSING : DEVICE

i l

5

i

I

I

LANGUAGE ANALYSIS ~21 i MEANS

;

I

i

SIMILAR-STRUCTURE N 22 5

GENERATING MEANS

:

I

i

I

FREQUENT-PATTERN N 23 i

|

I

;

DETECTION MEANS

:

I _____________________________________________ ___|

OUTPUT DEVICE

FIG. 6

w3

US. Patent

Dec. 17, 2013

Sheet 5 0f 23

ANALYZE LANGUAGE OF

US 8,612,207 B2

~A1

DOCUMENT DATA

GENERATE SIMILAR STRUCTURE N A2

OF SENTENCE STRUCTURE I

DETECT FREQUENT PATTERN

w A3

I!

OUTPUT DETECTED PATTERN ~ A4

PARALLEL MODIFICATION

N A2-1

I

GENERATE PARTIAL STRUCTURE

~ A2-2

I

NON-DIRECTIONAL BRANCHING N A2_3

OF DIRECTIONAL BRANCH II

REPLACE SYNONYM

NON-DIRECTIONAL BRANCHING

OF ORDERING TREES

_

N A2 5

I GENERATE EQUIVALENT CLASS N A2-6

FIG. 8

US. Patent

Dec. 17, 2013

Sheet 6 0f 23

US 8,612,207 B2

4 I

. ------------ "a _________________________________________ __ I I

i EQTOACESSING

I DEWCE

a

kAAEIXaLéAGE ANALYSIS

N 21 I

I

:

I

I

:

:

SIMILAR-STRUCTURE

N 22 i

GENERATING MEANS

:

i

I

a

I

'

FREQUENT-SIMlLAR-PATTERN N 24 :

I

DETECTION MEANS

I

|

i

OUTPUT DEVICE

3

FIG. 9 ANALYZE LANGUAGE OF DOCUMENT DATA

A1

I

GENERATE SIMILAR STRUCTURE

~ A2

OF SENTENCE STRUCTURE DETECT FREQUENT SIMILAR N B3

PATTERN I

OUTPUT DETECTED

PATTERN

FIG. 10

IV

A4

US. Patent

Dec. 17, 2013

Sheet 8 0f 23

A1 ~

US 8,612,207 B2

OF

l c1~

é‘F?ifk?Es?éb’éiLi’ézAmusT'm

1 62 ~ ETERATETTAAET.

l 03/»

OF

1 1

(:4 ~ DETECT FREQUENT SIMILAR PATTERN

A4 N

OUTPUT DETECTED PATTERN

FIG. 12

+—

US. Patent

Dec. 17, 2013

US 8,612,207 B2

Sheet 9 0f 23

C3-1 PARALLEL MODIFICATION IS DETERMINED‘?

SYNONYM REPLACEMENT IS DETERMINED?

PARALLEL MODIFICATION

A2-1 REPLACE SYNONYM

GENERATE PARTIAL STRUC URE

~A2'2

C3-4

63-2 NON-ORDERING OF ORDERING TREES IS DETERMINED?

NON DIRECTIONAL BRANCHING 0F DIRECTIONAL

BRANCH Is DETERMINED YES NON-DIRECTIONAL BRANCHING OF DIRECTIONAL BRANCH

NON-ORDERING OF ORDERING TREES

~ A2-3

I

GENERATE EQUIVALENT CLASS

FIG. 13

~A2-4

US. Patent

Dec. 17, 2013

Sheet 10 0f 23

US 8,612,207 B2

l

l

U

i MEMORY

5

: DEVICE

:~"\1

I

I

i

I

11 i

TEXT DB

I

I

:

I ______________ “I ____________ ___i

8~

DATA PROCESSING

INPUT

DEVICE

DEVICE

6

TEXT MINING PROGRAM

I

OUTPUT

3

DEVICE

FIG. 14

~7

US. Patent

Dec. 17, 2013

Sheet 11 0123

US 8,612,207 B2

SENTENCE 1; FAST TYPE A OF

~

VEHICLE IS CHEAP

?'?'b‘wé

SENTENCE 2:FAST AND CHEAP

Ll

-

TYPE A OF VEHICLE

?daé?"

SENTENCE 3:HlGH-VELOC|TY TYPE

A OF VEHICLE THAT

E

WOTEEIQISETEA

HAS BEEN CHEAP

FIG. 15

SENTENCE STRUCTURE OF SENTENCE 1: CHEAP

SURFACE CASE: HA I?

‘EU

TYPE A OF VEHICLE

gm"

SENTENCE STRUCTURE OF SENTENCE 2:

SENTENCE STRUCTURE OF SENTENCE 3:

TYPEAOF VEHICLE

TYPEAOF VEHICLE

I

A

CHEAP

‘9H

CHEAP

HIGH

m VELOCITY INFORMATION E323 ABOUT ATTACHED WORD: PERFECT

FAST

i?u

FIG. 16A

FAST

ii» I

FIG. 168

FIG. 16C

US. Patent

Dec. 17, 2013

Sheet 12 0f 23

US 8,612,207 B2

SYNONYM DICTIONARY

REPRESENTATIVE

WORD

FAST

STEP A2-1

AFTER REPLACED

HIGH VELOCITY

in

PARTIAL

STRUCTURE 2a-0:

§'T"é'b’é§URE 28-1;

TYPE A OF

VEHICLE

IEEIEC’ESF



A

CHEAP

PARALLEL

MODIFICATION >

.

FAST

I-Eu

22w

FAST

in

FIG. 18

CHEAP

I$3M

US. Patent

Dec. 17, 2013

Sheet 13 0f 23

US 8,612,207 B2

STEP A2-2 PARTIAL STRUCTURE 2a-0: TYPEAOF VEHICLE



PARTIAL STRUCTURE 2b-0:

PARTIAL STRUCTURE 28-0:

TYPEAOF VEHICLE

CHEAP

‘EA

?TiA

CHEAP

FAST

L FL‘

W

& 1

PARTIAL



:AL> STRUCTURE M TYPE A OF

FAST

22“

VEHICLE

TYPE A OF VEHICLE

PARTIAL

STRUCTURE 2a-1 TYPE A OF VEHICLE

STRUCTURE 29-0

CHEAP rh

QIFEA

CHEAP

PARTIAL *T“ STRUCTURE 2d-0

ii“

PARTIAL STRUCTURE Zf-O

A

SIMILAR

FAST

‘TM

CHEAP

FASI; '

@LI

FAST I,

)

FIG. 19

HELI if) I

US. Patent

Dec. 17, 2013

STEPA2-3 PARTIAL STRUCTURE 2a-0 TYPEAOF VEHICLE \ A

CHEAP

Sheet 14 0f 23

SIMILAR STRUCTURE 2a-2 TYPEAOF VEHICLE QEA

'gg?g??g‘m >

CHEAP

\ ‘12w

"'{fu

FAST

US 8,612,207 B2

PARTIAL SIMILAR STRUCTURE 2c-0 STRUCTURE 2c-1 TYPEAOF TYPEAOF VEHICLE NONDIRECT'ONA VEHICLE Q?b IBRANCHING > ‘ 2:)‘ "

CHEAP

CHEAP

ELI

#Ll

FAST

a,“

?n

SIMILAR

SIMILAR

PARTIAL

sIMILAR

sTRucTuRE 2a-1

STRUCTURE 2731-3

STRUCTURE 29-0

STRUCTURE zg-I

TYPEAOF

VEHICLE

‘IIoIIoIREcIIoIIAL

CHEAP NON DIRECT?IAL CHEAP

/‘€ABRANCHING >

L?" BRANCHING > FASTCE,“

FAST

FAST

CHEAP

ii»!

FAST

3“

5%“

CEBU

PARTIAL STRUCTURE 2w

sIMILAR STRUCTURE 2b-1

TYPEAOF

\TIEP‘IIECAEJF

VEHICLE

I

‘A 0 -D RECTIONAL BRANCH'NG FAST

ii“

FAST

?u

FIG. 20E

PARTIAL STRUCTURES 2d-0, 2e-0, AND

A

2f-0 ARE OMITTED BECAUSE WITHOUT MODIFICATION

US. Patent

Dec. 17, 2013

Sheet 16 0123

US 8,612,207 B2

._ _

.5“Eurmu o

tawwk:31".5“5.

.QEmm