US008612207B2
(12) United States Patent
(10) Patent N0.: (45) Date of Patent:
Sakao et a]. (54)
(56)
TEXT MINING DEVICE, METHOD THEREOF, AND PROGRAM
References Cited
5,111,398 A * 5,170,349 A *
Tokyo (JP)
5/1992 Nunberg etal. ................ .. 704/9 12/1992 Yagisawa et a1. ............... .. 704/9
(Continued)
(73) Assignee: NEC Corporation, Tokyo (JP)
(*)
FOREIGN PATENT DOCUMENTS
Subject to any disclaimer, the term of this patent is extended or adjusted under 35
JP JP
8/1992 7/1998
4-218872 10-198697
U.S.C. 154(b) by 1283 days.
(21) Appl. No.:
Dec. 17, 2013
U.S. PATENT DOCUMENTS
(75) Inventors: Yousuke Sakao, Tokyo (JP); Kenji Satoh, Tokyo (JP); Susumu Akamine,
Notice:
US 8,612,207 B2
(Continued)
10/593,375
OTHER PUBLICATIONS Tatsuya Asai, et a1., “Ef?cient Substructure Discovery from Large
Data Mining, SDM2002, Apr. 2002.
Semi-Structured Data”, Proc. Of 2nd SIAM International Conf. On
(22)
PCT Filed:
Mar. 17, 2005
(86)
PCT No.:
PCT/JP2005/005440
§ 371 (0X1)’ (2), (4) Date:
(Continued) Primary Examiner * Michael Colucci
(74) Attorney, Agent, or Firm * Scully, Scott, Murphy &
Feb. 26, 2007
Presser PC
(57)
(87) PCTPub.No.: WO2005/091170 PCT Pub. Date: Sep. 29, 2005
(65)
result. Similar-structure generation adjustment means 25 generates, from an input of an input device, a determination
Prior Publication Data
US 2007/0233458 A1
(30)
item for determining Whether or not the structures are iden
Oct. 4, 2007
Foreign Application Priority Data
Mar. 18, 2004
(JP) ............................... .. 2004-079077
Int. Cl.
(52)
vs. c1. USPC .......... .. 704/9; 704/260; 704/2; 704/1; 704/6;
(58)
tical every type of differences between the sentence struc tures. Similar-structure determination adjustment means 26 generates, from an input of the input device 6, a determination item for determining Whether or not the difference between
attribute values is ignored every type of attribute values. Simi lar-structure generating means 22 generates a similar struc ture of a partial structure forming the sentence structure
(51)
G06F 17/27
ABSTRACT
Language analysis means 21 analyzes texts read from a text DB 11, and generates a sentence structure as the analysis
(2006.01)
obtained by language analysis means 21 in accordance With the determination item from the similar-structure generation adjustment means 25, and sets the generated similar structure
704/10; 704/235; 704/270; 704/7; 704/246; 704/251; 704/255; 704/257; 715/234; 715/260; 715/273; 714/37; 709/206; 707/741
as an equivalent class of the partial structure on the generation source. Frequent-similar-pattem detection means 24 ignores the attribute value in accordance With the determination item
Field of Classi?cation Search USPC ............. .. 704/2, 9, 260, 1, 6, 10, 235, 270, 7,
means 26, detects the frequent pattern on the basis of a set of
704/246, 251, 255, 257; 714/37; 715/234, 715/260, 273; 709/206; 707/741
means 22, and outputs the frequent pattern to an output device 3.
given from the similar-structure determination adjustment equivalent classes from the similar-structure generating
See application ?le for complete search history.
PARTIAL STRUCTURE 3.5-0:
SIMILAR STRUCTLRE 39-1:
4“ w " NFORMATION
351?,
LAR STRUCTURE 35-1:
if.
"5
ABOUTATI'ACHED WORD: PERFECT
16 Claims, 23 Drawing Sheets
INFORMATION
ABOUTM'I'AOHED wean: PERFECT
1532,
g? A
SIMILAR STRUCTURE 3&1!
_
NFORMATION ABOUT ATTACHED WORD: PERFECT
'E'u ' '4
NFORMATION
ABOUTA'ITACHED
WORD: PERFECT
US 8,612,207 B2 Page 2
(56)
References Cited U.S. PATENT DOCUMENTS
5,424,947 5,799,268 5,960,384 6,272,455 6,278,967 6,499,026 6,741,988 7,051,022 7,146,308 7,523,126 2003/ 0004942 2003/0163537 2003/0204496
. . . . . . . . . .
3/2004
2004/0064447 A1*
4/2004 Simske et al.
2004/0260979 A1*
6/1995 Nagao et al. .................... .. 704/9 8/1998 Boguraev ........................ .. 704/9 9/1999 Brash .............................. .. 704/9 8/2001 Hoshen et al. 704/1 8/2001 Akers et al. . 704/2 12/2002 Rivette et a1. .. ....... .. 1/1 5/2004 Wake?eld et al. 707/741 5/2006 12/2006 Lin et al. ......................... .. 704/9 4/2009 Rivette et a1 ....................... .. 1/1 1/2003 8/2003 10/2003 Faisal
2004/0059577 A1*
. . . . ..
12/2004
Pickering .................... .. 704/260
707/5
Kumai .......................... .. 714/37
FOREIGN PATENT DOCUMENTS JP JP JP JP
2000-76274 2001-84250 2001-134575 2002-14990
3/2000 3/2001 5/2001 1/2002
OTHER PUBLICATIONS
1/1
Taku Kudo, et al., “Test Mining Using Linguistic Information”, Infor mation Processing Society of Japan Kenkyu Hokoku, 2002-NL-148, Mar. 5, 2002, V0. 2002, No. 20, pp. 65 to 72.
* cited by examiner
US. Patent
Dec. 17, 2013
Sheet 1 0123
US 8,612,207 B2
US. Patent
Dec. 17, 2013
Sheet 2 0f 23
I
US 8,612,207 B2
INFORMATION
KNOW
ABOUTATTACHED
m,
WORD: NEGATION
SURFACE / \ SURFACE
CASE: HA
CASE: W0 5
I1‘- /
INFORMATION
HE m
SQ§NBNEEN "CIVBOUT ATrAcHCETO W3 OR D: PE RFE
SURFACE
CASE GA
35555,?)
M.
\
TYPEAOF
VEHICLE
g
PR'CE 4%
A
KNOW
H615
HE
INFORMATION ABOUT ATTACHED WORD: NEGATION
HAS BEEN DOWN
SURFACE
TYPEAOF
CASE: HA
VEHICLE
PR'CE
‘*
INFORMATION
SURFACE
OVBOQQLQTQEESE'EETD
CASE: WO
I 4%
SURF-ACE CASE: GA
11*‘
SURFACE CASE: W0
1a
FIG. 3
'
'é
US. Patent
Dec. 17, 2013
‘ZRIIOIFIIQEAJ AND
Sheet 3 0f 23
US 8,612,207 B2
rii'UELIéQ‘EA .I
_
AND
rig/Haiku
rT-‘EClEué‘FEA .J
"FAST TYPE OF VEHICLE
"FAST AND CHEAP TYPE A
ISA " AND
OF VEHICLE "AND
"TYPE A oF VEHICLE
"CHEAP AND FAST
IS FAST"
TYPE A OF VEHICLE '
TYPE A OF VEHICLE
FAST
SURFACE
SURFACE A I
CASE: HA {'31
CASE: HA I: TYPE A 0F
TYPE A CF VEHICLE
TYPE A OF VEHICLE
~
i?u
Eu
FEE/Alt Imiil AND
_
“
r??mg1?w H
SYNTAX STRUCTURE AND
REPCEE’ESFFAIEQI‘AAED
TYPEA OF VEHICLE HAS HIGH VELOCITY"
HIGH
FAS;_J VELOCITY ‘LU E51 SURFACE
SURFACE
CASE: HA
CASE: HA
H-
I
TYPE A OF VEHICLE
TYPE A OF VEHICLE
?A
ES'ASFREACJI'RENSF VEH CLE ARE FAST,
”
V
BI
E
l.I
FAST I
FAST ?L )
SURFACE L‘ SURFACE A SURFACE CASE: HA‘ b‘HI CASE: HA CASE: HA
It/
TYPE B OF VEHICLE TYPE A OF
VEHICLE
SURFACE ‘
CASE: Towt TYPE A OF VEHICLE
A
FIG. 4C
FIG. 4D
\
IS
TYPE B OF VEHICLE
US. Patent
Dec. 17, 2013
Sheet 4 0f 23
US 8,612,207 B2
rAIxhniiJ AND ré?mmiil
rAki?u) AND ré?Ali?'IT-JUJ A
"
"TYPE A OF VEHICLE IS FAST" AND
"ACCELERATION OF
"TYPE A OF VEHICLE WAS FAST"
TYPE A OF VEHICLE " .
INF
ACCELERATEI ACCELERATEI in
SURFACE CASE: HA
FAST | PAST IABSITTATTTACHEDI
n
SURFACE CASE: NO
I
TYPE A OF VEHICL
H M iEHWORD: PERFECT SURFACE CASE: HA
I
TYPE A OF VEHICLE
:2 :25 A
I
SURFACE CASE: HA
a)
MA ION
It
TYPE A OF VEHICLE
:: ‘.2:
TYPE A OF
VEHICLE
A
2;‘ :5 A
1. ‘I ———————————————— -—~L—————'---_-I
IMEMCRY I DEVICE
: I
I
I
I
i :I
2‘I
11 I TEXT DB
:I
i. ________________________
r -------- --\- ---------------------------------- —-1
I DATA
I PROCESSING : DEVICE
i l
5
i
I
I
LANGUAGE ANALYSIS ~21 i MEANS
;
I
i
SIMILAR-STRUCTURE N 22 5
GENERATING MEANS
:
I
i
I
FREQUENT-PATTERN N 23 i
|
I
;
DETECTION MEANS
:
I _____________________________________________ ___|
OUTPUT DEVICE
FIG. 6
w3
US. Patent
Dec. 17, 2013
Sheet 5 0f 23
ANALYZE LANGUAGE OF
US 8,612,207 B2
~A1
DOCUMENT DATA
GENERATE SIMILAR STRUCTURE N A2
OF SENTENCE STRUCTURE I
DETECT FREQUENT PATTERN
w A3
I!
OUTPUT DETECTED PATTERN ~ A4
PARALLEL MODIFICATION
N A2-1
I
GENERATE PARTIAL STRUCTURE
~ A2-2
I
NON-DIRECTIONAL BRANCHING N A2_3
OF DIRECTIONAL BRANCH II
REPLACE SYNONYM
NON-DIRECTIONAL BRANCHING
OF ORDERING TREES
_
N A2 5
I GENERATE EQUIVALENT CLASS N A2-6
FIG. 8
US. Patent
Dec. 17, 2013
Sheet 6 0f 23
US 8,612,207 B2
4 I
. ------------ "a _________________________________________ __ I I
i EQTOACESSING
I DEWCE
a
kAAEIXaLéAGE ANALYSIS
N 21 I
I
:
I
I
:
:
SIMILAR-STRUCTURE
N 22 i
GENERATING MEANS
:
i
I
a
I
'
FREQUENT-SIMlLAR-PATTERN N 24 :
I
DETECTION MEANS
I
|
i
OUTPUT DEVICE
3
FIG. 9 ANALYZE LANGUAGE OF DOCUMENT DATA
A1
I
GENERATE SIMILAR STRUCTURE
~ A2
OF SENTENCE STRUCTURE DETECT FREQUENT SIMILAR N B3
PATTERN I
OUTPUT DETECTED
PATTERN
FIG. 10
IV
A4
US. Patent
Dec. 17, 2013
Sheet 8 0f 23
A1 ~
US 8,612,207 B2
OF
l c1~
é‘F?ifk?Es?éb’éiLi’ézAmusT'm
1 62 ~ ETERATETTAAET.
l 03/»
OF
1 1
(:4 ~ DETECT FREQUENT SIMILAR PATTERN
A4 N
OUTPUT DETECTED PATTERN
FIG. 12
+—
US. Patent
Dec. 17, 2013
US 8,612,207 B2
Sheet 9 0f 23
C3-1 PARALLEL MODIFICATION IS DETERMINED‘?
SYNONYM REPLACEMENT IS DETERMINED?
PARALLEL MODIFICATION
A2-1 REPLACE SYNONYM
GENERATE PARTIAL STRUC URE
~A2'2
C3-4
63-2 NON-ORDERING OF ORDERING TREES IS DETERMINED?
NON DIRECTIONAL BRANCHING 0F DIRECTIONAL
BRANCH Is DETERMINED YES NON-DIRECTIONAL BRANCHING OF DIRECTIONAL BRANCH
NON-ORDERING OF ORDERING TREES
~ A2-3
I
GENERATE EQUIVALENT CLASS
FIG. 13
~A2-4
US. Patent
Dec. 17, 2013
Sheet 10 0f 23
US 8,612,207 B2
l
l
U
i MEMORY
5
: DEVICE
:~"\1
I
I
i
I
11 i
TEXT DB
I
I
:
I ______________ “I ____________ ___i
8~
DATA PROCESSING
INPUT
DEVICE
DEVICE
6
TEXT MINING PROGRAM
I
OUTPUT
3
DEVICE
FIG. 14
~7
US. Patent
Dec. 17, 2013
Sheet 11 0123
US 8,612,207 B2
SENTENCE 1; FAST TYPE A OF
~
VEHICLE IS CHEAP
?'?'b‘wé
SENTENCE 2:FAST AND CHEAP
Ll
-
TYPE A OF VEHICLE
?daé?"
SENTENCE 3:HlGH-VELOC|TY TYPE
A OF VEHICLE THAT
E
WOTEEIQISETEA
HAS BEEN CHEAP
FIG. 15
SENTENCE STRUCTURE OF SENTENCE 1: CHEAP
SURFACE CASE: HA I?
‘EU
TYPE A OF VEHICLE
gm"
SENTENCE STRUCTURE OF SENTENCE 2:
SENTENCE STRUCTURE OF SENTENCE 3:
TYPEAOF VEHICLE
TYPEAOF VEHICLE
I
A
CHEAP
‘9H
CHEAP
HIGH
m VELOCITY INFORMATION E323 ABOUT ATTACHED WORD: PERFECT
FAST
i?u
FIG. 16A
FAST
ii» I
FIG. 168
FIG. 16C
US. Patent
Dec. 17, 2013
Sheet 12 0f 23
US 8,612,207 B2
SYNONYM DICTIONARY
REPRESENTATIVE
WORD
FAST
STEP A2-1
AFTER REPLACED
HIGH VELOCITY
in
PARTIAL
STRUCTURE 2a-0:
§'T"é'b’é§URE 28-1;
TYPE A OF
VEHICLE
IEEIEC’ESF
‘
A
CHEAP
PARALLEL
MODIFICATION >
.
FAST
I-Eu
22w
FAST
in
FIG. 18
CHEAP
I$3M
US. Patent
Dec. 17, 2013
Sheet 13 0f 23
US 8,612,207 B2
STEP A2-2 PARTIAL STRUCTURE 2a-0: TYPEAOF VEHICLE
‘
PARTIAL STRUCTURE 2b-0:
PARTIAL STRUCTURE 28-0:
TYPEAOF VEHICLE
CHEAP
‘EA
?TiA
CHEAP
FAST
L FL‘
W
& 1
PARTIAL
‘
:AL> STRUCTURE M TYPE A OF
FAST
22“
VEHICLE
TYPE A OF VEHICLE
PARTIAL
STRUCTURE 2a-1 TYPE A OF VEHICLE
STRUCTURE 29-0
CHEAP rh
QIFEA
CHEAP
PARTIAL *T“ STRUCTURE 2d-0
ii“
PARTIAL STRUCTURE Zf-O
A
SIMILAR
FAST
‘TM
CHEAP
FASI; '
@LI
FAST I,
)
FIG. 19
HELI if) I
US. Patent
Dec. 17, 2013
STEPA2-3 PARTIAL STRUCTURE 2a-0 TYPEAOF VEHICLE \ A
CHEAP
Sheet 14 0f 23
SIMILAR STRUCTURE 2a-2 TYPEAOF VEHICLE QEA
'gg?g??g‘m >
CHEAP
\ ‘12w
"'{fu
FAST
US 8,612,207 B2
PARTIAL SIMILAR STRUCTURE 2c-0 STRUCTURE 2c-1 TYPEAOF TYPEAOF VEHICLE NONDIRECT'ONA VEHICLE Q?b IBRANCHING > ‘ 2:)‘ "
CHEAP
CHEAP
ELI
#Ll
FAST
a,“
?n
SIMILAR
SIMILAR
PARTIAL
sIMILAR
sTRucTuRE 2a-1
STRUCTURE 2731-3
STRUCTURE 29-0
STRUCTURE zg-I
TYPEAOF
VEHICLE
‘IIoIIoIREcIIoIIAL
CHEAP NON DIRECT?IAL CHEAP
/‘€ABRANCHING >
L?" BRANCHING > FASTCE,“
FAST
FAST
CHEAP
ii»!
FAST
3“
5%“
CEBU
PARTIAL STRUCTURE 2w
sIMILAR STRUCTURE 2b-1
TYPEAOF
\TIEP‘IIECAEJF
VEHICLE
I
‘A 0 -D RECTIONAL BRANCH'NG FAST
ii“
FAST
?u
FIG. 20E
PARTIAL STRUCTURES 2d-0, 2e-0, AND
A
2f-0 ARE OMITTED BECAUSE WITHOUT MODIFICATION
US. Patent
Dec. 17, 2013
Sheet 16 0123
US 8,612,207 B2
._ _
.5“Eurmu o
tawwk:31".5“5.
.QEmm