JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
1
A Simple Method for Estimating Term Mutual Information D. Cai and T.L. McCluskey Abstract—The ability to formally analyze and automatically measure statistical dependence of terms is a core problem in many areas of science. One of the commonly used tools for this is the expected mutual information (MI) measure. However, it seems that MI methods have not achieved their potential. The main problem in using MI of terms is to obtain actual probability distributions estimated from training data, as the true distributions are invariably not known. This study focuses on the problem and proposes a novel but simple method for estimating probability distributions. Estimation functions are introduced; mathematical meaning of the functions is interpreted and the verification conditions are discussed. Examples are provided to illustrate the possibility of failure of applying the method if the verification conditions are not satisfied. An extension of the method is considered. Index Terms—Information analysis and extraction, dependence and relatedness of terms, statistical semantic analysis.
—————————— ——————————
1 INTRODUCTION
T
H E ability to form ally analyze and au tom atically m easu re statistical d epend ence (related ness, proxim ity, association, sim ilarity) of term s in textu al d ocu m ents is a core problem in m any areas of sciences, su ch as, featu re extraction and selection, concept learning and clu stering, d ocu m ent representation and qu ery form u lation, text analysis and d ata m ining. Solu tion of the problem has been a technical barrier for a variety of practical m athem atical applications. One of the com m only u sed tools of analysis and m easu rem ent is the expected m u tu al inform ation (MI) m easu re d raw n from inform ation theory [10], [16]. Many stu d ies have u sed the m easu re for a variety of tasks in, for instance, featu re selection [2], [1], [11], [15], d ocu m ent classification [18], face im age clu stering [14], m u ltim od ality im age registration [13], inform ation retrieval [6], [7], [8], [9], [14]. H ow ever, it seem s that MI m ethod s have not achieved their potential. The m ain p roblem w e face in u sing the expected MI m easu re is obtaining actu al probability d istribu tions, as the tru e d istribu tions are invariably not know n, and w e have to estim ate them from training d ata. This w ork explores techniqu es of estim ation. To ad d ress this stu d y clearly, let u s first introd u ce the concept of a term state value d istribu tion. A term is u su ally thou ght of as having states “present” or “absent” in a d ocu m ent. Thu s, for an arbitrary term 𝑡, it w ill be convenient to introd u ce a variable 𝛿 taking valu es from set Ω = *1, 0+, w here 𝛿 = 1 expresses that 𝑡 is present and 𝛿 = 0 expresses that 𝑡 is absent. Denote 𝑡 = 𝑡, 𝑡̅ w hen 𝛿 = 1, 0, respectively. We call Ω a state value space, and ————————————————
D. Cai is with the School of Computing and Engineering, University of Huddersfield, UK, HD1 3BE. T.L. M cCluskey is with the School of Computing and Engineering, University of Huddersfield, UK, HD1 3BE.
each elem ent in Ω a state value, of 𝑡. Sim ilarly, for an arbitrary term pair (𝑡 , 𝑡 ), w e introd u ce a variable pair (𝛿 , 𝛿 ) taking valu es from set Ω × Ω = {(1,1), (1,0), (0,1), (0,0)}. We call Ω × Ω a state valu e space, and each elem ent in Ω × Ω a state valu e p air, of (𝑡 , 𝑡 ). Let 𝐷 be a collection of d ocuments (training d ata), and 𝑉 a vocabulary of term s used to index individual documents in 𝐷. Denote 𝑉 ⊆ 𝑉 as the set of terms occu rring in document 𝑑 ∈ 𝐷. Thu s, for each term 𝑡 occu rring in 𝑑, its state valu e d istribu tion is
𝑃 (𝛿) = 𝑃(𝑡 |𝑑)
(𝛿 ∈ Ω)
Obviously, each term 𝑡 ∈ 𝑉 is matched to a state value distribution and there are totally |𝑉 | state value distributions for d ocument 𝑑. There exists statistical d epend ence betw een tw o term s, 𝑡 and 𝑡 , if the state valu e of one of them provid es m u tu al inform ation abou t the probability of the state valu e of another. Losee [12] show ed that there is a relationship betw een the frequ encies (or probabilities) of term s and MI of term s. Therefore, term 𝑡 taking som e state valu e 𝛿 (say 𝛿 = 1) shou ld be looked u pon as com p lex becau se another state valu e (say 𝛿 = 0) of 𝑡 , and state valu es of m any other term s (i.e., all term s 𝑡 ∈ 𝑉 − *𝑡 + ), m ay be d epend ent on this 𝛿 . Mathem atically, for tw o arbitrary term s 𝑡 , 𝑡 ∈ 𝑉 , the expected mutual information [10] abou t the probabilities of the state valu e pair (𝛿 , 𝛿 ) of term pair (𝑡 , 𝑡 ) can be expressed by:
𝐼 (𝛿 , 𝛿 ) = 𝐻(𝛿 ) − 𝐻(𝛿 |𝛿 ) = 𝐻(𝛿 ) − 𝐻(𝛿 |𝛿 ) =∑
,
,
𝑃 (𝛿 , 𝛿 ) log
© 2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 http://sites.google.com/site/journalofcomputing/
( , ( )
) (
)
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
w here 𝐻(𝛿 ) is entropy of 𝛿 , measu ring u ncertainty on 𝛿 ; 𝐻(𝛿 |𝛿 ) is conditional entropy of 𝛿 , measu ring uncertainty on 𝛿 given know ing 𝛿 . Thus, 𝐼 (𝛿 , 𝛿 ) measures the amou nt of information that 𝛿 provides about 𝛿 , and vice versa. The estim ation of d istribu tions 𝑃 (𝛿) and 𝑃 (𝛿 , 𝛿 ) requ ired in 𝐼 (𝛿 , 𝛿 ) is cru cial and rem ains an open issu e for effectively d istingu ishing potentially d epend ent term pairs from m any others and , therefore, the m ain concern of ou r cu rrent stu d y. In Section 2, w e introd u ce estim ation fu nctions, interpret m athem atical m eaning of the fu nctions and d iscu ss verification cond itions. In Section 3, w e provid e exam ples to clarify the id ea of ou r m ethod . Section 4 consid ers an extension of ou r m ethod and conclu sions are d raw n in Section 5. Som e m athem atical d etails are given in the Ap pend ix.
2 ESTIMATION
𝑝 (𝑡) = 𝑝(𝑡|𝑑) =
∑
𝑓 (𝑡) 𝑓 (𝑡) = ‖𝑑‖ 𝑓 (𝑡 ) ∈
(𝑡 ∈ 𝑉 )
w hich shou ld not be confu sed w ith the term state value d istribu tion 𝑃 (𝛿). In ord er to constitu te the state valu e d istribu tion s, for arbitrary term s 𝑡, 𝑡 , 𝑡 ∈ 𝑉 , let w e introd u ce tw o estimation functions:
𝛾 (𝑡 , 𝑡 ) =
given term s 𝑡 , 𝑡 ∈ 𝑉 , d efine
𝜑 (𝛿 = 1, 𝛿 = 1) = 𝛾 (𝑡 , 𝑡 ) 𝜑 (𝛿 = 1, 𝛿 = 0) = 𝜓 (𝑡 ) − 𝛾 (𝑡 , 𝑡 ) 𝜑 (𝛿 = 0, 𝛿 = 1) = 𝜓 (𝑡 ) − 𝛾 (𝑡 , 𝑡 ) 𝜑 (𝛿 = 0, 𝛿 = 0) = 1 − 𝜓 (𝑡 ) − 𝜓 (𝑡 ) + 𝛾 (𝑡 , 𝑡 ) N ote that 𝜑 (𝛿 , 𝛿 ) m ay not constitu te a probability d istribu tion. N ext, w e need to prove that, u nd er som e cond itions,
𝜑 (𝛿 , 𝛿 ) can be a probability d istribu tion by Theorem 1 below. For d oing so, h ere, and throu ghou t this stu d y, w e d enote the d enom inator of 𝛾 (𝑡 , 𝑡 ) by 𝜛=∑
In practical ap plication, the state valu e d istribu tions m ay be estim ated from training d ata. The estim ation of the joint state valu e d istribu tion, 𝑃 (𝛿 , 𝛿 ), is a m ore com plicated task, w hich is thu s the m ain concern of this section. Let u s start w ith consid ering a given d ocu m ent. Say w e have a d ocu m ent 𝑑 ∈ 𝐷 w ith 𝑉 = *𝑡 , 𝑡 , … , 𝑡 + ⊆ *𝑡 , 𝑡 , … , 𝑡 + = 𝑉, w here 1 ≤ 𝑖 < 𝑖 < ⋯ < 𝑖 ≤ 𝑛. In this stu d y, w e alw ays assu m e that 2 < 𝑠 = |𝑉 | ≤ 𝑛 (nam ely, each d ocu m ent has at least three d istinct term s). Generally, if w e denote 𝑓 (𝑡) as the frequency of term 𝑡 in 𝑑 and ‖𝑑‖ as the length of 𝑑 then, for a given d ocu m ent 𝑑, the term occurrence frequency d istribu tion is given by
𝜓 (𝑡) = ∑
2
( ) ∈
∑
(1)
( )
,
( )
( )
∈
(
,
∈
𝑓 (𝑡 ) 𝑓 (𝑡 )
and , for an arbitrary term 𝑡 ∈ 𝑉 , d enote
𝜛 =∑
,
∈
* +𝑓
Clearly 𝜛 > 𝜛 ≥ 1 as |𝑉 | > 2. To prove Theorem 1, w e need to introd u ce tw o lem m as. Detailed p roofs are given in the Ap pend ix. Lemma 1. For an arbitrary term 𝑡 ∈ 𝑉 , we have 𝜛 = ‖𝑑‖𝑓 (𝑡) − 𝑓 (𝑡) + 𝜛 Lemma 2. For the functions 𝜓 (𝑡) and 𝛾 (𝑡 , 𝑡 ) given in (1) and (2), respectively, we have: (a) 𝜛 ≥ 𝑓 (𝑡 ) if and only if 𝜓 (𝑡 ) ≥ 𝛾 (𝑡 , 𝑡 ); (b) 𝜛 ≥ 𝑓 (𝑡 ) if and only if 𝜓 (𝑡 ) ≥ 𝛾 (𝑡 , 𝑡 ). We are now read y to introd u ce Theorem 1 below. Detailed p roof is given in the Ap pend ix. Theorem 1. For arbitrary terms 𝑡 , 𝑡 ∈ 𝑉 , expression
𝑃 (𝛿 , 𝛿 ) = 𝜑 (𝛿 , 𝛿 ) )
.
/
(2)
Clearly, 0 < 𝜓 (𝑡), 𝛾 (𝑡 , 𝑡 ) < 1 for arbitrary 𝑡, 𝑡 , 𝑡 ∈ 𝑉 . Then, for each term 𝑡 ∈ 𝑉 , from the fu nction 𝜓 (𝑡), d efine 𝑃 (𝛿) by
(4)
is a probability distribution over Ω × Ω if it satisfies two inequalities: a) 𝜛 ≥ 𝑓 (𝑡 ) and b) 𝜛 ≥ 𝑓 (𝑡 ). Thu s, by the above expression 𝜑 (𝛿 , 𝛿 ) and Theorem 1, w e have, for instance,
𝑃 (𝛿 = 1, 𝛿 = 1) = 𝑃 (𝑡 , 𝑡 ) = 𝛾 (𝑡 , 𝑡 )
𝑃 (𝛿 = 1) = 𝑃 (𝑡) = 𝜓 (𝑡) 𝑃 (𝛿 = 0) = 𝑃 (𝑡̅) = 1 − 𝜓 (𝑡)
(𝑡 ) 𝑓 (𝑡 )
(3)
w hich is a probability d istribu tion over Ω. To constitu te 𝑃 (𝛿 , 𝛿 ) from the function 𝛾 (𝑡 , 𝑡 ), for
𝑃 (𝛿 = 1, 𝛿 = 0) = 𝑃 (𝑡 , 𝑡̅ ) = 𝜓 (𝑡 ) − 𝛾 (𝑡 , 𝑡 ) The first lem m a tells u s that there exists a relationship betw een 𝜛 and 𝜛 . The second lem m a tells u s how to
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
verify tw o inequ alities (cond itions) requ ired in Theorem 1: 𝜛 ≥ 𝑓 (𝑡 ) and 𝜛 ≥ 𝑓 (𝑡 ) m ay be sim ply verified by 𝜓 (𝑡 ) ≥ 𝛾 (𝑡 , 𝑡 ) and 𝜓 (𝑡 ) ≥ 𝛾 (𝑡 , 𝑡 ), respectively. The reasoning behind the estimate, 𝑃(𝛿 , 𝛿 ), is rather intuitive. 𝑃(𝛿 = 1, 𝛿 = 0) and 𝑃(𝛿 = 0, 𝛿 = 1) are derived by two constraints:
𝑃(𝛿 = 1, 𝛿 = 0) + 𝑃(𝛿 = 1, 𝛿 = 1) = 𝑃(𝛿 = 1); 𝑃(𝛿 = 0, 𝛿 = 1) + 𝑃(𝛿 = 1, 𝛿 = 1) = 𝑃(𝛿 = 1); w hich ensure that both 𝑃(𝛿 ) and 𝑃(𝛿 ) are marginal distributions of 𝑃(𝛿 , 𝛿 ). 𝑃(𝛿 = 0, 𝛿 = 0) is derived by another constraint
∑
,
,
𝑃 (𝛿 , 𝛿 ) = 1
It is w orth explaining the derivation of 𝑃(𝛿 = 1, 𝛿 = 1) = 𝛾(𝑡𝑖 , 𝑡 ) in more detail. In practice, the estim ation functions should be considered carefully and introduced meaningfully according to a specific application problem . Let us now explain the meaning of 𝛾(𝑡 , 𝑡 ) given in (1). It may be easier to m ake the explanation through an 𝑛 × 𝑛 matrix. Su ppose w e are given a document 𝑑 represented by a (frequency) 1 × 𝑛 matrix
𝒎 = ,𝑓 (𝑡 ), 𝑓 (𝑡 ), … , 𝑓 (𝑡 )- = ,𝑓 (𝑡)-
×
in w hich, each element is a frequency satisfying 𝑓 (𝑡) ≥ 1 w hen 𝑡 ∈ 𝑉 and 𝑓 (𝑡) = 0 w hen 𝑡 ∈ 𝑉 − 𝑉 . The m atrix product can be w ritten by
𝑓 (𝑡 ) 𝒎 × 𝒎 = [ … ] × ,𝑓 (𝑡 ) 𝑓 (𝑡 ) 𝑓 (𝑡 )𝑓 (𝑡 ) … =[ 𝑓 (𝑡 )𝑓 (𝑡 ) = [𝑓 (𝑡 )𝑓 (𝑡 )]
… … …
𝑓 (𝑡 )-
…
3 DISCUSSION It shou ld be em phasized , in ord er to speak of the MI of term s, that w e m u st verify tw o argu m ents of 𝐼(𝛿 , 𝛿 ) are probability d istribu tions. For instance, in ou r m ethod , they shou ld satisfy the tw o inequ alities given in Theorem 1. Let u s look at exam ples below, w hich w ill help to clarify the id ea and m ake u nd erstand able the com pu tation involved in all the above form u lae. Example 1. Su ppose 𝑑 = *𝑡 , 𝑡 , 𝑡 , 𝑡 , 𝑡 , 𝑡 , 𝑡 , 𝑡 +, then we have 𝑉 = *𝑡 , 𝑡 , 𝑡 , 𝑡 , 𝑡 , 𝑡 +, and 𝜛 = 26. Thu s, for term pair (𝑡 , 𝑡 ), from (2) and (4), w e have
𝑃 (𝛿 = 1, 𝛿 = 1) =
×
=
𝑃 (𝛿 = 1, 𝛿 = 0) = −
=
>0
𝑃 (𝛿 = 0, 𝛿 = 1) = −
=
>0 =
Then, it follow s immediately (using natural logarithm s)
𝐼 (𝛿 , 𝛿 ) =
log
+
log
+
log
+
log
≈ 0.0693 − 0.0321 − 0.0405 + 0.0472 = 0.0439
× ×
Generally, [𝑓 (𝑡 )𝑓 (𝑡 )] , w hich is symmetric, is × called the co-occurrence frequency matrix of term s concerning 𝑑. H ence,
×
Note that assum ption |𝑉 | > 2 ensures that there exists more than one non-zero element in the matrix, such that [𝑃𝑑 (𝛿𝑖 = 1, 𝛿 = 1)] ≠ ,0- × . Notice also that, becau se × no two com ponents of (𝑡 , 𝑡 ) can be the same, the elements where 𝑖 = , corresponding to 𝑃 (𝛿 = 1, 𝛿 = 1) for 𝑖 = 1, … , 𝑛, should not be considered in our context. However, it is only for notational convenience that these elements are inclu ded in the matrix.
𝑃 (𝛿 = 0, 𝛿 = 0) = 1 − − +
𝑓 (𝑡 )𝑓 (𝑡 ) … ] 𝑓 (𝑡 )𝑓 (𝑡 )
= 𝜛[𝑃 (𝛿 = 1, 𝛿 = 1)]
[𝑃 (𝛿 = 1, 𝛿 = 1)]
𝑓 (𝑡 )𝑓 (𝑡 ) for 𝑖 < 𝑖, = 1, … , 𝑛, is a normalization factor for the characterization. Clearly, 𝜛 is a constant for all term pairs, (𝑡 , 𝑡 ), occurring in a given d ocument.
×
= 𝜛 0 𝑓 (𝑡 )𝑓 (𝑡 )1
3
= 0 𝑓 (𝑡 )𝑓 (𝑡 )1 = [𝛾 (𝑡 , 𝑡 )]
×
×
can be referred to as the normalized co-occu rrence frequency matrix of term s concerning 𝑑. Consequently, 𝑃 (𝛿 = 1, 𝛿 = 1) = 𝛾 (𝑡 , 𝑡 ), for 𝑖, = 1, … , 𝑛, can be represented by an 𝑛 × 𝑛 matrix: its numerator, 𝑓 (𝑡 )𝑓 (𝑡 ), characterizes the co-occu rrence frequencies of 𝑡 and 𝑡 in document 𝑑; its denom inator, 𝜛, the sum of all possible numerators
Example 2. Suppose that we are given a document 𝑑 = *𝑡 , 𝑡 , 𝑡 , 𝑡 , 𝑡 , 𝑡 +. From which we have
𝜛=∑
,
∈
𝑓 (𝑡 ) 𝑓 (𝑡 )
= 𝑓 (𝑡 )𝑓 (𝑡 ) + 𝑓 (𝑡 )𝑓 (𝑡 ) + 𝑓 (𝑡 )𝑓 (𝑡 ) + 𝑓 (𝑡 )𝑓 (𝑡 ) + 𝑓 (𝑡 )𝑓 (𝑡 ) + 𝑓 (𝑡 )𝑓 (𝑡 ) = 1 × 3 + 1 × 1 + 1 × 1 + 3 × 1 + 3 × 1 + 1 × 1 = 12 Thu s, for instance, for term pair (𝑡 , 𝑡 ), we have 𝛾 (𝑡 , 𝑡 ) =
×
, and
𝑃 (𝛿 = 1, 𝛿 = 0) = 𝜓 (𝑡 ) − 𝛾 (𝑡 , 𝑡 ) = 1/6 − 3/12 = −1/12 < 0 from which we can conclu de that 𝑃 (𝛿 , 𝛿 ) is not a
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
probability distribution since 𝜛 (𝑡 ) − 𝛾 (𝑡 , 𝑡 ) < 0. Also, we can verify this in an alternative w ay:
𝜛 =∑
,
∈
*
+𝑓
(𝑡 )𝑓 (𝑡 )
= 𝑓 (𝑡 )𝑓 (𝑡 ) + 𝑓 (𝑡 )𝑓 (𝑡 ) +𝑓 (𝑡 )𝑓 (𝑡 )
fu nction w ou ld be 𝑤 (𝑡) = 𝑓 (𝑡) × log
(i) In order to com pute term d ependence, w e must verify both 𝜓 (𝑡 ) ≥ 𝛾 (𝑡 , 𝑡 ) and 𝜓 (𝑡 ) ≥ 𝛾 (𝑡 , 𝑡 ), or equivalently to verify both 𝜛 ≥ 𝑓 (𝑡 ) and 𝜛 ≥ 𝑓 (𝑡 ), satisfied simultaneously, for each term pair considered. (ii) 𝛾 (𝑡 , 𝑡 ) becomes sm aller rapidly as documents become longer, it should thu s not be a problem to satisfy the above tw o inequalities in practical application. In a practical application, w e generally concentrate on the statistics of co-occu rrence of terms. That is, the dependence with which w e are really concerned is state value (𝛿 , 𝛿 ) = (1, 1) of term pair (𝑡 , 𝑡 ). In this case, what w e need is to apply only the first item of 𝐼(𝛿 , 𝛿 ) and to verify the second cond ition given in Theorem 1:
( ) ‖ ‖
( )
𝜛 =∑
,
∈
𝜓 (𝑡 )𝜓 (𝑡 ) =
( )
‖𝑑‖ = ∑ 𝑤 (𝑡 )
That is, the first inequality given in Lemma 2 is not satisfied. The above Exam ple 2 is a specific instance of failing to apply the estim ation given in (1)-(4). From the above tw o exam ples, we can see:
( )
| |
the nu m ber of d ocu m ents in 𝐷 in w hich 𝑡 occu rs. Also, the m ethod d escribed in previou s sections is a special case w here, 𝑤 (𝑡) = 𝑓 (𝑡) for 𝑡 ∈ 𝑉 . With d ocu m ent representation by 𝑤 (𝑡), let u s continu e to d enote the “length” of d ocu m ent 𝑑 by
= 1 + 1 + 1 < 9 = 𝑓 (𝑡 )𝑓 (𝑡 )
𝑃 (𝛿 = 1, 𝛿 = 1) =
4
Also, the verification cond itions are given by the follow ing lem m as and theorem . Proofs of Lem m as 3-4 and Theorem 2 are here om itted as they are sim ilar to the respective proofs of Lem m as 1-2 and Theorem 1. Lemma 3. For an arbitrary term 𝑡 ∈ 𝑉 , we have 𝜛 = ‖𝑑‖𝑤 (𝑡) − 𝑤 (𝑡) + 𝜛 Lemma 4. For functions 𝜓 (𝑡) and 𝛾 (𝑡 , 𝑡 ) given in (5) and (6), respectively, we have:
𝜛 ≥ 𝑤 (𝑡 ) if and only if 𝜓 (𝑡 ) ≥ 𝛾 (𝑡 , 𝑡 ); 𝜛 ≥ 𝑤 (𝑡 ) if and only if 𝜓 (𝑡 ) ≥ 𝛾 (𝑡 , 𝑡 ).
×
satisfying 𝑤 (𝑡) > 0 w hen 𝑡 ∈ 𝑉 and 𝑤 (𝑡) = 0 w hen 𝑡 ∈ 𝑉 − 𝑉 . The 𝑤 (𝑡) is called a weighting function, ind icating the im portance of term 𝑡 in representing d ocu m ent 𝑑. For instance, a w id ely u sed w eighting
Theorem 2. 𝑃 (𝛿 , 𝛿 ) given in (6) is a probability distribution if it satisfies two inequalities: a) 𝜛 ≥ 𝑤 (𝑡 ) and b) 𝜛 ≥ 𝑤 (𝑡 ). Obviou sly, 𝑤𝑑 (𝑡) is the m ain com ponent of the estim ation fu nctions 𝜓 (𝑡) and 𝛾 (𝑡 , 𝑡 ). As w e all know ,
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
d ocu m ent representations, 𝑤 (𝑡), play an essential role in d eterm ining effectiveness. The issu e of accu racy and valid ity of d ocu m ent representation has long been a cru cial and open problem . It is beyond the scope of this paper to d iscu ss the issu e in greater d etail. A d et ailed d iscu ssion abou t representation techniqu es m ay be fou nd , for instance, in stu d y [3][4].
5
= ‖𝑑‖𝑓 (𝑡) − 𝑓 (𝑡) +∑
,
∈
* +𝑓
(𝑡 )𝑓 (𝑡 )
= ‖𝑑‖𝑓 (𝑡) − 𝑓 (𝑡) + 𝜛
Lemma 2. Functions 𝜓 (𝑡) and 𝛾 (𝑡 , 𝑡 ) given in (1) have:
(a) 𝜛 ≥ 𝑓 (𝑡 ) if and only if 𝜓 (𝑡 ) ≥ 𝛾 (𝑡 , 𝑡 );
5 CONCLUSION It seem s that MI m ethod s have not achieved their potential for au tom atically m easu ring statistical d epend ence of term s. The m ain problem in MI m ethod s is to obtain actu al probability d istribu tions estim ated from training d ata. This stu d y concentrated on su ch a problem and proposed a novel bu t sim ple m ethod for m easu res. We introd u ced estim ation fu nctions 𝜓 (𝑡) and 𝛾 (𝑡 , 𝑡 ), w hich m ay be u sed to cap tu re the occu rrence and cooccu rrence inform ation of term s and to d efine d istribu tions 𝑃 (𝛿) and 𝑃 (𝛿 , 𝛿 ). We interpreted m athem atical m eaning of the fu nctions w ithin practical ap plication contexts. We d iscu ssed verification cond itions in ord er to ensu re 𝑃 (𝛿) and 𝑃 (𝛿 , 𝛿 ) are probability d istribu tions u nd er the cond itions. We provid ed exam ples to clarify the id ea of ou r m ethod , to m ake u nd erstand able the com pu tation involved in all the form u lae and , in p articu lar, to illu strate the possibility of failu re of ap plying ou r m ethod if the verification cond itions are not satisfied . We consid ered the possibility of extension of ou r m ethod , ind icated that it is applicable to any qu antitative d ocu m ent representations w ith a w eighting fu nction. The generality of the form al d iscu ssion m eans ou r m ethod can be ap plicable to m any areas of science, involving statistical sem antic analysis of textu al d ata .
(b) 𝜛 ≥ 𝑓 (𝑡 ) if and only if 𝜓 (𝑡 ) ≥ 𝛾 (𝑡 , 𝑡 ). Proof. We only prove (a). The proof of (b) is sim ilar to (a). Notice that 𝜛 ≠ 0 and ‖𝑑‖ ≠ 0. Thu s, by Lemma 1, we have
𝜛 − 𝑓 (𝑡 ) ≥ 0 if and only if 𝜛 = ‖𝑑‖𝑓 (𝑡 ) + [𝜛 − 𝑓 (𝑡 )] ≥ ‖𝑑‖𝑓 (𝑡 ) if and only if 𝜛𝑓 (𝑡 ) ≥ ‖𝑑‖𝑓 (𝑡 )𝑓 (𝑡 ) if and only if
𝜓 (𝑡 ) =
( ) ‖ ‖
≥
( )
( )
= 𝛾 (𝑡 , 𝑡 )
Theorem 1. 𝑃 (𝛿 , 𝛿 ) given in (4) is a probability distribution if a) 𝜛 ≥ 𝑓 (𝑡 ) and b) 𝜛 ≥ 𝑓 (𝑡 ) Proof. 𝑃 (𝛿 = 1, 𝛿 = 1) > 0 as 0 < 𝛾 (𝑡 , 𝑡 ) < 1; 𝑃 (𝛿 = 1, 𝛿 = 0), 𝑃 (𝛿 = 0, 𝛿 = 1) ≥ 0 as 𝜓 (𝑡 ), 𝜓 (𝑡 )
APPENDIX
≥ 𝛾 (𝑡 , 𝑡 ) by Lemm a 2; 𝑃 (𝛿 = 0, 𝛿 = 0) = ,1 − 𝜓 (𝑡 )-
Lemma 1. For an arbitrary term 𝑡 ∈ 𝑉 , we have
−[𝜓 (𝑡 ) − 𝛾 (𝑡 , 𝑡 )] > 0 as 0 < 𝜓 (𝑡 ) < 1. ∑ , , 𝑃 (𝛿 , 𝛿 ) = 1 can easily be seen from (3).
Finally,
𝜛 = ‖𝑑‖𝑓 (𝑡) − 𝑓 (𝑡) + 𝜛
REFERENCES Proof. Without losing generality, suppose 𝑡 = 𝑡 . (Otherwise, let 𝑡 = 𝑡 . N otice that the order of the elements in the set is unnecessary, so w e can rew rite 𝑉 = *𝑡 , 𝑡 , … , 𝑡 , 𝑡 , … , 𝑡 +, and thu s 𝑉 − *𝑡+ = w ith 1≤𝑖