Shared Components Topic Models

Report 2 Downloads 164 Views
Shared  Components  Topic  Models   Ma$hew  R.  Gormley,    Mark  Dredze,     Benjamin  Van  Durme,    Jason  Eisner   Center  for  Language  and  Speech  Processing   Human  Language  Technology  Center  of  Excellence   Johns  Hopkins  University   NAACL  2012   June  6,  2012  

Contrast  of  LDA  Extensions   DistribuTons   DistribuTons   over     over     Topic  Model    =   +   topics  (docs)   words  (topics)  

2  

Contrast  of  LDA  Extensions   DistribuTons   DistribuTons   over     over     Topic  Model    =   +   topics  (docs)   words  (topics)   Most  extensions  to  LDA  

Our  Model  

3  

words

words

words

words

words

0.000 0.006 0.012

0.006 0.000

probability

0.006 0.000

probability

0.000 0.006 0.012

probability

0.006 0.000

0.000

0.006

probability

LDA  for  Topic  Modeling  

probability

(Blei,  Ng,  &  Jordan,  2003)  

words

•  Each  topic  is  defined  as  a  Mul5nomial   distribu5on  over  the  vocabulary,   parameterized  by  ϕk   4  

(Blei,  Ng,  &  Jordan,  2003)  

LDA  for  Topic  Modeling  

words

words

ϕ5

words

words

0.000 0.006 0.012

ϕ6

probability

0.006 0.000

0.006

probability

ϕ4

0.000

0.000 0.006 0.012

probability

0.006 0.000

probability

0.006 0.000

words

ϕ3

ϕ2

probability

ϕ1  

words

•  Each  topic  is  defined  as  a  Mul5nomial   distribu5on  over  the  vocabulary,   parameterized  by  ϕk   5  

words

words

0.000 0.006 0.012

probability

0.006

probability

words

ϕ6

0.000

ϕ5

0.006

probability

ϕ4

0.000

probability

0.006

probability

0.006

words

ϕ3

0.000

ϕ2

0.000

probability

ϕ1  

0.000 0.006 0.012

LDA  for  Topic  Modeling  

(Blei,  Ng,  &  Jordan,  2003)  

words

words

team,  season,   hockey,  player,   penguins,  ice,     canadiens,   puck,  montreal,   stanley,  cup  

•  A  topic  is  visualized  as  its  high  probability  words.    

6  

words

words

0.000 0.006 0.012

probability

0.006

probability

words

ϕ6

0.000

{hockey}  

ϕ5

0.006

probability

ϕ4

0.000

probability

0.006

probability

0.006

words

ϕ3

0.000

ϕ2

0.000

probability

ϕ1  

0.000 0.006 0.012

LDA  for  Topic  Modeling  

(Blei,  Ng,  &  Jordan,  2003)  

words

words

team,  season,   hockey,  player,   penguins,  ice,     canadiens,   puck,  montreal,   stanley,  cup  

•  A  topic  is  visualized  as  its  high  probability  words.     •  A  pedigogical  label  is  used  to  idenTfy  the  topic.   7  

LDA  for  Topic  Modeling   ϕ1 ϕ2 {Canadian  gov.}   {government}  

ϕ3

{hockey}  

ϕ4

{U.S.  gov.}  

(Blei,  Ng,  &  Jordan,  2003)  

ϕ5

{baseball}  

ϕ6

{Japan}  

•  A  topic  is  visualized  as  its  high  probability  words.     •  A  pedagogical  label  is  used  to  idenTfy  the  topic.   8  

LDA  for  Topic  Modeling   ϕ1 ϕ2 {Canadian  gov.}   {government}  

ϕ3

{hockey}  

ϕ4

{U.S.  gov.}  

ϕ5

{baseball}  

ϕ6

{Japan}  

Dirichlet(α)   θ1=  

9  

LDA  for  Topic  Modeling   ϕ1 ϕ2 {Canadian  gov.}   {government}  

ϕ3

{hockey}  

ϕ4

{U.S.  gov.}  

ϕ5

{baseball}  

ϕ6

{Japan}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  

10  

LDA  for  Topic  Modeling   ϕ1 ϕ2 {Canadian  gov.}   {government}  

ϕ3

{hockey}  

ϕ4

{U.S.  gov.}  

ϕ5

{baseball}  

ϕ6

{Japan}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  

11  

LDA  for  Topic  Modeling   ϕ1 ϕ2 {Canadian  gov.}   {government}  

ϕ3

{hockey}  

ϕ4

{U.S.  gov.}  

ϕ5

{baseball}  

ϕ6

{Japan}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  

12  

LDA  for  Topic  Modeling   ϕ1 ϕ2 {Canadian  gov.}   {government}  

ϕ3

{hockey}  

ϕ4

{U.S.  gov.}  

ϕ5

{baseball}  

ϕ6

{Japan}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  vessels   regularly  if  infrequently  detain   each  other's  fish  boats  in  the   disputed  waters  off  Dixon…   13  

LDA  for  Topic  Modeling   ϕ1 ϕ2 {Canadian  gov.}   {government}  

ϕ3

{hockey}  

ϕ4

{U.S.  gov.}  

ϕ5

{baseball}  

ϕ6

{Japan}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  vessels   regularly  if  infrequently  detain   each  other's  fish  boats  in  the   disputed  waters  off  Dixon…  

θ2=   In  the  year  before   Lemieux  came,  Pi$sburgh   finished  with  38  points.     Following  his  arrival,  the   Pens  finished…  

θ3=   The  Orioles'  pitching  staff   again  is  having  a  fine   exhibiTon  season.  Four   shutouts,  low  team  ERA,   (Well,  I  haven't  go$en  any   baseball…   14  

LDA  for  Topic  Modeling   Dirichlet(β)   ϕ1 ϕ2 {Canadian  gov.}   {government}  

ϕ3

{hockey}  

ϕ4

{U.S.  gov.}  

ϕ5

{baseball}  

ϕ6

{Japan}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  vessels   regularly  if  infrequently  detain   each  other's  fish  boats  in  the   disputed  waters  off  Dixon…  

θ2=   In  the  year  before   Lemieux  came,  Pi$sburgh   finished  with  38  points.     Following  his  arrival,  the   Pens  finished…  

θ3=   The  Orioles'  pitching  staff   again  is  having  a  fine   exhibiTon  season.  Four   shutouts,  low  team  ERA,   (Well,  I  haven't  go$en  any   baseball…   15  

LDA  for  Topic  Modeling   Dirichlet(β)   ϕ1 ϕ2 {Canadian  gov.}   {government}  

ϕ3

{hockey}  

ϕ4

{U.S.  gov.}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  vessels   regularly  if  infrequently  detain   each  other's  fish  boats  in  the   disputed  waters  off  Dixon…  

θ2=   In  the  year  before   Lemieux  came,  Pi$sburgh   finished  with  38  points.     Following  his  arrival,  the   Pens  finished…  

DistribuTons   over     ϕ ϕ words  (topics)   5

{baseball}  

6

{Japan}  

DistribuTons   over     θ3=   topics  (docs)   The  Orioles'  pitching  staff   again  is  having  a  fine   exhibiTon  season.  Four   shutouts,  low  team  ERA,   (Well,  I  haven't  go$en  any   baseball…   16  

LDA  for  Topic  Modeling   Dirichlet(β)   ϕ1 ϕ2 {Canadian  gov.}   {government}  

ϕ3

{hockey}  

ϕ4

{U.S.  gov.}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  vessels   regularly  if  infrequently  detain   each  other's  fish  boats  in  the   disputed  waters  off  Dixon…  

θ2=   In  the  year  before   Lemieux  came,  Pi$sburgh   finished  with  38  points.     Following  his  arrival,  the   Pens  finished…  

DistribuTons   over     ϕ ϕ words  (topics)   5

{baseball}  

6

{Japan}  

DistribuTons   over     θ3=   topics  (docs)   The  Orioles'  pitching  staff   again  is  having  a  fine   exhibiTon  season.  Four   shutouts,  low  team  ERA,   (Well,  I  haven't  go$en  any   baseball…   17  

LDA  for  Topic  Modeling   Two  problems  with  the  LDA  generaTve  story  for   topics:     1.  Independently  generate  each  topic   2.  For  each  topic,  store  a  parameter  per  word  in  the   vocabulary  

18  

LDA  for  Topic  Modeling   Two  problems  with  the  LDA  generaTve  story  for   topics:     1.  Independently  generate  each  topic   2.  For  each  topic,  store  a  parameter  per  word  in  the   vocabulary  

We’re  not  the  first  to  noTce  this…   19  

Our  Model   Shared  Components  Topic  Model  (SCTM):   –  Generate  a  pool  of  “components”  (proto-­‐topics)   –  Assemble  each  topic  from  some  of  the  components   •  MulTply  and  renormalize  (“product  of  experts”)  

–  Documents  are  mixtures  of  topics  ( just  like  LDA)   1.  So  the  wordlists  of  two  topics  are  not  generated   independently!   2.  Fewer  parameters   20  

SCTM:  MoTvaTng  Example   Components  are  distribuTons  over  words.   How  to  combine  components  into  topics?   Decreasing  probability  

ϕ1 {sports}  

ϕ2 {Canada}  

ϕ3 {government}  

player  

canada  

democracy  

team  

Quebec  

socialism  

hockey  

parliament  

voted  

baseball  

snow  

elecTon  

Orioles  

Hansard’s  

Obama  

Canucks  

Elizabeth  II  

PuTn  

season  

hockey  

parliament  

21  

SCTM:  MoTvaTng  Example   We  can  imagine  a  component  as  a  set  of  words   (i.e.  all  the  non-­‐zero  probabiliTes  are  idenTcal):   ϕ1 {sports}  

ϕ2 {Canada}  

ϕ3 {government}  

player  

canada  

democracy  

team  

Quebec  

socialism  

hockey  

parliament  

voted  

baseball  

snow  

elecTon  

Orioles  

Hansard’s  

Obama  

Canucks  

Elizabeth  II  

PuTn  

season  

hockey  

parliament  

22  

SCTM:  MoTvaTng  Example   To  create  a  {Canadian  government}  topic  we  could   take  the  union  of  {government}  and  {Canada}.     ϕ1 {sports}  

ϕ2 {Canada}  

ϕ3 {government}  

player  

canada  

democracy  

team  

Quebec  

socialism  

hockey  

parliament  

voted  

baseball  

snow  

elecTon  

Orioles  

Hansard’s  

Obama  

Canucks  

Elizabeth  II  

PuTn  

season  

hockey  

parliament  

23  

SCTM:  MoTvaTng  Example   Be$er  yet,  to  create  a  {Canadian  government}  topic   we  could  take  the  intersec5on  of  {government}   and  {Canada}.     ϕ3 {government}  

ϕ1 {sports}  

{Canadian  gov.}  

ϕ2 {Canada}  

24  

SCTM:  MoTvaTng  Example   Be$er  yet,  to  create  a  {Canadian  government}  topic   we  could  take  the  intersec5on  of  {government}   and  {Canada}.     ϕ3 {government}  

{Canadian  gov.}  

{hockey}?  

ϕ1 {sports}  

ϕ2 {Canada}  

25  

SCTM:  MoTvaTng  Example   More  complex  intersec5ons  might  be  more   realisTc:   ϕ1

{Canada}  

ϕ2

{government}  

ϕ3

{sports}  

b1

{Canadian  gov.}  

26  

SCTM:  MoTvaTng  Example   More  complex  intersec5ons  might  be  more   realisTc:   ϕ1

{Canada}  

b1

{Canadian  gov.}  

ϕ2

{government}  

ϕ3

{sports}  

ϕ4

{U.S.}  

b4 {U.S.  gov.}  

27  

SCTM:  MoTvaTng  Example   More  complex  intersec5ons  might  be  more   realisTc:   ϕ1

{Canada}  

b1

{Canadian  gov.}  

ϕ2

{government}  

b3   {hockey}  

ϕ3

{sports}  

ϕ4

{U.S.}  

b4 {U.S.  gov.}  

28  

SCTM:  MoTvaTng  Example   More  complex  intersec5ons  might  be  more   realisTc:   ϕ1

{Canada}  

b1

ϕ2

{government}  

b2 {Canadian  gov.}   {government}  

b3   {hockey}  

ϕ3

{sports}  

b4 {U.S.  gov.}  

ϕ4

{U.S.}  

b5 {baseball}  

ϕ5

{Japan}  

b6 {Japan}  

29  

SCTM:  MoTvaTng  Example   More  complex  intersec5ons  might  be  more   realisTc:  

Components  

ϕ1

{Canada}  

b1

ϕ2

{government}  

b2 {Canadian  gov.}   {government}  

b3   {hockey}  

ϕ3

{sports}  

b4 {U.S.  gov.}  

ϕ4

{U.S.}  

b5 {baseball}  

ϕ5

{Japan}  

b6 {Japan}  

Topics   30  

Sos  IntersecTon  and  Union   •  We  don’t  want  topics  to  be  sets  of  words,  we   want  probability  distribu5ons  over  words   •  In  probability  space…  

Union  

Mixture  

IntersecTon  

Normalized   Product   31  

Anonymous Author(s) Affiliation Address email  

Product  of  Experts

Product  of  Experts  (PoE)  model  (Hinton,   2002)   Q C φ PV c=1 QC cx v=1 c=1 φcv

Experts– (PoE) [1] model . . , φC ) = product   , where ther Another   name  p(x|φ for  a1  ,n.ormalized   nd the summation the denominator is over all possible types. –  For  a  sinubset   of  components,   define   the  feature model   as:   �

p(x|φ1 , . . . , φC ) = �V

c∈C

v=1

richlet allocation generative process {1, . . . , K}:

IntersecTon  

[draw distribution over words] m ∈ {1, . . . , M } [draw distribution over topics] ∈ {1, . . . , Nm }



φcx c∈C

φcv

The Finite IBP model generative process

Normalized   [colum ∼ Product   Beta( , 1) probability of comp ([draw PoE)  

For each component c ∈ {1, . . . , C}: γ πc C For each topic k ∈ {1, . . . , K}: b ∼ Bernoulli(π )

[ro

32  

Our  Model   Shared  Components  Topic  Model  (SCTM):   –  Generate  a  pool  of  “components”  (proto-­‐topics)   –  Assemble  each  topic  from  some  of  the  components   •  Mul5ply  and  renormalize  (“product  of  experts”)  

–  Documents  are  mixtures  of  topics  ( just  like  LDA)   1.  So  topics  are  not  independent!   2.  Fewer  parameters  

33  

Our  Model   Shared  Components  Topic  Model  (SCTM):  

–  Generate  a  pool  of  “components”  (proto-­‐topics)   –  Assemble  each  topic  from  some  of  the  components   •  MulTply  and  renormalize  (“product  of  experts”)  

–  Documents  are  mixtures  of  topics  ( just  like  LDA)   1.  So  topics  are  not  independent!   2.  Fewer  parameters  

34  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   ϕ1

{Canada}  

ϕ2

{government}  

ϕ3

{sports}  

ϕ4

{U.S.}  

ϕ5

{Japan}  

b1

35  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   π1  

π2

ϕ1

ϕ2

{Canada}  

{government}  

π3 ϕ3

{sports}  

π4 ϕ4

{U.S.}  

π5 ϕ5

{Japan}  

b1

36  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   π1  

π2

ϕ1

ϕ2

{Canada}  

π3

{government}  

ϕ3

{sports}  

π4 ϕ4

{U.S.}  

π5 ϕ5

{Japan}  

b1

 b1c ~ Bernoulli(πc)   37  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   π1  

π2

ϕ1

ϕ2

{Canada}  

π3

{government}  

ϕ3

{sports}  

π4 ϕ4

{U.S.}  

π5 ϕ5

{Japan}  

b1

 b1c ~ Bernoulli(πc)   38  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   π1  

π2

ϕ1

ϕ2

{Canada}  

π3

{government}  

ϕ3

{sports}  

π4 ϕ4

{U.S.}  

π5 ϕ5

{Japan}  

b1

 b1c ~ Bernoulli(πc)   39  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   π1  

π2

ϕ1

ϕ2

{Canada}  

π3

{government}  

ϕ3

{sports}  

π4 ϕ4

{U.S.}  

π5 ϕ5

{Japan}  

b1

 b1c ~ Bernoulli(πc)   40  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   π1  

π2

ϕ1

ϕ2

{Canada}  

π3

{government}  

ϕ3

{sports}  

π4 ϕ4

{U.S.}  

π5 ϕ5

{Japan}  

b1

 b1c ~ Bernoulli(πc)   41  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   π1  

π2

ϕ1

ϕ2

{Canada}  

π3

{government}  

ϕ3

{sports}  

π4 ϕ4

{U.S.}  

π5 ϕ5

{Japan}  

b1

{Canadian  gov.}  

 b1c ~ Bernoulli(πc)   42  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   π1  

π2

ϕ1

ϕ2

{Canada}  

π3

{government}  

ϕ3

{sports}  

π4 ϕ4

{U.S.}  

π5 ϕ5

{Japan}  

b1

b2 {Canadian  gov.}   {government}  

 bkc ~ Bernoulli(πc)   43  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   π1  

π2

ϕ1

ϕ2

{Canada}  

b1

π3

{government}  

b2 {Canadian  gov.}   {government}  

b3   {hockey}  

ϕ3

{sports}  

b4 {U.S.  gov.}  

π4 ϕ4

{U.S.}  

b5 {baseball}  

π5 ϕ5

{Japan}  

b6 {Japan}  

 bkc ~ Bernoulli(πc)   44  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   Beta(γ/C,  1)   π1  

π2

ϕ1

ϕ2

{Canada}  

b1

π3

{government}  

b2 {Canadian  gov.}   {government}  

b3   {hockey}  

ϕ3

{sports}  

b4 {U.S.  gov.}  

π4 ϕ4

{U.S.}  

b5 {baseball}  

π5 ϕ5

{Japan}  

b6 {Japan}  

 bkc ~ Bernoulli(πc)   45  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   Beta-­‐Bernoulli  model  

–  The  finite  version  of  the   Indian  Buffet  Process   (Griffiths  &   Ghahramani,  2006)   –  Prior  over  K  x  C  binary   matrices  

46  

Learning  the  Structure  of  Topics   How  do  we  decide  which  subset  of  components   combine  to  form  a  single  topic?   Beta-­‐Bernoulli  model  

–  The  finite  version  of  the   Indian  Buffet  Process   (Griffiths  &   Ghahramani,  2006)   –  Prior  over  K  x  C  binary   matrices   –  We  can  stack  the  binary   vectors  to  form  a   matrix  

ϕ1

ϕ2

ϕ3

ϕ4

ϕ5 b1   {Canadian  gov.}   b2   {government}   b3   {hockey}   b4   {U.S.  gov.}   b5   {baseball}   b6   {Japan}   47  

Our  Model   Shared  Components  Topic  Model  (SCTM):   –  Generate  a  pool  of  “components”  (proto-­‐topics)   –  Assemble  each  topic  from  some  of  the  components   •  MulTply  and  renormalize  (“product  of  experts”)  

–  Documents  are  mixtures  of  topics  (just  like  LDA)   1.  So  topics  are  not  independent!   2.  Fewer  parameters  

48  

Our  Model  (SCTM)   ϕ1

{Canada}  

b1

ϕ2

{government}  

b2 {Canadian  gov.}   {government}  

b3   {hockey}  

ϕ3

{sports}  

b4 {U.S.  gov.}  

ϕ4

{U.S.}  

b5 {baseball}  

ϕ5

{Japan}  

b6 {Japan}  

How  do  we  generate  the  components?   49  

Our  Model  (SCTM)   ϕ1

{Canada}  

b1

ϕ2

{government}  

b2 {Canadian  gov.}   {government}  

b3   {hockey}  

ϕ3

{sports}  

b4 {U.S.  gov.}  

Dirichlet(β)  

ϕ4

{U.S.}  

b5 {baseball}  

ϕ5

{Japan}  

b6 {Japan}  

How  do  we  generate  the  components?   50  

Our  Model  (SCTM)   ϕ1

{Canada}  

b1

ϕ2

{government}  

b2 {Canadian  gov.}   {government}  

b3   {hockey}  

ϕ3

{sports}  

b4 {U.S.  gov.}  

Dirichlet(β)  

ϕ4

{U.S.}  

b5 {baseball}  

ϕ5

{Japan}  

b6 {Japan}  

51  

Our  Model  (SCTM)   ϕ1

{Canada}  

ϕ2

ϕ3

{government}  

b1

b2 {Canadian  gov.}   {government}  

{sports}  

b3   {hockey}  

b4 {U.S.  gov.}  

Dirichlet(β)  

ϕ4

{U.S.}  

b5 {baseball}  

ϕ5

{Japan}  

b6 {Japan}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  vessels   regularly  if  infrequently  detain  

θ2=   In  the  year  before   Lemieux  came,  Pi$sburgh   finished  with  38  points.     Following  his  arrival,  the  

θ3=   The  Orioles'  pitching  staff   again  is  having  a  fine   exhibiTon  season.  Four   shutouts,  low  team  ERA,  

52  

SCTM  

Beta(γ/C,  1)   π1  

π3

π2 ϕ1

{Canada}  

ϕ2

b2 {Canadian  gov.}   {government}  

{sports}  

b3   {hockey}  

π5

π4 ϕ3

{government}  

b1

Dirichlet(β)  

b4 {U.S.  gov.}  

ϕ4

{U.S.}  

b5 {baseball}  

ϕ5

{Japan}  

b6 {Japan}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  vessels   regularly  if  infrequently  detain  

θ2=   In  the  year  before   Lemieux  came,  Pi$sburgh   finished  with  38  points.     Following  his  arrival,  the  

θ3=   The  Orioles'  pitching  staff   again  is  having  a  fine   exhibiTon  season.  Four   shutouts,  low  team  ERA,  

53  

SCTM  

Beta(γ/C,  1)   π1  

π3

π2 ϕ1

{Canada}  

ϕ2

b1

b2 {Canadian  gov.}   {government}  

π4 ϕ3

{government}  

{sports}  

b3   {hockey}  

b4 {U.S.  gov.}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  vessels   regularly  if  infrequently  detain  

DistribuTons   Dirichlet(β)   π over     words  (topics)  

θ2=   In  the  year  before   Lemieux  came,  Pi$sburgh   finished  with  38  points.     Following  his  arrival,  the  

5

ϕ4

{U.S.}  

b5 {baseball}  

ϕ5

{Japan}  

b6 {Japan}  

DistribuTons   over     θ3=   topics  (docs)  

The  Orioles'  pitching  staff   again  is  having  a  fine   exhibiTon  season.  Four   shutouts,  low  team  ERA,  

54  

Contrast  of  LDA  Extensions   DistribuTons   DistribuTons   over     over     Topic  Model    =   +   topics  (docs)   words  (topics)   •  •  •  •  •  •  •  •  •  •  • 

(Blei  et  al.,  2004)   (Rosen-­‐Zvi  et  al.,  2004)     (Teh  et  al.,  2004)   (Blei  &  Lafferty,  2006)   (Li  &  McCallum,  2006)   (Mimno  et  al.,  2007)   (Boyd-­‐Graber  &  Blei,  2009)   (Williamson  et  al,  2010)   (Paul  &  Girju,  2010)   (Paisley  et  al,  2011)   (Kim  &  Sudderth,  2011)   55  

Contrast  of  LDA  Extensions   DistribuTons   DistribuTons   over     over     Topic  Model    =   +   topics  (docs)   words  (topics)   •  •  •  •  •  •  •  •  •  •  • 

Hierarchical  LDA  (hLDA)   Author-­‐Topic  Model   HDP  mixture  model   Correlated  Topic  Models  (CTM)   Pachinko  AllocaTon  Model  (PAM)   Hierarchical  PAM  (hPAM)   SyntacTc  Topic  Models   Focused  Topic  Models   2D  Topic-­‐Aspect  Model   DILN  for  mixed-­‐membership  modeling   Doubly  Correlated  Nonparametric  TM   56  

Correlated  Topics   •  Correlated  Topics   –  Correlated  Topic  Models  (CTM)   –  Pachinko  AllocaTon  Model  (PAM)   –  Hierarchical  LDA  (hLDA)   –  Hierarchical  PAM  (hPAM)  

•  Key  difference  from  SCTM:  correlaTon  is  limited  to   topics  that  appear  together  in  the  same  document   –  Example:  {hockey}  and  {baseball}  topics  share  many  words   in  common,  but  never  appear  in  the  same  document  

•  The  spirit  of  learning  relaTonships  between  topics  is   very  similar!   57  

Our  Model  (SCTM)   ϕ1

{Canada}  

ϕ2

ϕ3

{government}  

b1

b2 {Canadian  gov.}   {government}  

θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  vessels   regularly  if  infrequently  detain  

{sports}  

b3   {hockey}  

b4 {U.S.  gov.}  

θ2=   In  the  year  before   Lemieux  came,  Pi$sburgh   finished  with  38  points.     Following  his  arrival,  the  

ϕ4

{U.S.}  

b5 {baseball}  

ϕ5

{Japan}  

b6 {Japan}  

θ3=   The  Orioles'  pitching  staff   again  is  having  a  fine   exhibiTon  season.  Four   shutouts,  low  team  ERA,  

58  

Correlated  Topics  

ϕ3

{sports}  

•  Correlated  Topics   –  Correlated  Topic  Models  (CTM)   –  Pachinko  AllocaTon  Model  (PAM)   –  Hierarchical  LDA  (hLDA)   –  Hierarchical  PAM  (hPAM)  

b3   {hockey}  

b5 {baseball}  

•  Key  difference  from  SCTM:  correlaTon  is  limited  to   topics  that  appear  together  in  the  same  document   –  Example:  {hockey}  and  {baseball}  topics  share  many  words   in  common,  but  never  appear  in  the  same  document  

•  The  spirit  of  learning  relaTonships  between  topics  is   very  similar!   59  

Contrast  of  LDA  Extensions   DistribuTons   DistribuTons   over     over     Topic  Model    =   +   topics  (docs)   words  (topics)   •  •  •  •  •  •  •  •  •  •  • 

Hierarchical  LDA  (hLDA)   Author-­‐Topic  Model   HDP  mixture  model   Correlated  Topic  Models  (CTM)   Pachinko  AllocaTon  Model  (PAM)   Hierarchical  PAM  (hPAM)   SyntacTc  Topic  Models   Focused  Topic  Models   2D  Topic-­‐Aspect  Model   DILN  for  mixed-­‐membership  modeling   Doubly  Correlated  Nonparametric  TM  

•  •  •  • 

(Wallach  et  al.,  2009)   (Reisinger  et  al.,  2010)   (Wang  &  Blei,  2009)   (Eisenstein  et  al.,  2011)  

60  

Contrast  of  LDA  Extensions   DistribuTons   DistribuTons   over     over     Topic  Model    =   +   topics  (docs)   words  (topics)   •  •  •  •  •  •  •  •  •  •  • 

Hierarchical  LDA  (hLDA)   Author-­‐Topic  Model   HDP  mixture  model   Correlated  Topic  Models  (CTM)   Pachinko  AllocaTon  Model  (PAM)   Hierarchical  PAM  (hPAM)   SyntacTc  Topic  Models   Focused  Topic  Models   2D  Topic-­‐Aspect  Model   DILN  for  mixed-­‐membership  modeling   Doubly  Correlated  Nonparametric  TM  

•  •  •  • 

Asymmetric  Dirichlet  prior   Spherical  Topic  Models   Sparse  Topic  Models   SAGE  for  topic  modeling  

61  

Contrast  of  LDA  Extensions   DistribuTons   DistribuTons   over     over     Topic  Model    =   +   topics  (docs)   words  (topics)   •  •  •  •  •  •  •  •  •  •  • 

Hierarchical  LDA  (hLDA)   Author-­‐Topic  Model   HDP  mixture  model   Correlated  Topic  Models  (CTM)   Pachinko  AllocaTon  Model  (PAM)   Hierarchical  PAM  (hPAM)   SyntacTc  Topic  Models   Focused  Topic  Models   2D  Topic-­‐Aspect  Model   DILN  for  mixed-­‐membership  modeling   Doubly  Correlated  Nonparametric  TM  

•  •  •  •  • 

Asymmetric  Dirichlet  prior   Spherical  Topic  Models   Sparse  Topic  Models   SAGE  for  topic  modeling   Shared  Components  Topic   Models  (this  work)  

62  

Contrast  of  LDA  Extensions   DistribuTons   DistribuTons   over     over     Topic  Model    =   +   topics  (docs)   words  (topics)   •  •  •  •  • 

Asymmetric  Dirichlet  prior   Spherical  Topic  Models   Sparse  Topic  Models   SAGE  for  topic  modeling   Shared  Components  Topic   Models  (this  work)  

63  

Comparison  of  a  few  Topic  Models   Dependently   Fewer   Generated   Parameters   Topics  

Descrip5on  

LDA   (Blei  et  al.,  2003)   Asymmetric  Dirichlet  Prior   (Wallach  et  al.,  2009)   Spherical  Topic  Model   (Reisinger  et  al.,  2010)   SparseTM   (Wang  &  Blei,  2009)   SAGE   (Eisenstein  et  al.,  2011)  

All  topics  drawn  from   language  specific  base   distribuTon  

Each  topic  is  sparse  

64  

Comparison  of  a  few  Topic  Models   Dependently   Fewer   Generated   Parameters   Topics  

Descrip5on  

LDA   (Blei  et  al.,  2003)   Asymmetric  Dirichlet  Prior   (Wallach  et  al.,  2009)   Spherical  Topic  Model   (Reisinger  et  al.,  2010)   SparseTM   (Wang  &  Blei,  2009)   SAGE   (Eisenstein  et  al.,  2011)   SCTM   (This  paper)  

All  topics  drawn  from   language  specific  base   distribuTon  

Each  topic  is  sparse  

Topics  are  products  of  a   shared  pool  of  components   65  

Parameter  EsTmaTon   • Goal:  infer  values  for  model  parameters   ϕc

{Canada}  

πc  

θm=  

• Monte  Carlo  EM  (MCEM)  algorithm,  where  the   M-­‐step  minimizes  a  ContrasTve  Divergence   (CD)  objecTve    

66  

Parameter  EsTmaTon  

Beta(γ)   π1  

π3

π2 ϕ1

{Canada}  

ϕ2

{government}  

b1

b2 {Canadian  gov.}   {government}  

{sports}  

b3   {hockey}  

π5

π4 ϕ3

b4 {U.S.  gov.}  

Dirichlet(β)  

ϕ4

{U.S.}  

b5 {baseball}  

ϕ5

{Japan}  

b6 {Japan}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  vessels   regularly  if  infrequently  detain  

θ2=   In  the  year  before   Lemieux  came,  Pi$sburgh   finished  with  38  points.     Following  his  arrival,  the  

θ3=   The  Orioles'  pitching  staff   again  is  having  a  fine   exhibiTon  season.  Four   shutouts,  low  team  ERA,  

67  

Parameter  EsTmaTon  

Beta(γ)   π1  

π Integrated   out  

π2 ϕ1

{Canada}  

3

ϕ2

{government}  

b1

b2 {Canadian  gov.}   {government}  

{sports}  

b3   {hockey}  

π5

π4

ϕ3

b4 {U.S.  gov.}  

Dirichlet(β)  

ϕ4

{U.S.}  

b5 {baseball}  

ϕ5

{Japan}  

b6 {Japan}  

Dirichlet(α)   θ1=   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  vessels   regularly  if  infrequently  detain  

Integrated   out   θ =   2

In  the  year  before   Lemieux  came,  Pi$sburgh   finished  with  38  points.     Following  his  arrival,  the  

θ3=   The  Orioles'  pitching  staff   again  is  having  a  fine   exhibiTon  season.  Four   shutouts,  low  team  ERA,  

68  

Parameter  EsTmaTon   ϕ1

{Canada}  

ϕ2

{government}  

b1

b2 {Canadian  gov.}   {government}   The  54/40'  boundary  dispute  is   sTll  unresolved,  and  Canadian   and  US  Coast  Guard  vessels   regularly  if  infrequently  detain   each  other's  fish  boats  in  the   disputed  waters  off  Dixon…  

b3   {hockey}  

ϕ3

{sports}  

b4 {U.S.  gov.}  

In  the  year  before   Lemieux  came,  Pi$sburgh   finished  with  38  points.     Following  his  arrival,  the   Pens  finished…  

ϕ4

{U.S.}  

b5 {baseball}  

ϕ5

{Japan}  

b6 {Japan}  

The  Orioles'  pitching  staff   again  is  having  a  fine   exhibiTon  season.  Four   shutouts,  low  team  ERA,   (Well,  I  haven't  go$en  any   baseball…   69  

Parameter  EsTmaTon   ϕ1

{Canada}  

ϕ2

{government}  

b1

b2 {Canadian  gov.}   {government}   z11   z13  

z14  

ϕ3

z12   z15   z16  

{sports}  

b3   {hockey}  

z21   z24  

b4 {U.S.  gov.}  

ϕ4

ϕ5

{U.S.}  

{Japan}  

b5 {baseball}   z31  

z22   z23  

b6 {Japan}   z32   z33  

z34   70  

Parameter  EsTmaTon   ϕ1

{Canada}  

ϕ2

b2 {Canadian  gov.}   {government}  

z13  

z14  

Model  parameters   {sports}  

{government}  

b1

z11  

ϕ3

z12   z15   z16  

b3   {hockey}  

b4 {U.S.  gov.}  

ϕ4

{U.S.}  

z24  

{Japan}  

b5 {baseball}  

Latent  variables   z21  

ϕ5

z31  

z22   z23  

b6 {Japan}   z32   z33  

z34   71  

Parameter  EsTmaTon   Standard  Mϕ-­‐step:  Maximize   likelihood   of  ϕcϕ ϕ ϕ

ϕ1

2

{Canada}  

3

{government}  

4

{sports}  

condiToned  on  zmn  and  bck    

b1

b2 {Canadian  gov.}   {government}  

b3   {hockey}  

{U.S.}  

b4 {U.S.  gov.}  

5

{Japan}  

b5 {baseball}  

b6 {Japan}  

Standard  E-­‐step:  Compute  expectaTons  of  zmn   a nd   z   z   z   z   bck  condiToned   zo  n  ϕc   z   11

z13  

32

31

12

z14  

z15   z16  

21

z24  

22

z23  

z33  

z34   72  

Parameter  EsTmaTon   Standard  Mϕ-­‐step:  Maximize   likelihood   of  ϕcϕ ϕ ϕ

ϕ1

2

{Canada}  

3

{government}  

4

{sports}  

condiToned  on  zmn  and  bck    

b1

b2 {Canadian  gov.}   {government}  

b3   {hockey}  

{U.S.}  

b4 {U.S.  gov.}  

{Japan}  

b5 {baseball}  

b6 {Japan}  

Monte-­‐Carlo  E-­‐step:  Sample  zmn  and  bckz     z   z   on   z   condiToned   z  ϕc   z   11

z13  

5

32

31

12

z14  

z15   z16  

21

z24  

22

z23  

z33  

z34   73  

Parameter  EsTmaTon   CD  M contrasTve   divergence   of  ϕϕc ϕ -­‐step:  Minimize   ϕ ϕ ϕ 1

2

{Canada}  

3

{government}  

4

{sports}  

condiToned  on  zmn  and  bck    

b1

b2 {Canadian  gov.}   {government}  

b3   {hockey}  

{U.S.}  

b4 {U.S.  gov.}  

{Japan}  

b5 {baseball}  

b6 {Japan}  

Monte-­‐Carlo  E-­‐step:  Sample  zmn  and  bckz     z   z   on   z   condiToned   z  ϕc   z   11

z13  

5

32

31

12

z14  

z15   z16  

21

z24  

22

z23  

z33  

z34   74  

9 0 1 2

3 4 5 6 7 8

9 0 1 2

076 077 078

{Draw xmnjth∼sample p(· |z{Z, , bzmn} , φ) [dr mnB} for i = 1 to N do QC where bzc φ Sample zi c=1 c,x p(x|z, QC for k = 1 to K do bz , φ) = PV bzc φ for c = 1 to C do v=1 c=1 c,v Sample bkc {M-step:} CD  M-­‐step:   for c = 1 to C do We  follow  Hinton  (2002)   for v = 1 to do Algorithm 1 VSCTM Training Single gradient step over ξ Initialize parameters: ξc , bkc , zi . d CD({Z, B}) (t+1) while not converged doφ(t) φcv = cv − η · dφcv {E-step:}

Parameter  EsTmaTon  

079 080 081 082 083 084

for j = 1 to J do Monte-­‐Carlo   E -­‐step:   {Draw jth sample {Z, B}(j) } 085 for i = 1 to N doBelow, we provide the Contrastive Divergence Sample zi 086 gence objective, where Z and B are treated as fixe for k = 1 to K do 087 � for c = 1 to C do B}) d SampledbCD({Z, 088 kc ≈− {M-step:} 75   dξcv 089

Parameter  EsTmaTon   • Goal:  infer  values  for  model  parameters   ϕc

{Canada}  

πc  

θm=  

• Monte  Carlo  EM  (MCEM)  algorithm,  where  the   M-­‐step  minimizes  a  ContrasTve  Divergence   (CD)  objecTve    

76  

Experiments:  Topic  Modeling   •  Experiments:   –  Can  SCTM  combine  a  fixed  number  of  components   (mulTnomials)  into  topics  to  achieve  lower  perplexity?   –  Does  SCTM  achieve  lower  perplexity  than  LDA  with  a   more  compact  model?  

•  Analysis:   –  What  are  the  learned  topics  like?   –  What  are  the  learned  components  like?   –  What  topic-­‐structure  is  learned?   77  

Experiments:  Topic  Modeling   Experimental  Setup:   – Datasets:     •  1,000  random  arTcles  from  20  Newsgroups   •  1,617  NIPS  abstracts  

– EvaluaTon:     •  les-­‐to-­‐right  average  perplexity  on  held-­‐out  data  

– Models:   •  LDA  trained  with  a  collapsed  Gibbs  sampler   –  In  LDA,  components  and  topics  are  in  a  one-­‐to-­‐one  relaTonship   (i.e.  a  special  case  of  the  SCTM  where  each  topic  is  comprised  of   only  its  corresponding  component)  

•  SCTM  with  parameter  esTmaTon  as  described   78  

Experiments:  Topic  Modeling   •  Experiments:   –  Can  SCTM  combine  a  fixed  number  of  components   (mulTnomials)  into  topics  to  achieve  lower  perplexity?   –  Does  SCTM  achieve  lower  perplexity  than  LDA  with  a   more  compact  model?  

•  Analysis:   –  What  are  the  learned  topics  like?   –  What  are  the  learned  components  like?   –  What  topic-­‐structure  is  learned?   79  

Experiments:  Topic  Modeling   20News  

1800

● ●

1600

Perplexity

● ●

LDA

● ●

1400

● ●

1200

● ● ● ● ● ●

1000 800 0

20

40

60

# of Components

80

100 80  

Experiments:  Topic  Modeling   10

1800

20News   ● ●

1600

Perplexity

● ●

20 ● ●

1400

SCTM with # components = # topics (labels show # topics)

40 ● ●

1200

LDA SCTM

60

80 ● ●

100 ● ● ● ●

1000 800 0

20

40

60

# of Components

80

100 81  

Experiments:  Topic  Modeling   10

1800

20News   ● ●

1600

Perplexity

● 20 ● 20

● 40 ●

1400

LDA SCTM

40 80

● ●

1200

(labels show # topics)

60 80 ● ●

100 ● ●

120

● ●

160

200

1000 800 0

20

40

60

# of Components

80

100 82  

Experiments:  Topic  Modeling   10

1800

20News   ● ●

1600

Perplexity

● 20 ● 20 30 ● 40 ●

1400

40

60 80

1200

LDA SCTM

● ●

(labels show # topics)

60 80 ● ●

120

100 ● ●

120 180

1000

● ●

160

200

240

300

800 0

20

40

60

# of Components

80

100 83  

Experiments:  Topic  Modeling   10

1800

20News   ● ●

1600

Perplexity

● 20 ● 20 30

40

1400

● 40 ●

60 80

1200

LDA SCTM

40 80

● ●

(labels show # topics)

60 80 ● ●

120

100 ● ●

120

160

180 240

1000

● ●

160

200

240 320

300 400

800 0

20

40

60

# of Components

80

100 84  

Experiments:  Topic  Modeling   10

1800

20News   ● ●

1600

Perplexity

● 20 ● 20 30

40 50

1400

● 40 ●

60 80 100

1200

40 80

● ●

(labels show # topics)

60 80 ● ●

120

100 ● ●

120

160 200

1000

● ●

160

180 240 300

200

240 320 400

300 400 500

800 0

20

LDA SCTM

40

60

# of Components

80

100 85  

Experiments:  Topic  Modeling   NIPS  

700 ● ●

● ●

LDA

600

Perplexity

● ●

500

● ●

● ● ● ●

400

● ●

300 0

20

40

60

# of Components

80

100 86  

Experiments:  Topic  Modeling   10

700

NIPS   ● ●

● ●

20

600

Perplexity

30 40 50

20 ● ●

40

40

60

500

80 100

400

(labels show # topics)

● ●

80

60

120 160 200

120

● ● ● 80 ●

180 240 300

20

40

100

160 240 320 400

300 0

LDA SCTM

60

# of Components

● ●

200 300 400 500

80

100 87  

Experiments:  Topic  Modeling   •  Experiments:   –  Can  SCTM  combine  a  fixed  number  of  components   (mulTnomials)  into  topics  to  achieve  lower  perplexity?   –  Does  SCTM  achieve  lower  perplexity  than  LDA  with  a   more  compact  model?  

•  Analysis:   –  What  are  the  learned  topics  like?   –  What  are  the  learned  components  like?   –  What  topic-­‐structure  is  learned?   88  

Experiments:  Topic  Modeling   ● 10 ●

20News   ● 20 ●

1400

● ●

Labels for LDA show # topics. Labels for SCTM show # components, # topics

Perplexity

● 40 ●

1200

LDA

● 60 ● ● 80 ● ● 100 ●

● 120 ● ● 140 ●

1000

800 0

100

200

300

400

500

# of Model Parameters (thousands)

600 89  

Experiments:  Topic  Modeling   ● 10 ● 10,20 10,30

● ●

● 20 ● 10,4020,40 10,50 20,60

Perplexity

1400

20News  

20,80

1200

LDA SCTM

Labels for LDA show # topics. Labels for SCTM show # components, # topics

● 40 ● 40,80 ● 60 ●

20,100 40,120

60,120

● 80 ●

● 100 ● 40,160 ● 120 ● 80,160 60,180 ● 140 ● 100,200 40,200 60,240 80,240 100,300 60,300 80,320 80,400 100,400 100,500

1000

800 0

100

200

300

400

500

# of Model Parameters (thousands)

600 90  

Experiments:  Topic  Modeling   ● 10 ●

NIPS  

600

Perplexity

● 20 ●

500

● ●

LDA

● 40 ●

● 60 ● ● 80 ●

400

● 100 ● ● 120 ●

● 140 ●

● 160 ●

● 180 ●

● 200 ●

300 0

100

200

300

# of Model Parameters (thousands)

400 91  

Experiments:  Topic  Modeling   ● 10 ● 10,20

10,30 ● 10,40 20 ● 10,50 20,40

Perplexity

600

NIPS  

20,60

500

● ●

LDA SCTM

● 40 ●

20,80 40,80 20,100

● 60 ●

40,120

60,120 80 ● ● 100 ● 40,160 ● 60,180 120 ● 40,200 80,160 ● 140 ● ● 60,240 80,240 160 ● ● 180 ● 100,200 100,300 60,300 ● 200 ● 80,320 100,400 80,400 100,500

400

300 0

100



200

300

# of Model Parameters (thousands)

400 92  

Experiments:  Topic  Modeling   •  Experiments:   –  Can  SCTM  combine  a  fixed  number  of  components   (mulTnomials)  into  topics  to  achieve  lower  perplexity?   –  Does  SCTM  achieve  lower  perplexity  than  LDA  with  a   more  compact  model?  

•  Analysis:   –  What  are  the  learned  topics  like?   –  What  are  the  learned  components  like?   –  What  topic-­‐structure  is  learned?   93  

What  does  SCTM  learn?  

Figure 2: SCTM binary matrix and topics from 3599 training documents of 20N EWS for C squares are “on” (equal to 1).

20News  

20

y

15

10

5

2

4

x

1800

6

k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

αk 0.306 0.031 0.025 0.102 0.071 0.018 0.074 0.106 0.011 0.055 0.063 0.160 0.042 0.038 0.079 0.158 0.136 0.122 0.017 0.380

Top words for topic subject organization israel return define law org encryption chip clipper keys des escrow security law turkish armenian armenians war turkey turks armenia drive card disk scsi hard controller mac drives image jpeg window display code gif color mit jews israeli jewish arab peace land war arabs org money back question years thing things point christian bible church question christ christians life administration president year market money senior health medical center research information april gun law state guns control bill rights states world organization system israel state usa cwru reply space nasa gov launch power wire ground air space nasa gov launch power wire ground air team game year play games season players hockey car lines dod bike good uiuc sun cars windows file government key jesus system program article writes center page harvard virginia research max output access digex int entry col line lines people don university posting host nntp time

Top words for topic a organization subject is administration preside years food center year opinions drive hard po pitt file program year c

! 10 10,20 11 ! 10,30

20,40 20 10,40 21 !!

1400 10,50

20,60

Perplexity

← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ←

20,80 40

1200

20,100

40

4

1000 800 0

10

# of Mode

8 10

10 !

LDA

700

10 !

LDA

10 ! 11 !

94  

What  does  SCTM  learn?  

Figure 2: SCTM binary matrix and topics from 3599 training documents of 20N EWS for C are “on” (equalmatrix to 1). 2:squares SCTM binary and topics from 3599 training documents o

20News   Figure k α Top words for topic squares are “on” (equal to 1). 20

y

15

10

5

2

4

x

1800

6 10

← 14 8 10 ← 15 ← 16

k

0.306 0.031 α 0.025k 0.306 0.102 0.071 0.031 0.018 0.025 0.074 0.106 0.102 0.011 0.071 0.055 0.018 0.063 0.160 0.074 0.042 0.106 0.038 0.011 0.079 0.158 0.055 0.136 0.063 0.122 0.160 0.017 0.380 0.042

Top words for topic a subject organization israel return define law org organization subject is encryption chip clipper keys des escrow security law administration preside Top words for topic turkish armenian armenians war turkey turks armenia years food center year subject organization israel law org drive card disk scsi hard controller macreturn drives define opinions drive hard po image jpeg window display code gifkeys color des mit escrowpitt file program encryption chip clipper security lawyear c jews israeli jewish arab peace land war arabs armenian armenians war turkey turks armenia orgturkish money back question years thing things point christian christcontroller christians lifemac drives drivebible cardchurch diskquestion scsi hard 10 administration president year market money senior 11 10,30 image jpeg window display code gif color mit 10,20 health medical center research information april 20,40 20 10,40 21 israeli arab gunjews law state guns jewish control bill rightspeace states land war arabs 1400 10,50 world systemquestion israel state years usa cwru reply things point 20,60 orgorganization money back thing 20,80 40 space nasa gov launch power wire ground air 20,100 1200 christian bible church question christ christians life 40 space nasa gov launch power wire ground air 4 administration president year market team game year play games season players hockey money senior 1000 carhealth lines dodmedical bike good center uiuc sun research cars information april windows file government key jesus system program gun law state guns control bill rights states 800 article writes center page harvard virginia research world system israel state usa cwru reply max output organization access digex int entry col line 0 10 lines peoplenasa don university postingpower host nntp time ground air space gov launch wire ! !

! !

Perplexity

← 1 ← 2 k ← 3 ← ← 14 ← ← 25 ← 6 ← ← 37 ← ← 48 ← ← 59 ← 10 ← ← 611 ← ← 712 13 ←← 8 ← 14 ← ← 915 ← ←1016 ← 17 ← ←1118 ← ←1219 ← ←1320

# of Mode

0.038 space nasa gov launch power wire ground air 0.079 team game year play games season players hockey 10 10 0.158 11 700 dod bike good uiuc sun cars LDA car lines LDA !

!

! !

95  

Figure 2: SCTM binary matrix and topics from 3599 training documents o squares are “on” (equal to 1).

What  does  SCTM  learn?  

k αk Top words for topic ← matrix 1 0.306 subjectfrom organization israel documents return defineoflaw org Figure 2: SCTM binary and topics 3599 training 20N EWS for C ← 2 0.031 encryption chip clipper keys des escrow security law squares are “on” (equal to 1). ← 3 0.025 turkish armenian armenians war turkey turks armenia αk Top words for topic words for topic a ← ← 4k1 0.102 drive card disk scsi hard controller mac Top drives 0.306 subject organization israel return define law org organization subject is ← ← 5 2 0.071 image chip jpegclipper window display code gif mit 0.031 encryption keys des escrow security law color administration preside 0.025 turkish turkey turksland armenia years food center year ← ← 6 3 0.018 jewsarmenian israeliarmenians jewish war arab peace war arabs ← 4 0.102 drive card disk scsi hard controller mac drives opinions drive hard po ← ← 7 5 0.074 org money back question years thing things 0.071 image jpeg window display code gif color mit pitt filepoint program year c ← ← 8 6 0.106 christian bible question 0.018 jews israeli jewish arabchurch peace land war arabs christ christians life ← 7 0.074 org money back question years thing thingsmarket point ← 9 0.011 administration president year money senior ← 8 0.106 christian bible church question christ christians life ← ←10 9 0.055 health medical research information april 10,201011 0.011 administration presidentcenter year market money senior 10,30 0.055 health center guns research information ← ←1110 0.063 gunmedical law state control billapril rights states 20,40 20 10,40 21 ← 11 0.063 gun law state guns control bill rights states 1400 10,50 ← 12 0.160 world organization system israel state usa cwru reply 20,60 ← 12 0.160 world organization system israel state usa cwru reply 20,80 40 ← ←1313 0.042 space govpower launch wire ground air 1200 20,100 0.042 space nasa nasa gov launch wirepower ground air 40 0.038 space nasa nasa gov launch wirepower ground air ← ←1414 0.038 space govpower launch wire ground air 4 15 0.079 team game year play games season players hockey ←← 15 0.079 team game year play games season players hockey 1000 ← 16 0.158 car lines dod bike good uiuc sun cars ← ←1617 0.158 car lines dod bikekey good cars 0.136 windows file government jesus uiuc systemsun program 800 0.122 article writes center harvard virginia ← ←1718 0.136 windows file page government keyresearch jesus system program ← 19 0.017 max output access digex int entry col line 0 10 ← ←1820 0.122 article center pagehost harvard 0.380 lines people writes don university posting nntp timevirginia research # of Mode ← 19 0.017 max output access digex int entry col line ← 20 0.380 lines people don university posting host nntp time 2 4 6 8 10

20News  

20

15

10

5

! !

Perplexity

y

! !

x

1800

10 !

LDA

700

10 !

LDA

10 ! 11 !

96  

Experiments:  Topic  Modeling   •  Experiments:   –  Can  SCTM  combine  a  fixed  number  of  components   (mulTnomials)  into  topics  to  achieve  lower  perplexity?   –  Does  SCTM  achieve  lower  perplexity  than  LDA  with  a   more  compact  model?  

•  Analysis:   –  What  are  the  learned  topics  like?   –  What  are  the  learned  components  like?   –  What  topic-­‐structure  is  learned?   97  

SCTM:  Hasse  Diagram  over  Topics   c=9 visual image images cells cortex scene support spatial feature vision cues stimulus statistics

c=4 paper units output layer networks patterns unit pattern set rule network rules weights training

c=2

c=1

network networks data learning optimal linear vector independent binary natural algorithms pca

model information parameters kalman robust matrices likelihood experimentally

k=16 αk =0.11 training units paper hidden number output problem rule set order unit show present method weights task

k=14 αk =0.07 models images image problem structure analysis mixture clustering approach show computational

k=8 αk =0.23 algorithm training error function method performance input classification classifier

k=4 αk =0.12 bayesian results show estimation method based parameters likelihood methods models

k=18 αk =0.07 information analysis component rules signal independent representations noise basis

k=3 αk =0.06 object recognition system objects information visual matching problem based classification

k=12 αk =0.13 problem state control reinforcement problems models time based decision markov systems function

k=5 αk =0.04 object recognition system objects information visual matching problem based classification

k=9 αk =0.02 vector feature classification support vectors kernel regression weight inputs dimensionality

NIPS  

k=11 αk =0.08 learning networks system recognition time network describes hand context views classification

k=1 αk =0.11 model learning system information parameters networks robust kalman rules estimation

k=13 αk =0.05 networks network learning distributed system weight vectors property binary point optimal real

k=7 αk =0.08 data paper networks network output feature features patterns set train introduced unit functions

k=10 αk =0.09 neural neurons analog synaptic neuron networks memory time capacity model associative noise dynamics

k=2 αk =0.13 network input information time recurrent back propagation units architecture forward layer

k=19 αk =0.03 system networks set neurons visual phase feature processing features output associative

k=20 αk =0.02 time network weights activation delay current chaotic connected discrete connections

k=6 αk =0.23 neural network paper recognition speech systems based results performance artificial

k=17 αk =0.10 number functions weights function layer generalization error results loss linear size

k=15 αk =0.12 cells neurons visual cortex motion response processing spatial cell properties patterns spike

Figure 4: Hasse diagram on NIPS for C = 10, K = 20 showing the top words for topics and unrepresented components (in shaded box). Notice that some topics only consist of a single component. The shaded box contains the

98  

m state rol ement models ased markov unction

data paper networks network output feature features patterns set train introduced unit functions

SCTM:  Hasse  Diagram  over  Topics  

c=9 visual image images cells cortex scene support spatial feature vision cues stimulus statistics

c=4 paper units output layer networks patterns unit pattern set rule network rules weights training

k=11 αk =0.08 learning networks system recognition time network describes hand context views classification

k=8 αk =0.23 algorithm training error function method performance input classification classifier

=1 αk =0.11 model learning system information parametersα =0.12 tworks k=4 robust bayesian results show kalman rules estimation method based estimation parameters k

likelihood methods models

c=2

c=1

network networks data learning optimal linear vector independent binary natural algorithms pca

model information parameters kalman robust matrices likelihood experimentally

k=18 αk =0.07 information analysis component rules signal independent representations k noise basis

k=10 αk =0.09 neural neurons analog synaptic networks α =0.11 k=14 α =0.07neuron k=16 training units models images memory time paper hidden image problem model structure capacity number output analysis mixture problem rule set associative clustering order unit show approach show present method computational noise dynamics weights task k

k

k=3 αk =0.06 object recognition system objects information visual matching problem based classification

k=13 α =0.05 networks network learning distributed system weight vectors property binary point optimal real

k=5 αk =0.04 object recognition system objects information visual matching problem k based classification

weight inputs dimensionality

k=7 αk =0.08 data paper networks network output feature features patterns set train introduced unit functions

k=2 αk =0.13 network input information time recurrent back k=10 α =0.09 k=11 α =0.08 neural neurons learningpropagation analog synaptic networks system neuron networks recognition time units memory time network architecture capacity model describes hand context views associative forward layer classification noise dynamics k

k

k=1 αk =0.11 model learning system information parameters networks robust kalman rules estimation

k=19 α =0.03 system networks set neurons visual phase feature k=9 α =0.02 processing vector feature classification features output support vectors kernel associative regression k

NIPS  

k=12 αk =0.13 problem state control reinforcement problems models time based decision markov systems function

k=13 αk =0.05 networks network learning distributed system weight vectors property binary k point optimal real

k=20 α =0.02 time network weights activation delay current chaotic connected discrete connections

k=19 αk =0.03 system networks set neurons visual phase feature processing features output associative

k=2 αk =0.13 network input information time recurrent back propagation units architecture forward layer

k=20 αk =0.02 time network weights activation delay current chaotic connected discrete kconnections

k=6 α =0.23 neural network paper recognition speechk=17systems α =0.10 based results number functions performance weights function layer artificial generalization k

error results loss linear size

k=6 αk =0.23 neural network paper recognition speech systems based results performance artificial

k=15 αk =0.12 cells neurons visual cortex motion response processing spatial cell properties patterns spike

Figure 4: Hasse diagram on NIPS for C = 10, K = 20 showing the top words for topics and unrepresented components (in shaded box). Notice that some topics only consist of a single component. The shaded box contains the

99  

SCTM:  Hasse  Diagram  over  Topics   c=9 visual image images cells cortex scene support spatial feature vision cues stimulus statistics

c=4 paper units output layer networks patterns unit pattern set rule network rules weights training

c=2

c=1

network networks data learning optimal linear vector independent binary natural algorithms pca

model information parameters kalman robust matrices likelihood experimentally

k=16 αk =0.11 training units paper hidden number output problem rule set order unit show present method weights task

k=14 αk =0.07 models images image problem structure analysis mixture clustering approach show computational

k=8 αk =0.23 algorithm training error function method performance input classification classifier

c=9 visual image images cells cortex scene support spatial feature vision cues stimulus k=18 α =0.07 =0.06 statistics k=3 αobject information k

analysis component rules signal independent representations noise basis

k

recognition system objects information visual matching problem based classification

k=12 αk =0.13 problem state control reinforcement problems models time based decision markov systems function

recognition system objects information visual matching problem based classification

k

system information parameters networks robust kalman rules estimation

k=7 αk =0.08 data paper networks network output feature features patterns set train introduced unit functions

k=10 αk =0.09 neural neurons analog synaptic neuron networks memory time capacity model associative noise dynamics

k=2 αk =0.13 network input information time recurrent back propagation units architecture forward layer

k

k

k

network learning distributed system weight vectors property binary point optimal real

set neurons visual phase feature processing features output associative

weights activation delay current chaotic connected discrete connections

k=11 αk =0.08 learning networks system recognition time network describes hand context views classification

c=4 paper units output layer networks patterns unit pattern set rule network rulesk=1 α =0.11 k=5 α =0.04 weights training object model learning k

NIPS  

c=2

c=1

network model networks data information parameters learning optimal kalman robust linear vector matrices independent likelihood binary natural k=20 α =0.02 k=13 α =0.05 k=19 α =0.03 experimentally algorithms time network networks pca system networks

k=12 αk =0. problem st control reinforcem problems mo time base decision ma k=6 α =0.23 systems func neural network k

paper recognition speech systems based results performance artificial

α=0.12 k=15 α k=14 αk =0.07 k=17 α =0.10 k=16 k =0.11 cells neurons number visual cortexunits functions training models images weights motion response function processing layer paper hidden image problem generalization spatial cell structure number output properties error results patterns spike analysis mixture loss linear size problem rule set clustering order unit show approach show present commethod Figure 4: Hasse diagram on NIPS for C = 10, K = 20 showing the top words for topics and unrepresented computational weights task 100   ponents (in shaded box). Notice that some topics only consist of a single component. The shaded box contains the k=4 αk =0.12 bayesian results show estimation method based parameters likelihood methods models

k=9 αk =0.02 vector feature classification support vectors kernel regression weight inputs dimensionality

k

k

Experiments:  Topic  Modeling   •  Experiments:  

–  For  the  same  number  of  components  (mulTnomials),   SCTM  achieves  lower  perplexity  than  LDA   –  Non-­‐square  SCTM  achieves  lower  perplexity  than  LDA   with  a  more  compact  model  

•  Analysis:  

–  SCTM  learns  diverse  LDA-­‐like  topics   –  Components  are  usually  only  interpretable  when  they   also  appear  as  a  topic   –  SCTM  learns  an  implicit  Hasse  diagram  defining   subsumpTon  relaTonships  between  topics  

101  

Summary   Shared  Components  Topic  Model  (SCTM):   1.  Generate  a  pool  of  “components”  (proto-­‐topics)   2.  Assemble  each  topic  from  some  of  the   components   •  MulTply  and  renormalize  (“product  of  experts”)  

3.  Documents  are  mixtures  of  topics  ( just  like  LDA)   –  So  the  wordlists  of  two  topics  are  not  generated   independently!   –  Fewer  parameters   102  

Future  Work   •  Improve  inference  for  SCTM   •  Topics  as  products  of  components  in  other   applica5ons   –  SelecTonal  preference:  components  could   correspond  to  semanTc  features  that  intersect  to   define  semanTc  classes   –  Vision:  topics  are  classes  of  objects,  the   components  could  be  features  of  those  objects  

103  

Thank  you!  

QuesTons,  comments?  

104