gensimでLSI（潜在的意味解析） - 機械学習・自然言語処理の勉強メモ

コーパスと辞書を用いて潜在的意味解析を行う。
＊文書のベクトル化（次元圧縮）

文書セットから辞書を作成する。
不用語を取り除く
BoW表現に変換

from gensim import corpora  
 
dic = corpora.Dictionary(documents) 
dic.filter_extremes(no_below=20, no_above=0.3) 
bow_corpus = [dic.doc2bow(d) for d in documents]

ベクトル化した文書をTF-IDF表現に変換（必須ではない。）

from gensim import models 
 
tfidf_model = models.TfidfModel(bow_corpus) 
tfidf_corpus = tfidf_model[bow_corpus]

作成したコーパスと辞書からLSIモデルを作成
LSIモデルから次元圧縮したコーパスを作成

lsi_model = models.LsiModel(tfidf_corpus, id2word=dic, num_topics=200) 
lsi_corpus = lsi_model[tfidf_corpus] 
print lsi_model.print_topic(10) 
print lsi_model.print_topic(20) 
0.272*"batman" + -0.259*"alien" + -0.259*"jacki" + -0.180*"truman" + -0.156*"vampir" + 0.150*"arnold" + 0.131*"robin" + 0.126*"action" + -0.119*"wed" + -0.115*"chan" 
0.533*"godzilla" + 0.163*"broderick" + 0.156*"-" + -0.143*"trooper" + 0.130*"--" + -0.130*"bug" + -0.126*"starship" + -0.114*"verhoeven" + -0.112*"harri" + -0.102*"titan" 
 
print lsi_corpus[0] 
[(0, 0.24008384938376839), (1, 0.035493134948721645), (2, 0.0027661163076791824), (3, -0.022602169535407168), (4, -0.0032662025820950026), (5, -0.053840986794015493), (6, 0.031955950390335205), (7, -0.0075247347929682795), (8, -0.0176339887741583), (9, -0.0044738729175389591), (10, 0.034031529582302801), (…, …………), (197, 0.032058850129589735), (198, 0.0085820201611369329), (199, 0.035471047578525046)]

＊200次元に圧縮された文書１の特徴

＊保存・ロード

lsi_model.save('lsi_topics300.model') 
lsi_model = models.LsiModel.load('lsi_topics300.model')

で可能。（毎回、学習する必要がなくなる。）