gensimのコーパス操作 - 機械学習・自然言語処理の勉強メモ

コーパスを作るときの操作をまとめる。

from gensim import corpora
doclist = [['human', 'interface', 'computer'],
            ['survey', 'user', 'computer', 'system'],
            ['eps', 'user', 'interface'],
            ['system', 'human', 'system', 'eps'],
            ['user','time'],
            ['trees', 'user'],
            ['graph', 'trees'],
            ['graph', 'minors', 'minors','trees'],
            ['graph', 'minors', 'survey']]
dic = corpora.Dictionary(doclist)
print dic.token2id
# {u'minors': 10, u'graph': 9, u'system': 5, u'trees': 8, u'eps': 6, u'computer': 1, u'survey': 3, u'user': 4, u'human': 2, u'time': 7, u'interface': 0}
print dic.dfs
# {0: 2, 1: 2, 2: 2, 3: 2, 4: 4, 5: 2, 6: 2, 7: 1, 8: 3, 9: 3, 10: 2}
# 単語の出現文書数をカウント（１文書に複数出現した場合も１とカウント）

この書き方の場合、辞書作成時にエラーになるので注意。

doclist = ['human interface computer',
           'survey user computer system',
           'graph minors survey']
dic = corpora.Dictionary(doclist)
# TypeError: doc2bow expects an array of unicode tokens on input, not a single string

filter_extremes(no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)

no_below・・出現文書数N未満の単語を削除
no_above・・出現文書率がN％より上（N%は除かれない）の単語を削除
keen_n・・no_below,no_aboveによるフィルターに関わらず、指定した数の単語が保持される。
keep_tokens・・指定した単語を保持する。(keep_tokens=['human', 'survey']のように指定)

dic.filter_extremes(no_below=3)

＊削除後、新しくマッピングIDを振り直す。
＊no_aboveを設定しない場合、デフォルト値（0.5）が適用されて意図せず単語は消えるので注意。

頻出するN個の単語を削除

dic.filter_n_most_frequent(3)

一度作った辞書は保存することで辞書の再作成の手間を省ける。

dic.save('word2vec.model')

文書をBoW表現(ID)と頻度（重み）のセットに変換する。

dic.doc2bow(doclist[0])
# {u'minors': 9, u'graph': 0, u'system': 5, u'trees': 8, u'eps': 7, u'computer': 1, u'survey': 3, u'user': 4, u'human': 2, u'interface': 6}
# [(0, 1), (1, 1), (2, 1)]

文書を特徴ベトルに変換（これで文書のベクトル化が完了）

from gensim import corpora,matutils
dense0 = list(matutils.corpus2dense([dic.doc2bow(doclist[0])], 
num_terms=len(dic)).T[0])
print(dense0)
# [1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

ただし、corpus2denseの引数は１文書単位なので、以下のようにモジュール化してリスト内包表記すると良い。

from gensim import corpora,matutils
def vec2dense(vec, num_terms):
    return list(matutils.corpus2dense([vec], num_terms=num_terms).T[0])
data_all  = [vec2dense(dic.doc2bow(doclist[i]),len(dic)) for i in range(len(doclist))]
print data_all
[[1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 
 [0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0],
 [0.0, 0.0, 1.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0],
 [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0],
 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],
 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0],
 [1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]]