gensimでDoc2Vec - 機械学習・自然言語処理の勉強メモ

Doc2Vecとは

Doc2Vecは、任意の長さの文書をベクトル化する技術。
文書やテキストの分散表現を獲得することができる。
＊ベクトル同士の類似度を測定して、文書分類や似た文書を探すことができる。

Word2VecのCBoWにおける入力は、単語をone-hot表現した単語IDだけだったが、
Doc2Vecは、単語IDにパラグラフIDを付加した情報を入力とする。

下図のイメージ
f:id:kento1109:20171117112613p:plain
下記、論文より抜粋
[1405.4053] Distributed Representations of Sentences and Documents

日本語での要約記事としてはこちらが分かりやすい。

【論文紹介】Distributed Representations of Sentences and Documents from Tomofumi Yoshida

www.slideshare.net

Word2VecのCBoWモデルと同様に、入力層は”コンテキスト”を表し、出力は予測単語として学習する。

DBoWはWord2VecのSkip-gramと似たようなテクニックを利用する。
DBoWとSkip-gramの違うところは、入力が文書IDとなっているところ。

gensimでの動作確認

基本的な書き方は下記の通り。

from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

training_docs = []

sent1 = TaggedDocument(words=['a', 'farm', 'animal', 'that', 'are', 'kept', 'for', 'their', 'meat'],
                       tags=["pig"])
sent2 = TaggedDocument(words=['animal', 'kept', 'on', 'a', 'farm', 'for', 'their', 'meat', 'or', 'milk'],
                       tags=["cattle"])
sent3 = TaggedDocument(words=['a', 'large', 'animal', 'of', 'the', 'cat', 'family', 'that', 'lives', 'in', 'Africa'],
                       tags=["lion"])

training_docs.append(sent1)
training_docs.append(sent2)
training_docs.append(sent3)

model = Doc2Vec(documents=training_docs, min_count=1, dm=0)

print(model.docvecs['pig'])
print(model.docvecs.most_similar('pig'))

モデル学習時のパラメータ
dm：dmpv=1（デフォルト）
size：分散表現の次元数
window：コンテキストの文脈幅
min_count：学習に使う単語の最低出現回数
alpha: 学習率
sample：頻出語を無視するしきい値
seed：乱数のシード値

結果は、model.docvecs[文書名]で文書のベクトル、
また、model.docvecs.most_similar(文書名)で文書の類似度が確認できる。
* most_similar(文書名, topn=N)でトップNのみ表示。

print(model.docvecs['pig'])
print(model.docvecs.most_similar('pig'))

[  7.37672264e-04   2.55218917e-03   3.87183158e-03   4.96673724e-03
  -2.12055026e-03  -1.79789859e-04   1.26332487e-03  -1.14310812e-03
   1.93652022e-03  -3.45532107e-03  -4.16487828e-03  -6.79772696e-04
  -2.06780876e-03  -3.85531620e-03  -2.83984147e-04  -2.28316057e-04
  -3.92117584e-03  -1.04109372e-03   3.93371237e-03   1.03355723e-03
  -1.23350730e-03   3.54638952e-03  -4.18237597e-03  -4.98958305e-03
   2.09533959e-03   3.08173359e-04  -2.41976828e-04  -2.69978796e-03
  -4.94299782e-03  -4.12264513e-03  -4.04989347e-03  -4.24184883e-03
  -3.99964629e-03   2.90936441e-03   4.15229239e-04  -1.50048616e-03
  -3.39782360e-04  -3.63648613e-03   1.25556451e-03   1.70723349e-03
  -2.16140714e-03  -2.15330481e-04   3.54462955e-03  -1.50826317e-03
  -5.40140725e-04   5.48483455e-04   4.97757597e-03   3.81828024e-04
   4.79781814e-03   3.17530736e-04  -1.34915707e-03  -1.89027516e-03
   2.36914819e-03   1.87047920e-03   1.52318855e-03  -2.42520310e-03
   3.42938164e-03   5.70843520e-04  -2.70692888e-03  -2.23776675e-03
   4.97038290e-03   1.18738913e-03  -2.37358734e-03  -2.54190556e-04
  -3.43979266e-03   4.35100542e-03  -3.38647538e-03   1.80602365e-03
   1.76456710e-03   1.49385724e-03  -1.68423669e-03  -2.57762847e-03
   1.19930680e-03   1.63393840e-03  -4.63356590e-03  -3.24527058e-03
  -1.26868521e-03   3.26017803e-03   5.91240940e-04  -4.34087451e-05
   3.84137640e-03   1.92171836e-03   1.44708622e-03   2.21109460e-03
   4.05864557e-03   2.81249546e-03   2.17977073e-03   1.11479964e-03
  -3.04341200e-03  -3.75849754e-03  -1.70324382e-03  -2.15367274e-03
   6.77816453e-04   3.41445883e-03  -2.50433641e-03   5.18219837e-04
   2.93624471e-04   1.78267073e-03  -3.49953887e-03   4.64853644e-03]
[('cattle', 0.01924954727292061), ('lion', -0.0010750014334917068)]

一応、自作のコサイン類似度計算関数でも確認。

def cos_sim(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
print cos_sim(model.docvecs['pig'], model.docvecs['cattle'])
# 0.0192495

モデルの類似度と一致している。

最後に、学習したモデルの保存・ロード

model.save('doc2vec.model')
model = models.Doc2Vec.load('doc2vec.model')