TheanoでRNN② - 機械学習・自然言語処理の勉強メモ

RNNの続き

kento1109.hatenablog.com

前回で、elman.pyの__init__部をまとめた。
今回は、elman-forward.pyのRNNインスタンス生成後のコードを呼んでいく。

前回の内容で

rnn = model(nh = s['nhidden'],
            nc = nclasses,
            ne = vocsize,
            de = s['emb_dimension'],
            cs = s['win'] )

rnnインスタンスを生成した。

それを使って学習を行う。

load the dataset

前回、あまりデータセットに触れなかったので確認する。
チュートリアルを見ると、sentence,index2wordの内容が書かれている。

>>> sentence
array([383, 189,  13, 193, 208, 307, 195, 502, 260, 539,
        7,  60,  72, 8, 350, 384], dtype=int32)
>>> map(lambda x: index2word[x], sentence)
['please', 'find', 'a', 'flight', 'from', 'miami', 'florida',
        'to', 'las', 'vegas', '<UNK>', 'arriving', 'before', 'DIGIT', "o'clock", 'pm']

これを見ると、sentenceは単語IDのベクトルで、xx_lexはこれを集めた行列だと推測される。

train with early stopping on validation set

まずは、変数のセット

best_f1 = -numpy.inf  # set infinity
s['clr'] = s['lr']  # set learning rate

エポック数分、学習を行う。

for e in xrange(s['nepochs']):
    # shuffle
    shuffle([train_lex, train_ne, train_y], s['seed'])
    s['ce'] = e
    tic = time.time()

utilsで定義されたshuffle関数を使って、各データセットの並びをシャッフルする。
s['ce'] に現在のエポック数をセット
このあたりは前準備で次からが学習部分

for i in xrange(nsentences):
    cwords = contextwin(train_lex[i], s['win'])
    words  = map(lambda x: numpy.asarray(x).astype('int32'),\
                 minibatch(cwords, s['bs']))
    labels = train_y[i]

nsentencesは、訓練データ数
contextwinは、指定したウィンドウサイズにpaddingする関数。

センテンスからBoWを作るイメージ
なので、cwords はこうなる。
※実際は、単語ではなく単語ID

cwords = contextwin(["I", "have", "an", "big", "apple"], 3)
# [[-1, 'I', 'have'], ['I', 'have', 'an'], ['have', 'an', 'big'], ['an', 'big', 'apple'], ['big', 'apple', -1]]

ここでは、一つのセンテンスを指定したウィンドウサイズでBoWにしていることを押さえる。

次に、minibatch関数
これも試してみた。

cwords = contextwin([12, 5, 31, 12, 23], 3)
# [[-1, 12, 5], [12, 5, 31], [5, 31, 12], [31, 12, 23], [12, 23, -1]]
words = map(lambda x: numpy.asarray(x).astype('int32'), \
            minibatch(cwords, 5))
# [array([[-1, 12,  5]]), 
#  array([[-1, 12,  5], [12,  5, 31]]), 
#  array([[-1, 12,  5], [12,  5, 31], [ 5, 31, 12]]), 
#  array([[-1, 12,  5], [12,  5, 31], [ 5, 31, 12], [31, 12, 23]]), 
#  array([[-1, 12,  5], [12,  5, 31], [ 5, 31, 12], [31, 12, 23], [12, 23, -1]])]

最後のwordsでは、全てのBoWが格納される。
また、words内のリスト数は元の単語数と等しくなる。
len(cwords) > s['bs'] の場合、

words = map(lambda x: numpy.asarray(x).astype('int32'), \
            minibatch(cwords, 3))
# [array([[-1, 12,  5]]), 
#  array([[-1, 12,  5], [12,  5, 31]]), 
#  array([[-1, 12,  5], [12,  5, 31], [ 5, 31, 12]]), 
#  array([[12,  5, 31], [ 5, 31, 12], [31, 12, 23]]), 
#  array([[ 5, 31, 12], [31, 12, 23],[12, 23, -1]])]

最後のwordsでは、最初のBoWが格納されない。
コードを触らない限り、s['bs'] > s['win']なので、wordsの最後のlistには文章をウィンドウサイズで分けた全ての値が格納される。
labelsには、i番目のセンテンスの正解タグ(IOB)が格納される。
次に１文書内の繰り返し処理

for word_batch , label_last_word in zip(words, labels):
    rnn.train(word_batch, label_last_word, s['clr'])
    rnn.normalize()
if s['verbose']:
    print '[learning] epoch %i >> %2.2f%%'%(e,(i+1)*100./nsentences),'completed in %.2f (sec) <<\r'%(time.time()-tic),
    sys.stdout.flush()

word_batch ：指定サイズのコンテキスト
label_last_word ：i番目のセンテンスのタグ
をiteretionで順番に取得していく。
初回はこんな感じ。

word_batch = [-1, 12, 5]
label_last_word = 4

それぞれが、idxs,yに設定される。
こっから、idxsがxシンボルにセットされるまでを追っていく。

x = self.emb[idxs].reshape((idxs.shape[0], de*cs))

が全てであるが、これが何を意味しているか押さえる必要がある。
まず、emdは分散表現行列。具体的には、

emd = 0.2 * numpy.random.uniform(-1.0, 1.0, (20, 5))
[[-0.033  0.088 -0.2   -0.079 -0.141]
 [-0.163 -0.125 -0.062 -0.041  0.016]
 [-0.032  0.074 -0.118  0.151 -0.189]
 [ 0.068 -0.033  0.023 -0.144 -0.121]
 [ 0.12   0.187 -0.075  0.077  0.151]
 [ 0.158 -0.166 -0.184 -0.132  0.151]
 [-0.161 -0.032  0.183  0.013  0.077]
 [-0.074  0.075  0.134 -0.193  0.1  ]
 [ 0.196  0.099 -0.088  0.116 -0.159]
 [-0.021  0.163 -0.083 -0.085 -0.148]
 [-0.192  0.072 -0.115 -0.094 -0.003]
 [-0.179  0.03  -0.141  0.036  0.08 ]
 [-0.159 -0.034  0.078 -0.034 -0.18 ]
 [ 0.014  0.066  0.006  0.178  0.035]
 [ 0.161 -0.145 -0.144  0.123 -0.041]
 [-0.134  0.171 -0.061  0.1    0.09 ]
 [ 0.153  0.049  0.1   -0.06  -0.092]
 [ 0.158 -0.029  0.186  0.065  0.049]
 [-0.154  0.18  -0.02   0.031 -0.037]
 [-0.105  0.161  0.029 -0.199  0.047]]

の行列。（ここでは、簡単のため５次元）
そして、以下で

emd[idxs] # idxs = [-1 12 5]
[[[-0.105  0.161  0.029 -0.199  0.047]  # -1なので一番下の行
  [-0.159 -0.034  0.078 -0.034 -0.18 ]  # 12番目
  [ 0.158 -0.166 -0.184 -0.132  0.151]]] # 5番目

指定した分散表現行列が抽出できる。
また、

emd[idxs] # idxs = [[-1 12 5] [12  5 11]]
[[[-0.105  0.161  0.029 -0.199  0.047]
  [-0.159 -0.034  0.078 -0.034 -0.18 ]
  [ 0.158 -0.166 -0.184 -0.132  0.151]]

 [[-0.159 -0.034  0.078 -0.034 -0.18 ]
  [ 0.158 -0.166 -0.184 -0.132  0.151]
  [-0.179  0.03  -0.141  0.036  0.08 ]]]

となることも押さえておく。
これが分かれば、

x = emd[idxs].reshape(idxs.shape[0], 5*3)
[[-0.105  0.161  0.029 -0.199  0.047 -0.159 -0.034  0.078 -0.034 -0.18
   0.158 -0.166 -0.184 -0.132  0.151]
 [-0.159 -0.034  0.078 -0.034 -0.18   0.158 -0.166 -0.184 -0.132  0.151
  -0.179  0.03  -0.141  0.036  0.08 ]]

も理解できる。
scanで再帰的に呼ばれるrecurrence関数は、このxの行数分繰り返される。
ここで、入力次元は、１単語の分散表現ではなく、
「分散表現の次元×ウィンドウサイズの単語」次元であることが分かる。
つまり、ウィンドウサイズの単語を一つのまとまりとして、ネットワークに通し、出力を得ている。

以降はモデルの評価なのでRNNの整理はここまでとする。
次回は最近の論文にもたくさん登場するLSTMについてまとめる。