TheanoでLSTM① - 機械学習・自然言語処理の勉強メモ

下記のdocumentationについて整理する。

LSTM Networks for Sentiment Analysis — DeepLearning 0.1 documentation

はじめに

コードを読んでいく前に全体の概要を掴む。

コードについては下記の２ファイル

lstm.py : モデルの定義＆訓練
imdb.py : IMDB データセットのロード＆前処理

それぞれ下記より入手可能
github.com

実行させるのは、lstm.pyの方

このdocumentationについて調べたところ、先人がおられたので、参考にさせて頂いた。
qiita.com

では、内容に入っていく。

train_lstm

__main__で呼ばれる関数。
メインはこの呼び出ししかしておらず、実質的な主処理は、train_lstmが行う。

if __name__ == '__main__':
    # See function train for all possible parameter and there definition.
    train_lstm(
        max_epochs=100,
        test_size=500,
    )

475行目の

model_options = locals().copy()

でローカル変数とその値を辞書形式で格納する。
model_options['dim_proj'] = 127 といった感じ。

以降、datasetをロードし、訓練・評価・テストデータに分けている。

init_params

embedding と classifierのためのパラメータの初期化
（LSTMのパラメータの初期化ではない。）

# embedding
randn = numpy.random.rand(options['n_words'],  # 10000
                          options['dim_proj'])  # 128
params['Wemb'] = (0.01 * randn).astype(config.floatX)

で10000×128のパラメータ行列を作る。
これは、embeddingパラメータの初期化
次に、

params = get_layer(options['encoder'])[0](options,
                                          params,
                                          prefix=options['encoder'])

get_layerはlayersの指定したkeyの値を返す。

# ff: Feed Forward (normal neural net), only useful to put after lstm
#     before the classifier.
layers = {'lstm': (param_init_lstm, lstm_layer)}

def get_layer(name):
    fns = layers[name]
    return fns

nameは「lstm」なので、(param_init_lstm, lstm_layer)が返される。
つまり、最後の部分は下記と等価であり、

params = param_init_lstm(options, params, prefix=options['encoder'])

param_init_lstm関数を呼んでいることが分かる。
（キーをして辞書を参照して、その値の関数を実行する。）
先に、 classifierのためのパラメータを見ておく。

# classifier
params['U'] = 0.01 * numpy.random.randn(options['dim_proj'],  # 128
                                        options['ydim']).astype(config.floatX)  # 2
params['b'] = numpy.zeros((options['ydim'],)).astype(config.floatX)

これらは、隠れ層-出力層の間のパラメータの定義
中間のLSTM層のパラメータもparamsにセットされるので、paramsが全てのパラメータを持つことになる。

param_init_lstm

def ortho_weight(ndim):
    W = numpy.random.randn(ndim, ndim)
    u, s, v = numpy.linalg.svd(W)
    return u.astype(config.floatX)

def param_init_lstm(options, params, prefix='lstm'):
    """
    Init the LSTM parameter:
    :see: init_params
    """
    W = numpy.concatenate([ortho_weight(options['dim_proj']),
                           ortho_weight(options['dim_proj']),
                           ortho_weight(options['dim_proj']),
                           ortho_weight(options['dim_proj'])], axis=1)
    params[_p(prefix, 'W')] = W
    U = numpy.concatenate([ortho_weight(options['dim_proj']),
                           ortho_weight(options['dim_proj']),
                           ortho_weight(options['dim_proj']),
                           ortho_weight(options['dim_proj'])], axis=1)
    params[_p(prefix, 'U')] = U
    b = numpy.zeros((4 * options['dim_proj'],))
    params[_p(prefix, 'b')] = b.astype(config.floatX)

    return params

では、特異値分解が出現している。
特異値分解は下記でまとめた。
kento1109.hatenablog.com

ここでは、

import numpy

def ortho_weight(ndim):
    W = numpy.random.randn(ndim, ndim)
    u, s, v = numpy.linalg.svd(W)
    return u

print ortho_weight(128).shape
# (128L, 128L)

であることが分かる。
これは、ortho_weightで任意の初期値を得ることが目的と考えられる。
（なぜ、特異値分解を使っているかは不明・・）
なんで、それらを4つでconcatinateしているので、Wは128×512のパラメータ行列となる。
ここで、LSTMのパラメータについて思い出すと、

$W_i$ ・・input gateに関する重み
$W_f$ ・・forget gateに関する重み
$W_o$ ・・output gateに関する重み
$W_c$ ・・memory cellsに関する重み

であった。
そして、隠れ層のユニット数は128である。
つまり、LSTM_Wを絵にするとこうなる。
f:id:kento1109:20171123145502p:plain
LSTM_Uも全く同じ。
例えば、input gateは下記で計算される。
$i_t=\sigma (W_ix_t+U_i{ h }_{t-1}+b_i)$
ここから

LSTM_W・・入力との線形和
LSTM_U・・隠れ状態ベクトルとの線形和

であることが分かる。
bはそれぞれのバイアス

ここまででパラメータの初期化の定義が出来た。

次に、

tparams = init_tparams(params)

でパラメータを共有変数として宣言する。

build_model

モデル構築関数の呼び出し

# use_noise is for dropout
(use_noise, x, mask,
y, f_pred_prob, f_pred, cost) = build_model(tparams, model_options)

シンボルxには、任意のN個の文書、
シンボルyには、任意のN個の文書のラベルがセットされる。

x = tensor.matrix('x', dtype='int64')
mask = tensor.matrix('mask', dtype=config.floatX)
y = tensor.vector('y', dtype='int64')

n_timesteps = x.shape[0]
n_samples = x.shape[1]

emb = tparams['Wemb'][x.flatten()].reshape([n_timesteps,
                                            n_samples,
                                            options['dim_proj']])

n_timesteps・・サンプル文書の長さ（全て同じ）
n_samples・・名前の通りサンプル文書数。

embは「文書の長さ×サンプル数×分散次元数」
入力は単語IDの行列であるが、ここで、変換処理して隠れ層へは単語IDの分散表現が渡すようにしている。

proj = get_layer(options['encoder'])[1](tparams, emb, options,
                                        prefix=options['encoder'],
                                        mask=mask)

は、下記のlstmの[1]の呼び出しに相当し、

layers = {'lstm': (param_init_lstm, lstm_layer)}

lstm_layer関数が呼ばれる。

lstm_layer関数はLSTMのメインとなるモデルの定義。
長くなりそうなので、ここからは次回とする。