TheanoでMLP（多層パーセプトロン） - 機械学習・自然言語処理の勉強メモ

下記のdocumentationについて整理する。

Multilayer Perceptron — DeepLearning 0.1 documentation

モデルイメージはこんなん。
f:id:kento1109:20171114151205p:plain

MLP インスタンスの生成

まずは、MLP インスタンスを下記のように生成する。

# construct the MLP class
classifier = MLP(
    rng=rng,
    input=x,
    n_in=28 * 28,
    n_hidden=n_hidden,  # 500
    n_out=10
    )

HiddenLayer

MLPクラスのコンストラクタでHiddenLayer&LogisticRegressionインスタンスを生成。
※多層ネットワークで意識すべきは、層間の値の受け渡し
→LogisticRegressionの

inputがHiddenLayerのoutput（隠れ層の出力）
n_inがn_in（隠れ層の出力サイズ）

となっている点を押さえておく。

class MLP(object):
    def __init__(self, rng, input, n_in, n_hidden, n_out):
        self.hiddenLayer = HiddenLayer(
            rng=rng,
            input=input,  # x
            n_in=n_in,  # 28×28
            n_out=n_hidden,  # 500
            activation=T.tanh
        )
        self.logRegressionLayer = LogisticRegression(
            input=self.hiddenLayer.output,
            n_in=n_hidden,
            n_out=n_out
        )

HiddenLayerのコンストラクタ
まず、Wの初期値の設定。

self.input = input
if W is None:
    W_values = numpy.asarray(
        rng.uniform(
            low=-numpy.sqrt(6. / (n_in + n_out)),
            high=numpy.sqrt(6. / (n_in + n_out)),
            size=(n_in, n_out)
        ),
        dtype=theano.config.floatX
    )
    if activation == theano.tensor.nnet.sigmoid:
        W_values *= 4
     W = theano.shared(value=W_values, name='W', borrow=True)
if b is None:
    b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
    b = theano.shared(value=b_values, name='b', borrow=True)
self.W = W
self.b = b

Wは次の決められた範囲の乱数から生成すると、学習が進みやすい。
tanh: $\pm \sqrt { \frac { 6 }{ { fan }_{ in }+{ fan }_{ out }\quad } }$
sigmoid: $\pm4 \sqrt { \frac { 6 }{ { fan }_{ in }+{ fan }_{ out }\quad } }$

次に、入力との線形結合＋出力shapeの定義

lin_output = T.dot(input, self.W) + self.b
self.output = (
    lin_output if activation is None else activation(lin_output)
)
# parameters of the model
self.params = [self.W, self.b]

LogisticRegression

続いて、LogisticRegressionインスタンス（出力層）の生成

self.logRegressionLayer = LogisticRegression(
    input=self.hiddenLayer.output,
    n_in=n_hidden,
    n_out=n_out

このクラスは、下記の１層モデルと同じ。
kento1109.hatenablog.com

ノルム

いわゆる過学習の抑制のための罰則項。
後で必要に応じてこの罰則項をパラメータに足す感じ。

# end-snippet-2 start-snippet-3
# L1 norm ; one regularization option is to enforce L1 norm to
# be small
self.L1 = (
    abs(self.hiddenLayer.W).sum()
    + abs(self.logRegressionLayer.W).sum()
)

# square of L2 norm ; one regularization option is to enforce
# square of L2 norm to be small
self.L2_sqr = (
    (self.hiddenLayer.W ** 2).sum()
    + (self.logRegressionLayer.W ** 2).sum()
)

呼び出し元で損失関数に加算している。

cost = (
    classifier.negative_log_likelihood(y)
    + L1_reg * classifier.L1
    + L2_reg * classifier.L2_sqr
    )

モデル関数定義

ロジスティック識別とほとんど同じ。

# compute the gradient of cost with respect to theta (sorted in params)
# the resulting gradients will be stored in a list gparams
gparams = [T.grad(cost, param) for param in classifier.params]

# specify how to update the parameters of the model as a list of
# (variable, update expression) pairs

省略していたが、MLP インスタンスの最後に

self.params = self.hiddenLayer.params + self.logRegressionLayer.params

として隠れ層と出力層の全てのパラメータをparamsに格納していた。
なので、gparamsは勾配計算により、パラメータのそれぞれの偏微分値がセットされることになる。
そのgparamsを

# given two lists of the same length, A = [a1, a2, a3, a4] and
# B = [b1, b2, b3, b4], zip generates a list C of same size, where each
# element is a pair formed from the two lists :
#    C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)]
updates = [
    (param, param - learning_rate * gparam)
    for param, gparam in zip(classifier.params, gparams)
]

で勾配法により順番に値を更新している。

# compiling a Theano function `train_model` that returns the cost, but
# in the same time updates the parameter of the model based on the rules
# defined in `updates`
train_model = theano.function(
    inputs=[index],
    outputs=cost,
    updates=updates,
    givens={
        x: train_set_x[index * batch_size: (index + 1) * batch_size],
        y: train_set_y[index * batch_size: (index + 1) * batch_size]
    }
)

は、ロジスティック識別と同様である。
バッチのindexを受け取り、train_setから一部を取り出す。
その入力と出力からcost関数を計算し、それが小さくなるようバックプロパゲーションを行い、少しずつ値を更新していく。