【CS231n】Neural Networks Part1: Setting up the Architecture

Stanford大の教材CS231nを使ってNNやCNNを学んでいる．
本記事では，Neural Network(NN)を中心に扱う．

Quick intro

linear classificationでは画像をクラスに分類するために、によって、各クラスのスコアを計算したのだった
- ${W}$ は行列、 ${x}$ は入力ベクトル
- CIFAR-10では ${x}$ の次元は[3072*1], ${W}$ の次元は[10*3072]

これに対し、2層のNeural Network(NN)では、スコアをで計算する
- ${W_1}$ の次元は[100*3072]
  画像を100次元の中間ベクトルに変換
- max(0, -)は非線形関数
- ${W_2}$ の次元は[10*100]
- ${W_1, W_2}$ はSGDで学習する

さらに3層の場合にはスコアは、で計算する
- 隠れたベクトルの大きさはハイパーパラメータ

Modeling one neuron

NNの分野はもともと神経生物学から派生している
ここから話をしていく

Biological motivation and connections

脳の一番小さな単位はneuron（ニューロン）
- 860億のneuronsが10¹⁵程度のsynapses（シナプス）でつながっている
- 各neuronはdendrites（樹状突起）から入力信号を受け取り、axon（軸索）に沿って出力信号を生成する
- axonは別のsynapseと結合する

計算機モデルでは
- 信号はaxonに沿って伝達され（ ${x_0}$ ）、
- 他のneuronの樹状突起（ ${w_0}$ )に作用する（tex:{w_0 x_0}])
ポイントは
- synaptic strength（重み ${w}$ ）が学習できて、影響の度合いを制御できること
- 基礎的なモデルでは、ある信号の強さが閾値を超えると、そのneuronは発火する(fire)
発火の度合いをactivation function(活性化関数） でモデル化する
- 歴史的には、このactivation functionには**sigmoid function ${\sigma}$ を使うのが通常

f:id:yusuke_ujitoko:20170108192849p:plain

(CS231nより引用)

コード例は下記

class Neuron(object):
  # ... 
  def forward(inputs):
    """ assume inputs and weights are 1-D numpy arrays and bias is a number """
    cell_body_sum = np.sum(inputs * self.weights) + self.bias
    firing_rate = 1.0 / (1.0 + math.exp(-cell_body_sum)) # sigmoid activation function
    return firing_rate

biological neuronのこのモデル化は非常に荒削り
- 実際はいろんなタイプのneuronがある
  - 樹状突起は非線形な計算を行う
  - synapseは1つ以上の重みをもち、非線形動的システムである
- 興味があれば、reviewやreviewを参照

Single neuron as a linear classifier

Neuronモデルのforward computationは親しみがある
- neuronは入力の線形領域に対して「好き」「嫌い」を決める能力があるといえる
- したがって、一つのneuronの出力結果に対してloss functionを考えると、linear classifierとみなせる

Binary Softmax classifier

は1クラスの確率とみなせる。
他のクラスの確率は
- このとき、交差エントロピー損失を算出でき、最適化することでbinary softmax classifier(logistic regression)にできる

Binary SVM classifier

neuronの出力結果に対して、max margin hinge lossを設定することもできる。
これを学習させると、binary Support Vector Machineとなる

Regularization interpretation

SVM/Softmaxにおけるregularization lossはgradual forgettingと解釈できる
- パラメータを更新していくと、すべてのsynapseの重み ${w}$ がゼロに向かうため

Commonly used activation functions

sigmoid

f:id:yusuke_ujitoko:20170108223241p:plain

(CS231nより引用)

sigmoidの非線形性は次の数式で表される：（上図）
- 実数を受取り、0~1の間に"squash"する関数
  - 大きな負の値は0になり、
  - 大きい正の値は1になる
- neuronの発火を模擬しやすく、昔からよく使われてきた
- しかし最近は以下の欠点のせいで使われなくなりつつある。

Sigmoidは勾配がサチるので意味をなさなくなる
- 0や1に近づくと、勾配が0に近づく。
- backpropagationの際には、局所的な勾配が結合される。
  そのため、局所的な勾配が0に近くなると、信号が伝達されなくなる

Sigmoidの出力はzero-centeredでない
- neuronへの入力が常に正だとすると、重み ${w}$ もすべて正、もしくはすべて負となる
- この状態で勾配更新していくとジグザグな結果となる

tanh

f:id:yusuke_ujitoko:20170108223324p:plain

(CS231nより引用)

tanhの非線形性は上図をみるとわかる
- 実数を受取り、-1~1の間に"squash"する関数

sigmoid neuronと同じくサチるが、zero-centeredではない.
そのため、sigmoidよりもtanhの方が常に好まれる

tanh neuronはsigmoid neuronをスケールさせたものであることに注意
- ${tanh(x) = 2 \sigma ( 2 x ) -1 }$

ReLU

f:id:yusuke_ujitoko:20170108223435p:plain

(CS231nより引用)

The Rectified Linear Unit(ReLU)はで表される
- 0で閾値をとる
- 以下の欠点と利点がある

利点
- サチるsigmoidやtanhに比べて、SGDの収束を早める効果がある
- 高価なsigmoidやtanhに比べて、計算が低コスト
欠点
- training時にReLUは脆く、"die"することがある
  - たとえば、ReLUを通過する大きな勾配は、重みを一度もactivateしない更新をすることがある

Leaky ReLU

Leaky ReLUは"dying ReLU"の問題を回避している
0未満の範囲では0ではなく、非常に小さい値をとる ${}$ $$ f(x) = \left\{ \begin{array}{} \alpha x & ( x < 0) \\ x & ( x \geq 0) \end{array} \right. $$
${\alpha}$ は非常に小さい値
この関数で上手くいったという報告があるが、その結果は一貫性がない

Maxout

Maxout neuronはを計算する
- ReLUやLeaky ReLUはMaxoutの特殊ケースであるので、ReLUの利点を受け継いでいる
- 一方、dying ReLUの問題は存在しない
- しかし、パラメータ数が二倍になるため、全パラメータ数が大きくなる

ここまででactivation functionの話は終わり。
通常複数のactivation functionを混在させない

Neural Network architectures

Layer-wise organization

Neural networkの特徴（下図参照）
- neuronが集まり層となり、その層の集合がneural network
- あるneuronsの出力が別のneuronsの入力
- 環状にはなっていない
- 通常は層がfully-connected layer（全結合層）となっている
  - 隣接する層のneuronはすべて結合している
- 層内のneuron同士は結合していない

f:id:yusuke_ujitoko:20170108231237p:plain

(CS231nより引用)

用語定義

「N層のNN」というと入力層は含まない
- 1層のNNは隠れそうを含まないことになる(入力が出力に直接結合）
- logistic regressionやSVMは1層NNの特殊ケース

Artificial Neural Networks(ANN)やMulti-Layer Perceptrons(MLP)は同義

出力層はactivation functionを持たない
- クラスごとのスコアを表現するため

NNのサイズはneuronの数、パラメータの数で議論する
- CNNは1億のパラメータと10~20の層を持つ

Example feed-forward computation

複数のneuronsを層にする利点の１つは、行列-ベクトル積化することによる計算効率化がある

# forward-pass of a 3-layer neural network:
f = lambda x: 1.0/(1.0 + np.exp(-x)) # activation function (use sigmoid)
x = np.random.randn(3, 1) # random input vector of three numbers (3x1)
h1 = f(np.dot(W1, x) + b1) # calculate first hidden layer activations (4x1)
h2 = f(np.dot(W2, h1) + b2) # calculate second hidden layer activations (4x1)
out = np.dot(W3, h2) + b3 # output neuron (1x1)

Representational power

層間が全結合されたNNは、重みネットワークでパラメータ化された巨大な関数にみなせる
下記の疑問が湧く
- 「この巨大な関数の表現力はいくらか？」
- 「NNでモデル化できない関数はあるか？」

少なくとも1つの隠れ層をもつNNは、universal approximationである
- 参考文献
  - Approximation by Superpositions of Sigmoidal Function
  - intuitive explanation
- 連続関数であれば、NNは模擬できる

1つの隠れ層でどんな関数も近似できるなら、なぜ多層を使う必要があるのか
- 数学的にはその通りだが、実際には使えないから
- 最適化アルゴリズムで学習しやすい関数である必要がある
- 深いネットワークであれば、表現力が2層のNNと同じであったとしても、上手くいく。

3層NNは2層NNより上手くいくが、それ以上深くなると難しくなる.
- この辺を勉強したいなら下記の文献を参照すべし
  - Deep Learningの本の6.4章]
  - Do Deep Nets Really Need to be Deep?
  - FitNets: Hints for Thin Deep Nets

Setting number of layers and their sizes

実際の問題に対しては以下のような疑問がある
- どういう構造のNNにする？
- 隠れ層使う？
  - 使うとしたら1つ？2つ？
- 各層の大きさは？

層を大きくすると、ネットワークのcapacity(能力)は向上する
- neuronが連携して多種の関数を表現できるため、関数が表現する領域が広がる
- 例えば
  - 2分類問題を2次元で考える。
  - 3種のNNで比較してみると、隠れ層のneuronが大きくなるほど領域が細かくなった(下図）

f:id:yusuke_ujitoko:20170109112207p:plain

(CS231nより引用)

neuronの数の増加には利点と欠点がある
- 利点
  - 複雑なデータを分類できるようになる
- 欠点
  - training dataにoverfitting(過学習)してしまう
    - dataのノイズまで学習

「データが複雑でない場合には、小さなNNが好まれる」と導けるがこれは誤り
- NNの過学習を防ぐ方法は幾つもある
  - L2 正規化
  - dropout
  - input noise
- neuronの数を制御するより、過学習を制御すべき

小さいNNはSGDなどの方法で学習しにくい
- loss functionが極小値を幾つかしかもたない
  それらは良くない（with high loss)
大きなNNであれば、最終的なlossは小さくなる

正規化の強さは過学習を制御する良い方法である
- 異なる設定で行った下図を参照

f:id:yusuke_ujitoko:20170109113401p:plain

(CS231nより引用)

Summary

生物学的neuronの荒削りなモデルを紹介

幾つかの種類のactivation functionを紹介
- ReLUが実際には使われることが多い

Neural Networks(NN)を紹介
- neuronが全結合層で結合されている
- 隣接層同士は結合し、層内では結合していない

層状構造は、NNを行列-ベクトル積で評価できるので効率的

NNはuniversal function approximators

NNのサイズは大きい方がよい
- 強く正規化して、過学習を防ぐ必要がある

理解できなかった内容

This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue.

(-) Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It’s clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss.

さらに踏み込んだ学習をするには

yusuke-ujitoko.hatenablog.com

まとめ

yusuke-ujitoko.hatenablog.com

緑茶思考ブログ

【CS231n】Neural Networks Part1: Setting up the Architecture

Quick intro

Modeling one neuron

Biological motivation and connections

Single neuron as a linear classifier

Binary Softmax classifier

Binary SVM classifier

Regularization interpretation

Commonly used activation functions

sigmoid

tanh

ReLU

Leaky ReLU

Maxout

Neural Network architectures

Layer-wise organization

用語定義

Example feed-forward computation

Representational power

Setting number of layers and their sizes

Summary

理解できなかった内容

さらに踏み込んだ学習をするには

次の記事

まとめ