Skip to content

Commit d9d3e80

Browse files
blester125neubig
authored andcommitted
Fix glorot initialization for convolutional kernels (#1420)
* fix glorot initialization for convs * Update Docs * Updated benchmark results
1 parent 4948b5c commit d9d3e80

File tree

4 files changed

+16
-4
lines changed

4 files changed

+16
-4
lines changed

dynet/param-init.cc

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,17 @@ void ParameterInitIdentity::initialize_params(Tensor & values) const {
2525

2626
void ParameterInitGlorot::initialize_params(Tensor & values) const {
2727
int dims = 0, dim_len = values.d.nd - (lookup ? 1 : 0);
28-
for (int i = 0; i < dim_len; ++i) dims += values.d[i];
29-
float my_scale = gain * sqrt(3 * dim_len) / sqrt(dims);
28+
float my_scale = 0.0;
29+
if (dim_len == 4) {
30+
// When doing a Conv the parameters is (H, W, In, Out)
31+
int receptive_field = values.d[0] * values.d[1];
32+
// Other framework m + n are calculated by multiplying by the kernel size.
33+
dims = values.d[2] * receptive_field + values.d[3] * receptive_field;
34+
my_scale = gain * sqrt(6) / sqrt(dims);
35+
} else {
36+
for (int i = 0; i < dim_len; ++i) dims += values.d[i];
37+
my_scale = gain * sqrt(3 * dim_len) / sqrt(dims);
38+
}
3039
TensorTools::randomize_uniform(values, -my_scale, my_scale);
3140
}
3241

dynet/param-init.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@ struct ParameterInitIdentity : public ParameterInit {
113113
* \ingroup params
114114
* \brief Initialize with the methods described in [Glorot, 2010](http://www.jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf?hc_location=ufi)
115115
* \details In order to preserve the variance of the forward and backward flow across layers, the parameters \f$\theta\f$ are initialized such that \f$\mathrm{Var}(\theta)=\frac 2 {n_1+n_2}\f$ where \f$n_1,n_2\f$ are the input and output dim.
116+
* \details In the case of 4d tensors (common in convolutional networks) of shape \f$XH,XW,XC,N\f$ the weights are sampled from \f$\mathcal U([-g\sqrt{\frac 6 {d}},g\sqrt{ \frac 6 {d}}])\f$ where \f$d = XC * (XH * XW) + N * (XH * XW)\f$
116117
* Important note : The underlying distribution is uniform (not gaussian)
117118
*
118119
* *Note:* This is also known as **Xavier initialization**

examples/mnist/basic-mnist-benchmarks/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,5 +67,5 @@ Batch size: 64, learning rate: 0.01.
6767
| OS | Device | Framework | Speed | Accuracy (After 20 Epochs)|
6868
| --- | --- | --- | --- | --- |
6969
| Ubuntu 16.04 | GeForce GTX 1080 Ti | PyTorch | ~ 4.49±0.11 s per epoch | 98.95% |
70-
| Ubuntu 16.04 | GeForce GTX 1080 Ti | DyNet (autobatch) | ~ 8.58±0.09 s per epoch | 99.14% |
71-
| Ubuntu 16.04 | GeForce GTX 1080 Ti | DyNet (minibatch) | ~ 4.13±0.13 s per epoch | 99.16% |
70+
| Ubuntu 16.04 | GeForce GTX 1080 Ti | DyNet (autobatch) | ~ 8.58±0.09 s per epoch | 98.98% |
71+
| Ubuntu 16.04 | GeForce GTX 1080 Ti | DyNet (minibatch) | ~ 4.13±0.13 s per epoch | 98.99% |

python/_dynet.pyx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -519,6 +519,8 @@ cdef class GlorotInitializer(PyInitializer):
519519
520520
If the dimensions of the parameter matrix are :math:`m,n`, the weights are sampled from :math:`\mathcal U([-g\sqrt{\\frac{6}{m+n}},g\sqrt{\\frac{6}{m+n}}])`
521521
522+
In the case of 4d tensors (common in convolutional networks) of shape :math:`XH,XW,XC,N` the weights are sampled from :math:`\mathcal U([-g\sqrt{\\frac{6}{d}},g\sqrt{\\frac{6}{d}}])` where :math:`d = XC * (XH * XW) + N * (XH * XW)`
523+
522524
The gain :math:`g` depends on the activation function :
523525
524526
* :math:`\\text{tanh}` : 1.0

0 commit comments

Comments
 (0)