Fast “dropout” training
Christopher D. Manning Department of Computer Science Stanford University Stanford, CA 94305
[email protected]
Sida Wang Department of Computer Science Stanford University Stanford, CA 94305
[email protected]
Abstract Recently, improved classification and regression performance has been achieved by preventing feature co-adaptation, or similarly, by encouraging independent contribution from different features. In particular, the method proposed in [1], informally called “dropout”, did this by randomly dropping out (zeroing) hidden units and input features for training neural networks. However, sampling a random subset of input features during training makes training much slower. We first look at the implied objective function of the dropout training. Then, instead of doing a Monte Carlo optimization of this objective as in [1], we show how to optimize it more directly by integrating a Gaussian approximation justified by the central limit theorem and empirical evidence, giving a 2-30 times speedup and more stability. We describe how to do fast dropout training in multilayer neural networks with several popular types of hidden units and output units, for both classification and regression. Finally, we empirically compare the performance of this method to previously published results, and to baselines.
1
Introduction
Recent work [1] has shown that encouraging independent contributions from each input feature, or equivalently, preventing feature co-adaptation is a promising method for regularization. This can be considered an alternative approach to regularization in addition to the widely used parameter shrinkage methods and Bayesian averaging. In [1], neural networks are trained while randomly dropping out (zeroing) hidden units and input dimensions. This process discourages the use of a feature that is only helpful while other specific features are present. Similar observations existed in the literature. Naive Bayes, by completely ignoring co-adaptation, performs better than discriminative methods when there is little data [2], and continues to perform better on certain relatively large datasets [3]. Furthermore, there are several works of a similar spirit. 1) In [4], it is observed that training involves trade-offs among weights, where the presence of highly indicative features can cause other useful but weaker features to undertrain. They proposed feature bagging: training different models on subset of features that are later combined. 2) In [3], each input feature has an L2 regularization strength that depends on how informative the particular feature is by itself, so individually uninformative features are strongly regularized, preventing co-adaptation. While the effectiveness of these methods is demonstrated, actually sampling, or training multiple models, make training much slower. For example, with a dropout rate of p = 0.5, the proportion of data still not seen after n es is pn (i.e. 5 es of the data is required to see 95% of it). If the data is not highly redundant, and if the relevant data is only partially observable at random, then the task also becomes harder, and the training efficiency may reduce further. In this paper, we look at how to achieve the benefit of “dropout” training without actually sampling, thereby using all the data efficiently. We do this by a Gaussian approximation that is justified by the central limit theorem 1
and empirical evidence. We begin with logistic regression for simplicity, and then extend the idea to neural networks.
2
The implied objective function, and its fast approximations
To avoid unnecessarily cumbersome notations, we illustrate the main idea with binary logistic regression (LR) with a single training vector x, and label y ∈ {0, 1}. 2.1
The implied objective function for dropout training
To train LR with dropout on data with dimension m, first sample zi ∼ Bernoulli(pi ) for i = 1...m. Here pi is the probability of not dropping out input xi . For neural networks, x is activation of the previous layer. For LR, x is the data. Typically, pbias = 1 and p = 0.1 ∼ 0.9 for everything else. After sampling z = {zi }i=1...m we can compute the stochastic gradient descent (sgd) update as follows: ∆w = (y − σ(wt Dz x))Dz x (1) m×m −x where Dz = diag(z) ∈ R , and σ(x) = 1/(1 + e ) is the logistic function. This update rule, applied over the training data over multiple es, can be seen as a Monte Carlo approximation to the following gradient: ∆w ¯ = Ez;zi ∼Bernoulli(pi ) [(y − σ(wt Dz x))Dz x]
(2)
The objective function with the above gradient is the expected conditional log-likelihood of the label given the data with dropped out dimensions indicated by z, for y ∼ Bernoulli(σ(wt Dz x))). This is the implied objective function for dropout training: L(w) = Ez [log(p(y|Dz x; w)] = Ez [y log(σ(wt Dz x)) + (1 − y) log(1 − σ(wt Dz x))]
(3)
Since we are just taking an expectation, the objective is still concave provided that the original log-likelihood is concave, which it is for logistic regression. Evaluating the expectation in (2) naively by summing over all possible z has complexity O(2m m). We are not sure how much better oneP can do in general when wi xi can be anything. One possibility m is to evaluate the first K moments of i wi xi zi , which can be computed quickly by considering the moment generating function, and speeding up the product using Fast Fourier Transform if necessary. Given the moments, the Taylor series expansion of the function σ can then be used to compute the expectation of σ. A direct application does not work because σ is not very polynomial-like, and its radius of convergence increases too slowly for the range we need. Rather than directly computing the expectation with respect to z, we propose a variable transformation that allows us to approximately compute the expectation with respect to a simple random variable Y ∈ R, instead of z ∈ Rm . In the following section, we describe an efficient O(m) approximation that is accurate for machine learning applications where wi xi usually come from a unimodal or bounded distribution. 2.2
Faster approximations to the dropout objective
We make the observation that evaluating the objective Pm function L(w) involves taking the expectation with respect to the variable Y (z) = wt Dz x = i wi xi zi , a weighted sum of Bernoulli random variables. For most machine learning problems, {wi } typically form a unimodal distribution centered at 0, {xi } is either unimodal or in a fixed interval. In this case, Y can be well approximated by a normal distribution even for relatively low dimensional data with m = 10. More technically, the Lyapunov condition is generally satisfied for a weighted sum of Bernoulli random variables of the form Y that are weighted by real data [5]. Then, Lyapunov’s central limit theorem states that Y (z) tends to a normal distribution as m → ∞. We empirically that the approximation is good for typical datasets of moderate dimensions, except when a couple of dimensions dominate all others. Finally, let p S = Ez [Y (z)] + Var[Y (z)] = µS + σS (4) Pm be the approximating Gaussian, where ∼ N (0, 1), Ez [Y (z)] = i pi wi xi , and Var [Y (z)] = Pm 2 i pi (1 − pi )(wi xi ) . 2
In the following subsections, based on the Gaussian assumption above, we present several approximations at different tradeoff points between speed and accuracy. In the end, we present experimental results showing that there is little to no performance loss when using the faster, less accurate approximations. 2.2.1
Sampling from the Gaussian
Given good convergence, we note that drawing samples of the approximating Gaussian S of Y (z), a constant time operation, is much cheaper than drawing samples directly of Y (z), which takes O(m). This effect is very significant for high dimensional datasets. So without doing much, we can already approximate the objective function (3) m times faster by sampling from S instead of Y (z). Empirically, this approximation is within the variance of the direct MC approximation of (3) by taking 200 samples of z. Approximating the gradient introduces a complication while using samples from the Gaussian. The gradient (2) involves not only Y (z) → S, but also Dz x directly: ∇L(w) = Ez [(y − σ(Y (z)))Dz x]
(5)
Let f (Y ) = f (Y (z)) = y − σ(Y (z)) and let g(z) = Dz x ∈ Rm . Naively approximating Ez [f (Y (z))g(z)] by either ES [f (S)]Ez [g(z)], or worse, by f (Es [S])Ez [g(z)] works poorly in of both approximation error and final performance. Note g(z) is a linear function and therefore Ez [g(z)] = g(Ez [z]) = diag(p)x. A good way to approximate (5) is by analytically taking the expectation with respect to zi and then using a linear approximation to the conditional expectation. More precisely, consider dimension i of the gradient: ∂L(w) = Ez [f (Y (z))xi zi ] ∂wi X = p(zi )zi xi Ez−i |zi [f (Y (z))] zi ∈{0,1}
= p(zi = 1)xi Ez−i |zi =1 [f (Y (z))] ∂ET ∼N (µ,σS2 ) [f (T )] + ≈ pi xi ES∼N (µS ,σS2 ) [f (S)] + ∆µi ∂µ µ=µS ! 2 2 ∂ET ∼N (µS ,σ ) [f (T )] ∆σi 2 2 ∂σ 2 σ =σS
=
pi xi (α(µS , σS2 )
+ ∆µi β(µS , σS2 ) + ∆σi2 γ(µS , σS2 ))
(6)
where z−i is the collection of all other zs except zi , µS , σS is defined in (4), ∆µi = (1 − pi )xi wi , ∆σi2 = −pi (1 − pi )x2i wi2 are the changes in µS , σS2 due to conditioning on zi . Note that the partial derivatives as well as ES∼N (µS ,σS2 ) [f (S)] only need to be computed once per training case, since they are independent of i. α, β, γ can be computed by drawing K samples from S and takes time O(K). See A.2 for details. In practice, using only β approximates the derivative to within the variance of successive MC computations of the objective L (see figure 2). In our experiments, this is 2-30 times faster compared to sampling (see figure 3 and table 2). While MC dropout becomes inefficient quickly with very high dropout rate (> 0.8), the Gaussian approximation is fairly insensitive because it still gets all the information from every data point (see 3). At a slightly higher loss in accuracy, we can get rid of z completely by reparameterizing the problem in µs and σs and taking derivatives with respect to them. So the objective function (3) becomes L(w) ≈ EY ∼N (µs ,σs ) [y log(σ(Y )) + (1 − y) log(1 − σ(Y ))] 2.2.2
(7)
Deterministic approximations
We can further improve the performance by using fast deterministic approximations. With a potentially much faster deterministic approximation, randomness harmful to line-search is removed. While not faster, we can deterministically “sample” from S. There is no analog of doing this with 3
the MC dropout method (it would be silly to permanently corrupt predetermined data dimensions). In addition to taking the expectation as mentioned in 2.2.1, α, β, γ are functions of µs , σs2 ∈ R, and can be computed by 1-d numerical integration as defined in (18), (19), and (20). Simpson’s method on [µ − 4σ, µ + 4σ] works quite well. One can also integrate the logit-normal distribution parameterized by µ and σ on [0, 1], but that is less stable numerically due to potentially huge masses at either end [6]. Doing numerical integration online takes as long as using 200 samples. Because we do not need that much accuracy in practice, numerical integration is not recommended. For even faster speed, one can also tabulate α, β, γ instead of computing them as needed. The function is smoother if parameterized by σµ , σ instead. However, numerical integration and a lookup table fail to scale to multinomial LR. In that case, we can use the Unscented Transform (UST) instead of sampling [7]. The rest of the procedure remains unchanged from 2.2.1. 2.2.3
A closed-form approximation
Interestingly, a fairly accurate closed-from approximation R x −t2 /2is also possible by using the Gaussian 1 √ cummulative distribution function Φ(x) = 2π −∞ e dt to approximate the logistic function. It can be shown by parameter differentiation with respect to s and then integrating with respect to s that Z ∞ µ Φ(λx)N (x|µ, s)dx = Φ √ (8) λ−2 + s2 −∞ p Substituting in σ(x) ≈ Φ( π/8x), we get ! Z ∞ µ 2 σ(x)N (x|µ, s )dx ≈ σ p (9) 1 + πs2 /8 −∞ This is an approximation that is used for Bayesian prediction when the posterior is approximated by a Gaussian [8]. As we now have a closed-form approximation of α, one can also obtain expressions for β and γ by differentiating. Furthermore, we can even approximate the objective function (7) in a closed-form that is easily differentiable: Z ∞ EY ∼N (µ,s2 ) [log(σ(Y ))] = log(σ(x))N (x|µ, s2 )dx (10) −∞ p µ (11) ≈ 1 + πs2 /8 log σ p 1 + πs2 /8 The actual objective as defined in (3) can be obtained from the above by observing that 1 − σ(x) = σ(−x). The gradient, or Hessian with respect to w can be found by analytically differentiating.
3
Fast dropout for neural networks
Dropout training, as originally proposed, was intended for neural networks where hidden units are dropped out, instead of the data. Fast dropout is directly applicable to dropping out the final hidden layer of neural networks. In this section, we approximately extend our technique to deep neural networks and show how to do it for several popular types of hidden units and output units. 3.1
The hidden layers
Under dropout training, each hidden unit takes a random variable as input, and produces a random variable as output. Because of the CLT, we may approximate the inputs as Gaussians and characterize the outputs by their means and variances. An additional complication is that the inputs to hidden units have a covariance, which is close to diagonal in practice as shown in figure 1. Consider any hidden unit in dropout training, we may approximate its input as a Gaussian variable X ∼ N (x|µ, s2 ), which leads to its output mean and variance ν, τ 2 . For the commonly used sigmoid unit ! Z ∞ µ 2 ν= σ(x)N (x|µ, s )dx ≈ σ p (12) 1 + πs2 /8 −∞ 4
This integral can be evaluated exactly for the rectified linear unit f (x) = max(0, x). Let r = µ/s, then Z ∞ f (x)N (x|µ, s2 )dx = Φ(r)µ + sN (r|0, 1)
ν=
(13)
−∞
With dropout training, each hidden unit also has an output variance, which can be approximated fairly well (see A.4). −3
x 10 10 10
3
5
20
2
0
30
1
40
0
10 20 30 40 −5
50 20
50
40
20
3000
3000
2000
2000
1000
1000
0
40
0 −20
−15
−10
−5
−0.1
0
0.1
0.2
Figure 1: Top: MC dropout covariance matrix of the inputs of 10 random hidden units; Top-left: trained to convergence; Top-right: at random initialization. The covariance is not completely diagonal once trained to convergence. Bottom: empirical input distribution of the input of a hidden unit. Bottom-left: trained to convergence; Bottom-right: at initialization. We lose almost nothing here. 3.2
Training with backpropagation
The resulting neural network can be trained by backpropagation with an additional set of derivatives. ∂L for each hidden unit i with input µi . For In normal backpropagation, one only need to keep ∂µ i P ∂L approximate dropout training, we need ∂s2 as well for input variance s2i . Where µi = p j wij νj0 i P 2 2 and si = p(1 − p) j νj02 wij + pτj02 wij and νi0 , τi0 are the output mean and variance of the previous layer. 3.3
The output layer
We still need to define what the cost function L is, which is task dependent. We outline how to do approximate dropout for the final layer for one-vs-rest logistic units, linear units under squared loss, and softmax units under their representative cost functions. Logistic units with the cross-entropy loss function that can be well-approximated using the following: Z ∞ EX∼N (µ,s2 ) [log(σ(X))] = log(σ(x))N (x|µ, s2 )dx (14) −∞ p µ (15) ≈ 1 + πs2 /8 log σ p 1 + πs2 /8 Linear units with squared error loss can be computed exactly Z ∞ 2 EX∼N (µ,s2 ) [(X − t) ] = (x − t)2 N (x|µ, s2 )dx −∞ 2
= s + (µ − t)2 Since s2 =
P
j
αwj2 x2j , this is L2 regularization. 5
(16) (17)
Unfortunately, the best way to compute the cross-entropy loss for softmax seems to be sampling from the input Gaussian directly (See A.3). Sampling can be applied to any output units with any loss function. Unscented transform can be used in place of random sampling, trading accuracy for faster speed. Unlike the best practice in real dropout, the test time procedure for approximate dropout is exactly the same as the training time procedure, with no weight halving needed. One shortcoming is the training implementation does become more complicated, mainly in the backpropagation stage, while the network function is straightforward. In the case when we are not concerned about training time, we can still apply deterministic dropout at test time, instead of using the weight halving heuristics that does not exactly match the training objective that is optimized. This alone can give a noticeable improvement in test performance without much extra work.
4
Relation to Bayesian model selection
Once we make the Gaussian approximation, there is an alternative interpretation of where the variance comes from. In the dropout framework, the variance comes from the dropout variable z. Under the alternative interpretation where w is a random variable, we can view dropout training as maximizing a lower bound on the Bayesian marginal likelihood in a class of models Mµ indexed by µ ∈ Rm . Concretely, consider random variables wi ∼ N (µi , αµ2i ), then the dropout objective L(w) = Ez;zi ∼Bernoulli(pi ) [log p(y|wT Dz x)] ≈ EY ∼N (E[wT Dz x],Var[wT Dz x]) [log p(y|Y )] = Ev:vi ∼N (µi ,αµ2i ) [log p(y|v T x)] ≤ log Ev:vi ∼N (µi ,αµ2i ) [p(y|v T x)] = log(Mµ ) where Mµ = p(D|v)p(v|µ)dv is the Bayesian evidence. p(vi |µi ) = N (vi |µi , αµ2i ) and T T p(y|v x) = σ(v x)y (1 − σ(v T x))1−y is the logistic model. For dropout training, µ = w/p and α = (1 − p)/p. R
Here the variance of v is tied to its magnitude, so a larger weight is only beneficial when it is robust to noise. While α can be determined by the dropout process, we are also free to choose α and we find empirically that using a slightly larger α than that determined by dropout often perform slightly better.
5
Experiments
Figure 2 shows that the quality of the gradient approximation using Gaussian samples are comparable to the difference between different MC dropout runs with 200 samples. Figures 3 and 4 show that, under identical settings, the Gaussian approximation is much faster than MC dropout, and have very similar training profile. Both the Gaussian approximation and the MC dropout training reduce validation error rate by about 30% over plain LR when trained to convergence, without ever overfitting. Method #Errors
Plain 182
Aprx. Drp. 124
A.D.+Var 110
Prev. 150
SotA 30
Table 1: Neural network on MNIST: Results on MNIST with 2 hidden layers of 1200 units each. A.D.+Var is approximate dropout plus more artificial variance by increasing α. Prev. is the best previous result without using domain knowledge [1]. SotA is the state of the art result on the test set. The accuracy and time taken are listed in table 2 for the datasets described in section A.1. The Gaussian approximation is generally around 10 times faster than MC dropout and performs comparably to NBSVM in [3]. Further speedup is possible by using one of the deterministic approximations instead of sampling. While each iteration of the Gaussian approximation is still slower than LR, it sometimes reaches a better validation performance in less time. 6
5
4
Approximate value
3
2
1
0
−1
Naive Gaussian approx. MC dropout
−2 −1
0
1
2 MC dropout value
3
4
5
0.5
0.5
0.45
0.45 error rate in the validation set
error rate in the validation set
Figure 2: Scatterplot of various approximations (y-axis) vs. direct MC dropout: Each point is a random dimension of the gradient, with its x-value computed from MC dropout with 200 samples of z, and its y-value computed by the method in the legend. MC dropout and Gaussian approximation used 200 samples. Naive is the approximation defined after (5), by assuming that f (z) and g(z) are independent. The RMSE for different MC dropout runs is 0.0305, and the RMSE between MC dropout and Gaussian approximation is 0.0308, showing no difference between our approximation and MC dropout training. The green line is the reference y = x.
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0 10
Plain LR Gaussian approx. MC dropout
0.4 0.35 0.3 0.25 0.2 0.15
1
2
3
10 10 10 seconds spent in training
0.1 0 10
4
10
1
10
2
3
10 10 training iterations
4
10
Figure 3: Validation errors vs. time spent in training (left), and number of iterations (right). trained using batch gradient descent on the 20-newsgroup subtask alt.atheism vs. religion.misc. 100 samples are used. For MC dropout, zi is sampled only for non-zero xi , with a dropout rate of 0.5.
For the MPQA dataset that effectively has m = 4, the Gaussian assumption is unjustifiable, but the derived method works empirically anyways (table 2). We trained on bigram features, surprisingly, even plain LR with good regularization parameters perform better than some previous works. Experiments using a 2-hidden-layer neural network is shown in table 1 and figure 5. 7
0.5
0.45
0.45 error rate in the validation set
error rate in the validation set
0.5
0.4 0.35 0.3 0.25 0.2 0.15
Plain LR Gaussian approx. MC dropout
0.4 0.35 0.3 0.25 0.2 0.15
0.1 −1 10
0
1
2
10 10 10 seconds spent in training
0.1 0 10
3
10
1
10
2
3
4
10 10 training iterations
10
Figure 4: Same as figure 3 but with a very high dropout rate of 0.7. 0.05
0.2 no dropout fast dropout
0.045 0.04 validation error
training error
0.035 0.03 0.025 0.02 0.015
0.15
0.1
0.01 0.005 0
5
10 iterations
15
0.05
20
20
40 60 iterations
80
Figure 5: Training and validation errors left: training error and right: validation errors on a MNIST subset with and without dropout with a 2 hidden layers neural network with 500 and 300 hidden units trained on 1000 examples.
6
Conclusions
We presented a way of getting the benefits of dropout training without actually sampling, thereby speeding up the process by a factor of 2-30 times. For high dimensional datasets (over a few hundred), each iteration of fast dropout is only about 2 times slower than normal training. While the objective for fast dropout is the same as MC dropout in the long run, because fast dropout is not losing any information in individual training cases, it is capable of doing more work in each iteration, reaching the same validation set performance in a shorter time, and generalizing better. Acknowledgments We thank Andrej Karpathy, Charlie Tang and Percy Liang for helpful discussions. Sida Wang was partially funded by an Natural Science and Engineering Research Council of Canada PGS-M scholarship and a Stanford School of Engineering fellowship.
8
Methods\ Datasets MC dropout training time Gaussian approx. training time plain LR training time
RT-2k 89.8 6363 89.7 240 88.2 145
TreeCRF[9] Vect. Sent.[10] RNN[11] NBSVM[3] |{i : xi > 0}|
88.9 89.45 788
IMDB RTs subj 91.2 79.2 93.3 6839 2264 2039 91.2 79.0 93.4 1071 362 323 89.5 77.2 91.3 306 81 68 Previous results - 77.3 88.89 - 88.13 - 77.7 91.22 79.4 93.2 232 22 25
AthR 86.7 126 87.4 6 83.6 3
CR 82.0 582 82.1 90 80.4 17
MPQA 86.0 417 86.1 185 84.6 22
Average 86.88 2661 86.99 325 84.97 92
87.9 346
81.4 81.8 21
86.1 86.4 86.3 4
87.03
Table 2: The main table of results. The top section contains the accuracy, and training time (in seconds) for various datasets. 100 samples are used for both MC dropout and the Gaussian approximation. The last row shows the average number of non-sparse dimensions in the dataset.
References [1]
[2]
[3] [4]
[5] [6] [7] [8] [9] [10]
[11]
[12] [13] [14] [15]
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. “Improving neural networks by preventing co-adaptation of feature detectors”. In: CoRR abs/1207.0580 (2012). Andrew Y. Ng and Michael I. Jordan. “On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes”. In: Proceedings of NIPS. Vol. 2. 2002, pp. 841– 848. Sida Wang and Christopher Manning. “Baselines and Bigrams: Simple, Good Sentiment and Topic Classification”. In: Proceedings of the ACL. 2012, pp. 90–94. Charles Sutton, Michael Sindelar, and Andrew McCallum. “Reducing Weight Undertraining in Structured Discriminative Learning”. In: Conference on Human Language Technology and North American Association for Computational Linguistics (HLT-NAACL). 2006. Erich L. Lehmann. Elements of Large-Sample Theory. Springer, 1998, p. 101. ISBN: 03873985956. Patrizio Frederic and Frank Lad. “Two Moments of the Logitnormal Distribution.” In: Communications in Statistics - Simulation and Computation 37.7 (2008), pp. 1263–1269. Simon J. Julier and Jeffrey K. Uhlmann. “A New Extension of the Kalman Filter to Nonlinear Systems”. In: Proceedings of AeroSense: Simulations and Controls. 1997, pp. 182–193. David J.C. MacKay. “The evidence framework applied to classification networks”. In: Neural Computation 5.4 (1992), pp. 720–736. Tetsuji Nakagawa, Kentaro Inui, and Sadao Kurohashi. “Dependency tree-based sentiment classification using CRFs with hidden variables”. In: Proceedings of ACL:HLT. 2010. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. “Learning Word Vectors for Sentiment Analysis”. In: Proceedings of the ACL. 2011. Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christopher D. Manning. “Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions”. In: Proceedings of EMNLP. 2011. Bo Pang and Lillian Lee. “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales”. In: Proceedings of the ACL. 2005. Minqing Hu and Bing Liu. “Mining and summarizing customer reviews”. In: Proceedings ACM SIGKDD. 2004, pp. 168–177. Janyce Wiebe, Theresa Wilson, and Claire Cardie. “Annotating Expressions of Opinions and Emotions in Language”. In: Language Resources and Evaluation 39.2-3 (2005), pp. 165–210. Bo Pang and Lillian Lee. “A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts”. In: Proceedings of the ACL. 2004.
9
A
Supplementary Material
A.1
Description of the datasets used
We compare with published results on the these datasets. Detailed statistics are shown in table 3. RT-s: Short movie reviews dataset containing one sentence per review [12]. This is known as the movie review (MR) dataset in some works. CR: Customer review dataset [13] processed like in [9].1 MPQA: Opinion polarity subtask of the MPQA dataset [14].2 Subj: The subjectivity dataset with subjective reviews and objective plot summaries [15]. IMDB-2k (RT-2k): The standard 2000 full-length movie review dataset [15]. We mistakenly refered to this as RT-2k in [3], but it actually comes from IM DB. IMDB: A large movie review dataset with 50k full-length reviews [10].3 AthR, XGraph, BbCrypt: Classify pairs of newsgroups in the 20-newsgroups dataset with all headers stripped off (the third (18828) version4 ), namely: alt.atheism vs. religion.misc, comp.windows.x vs. comp.graphics, and rec.sport.baseball vs. sci.crypt, respectively. (N+ , N− ) (5331,5331) (2406,1366) (3316,7308) (5000,5000) (1000,1000) (25k,25k) (799,628) (980,973) (992,995)
Dataset RT-s CR MPQA Subj. RT-2k IMDB AthR XGraph BbCrypt
l 21 20 3 24 787 231 345 261 269
CV 10 10 10 10 10 N N N N
|V | 21K 5713 6299 24K 51K 392K 22K 32K 25K
∆ 0.8 1.3 0.8 0.8 1.5 0.4 2.9 1.8 0.5
Table 3: Dataset statistics. (N+ , N− ): number of positive and negative examples. l: average number of words per example. CV: number of cross-validation splits, or N for train/test split. |V |: the vocabulary size. ∆: upper-bounds of the differences required to be statistically significant at the p < 0.05 level.
A.2
Computing α, β, γ
α, β, γ can be computed by either numerical integration or by sampling: α(µ, σ 2 ) = y − ET ∼N (µ,σ2 )
∂α(µ, σ 2 ) β(µ, σ ) = =− ∂µ 2
Z
∞
√
−∞
∂α(µ, σ 2 ) γ(µ, σ ) = =− ∂σ 2 2
Z
1 2πσ 2
∞
−∞
1 1 + e−T
√
e
−
1 2πσ 2
Z
∞
=y− −∞
(x−µ)2 2σ 2
e
−
10
2πσ 2
e−
(x−µ)2 2σ 2
1 dx 1 + e−x
(x − µ)2 1 − 2σ 4 (1 + e−x ) 2σ 2 (1 + e−x )
http://www.cs.uic.edu/∼liub/FBS/sentiment-analysis.html http://www.cs.pitt.edu/mpqa/ 3 http://ai.stanford.edu/∼amaas/data/sentiment 4 http://people.csail.mit.edu/jrennie/20Newsgroups 2
1
(18)
x−µ T −µ −2 dx = −σ ET ∼N (µ,σ2 ) σ 2 (1 + e−x ) 1 + e−T (19)
(x−µ)2 2σ 2
1
√
dx (20)
A.3
Approximating softmax and cross-entropy loss
A.4
Output Variance
For the sigmoid unit τ=
Var
X∼N (µ,s2 )
[σ(X)] = E[σ(X)2 ] − E[σ(X)]2 ≈ E[σ(a(X − b))] − E[σ(X)]2 ! !2 a(µ − b) µ ≈σ p −σ p 1 + π/8a2 s2 1 + π/8s2
(21) (22) (23)
√ a,b can be found √by matching the values and derivatives. Some reasonable values are a = 4 − 2 2 and b = − log( 2 − 1). A.5
Regularization by individual informativeness
While the Gaussian and deterministic approximations are more efficient than the direct MC computation, it still does considerably more work than plain LR. We present another procedure that is in a very similar spirit that is as fast as plain LR, but possibly less robust than dropout. It is a reinterpretation of the feature transformation described in [3] applied to Vector Machines. The idea can be very easily applied to logistic regression. Let L0 (w) = − log(p(y|x, w)) be the negative log-likelihood. Define the cost function: X 2 L(w) = L0 (w) + Cij wij (24) i,j
The only difference from regular logistic regression is that Cij depends on how informative xj is for label y = i, so that individually uninformative features are heavily regularized. For sparse p(x |y=i) binary features a good choice is C1ij = | log p(xjj |y6=i) | + . More generally, one can either bin the data to make it binary or fit a decision stump τij ∈ R, bij ∈ {0, 1} to the data. We can set p(bij xj >τij |y=i) 1 Cij = log( p(bij xj >τij |y6=i) + .
11