Rico's Nerd Cluster

「离开世界之前 一切都是过程」

Deep Learning - Layer Normalization

Normalization For Sequential Data

Layer Normalization Batch normalization has two main constraints: When batch size become smaller, it performs bad? Nowadays, we tend to have higher data resolution, especially in large NLP tra...

Deep Learning - Batch Normalization (BN)

Internal Covariate Shift

Batch Normalization Among many pitfalls of ML, statistical stability is always high on the list. Model training is random: the initialization, even the common optimizers (SGD, Adam, etc.) are stoc...

Deep Learning - Optimizations Part 1

Momentum, RMSProp, Adam, AdamW, Learning Rate Decay, Local Minima, Gradient Clipping

Introduction Deep learning is still highly empirical, it works well in big data where there’s a lot of data, but its theories are not set in stone (at least yet). So use below optimization techniq...

Deep Learning - Exploding And Vanishing Gradients

When in doubt, be courageous, try things out, and see what happens! - James Dellinger

Why Exploding & Vanishing Gradients Happen In a very deep network, output of each layer might diminish / explodes. This is mainly because layer outputs are products of $W_1W_2…x$ (ignoring act...

Deep Learning - Overfitting

Bias, Variance, Overfitting, Regularization, Dropout

A Nice Quote 💡 Before we delve in, I’d like to quote from James Dellinger that really hits home: I think the journey we took here showed us that this knee-jerk response of feeling of intimidat...

Deep Learning - Batch Gradient Descent

Batch Gradient Descent, Mini-Batch

A Neuron And Batch Gradient Descent A Neuron, has multiple inputs and a single output. First it gets the weighted sum of all inputs, then feeds it into an “activation function”. Below, the activat...

Deep Learning - Loss Functions

Mean Squared Error, Mean Absolute Error, Hinge Loss, Huber Loss, L1 Loss, Cross Entropy Loss, NLL Loss, Sparse Entropy, IoU Loss, Dice Loss, Focal Loss, Cauchy Robust Kernel

Regression Losses Mean Squared Error (MSE) \[\text{MSE} = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2\] Disadvantages: Sensitive to outliers because errors are squared. Assumes Gau...

Deep Learning - Activation

Sigmoid, ReLU, GELU Tanh

Activation Functions Early work observed that the Rectified Linear Unit (ReLU) often trains faster than sigmoid-like activations because it avoids saturation for positive inputs and has a simple g...

Deep Learning - Auto Differentiator From Scratch

Auto Diff Is The Dark Magic Of All Dark Magics Of Deep Learning

Introduction Gradients here refer to scalar to matrix gradient. We need to accumulate gradients for mini-batch training. Elementwise Multiplication gradients: A * B = C del C / del A_ij...

Deep Learning - Gradient Checking

First Step To Debugging A Neural Net

How To Do Gradient Checking In calculus, we all learned that a derivative is defined as: \[\begin{gather*} f'(\theta) = \frac{f(\theta + \epsilon) - f(\theta)}{\epsilon} \end{gather*}\] ...