Rico's Nerd Cluster

「离开世界之前 一切都是过程」

Deep Learning - Layer Normalization

Normalization For Sequential Data

Layer Normalization Batch normalization has two main constraints: When batch size become smaller, it performs bad? Nowadays, we tend to have higher data resolution, especially in large NLP tra...

Deep Learning - Batch Normalization (BN)

Internal Covariate Shift

Batch Normalization Among many pitfalls of ML, statistical stability is always high on the list. Model training is random: the initialization, even the common optimizers (SGD, Adam, etc.) are stoc...

Deep Learning - Optimizations Part 1

Momentum, RMSProp, Adam, AdamW, Learning Rate Decay, Local Minima, Gradient Clipping

Introduction Deep learning is still highly empirical, it works well in big data where there’s a lot of data, but its theories are not set in stone (at least yet). So use below optimization techniq...

Deep Learning - Exploding And Vanishing Gradients

When in doubt, be courageous, try things out, and see what happens! - James Dellinger

Why Exploding & Vanishing Gradients Happen In a very deep network, output of each layer might diminish / explodes. This is mainly because layer outputs are products of $W_1W_2…x$ (ignoring act...

Deep Learning - Overfitting

Bias, Variance, Overfitting, Regularization, Dropout

A Nice Quote 💡 Before we delve in, I’d like to quote from James Dellinger that really hits home: I think the journey we took here showed us that this knee-jerk response of feeling of intimidat...

Deep Learning - Batch Gradient Descent

Batch Gradient Descent, Mini-Batch

A Neuron And Batch Gradient Descent A Neuron, has multiple inputs and a single output. First it gets the weighted sum of all inputs, then feeds it into an “activation function”. Below, the activat...

Deep Learning - Activation and Loss Functions

Sigmoid, ReLU, GELU Tanh, Mean Squared Error, Mean Absolute Error, Cross Entropy Loss, Hinge Loss, Huber Loss, IoU Loss, Dice Loss, Focal Loss, Cauchy Robust Kernel

Activation Functions Early papers found out that Rectified Linear Unit (ReLu) is always faster than Sigmoid because of its larger derivatives, and non-zero derivatives at positive regions. Howeve...

Deep Learning - Auto Differentiator From Scratch

Auto Diff Is The Dark Magic Of All Dark Magics Of Deep Learning

Introduction Gradients here refer to scalar to matrix gradient. We need to accumulate gradients for mini-batch training. Elementwise Multiplication gradients: A * B = C del C / del A_ij...

Deep Learning - Gradient Checking

First Step To Debugging A Neural Net

How To Do Gradient Checking In calculus, we all learned that a derivative is defined as: \[\begin{gather*} f'(\theta) = \frac{f(\theta + \epsilon) - f(\theta)}{\epsilon} \end{gather*}\] ...

Deep Learning - Introduction

Why Do We Even Need Deep Neuralnets? Data Partition, ML Ops, Data Normalization

Why Do We Need Deep Learning Any bounded continuous function can be approximated by an arbitrarily large single layer. W hy? The idea is roughly that the linear combinations of activation function...