Rico's Nerd Cluster

「离开世界之前 一切都是过程」

Deep Learning - TensorFlow Basics

Nothing Fancy, Just A Basic TF Network

Basic Operations Max Operations Immutable (tf.constant) vs Variable (tf.Variable), notice the different capitalization: tf.math.reduce_max(): find the max along certain dimension(s). 1 2 3...

Deep Learning - Softmax And Cross Entropy Loss

Softmax, Cross Entropy Loss, and MLE

Softmax When we build a classifier for cat classification, at the end of training, it’s necessary to find the most likely classes for given inputs. The raw unnomalized ...

Deep Learning - Hyper Parameter Tuning

It finally comes down to how much compute we have, actually...

How To Sample For Single Parameter Tuning Generally, we need to try different sets of parameters to find the best performing one. In terms of number layers, it could be a linear search: Defin...

Deep Learning - Layer Normalization

Normalization For Sequential Data

Layer Normalization Batch normalization has two main constraints: When batch size become smaller, it performs bad? Nowadays, we tend to have higher data resolution, especially in large NLP tra...

Deep Learning - Batch Normalization (BN)

Internal Covariate Shift

Batch Normalization Among many pitfalls of ML, statistical stability is always high on the list. Model training is random: the initialization, even the common optimizers (SGD, Adam, etc.) are stoc...

Deep Learning - Optimizations Part 1

Momentum, RMSProp, Adam, AdamW, Learning Rate Decay, Local Minima, Gradient Clipping

Introduction Deep learning is still highly empirical, it works well in big data where there’s a lot of data, but its theories are not set in stone (at least yet). So use below optimization techniq...

Deep Learning - Exploding And Vanishing Gradients

When in doubt, be courageous, try things out, and see what happens! - James Dellinger

Why Exploding & Vanishing Gradients Happen In a very deep network, output of each layer might diminish / explodes. This is mainly because layer outputs are products of $W_1W_2…x$ (ignoring act...

Deep Learning - Overfitting

Bias, Variance, Overfitting, Regularization, Dropout

A Nice Quote 💡 Before we delve in, I’d like to quote from James Dellinger that really hits home: I think the journey we took here showed us that this knee-jerk response of feeling of intimidat...

Deep Learning - Batch Gradient Descent

Batch Gradient Descent, Mini-Batch

A Neuron And Batch Gradient Descent A Neuron, has multiple inputs and a single output. First it gets the weighted sum of all inputs, then feeds it into an “activation function”. Below, the activat...

Deep Learning - Activation and Loss Functions

Sigmoid, ReLU, GELU Tanh, Mean Squared Error, Mean Absolute Error, Cross Entropy Loss, Hinge Loss, Huber Loss, IoU Loss, Dice Loss, Focal Loss

Activation Functions Early papers found out that Rectified Linear Unit (ReLu) is always faster than Sigmoid because of its larger derivatives, and non-zero derivatives at positive regions. Howeve...