How To Sample For Single Parameter Tuning
Generally, we need to try different sets of parameters to find the best performing one.
In terms of number layers, it could be a linear search:
- Define a range of possible numbers of layers e.g., $[5, 20]$
- Uniformly sample from this range
However, the same cannot be applied to learning rate, momentum parameter, or RMS prop parameter. That’s because they could range from $(0, 1]$. Therefore, the search would look like:
- Choose a range of log10 of posible values. E.g., if we want $[1e-3, 1]$ for learning rate, we choose $[-3, 0]$
- Uniformly sample in the log space.
For example, to estimate alpha
:
1
2
3
r = -5 * np.random().rand()
// gives 1e-5 to 1
alpha = 10 ** r
Tuning Strategy On A High Level
If you have lots of compute, you might be able to train multiple models at a time. If not, you might want to spawn 1 set of model parameters, “babysit” the training process, and based on the result, decide what the next set of parameters could be. This process could take days.
Another point is that hyperparameters could get stale. As the business grows, you might want to re-evaluate the hyper parameters and retrain.
Importance ratings:
learning_rate > num_hidden_layers > num_nodes_layer > Adam parameters
When trying different values, trying random values is better than trying on a grid, because you might be trying different values.
Coarse to grid is the process to zoom in on a specific region in the hyper parameter space.