NER, spaCy and Optimised Lasagne

Paul Ellis
3 min readFeb 3, 2021

We previously discussed that spaCy will utilise default optimiser settings if the parameter sgd is not set when creating a model.

A list of the default hyper parameters is outlined below.

https://spacy.io/api/cli#train-hyperparams

If no value is set for the label sgd in the model spaCy will utilise the Adam optimiser with default settings. Adam (Kingma and Ba, 2014[1]) is another adaptive learning rate optimization algorithm. The name “Adam” derives from the phrase “adaptive moment estimations”. In the context of the previous algorithm it is seen as a variant on the combination of RMSProp and momentum with a few distinctions. First, in Adam, momentum is incorporated directly as an estimate of the first order moment (with exponential weighting) of the gradient. The most straightforward way to add momentum to RMSProp is to apply momentum to the rescaled gradients. The use of momentum in combination with rescaling does not have a clear theoretical motivation. Second, Adam includes bias corrections to the estimates of both the first-order moments (the momentum term) and the (uncentered) second-order moments to account for their initialization at the origin. (GoodFellow ,2016[2])

Adam can be viewed as a combination of RMSprop and momentum: RMSprop contributes the exponentially decaying average of past squared gradients Vt, while momentum accounts for the exponentially decaying average of past gradients. We have also seen that Nesterov accelerated gradient (NAG) is superior to vanilla momentum. Nadam (Nesterov-accelerated Adaptive Moment Estimation) thus combines Adam and NAG.

https://ruder.io/optimizing-gradient-descent/index.html#nadam

The following article assisted in the setting of the optimizer learning rate.

To test an alternative optimiser to minimise the loss function we imported the Adam and Model utilities from the thinc library. We then set the attribute values to that defined in the hyper -parameters namely learning rate, beta1, beta2, L2-penalty, optimiser_eps and grad_norm_clip. We retained the same dropout rate but chose to print the loss data before saving the model to a new model, ‘model3.’

Similarly, we chose the same volume of records contained within the Campus data.

The data closely resembled the results previously obtained from the spaCy defined optimiser.

Although we have successfully created and demonstrated an NER model the underlying model is constructed from spaCy’s English core model, en_core_web_sm, whereas our data contains Unicode text from multiple languages. However, it provides the building blocks for further investigation into non-English models.

Paul

[1] Kingma, D.P. and Ba, J. (2014) Adam: A Method for Stochastic Optimization.

[2] Goodfellow, I, Bengio Y, Courville, A. “Deep Learning”. www.deeplearningbook.org

--

--