ENS-Data Science colloquium – Rebecca Willett : How do simple rotations affect the implicit bias of Adam?
Amphi Jean Jaurès 29 rue d'Ulm, PARIS, FranceAdaptive gradient methods such as Adam and Adagrad are widely used in = machine learning, yet their effect on the generalization of learned = models =E2=80=93 relative to methods like gradient descent =E2=80=93 = remains poorly understood. Prior work on binary classification suggests = that Adam exhibits a =E2=80=9Crichness bias,=E2=80=9D which can help it = learn nonlinear decision boundaries closer to the Bayes-optimal decision = boundary relative to gradient descent. However, the coordinate-wise = preconditioning scheme employed by Adam renders the overall method = sensitive to orthogonal transformations of feature […]