Adaptive gradient methods such as Adam and Adagrad are widely used in = machine learning, yet their effect on the generalization of learned = models =E2=80=93 relative to methods like gradient descent =E2=80=93 = remains poorly understood. Prior work on binary classification suggests = that Adam exhibits a =E2=80=9Crichness bias,=E2=80=9D which can help it = learn nonlinear decision boundaries closer to the Bayes-optimal decision = boundary relative to gradient descent. However, the coordinate-wise = preconditioning scheme employed by Adam renders the overall method = sensitive to orthogonal transformations of feature space. We show that = this sensitivity can manifest as a reversal of Adam=E2=80=99s = competitive advantage: even small rotations of the underlying data = distribution can make Adam forfeit its richness bias and converge to a = linear decision boundary that is farther from the Bayes-optimal decision = boundary than the one learned by gradient descent. To alleviate this = issue, we show that a recently proposed reparameterization method =E2=80=93= which applies an orthogonal transformation to the optimization = objective =E2=80=93 endows any first-order method with equivariance to = data rotations, and we empirically demonstrate its ability to restore = Adam=E2=80=99s bias towards rich decision boundaries. This is joint work = with Adela DePavia and Vasileios Charisopoulos.
These seminars are being made possible through the support of the CFM-ENS Chair « Modèles et Sciences des Données ».
The organizers: Giulio Biroli, Alex Cayco Gajic, Bruno Loureiro, Stéphane Mallat, Gabriel Peyré.