Generative AI is increasingly presented as a potential substitute for humans, including as human research subjects in various disciplines. Yet there is no scientific consensus on how closely these in-silico clones could represent their human counterparts. While some defend the use of these “synthetic users,” others point towards the biases in the responses provided by the LLMs. Through an experiment using survey questionnaires, we demonstrate that these latter critics are right to be wary of using generative AI to emulate respondents, but probably not for the right reason. Our results i) confirm that to date, models cannot replace research subjects for opinion or attitudinal research; ii) that they display a strong bias on each question (reaching only a small region of social space); and iii) that this bias varies randomly from one question to the other (reaching a different region every time). Besides the two existing competing theses (“representativity” and “social bias”), we propose a third one, which we call call “machine bias”. We detail this term and explore its consequences, for LLM research but also for studies on social biases.
We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each trainingquestion, as in Supervised Fine Tuning (SFT). Current standard practice focuses on maximum likelihood (ie log loss minimization) approaches, but we argue that likelihood-maximization methods can fail even in simple settings. Instead, we view the problem as apprenticeship learning (i.e., imitation learning) in contextual bandits, with offline demonstrations from some expert (optimal, or very good) policy, and suggest alternative simple approaches with strong guarantees.
Joint work with Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Kasiviswanathan, and Cong Ma
These seminars are being made possible through the support of the CFM-ENS Chair « Modèles et Sciences des Données ».
The organizers: Giulio Biroli, Alex Cayco Gajic, Bruno Loureiro, Stéphane Mallat, Gabriel Peyré.