Minisymposium Presentation
Benchmarking Economic Reasoning in Artificial Intelligence Models
Douglas Araujo joined the BIS in September 2018, where he has been an economist since May 2022. Previously, Douglas was in the Secretariat of the Basel Committee on Banking Supervision overseeing a range of policy and supervisory topics. Before that, he worked at the Central Bank of Brazil on macroprudential supervision (2011-15) and led the efforts to develop and implement the Brazilian proportionality framework for prudential regulation (2015-18). Douglas worked on financial stability monitoring as a Fellow at the BIS's Financial Stability Institute in 2014. From 2015 to 2018, he also supported a number of countries in enhancing their macroprudential frameworks as a member of International Monetary Fund missions. Until 2011, Douglas worked in the private sector in Brazilian financial markets. His current research focuses on banking and the effects of digitisation on finance. Douglas also contributes open source software at the intersection of machine learning and economics.
A theory-informed test of reasoning in artificial intelligence (AI) combines three sequential steps to consider correct answers as the result of a reasoning process as opposed to luck of probabilistic word matching. The first step is information filtering, where an AI model that reasons must distinguish the relevant information in a prompt from trivia. In the second step, knowledge association, the AI combines implicit or explicit knowledge with the relevant prompt information. And finally in the third step of logic attribution, a reasoning AI assigns correct logic operations for deducive, inducive, and other types of logic to uncover the corret answer. In economic settings, the logic steps involve different levels of counterfactual considerations and policy-relevant thought experiments. This paper leverages insights from the large language model benchmarking literature and the social economics literature to inform the design of benchmarking tests that are challenging, robust, evolving over time and informative about any type of reasoning shortcomings. The benchmarking process can be adapted to other sciences. An accompanying training dataset is available to help AI developers improve reasoninig in their models, and interested users can submit proposals for material to create questions.