Dwarkesh Podcast - Francois Chollet, Mike Knoop - LLMs won’t lead to AGI - $1,000,000 Prize to find true solution

发布时间：2024-06-11 17:03:59 原节目

以下是将原文翻译成中文： **内容简介：** 这份记录稿是谷歌的AI研究员、Keras的创造者弗朗索瓦·肖莱（Francois Chollet）、Zafior联合创始人迈克·努斯（Mike Knuth）和一位采访者之间围绕 ARC（抽象和推理语料库）基准测试以及为解决该难题提供的百万美元奖金展开的讨论。 **什么是 ARC？** 肖莱将 ARC 描述为机器的智商测试，旨在抵抗记忆，这是大型语言模型（LLM）的一个常见问题。与依赖知识回忆的传统基准测试不同，ARC 需要推理和适应新颖的谜题，使用诸如物理、客体永存和计数等基本核心知识。谜题呈现出输入-输出对，展示了一种模式，任务是将该模式应用于新的输入以生成正确的输出。肖莱澄清说，ARC 所需的核心知识是一个五岁儿童的知识水平，强调它不是关于累积的知识，而是适应新事物能力。 **LLM vs. ARC** 主要的争论点是，为什么当前的 LLM 在 ARC 上表现不佳，尽管它们规模庞大，并且能够在其他复杂任务上表现出色。肖莱认为，LLM 主要擅长记忆和模式获取，而不是即时程序合成。虽然 LLM 可以通过微调来编码已见过任务的解决方案程序，但当遇到真正需要合成新的解决方案程序的新问题时，它们就会遇到困难。采访者探究 LLM 是否可以通过规模化和上下文学习来弥合这一差距，尤其是在能够更好理解空间推理的多模态模型出现之后。然而，肖莱仍然持怀疑态度，他表示核心挑战不是解析输入数据，而是每个新任务的陌生性。 **技能 vs. 智能的辩论** 肖莱和采访者辩论了 LLM 不断增强的能力是否表明了真正的智能，或者仅仅是技能的放大版。扩展定律的论点指出，增加计算能力和训练数据可以提高基准测试的性能。然而，肖莱声称，这仅仅增强了技能（在现有模式上的熟练程度），而不是智能（适应未知事物的能力）。他类比了没有智能的生物，比如昆虫，它们依赖于基因中硬编码的行为程序。 **ARC 奖** 由迈克·努斯资助的百万美元 ARC 奖，旨在鼓励开发能够进行真正推理和适应的 AI 系统。比赛将有几年的年度奖金。为了赢得某一年的奖金，参赛者必须提交一个开源模型，该模型能够正确解决至少 85% 的测试任务。该奖项明确旨在避免封闭系统。 **目前的进展** 努斯强调，ARC 在很大程度上仍然未被解决，这与其他 LLM 迅速饱和的基准测试形成对比。正如肖莱重申的那样，关键区别在于 ARC 对记忆的抵抗力。杰克·科尔（Jack Cole）使用数百万个生成的 ARC 任务预训练 LLM 的工作得到了强调，但肖莱指出，测试时微调（一种允许 LLM 适应每个新问题的主动推理形式）发挥了重要作用。专家组认为这是一种非常有前途的方法。 **架构改进是必要的** 肖莱认为，LLM 在模式认知、直觉和记忆方面非常出色。但是，它们不应单独用于通用人工智能。他的建议是，我们可以利用这些领域的优势来构建一种通用的推理和问题解决形式。最终，他提出的关键点是，AI 仍然没有进行 2 型工作或计划的系统。 **下一步** 尽管 ARC 有可能被黑客攻击或被不正当利用，但肖莱和努斯都致力于将 ARC 维持为一个有价值的基准测试。这将需要仔细监控解决方案并不断发展数据集以保持挑战性。他们将 ARC 奖视为加速迈向真正 AI 的一种方式，即使这意味着发现基准测试本身的缺陷。目标是激励研究人员开发 AI 系统，该系统可以仅从几个示例中合成解决问题的程序，这是一种软件开发的新范式。

The transcript features a discussion between Francois Chollet, an AI researcher at Google and creator of Keras, Mike Knuth, co-founder of Zafior, and an interviewer, centered around the ARC (Abstraction and Reasoning Corpus) benchmark and a million-dollar prize offered to solve it. **What is ARC?** Chollet describes ARC as an IQ test for machines, designed to be resistant to memorization, a common issue with large language models (LLMs). Unlike traditional benchmarks that rely on knowledge recall, ARC requires reasoning and adaptation to novel puzzles, using basic core knowledge like physics, objectness, and counting. The puzzles present input-output pairs demonstrating a pattern, and the task is to apply that pattern to a new input to generate the correct output. Chollet clarifies that the core knowledge required for ARC is that of a five-year-old, underscoring that it is not about accumulated knowledge, but the ability to adapt to novelty. **LLMs vs. ARC** The main point of contention is why current LLMs struggle with ARC, despite their immense size and the ability to perform well on other complex tasks. Chollet argues that LLMs primarily excel at memorization and pattern fetching, rather than on-the-fly program synthesis. While LLMs can encode solution programs for seen tasks through fine-tuning, they struggle when presented with genuinely novel problems that require synthesizing a new solution program. The interviewer probes whether LLMs could bridge this gap through scale and in-context learning, particularly with the advent of multimodal models that can better understand spatial reasoning. However, Chollet remains skeptical, stating that the core challenge is not parsing the input data but the unfamiliar nature of each new task. **The Skill vs. Intelligence Debate** Chollet and the interviewer debate whether the increasing capabilities of LLMs are indicative of genuine intelligence or simply a scaled-up version of skill. The scaling laws argument states that increasing compute power and training data improves performance on benchmarks. However, Chollet claims that this only enhances skill (proficiency in existing patterns), not intelligence (the ability to adapt to the unknown). He draws an analogy to creatures without intelligence, like insects, who rely on hard-coded behavioral programs in their genes. **The Arc Prize** The million-dollar ARC Prize, funded by Mike Knuth, aims to encourage the development of AI systems capable of genuine reasoning and adaptation. The contest will have an annual prize for a few years. To win the prize for a given year, contestants must put together an open-source model that can meet a minimum threshold of 85% of the test tasks correctly. The prize is explicitly designed to avoid a closed system. **Progress So Far** Knuth highlights that ARC has remained largely unsolved, contrasting it with other benchmarks that LLMs quickly saturated. The key difference, as Chollet reiterates, is ARC's resistance to memorization. Jack Cole's work with pre-training LLMs on millions of generated ARC tasks is highlighted, but Chollet points out the significant role of test-time fine-tuning, a form of active inference that allows the LLM to adapt to each new problem. The panel believes that this is extremely promising approach. **Architectural Improvements Are Necessary** Chollet believes that LLMs are excellent at pattern cognition, intuition, and memorization. However, they should not be used alone for general artificial intelligence. His proposal is that we can take advantages of the strengths of each of these areas to build a general form of reasoning and problem-solving. Ultimately, the key point he makes is that AI still does not have system to do the type 2 work or planning. **Next Steps** Both Chollet and Knuth are committed to maintaining ARC as a valuable benchmark, despite the potential for it to be hacked or gamed. This will require carefully monitoring solutions and evolving the dataset to remain challenging. They see the ARC Prize as a way to accelerate progress towards genuine AI, even if it means uncovering flaws in the benchmark itself. The goal is to inspire researchers to develop AI systems that can synthesize problem-solving programs from just a few examples, a new paradigm for software development.