在得出“X 方法无效”这样的结论之前,你应该谨慎,要确保用于测试的数据集确实能够检验该方法。
OpenAI 研究员 Jason Wei 刚刚发表了一篇博文,探讨了在当前 AI 研究中一项被低估却至关重要的技能:找到真正能体现新方法有效性的数据集。这项技能在十年前还不存在,但如今却可能成为一项研究成败的关键。
一个常见的例子是“思维链 (Chain of Thought, CoT) 在哪些数据集上能提升性能?”。近期一篇论文甚至认为 CoT 主要对数学和逻辑任务有帮助。Wei 认为这种观点是缺乏想象力和多样化评估的表现。如果我们简单地在 100 个随机用户聊天提示上测试 CoT 模型,可能看不到明显的差异,但这仅仅是因为这些提示本来就不需要 CoT 就能解决。事实上,在一些特定的数据子集上,CoT 能带来巨大提升——例如数学和编程任务,以及任何验证不对称的任务。
换句话说,在断言“X 方法无效”之前,需要确保用于测试的数据集确实能够体现该方法的价值。
Jason Wei 的这篇博文强调了在当前 AI 研究中,随着模型能力的不断增强,数据集的选择变得更加微妙和关键。
全文
Jason Wei 人工智能研究员 @OpenAI
An underrated but occasionally make-or-break skill in AI research (that didn’t really exist ten years ago) is the ability to find a dataset that actually exercises a new method you are working on. Back in the day when the bottleneck in AI was learning, many methods were dataset-agnostic; for example, a better optimizer would be expected to improve on both ImageNet and CIFAR-10. Nowadays language models are so multi-task that the answer to whether something works is almost always “it depends on the dataset”.
在人工智能研究中,一项被低估但偶尔能决定成败的技能(十年前还不存在)是找到一个真正能检验你正在研究的新方法的数据集的能力。在过去,人工智能的瓶颈是学习,许多方法与数据集无关;例如,一个更好的优化器应该在 ImageNet 和 CIFAR-10 上都能提高性能。如今,语言模型具有如此强大的多任务处理能力,以至于某件事是否有效,答案几乎总是“取决于数据集”。
A common example of this is the question, “on what datasets does chain of thought improve performance?” A recent paper even argued (will link below) that CoT mainly helps on math/logic, and I think that is both a failure of imagination and a lack of diverse evals. Naively you might try CoT models on 100 random user chat prompts and not see much difference, but this is because the prompts were already solvable without CoT. In fact there is a small and very important slice of data where CoT makes a big difference—the obvious examples are math and coding, but include almost any task with asymmetry of verification. For example, generating a poem that fits a list of constraints is hard on the first try but much easier if you can draft and revise using CoT.
一个常见的例子是这个问题:“思维链 (Chain of Thought, CoT) 在哪些数据集上能提高性能?” 一篇最近的论文甚至认为(链接附后)CoT 主要有助于数学/逻辑,我认为这既是想象力的失败,也是缺乏多样化评估的结果。你可能会简单地在 100 个随机用户聊天提示上尝试 CoT 模型,却看不到太大的区别,但这是因为这些提示在没有 CoT 的情况下已经可以解决。事实上,在一小部分非常重要的数据上,CoT 可以带来很大的不同——明显的例子是数学和编码,但也包括几乎任何具有验证不对称性的任务。例如,生成一首符合一系列约束条件的诗歌,第一次尝试时很困难,但如果你可以使用 CoT 进行草拟和修改,就会容易得多。
As another made-up example, let’s say you want to know if browsing improves performance on geology exams. Maybe using browsing on some random geology dataset didn’t improve performance. The important thing to do here would be to see if the without-browsing model was actually suffering due to lack of world knowledge—if it wasn’t, then this was the wrong dataset to try browsing on.
再举一个虚构的例子,假设你想知道浏览网页是否能提高地质学考试的成绩。也许在一些随机的地质学数据集上使用浏览并没有提高性能。这里重要的是要查看没有浏览功能的模型是否真的因为缺乏世界知识而表现不佳——如果不是,那么这就不是一个测试浏览功能的正确数据集。
In other words you should hesitate to draw a conclusion like “X method doesn’t work” without ensuring that the dataset used for testing actually exercises that method. The inertia from five years ago is to take existing benchmarks and try to solve them, but nowadays there is a lot more flexibility and sometimes it even makes sense to create a custom dataset to showcase the initial usefulness of an idea. Obviously the danger with doing this is that a contrived dataset may not represent a substantial portion of user queries. But if the method is in principle general I think this is a good way to start and something people should do more often.
换句话说,在得出“X 方法无效”这样的结论之前,你应该谨慎,要确保用于测试的数据集确实能够检验该方法。五年前的惯性是采用现有的基准数据集并尝试解决它们,但如今的灵活性要大得多,有时甚至可以创建一个自定义数据集来展示一个想法的初步实用性。显然,这样做的危险在于,人为设计的数据集可能无法代表用户查询的很大一部分。但如果该方法在原则上是通用的,我认为这是一个好的开始,也是人们应该更多尝试的事情。
来源 :AI寒武纪