Mars Benchmark

Abstract

Large Language Models (LLMs) trained on massive corpora have shown remarkable success in knowledge-intensive tasks. Yet, most of them rely on pre-stored knowledge. Inducing new general knowledge from a specific environment and performing reasoning with the acquired knowledge---situated inductive reasoning, is crucial and challenging for machine intelligence. In this paper, we design Mars, an interactive environment devised for situated inductive reasoning. It introduces counter-commonsense game mechanisms by modifying terrain, survival setting and task dependency while adhering to certain principles. In Mars, agents need to actively interact with their surroundings, derive useful rules and perform decision-making tasks in specific contexts. We conduct experiments on various RL-based and LLM-based methods, finding that they all struggle on this challenging situated inductive reasoning benchmark. Furthermore, we explore Induction from Reflection, where we instruct agents to perform inductive reasoning from history trajectory. The superior performance underscores the importance of inductive reasoning in Mars. Through Mars, we aim to galvanize advancements in situated inductive reasoning and set the stage for developing the next generation of AI systems that can reason in an adaptive and context-sensitive way.

Leaderboard on Mars

Reward on the 8 different worlds setting of Logo Mars. Results for LM models are summarized over 9 independent trials while RL methods over 20 independent trials.

#	Method	Model	Source	ALL (Excl. Default)	Default	Terrain	Survival	Task. Dep	Terr. Surv.	Terr. Task.	Surv. Task.	All Three.
1	IfR 🥇	GPT-4-0125-preview	Link	5.5	9.0\(^{\pm 2.3}\)	8.0\(^{\pm 3.7}\)	7.7\(^{\pm 3.7}\)	5.6\(^{\pm 2.9}\)	6.8\(^{\pm 1.9}\)	6.9\(^{\pm 1.8}\)	3.3\(^{\pm 1.4}\)	0.1\(^{\pm 0.5}\)
2	ReAct 🥈	GPT-4-0125-preview	Link	4.6	7.7\(^{\pm 1.6}\)	7.4\(^{\pm 2.7}\)	6.4\(^{\pm 3.7}\)	5.0\(^{\pm 2.1}\)	6.7\(^{\pm 2.5}\)	4.8\(^{\pm 2.0}\)	1.5\(^{\pm 1.3}\)	0.7\(^{\pm 1.6}\)
3	Skill Library 🥉	GPT-4-0125-preview	Link	4.2	8.0\(^{\pm 2.1}\)	9.5\(^{\pm 2.9}\)	7.9\(^{\pm 2.9}\)	1.5\(^{\pm 1.9}\)	3.0\(^{\pm 2.5}\)	5.5\(^{\pm 1.5}\)	2.3\(^{\pm 1.5}\)	-0.5\(^{\pm 0.5}\)
4	Reflexion	GPT-4-0125-preview	Link	3.6	6.0\(^{\pm 1.7}\)	6.4\(^{\pm 3.0}\)	4.6\(^{\pm 3.9}\)	3.2\(^{\pm 1.6}\)	4.9\(^{\pm 2.5}\)	5.3\(^{\pm 2.5}\)	1.0\(^{\pm 1.6}\)	-0.4\(^{\pm 0.7}\)
5	IfR	LLaMA-3.1-8B-instruct	Link	2.9	3.8\(^{\pm 2.4}\)	3.8\(^{\pm 2.1}\)	3.7\(^{\pm 2.8}\)	2.9\(^{\pm 1.0}\)	3.8\(^{\pm 2.0}\)	3.3\(^{\pm 1.2}\)	1.1\(^{\pm 1.3}\)	0.8\(^{\pm 1.4}\)
6	ReAct	LLaMA-3.1-8B-instruct	Link	1.7	3.6\(^{\pm 2.1}\)	2.1\(^{\pm 2.2}\)	2.3\(^{\pm 2.5}\)	2.3\(^{\pm 1.0}\)	1.1\(^{\pm 1.4}\)	3.0\(^{\pm 1.6}\)	0.7\(^{\pm 2.0}\)	0.2\(^{\pm 1.2}\)
*	PPO	RL-based*	Link	0.0	1.9\(^{\pm 1.4}\)	-0.1\(^{\pm 0.6}\)	-0.6\(^{\pm 0.5}\)	2.1\(^{\pm 1.2}\)	0.0\(^{\pm 0.7}\)	-0.7\(^{\pm 0.3}\)	-0.6\(^{\pm 0.4}\)	0.1\(^{\pm 0.8}\)
*	DreamerV3	RL-based*	Link	7.9	11.5\(^{\pm 1.6}\)	9.3\(^{\pm 2.2}\)	8.6\(^{\pm 4.1}\)	8.8\(^{\pm 2.8}\)	7.1\(^{\pm 2.1}\)	6.6\(^{\pm 0.7}\)	9.6\(^{\pm 3.4}\)	5.1\(^{\pm 1.8}\)

RL-based*: For reference only. They individually train a model for each world but failed to truly evaluate the ability of situated inductive reasoning.
ALL (Excl. Default): Average reward of all worlds except Default, i.e., overall performance of counter-commonsense worlds.
World types: Default: original Crafter setting, i.e., no modifications. Terrain, Survival and Task Dependency: three types of modifications. Terr. Surv., Terr. Task., Surv. Task., and All Three.: modifications combining two or all three types.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Mars: Situated Inductive Reasoning Benchmark

Imagine a scenario: in the United States, you drive on the right side of the road. When you travel to the UK, you might initially find it strange how people drive. However, you soon realize that driving on the left is the norm here and adapt yourself to the new rule. Inductive reasoning, a capacity that identifies underlying rules, mechanisms, or general claims of observations experience based on past observations.

Mars, an open-world environment for situated inductive reasoning, involves inductive reasoning through active interaction and applying newly acquired rules to make context-sensitive decisions.

First, built on Crafter, we introduce counter-commonsense elements to design Mars. Agents interact with the environment and accumulate historical trajectories. For example, an agent might observe that regardless of time or location, mining stone always yields diamonds; using 2 diamonds can craft a table. Consequently, the agent can induce rules "Mining stone yields diamond" and "Placing table consumes 2 diamonds". When tasked with making a wooden pickaxe, the agent can apply these rules to plan and execute specific actions in different contexts.

Modification: From Crafter to Mars

To challenge the agent with an environment that deviates from prior (parametric) knowledge and necessitates situated inductive reasoning, we introduce targeted modifications to typical commonsense elements on the foundation of Crafter, classified into three categories: (1) Terrain: altering the predictable terrain distributions; (2) Survival: modifying the behavior of non-player characters, which effects the agents status level (e.g., health); (3) Task Dependency: changing the dependency between tasks.

While we can sample numerous new worlds following the above procedure, we carefully designed several strict principles so that they are not completely fantastical and are always playable. We guarantee that each collected item has at least one obtainable method and each tool has a practical use, motivating the agent to engage in crafting; for every resource that can be increased by some event, there must be a corresponding event that can decrease the resource, maintaining a balance; we also develop an automated program to ensure that each achievement is achievable; the quantity of items required for task achievements must be greater than what the world provides.

Method: Induction from Reflection

Building on the JARVIS-1 framework, we further introduce the induction from reflection (IfR) module in Controller. Given the selected task and the agent's observation, planner decomposes the task into a sequence of subgoals. Controller then outputs specific actions to accomplish these subgoals. Successful plans are stored in the skill library, while failed plans prompt the agent to perform self-explanation and replan. When the controller finishes a subgoal (including "succeed", "failed" or "timeout"), we force LLM to engage in reflective thinking to induce possible game mechanisms based on the agent’s historical trajectory. The derived rules are then stored in a rule library, which the task proposer, planner, and controller can use.

Experimental Results

We systematically evaluate the performance of various RL-based and LLM-based methods on Mars. For LLMs that cannot accept image inputs, we provide a wrapper that gives text descriptions of gameplay screen. Quantitative results on Mars are depicted as follows.

We find that all baseline models exhibit a performance decline when transitioning from the Default to Mars scenarios, with the extent of the decline dependent on the type (e.g., terrain, survival, and task dependency) and the number of modifications. This underscores that Mars presents significant challenges for current methodologies. We also explore the performance of the Induction from Reflection module, which outperforms the baseline models, demonstrating the importance of inductive reasoning in a counter-commonsense environment.

BibTeX

@inproceedings{tang2024mars, title={Mars: Situated Inductive Reasoning in an Open-World Environment}, author={Tang, Xiaojuan and Li, Jiaqi and Liang, Yitao and Zhu, Song-chun and Zhang, Muhan and Zheng, Zilong}, booktitle={38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks}, year={2024} }