Vygotskian Autotelic Artificial Intelligence

List of papers

This list of 61 papers include algorithms leveraging the cognitive functions of language. We use 5 different tags:

Disembodied refers to disembodied agents usually based on supervised learning algorithms.
Embodied refers to embodied agents usually trained with reinforcement learning algorithms.
Autotelic refers to embodied agents able to represent, generate, pursue and master their own goals (see a review here).
Vygotskian refers to disembodied or embodied agents that internalize social linguistic production (see details here).
VAAI refers to embodied, Vygotskian and autotelic agents.

Click on the paper’s title to display the list of authors, the abstract, a link to the article and the bibtex.

2022: 9 Papers

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances Embodied
Authors:
Ahn M, Brohan A, Brown N, Chebotar Y, Cortes O, David B, Finn C, Gopalakrishnan K, Hausman K, Herzog A, Ho D, Hsu J, Ibarz J, Ichter B, Irpan A, Jang E, Ruano RJ, Jeffrey K, Jesmonth S, Joshi NJ, Julian R, Kalashnikov D, Kuang Y, Lee KH, Levine S, Lu Y, Luu L, Parada C, Pastor P, Quiambao J, Rao K, Rettinghouse J, Reyes D, Sermanet P, Sievers N, Tan C, Toshev A, Vanhoucke V, Xia F, Xiao T, Xu P, Xu S, Yan M
Abstract:
Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at this https URL
Links:
- Paper
- Webpage
Bibtex:
```
@article{ahn2022doasican,
  author = {Ahn, Michael and Brohan, Anthony and Brown, Noah and Chebotar, Yevgen and Cortes, Omar and David, Byron and Finn, Chelsea and Gopalakrishnan, Keerthana and Hausman, Karol and Herzog, Alex and Ho, Daniel and Hsu, Jasmine and Ibarz, Julian and Ichter, Brian and Irpan, Alex and Jang, Eric and Ruano, Rosario Jauregui and Jeffrey, Kyle and Jesmonth, Sally and Joshi, Nikhil J and Julian, Ryan and Kalashnikov, Dmitry and Kuang, Yuheng and Lee, Kuang-Huei and Levine, Sergey and Lu, Yao and Luu, Linda and Parada, Carolina and Pastor, Peter and Quiambao, Jornell and Rao, Kanishka and Rettinghouse, Jarek and Reyes, Diego and Sermanet, Pierre and Sievers, Nicolas and Tan, Clayton and Toshev, Alexander and Vanhoucke, Vincent and Xia, Fei and Xiao, Ted and Xu, Peng and Xu, Sichun and Yan, Mengyuan},
  title = {{Do As I Can, Not As I Say: Grounding Language in Robotic Affordances}},
  journal = {ArXiv - abs/2204.01691},
  year = {2022},
}
```
Help Me Explore: Minimal Social Interventions for Graph-Based Autotelic Agents Embodied Autotelic
Authors:
Akakzia A, Serris O, Sigaud O, Colas C
Abstract:
In the quest for autonomous agents learning open-ended repertoires of skills, most works take a Piagetian perspective: learning trajectories are the results of interactions between developmental agents and their physical environment. The Vygotskian perspective, on the other hand, emphasizes the centrality of the socio-cultural environment: higher cognitive functions emerge from transmissions of socio-cultural processes internalized by the agent. This paper argues that both perspectives could be coupled within the learning of autotelic agents to foster their skill acquisition. To this end, we make two contributions: 1) a novel social interaction protocol called Help Me Explore (HME), where autotelic agents can benefit from both individual and socially guided exploration. In social episodes, a social partner suggests goals at the frontier of the learning agent knowledge. In autotelic episodes, agents can either learn to master their own discovered goals or autonomously rehearse failed social goals; 2) GANGSTR, a graph-based autotelic agent for manipulation domains capable of decomposing goals into sequences of intermediate sub-goals. We show that when learning within HME, GANGSTR overcomes its individual learning limits by mastering the most complex configurations (e.g. stacks of 5 blocks) with only few social interventions.
Links:
- Paper
- Source-code
Bibtex:
```
@article{akakzia2022help,
  title={{Help Me Explore: Minimal Social Interventions for Graph-Based Autotelic Agents}},
  author={Akakzia, Ahmed and Serris, Olivier and Sigaud, Olivier and Colas, C{\'e}dric},
  journal={ArXiv - abs/2202.05129},
  year={2022}
}
```
Improving Intrinsic Exploration with Language Abstractions Embodied Vygotskian
Authors:
Mu J, Zhong V, Raileanu R, Jiang M, Goodman ND, Rocktaschel T, Grefenstette E
Abstract:
Reinforcement learning (RL) agents are particularly hard to train when rewards are sparse. One common solution is to use intrinsic rewards to encourage agents to explore their environment. However, recent intrinsic exploration methods often use state-based novelty measures which reward low-level exploration and may not scale to domains requiring more abstract skills. Instead, we explore natural language as a general medium for highlighting relevant abstractions in an environment. Unlike previous work, we evaluate whether language can improve over existing exploration methods by directly extending (and comparing to) competitive intrinsic exploration baselines: AMIGo (Campero et al., 2021) and NovelD (Zhang et al., 2021). These language-based variants outperform their non-linguistic forms by 45-85% across 13 challenging tasks from the MiniGrid and MiniHack environment suites.
Links:
- Paper
Bibtex:
```
@article{Mu2022ImprovingIE,
  title={{Improving Intrinsic Exploration with Language Abstractions}},
  author={Jesse Mu and Victor Zhong and Roberta Raileanu and Minqi Jiang and Noah D. Goodman and Tim Rocktaschel and Edward Grefenstette},
  journal={ArXiv - abs/2202.08938},
  year={2022},
}
```
Intra-agent speech permits zero-shot task acquisition Embodied Vygotskian
Authors:
Chen Yan, Federico Carnevale, Petko Georgiev, Adam Santoro, Aurelia Guy, Alistair Muldal, Chia-Chun Hung, Josh Abramson, Timothy Lillicrap, Gregory Wayne
Abstract:
Human language learners are exposed to a trickle of informative, context-sensitive language, but a flood of raw sensory data. Through both social language use and internal processes of rehearsal and practice, language learners are able to build high-level, semantic representations that explain their perceptions. Here, we take inspiration from such processes of "inner speech" in humans (Vygotsky, 1934) to better understand the role of intra-agent speech in embodied behavior. First, we formally pose intra-agent speech as a semi-supervised problem and develop two algorithms that enable visually grounded captioning with little labeled language data. We then experimentally compute scaling curves over different amounts of labeled data and compare the data efficiency against a supervised learning baseline. Finally, we incorporate intra-agent speech into an embodied, mobile manipulator agent operating in a 3D virtual world, and show that with as few as 150 additional image captions, intra-agent speech endows the agent with the ability to manipulate and answer questions about a new object without any related task-directed experience (zero-shot). Taken together, our experiments suggest that modelling intra-agent speech is effective in enabling embodied agents to learn new tasks efficiently and without direct interaction experience.
Links:
- Paper
Bibtex:
```
@misc{https://doi.org/10.48550/arxiv.2206.03139,
  doi = {10.48550/ARXIV.2206.03139},
  
  url = {https://arxiv.org/abs/2206.03139},
  
  author = {Yan, Chen and Carnevale, Federico and Georgiev, Petko and Santoro, Adam and Guy, Aurelia and Muldal, Alistair and Hung, Chia-Chun and Abramson, Josh and Lillicrap, Timothy and Wayne, Gregory},
  
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Intra-agent speech permits zero-shot task acquisition},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}
```
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents Embodied
Authors:
Huang W, Abbeel P, Pathak D, Mordatch I
Abstract:
Can world knowledge learned by large language models (LLMs) be used to act in interactive environments? In this paper, we investigate the possibility of grounding high-level tasks, expressed in natural language (e.g. "make breakfast"), to a chosen set of actionable steps (e.g. "open fridge"). While prior work focused on learning from explicit step-by-step examples of how to act, we surprisingly find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into mid-level plans without any further training. However, the plans produced naively by LLMs often cannot map precisely to admissible actions. We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions. Our evaluation in the recent VirtualHome environment shows that the resulting method substantially improves executability over the LLM baseline. The conducted human evaluation reveals a trade-off between executability and correctness but shows a promising sign towards extracting actionable knowledge from language models. Website at this https URL
Links:
- Paper
- Webpage
Bibtex:
```
@article{huang2022language,
 author = {Huang, Wenlong and Abbeel, Pieter and Pathak, Deepak and Mordatch, Igor},
 journal = {ArXiv - abs/2201.07207},
 title = {Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents},
 year = {2022}
}
```
Large Pre-Trained Language Models Contain Human-Like Biases of What Is Right and Wrong to Do Disembodied
Authors:
Schramowski P, Turan C, Andersen N, Rothkopf CA, Kersting K
Abstract:
Artificial writing is permeating our lives due to recent advances in large-scale, transformer-based language models (LMs) such as BERT, its variants, GPT-2/3, and others. Using them as pre-trained models and fine-tuning them for specific tasks, researchers have extended state of the art for many NLP tasks and shown that they capture not only linguistic knowledge but also retain general knowledge implicitly present in the data. Unfortunately, LMs trained on unfiltered text corpora suffer from degenerated and biased behaviour. While this is well established, we show that recent LMs also contain human-like biases of what is right and wrong to do, some form of ethical and moral norms of the society -- they bring a "moral direction" to surface. That is, we show that these norms can be captured geometrically by a direction, which can be computed, e.g., by a PCA, in the embedding space, reflecting well the agreement of phrases to social norms implicitly expressed in the training texts and providing a path for attenuating or even preventing toxic degeneration in LMs. Being able to rate the (non-)normativity of arbitrary phrases without explicitly training the LM for this task, we demonstrate the capabilities of the "moral direction" for guiding (even other) LMs towards producing normative text and showcase it on RealToxicityPrompts testbed, preventing the neural toxic degeneration in GPT-2.
Links:
- Paper
- Source-code
Bibtex:
```
@article{schramowski2022large,
 author = {Schramowski, Patrick and Turan, Cigdem and Andersen, Nico and Rothkopf, Constantin A and Kersting, Kristian},
 journal = {Nature Machine Intelligence},
 number = {3},
 publisher = {Nature Publishing Group},
 title = {{Large Pre-Trained Language Models Contain Human-Like Biases of What Is Right and Wrong to Do}},
 year = {2022}
}
```
Probing Pre-Trained Language Models for Cross-Cultural Differences in Values Disembodied
Authors:
Arora A, Kaffee LA, Augenstein I
Abstract:
Language embeds information about social, cultural, and political values people hold. Prior work has explored social and potentially harmful biases encoded in Pre-Trained Language models (PTLMs). However, there has been no systematic study investigating how values embedded in these models vary across cultures. In this paper, we introduce probes to study which values across cultures are embedded in these models, and whether they align with existing theories and cross-cultural value surveys. We find that PTLMs capture differences in values across cultures, but those only weakly align with established value surveys. We discuss implications of using mis-aligned models in cross-cultural settings, as well as ways of aligning PTLMs with value surveys.
Links:
- Paper
- Source-code
Bibtex:
```
@article{arora2022probing,
  title={{Probing Pre-Trained Language Models for Cross-Cultural Differences in Values}},
  author={Arora, Arnav and Kaffee, Lucie-Aim{\'e}e and Augenstein, Isabelle},
  journal={ArXiv - abs/2203.13722},
  year={2022}
}
```
Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning Disembodied
Authors:
Creswell A, Shanahan M, Higgins I
Abstract:
Large language models (LLMs) have been shown to be capable of impressive few-shot generalisation to new tasks. However, they still tend to perform poorly on multi-step logical reasoning problems. Here we carry out a comprehensive evaluation of LLMs on 50 tasks that probe different aspects of logical reasoning. We show that language models tend to perform fairly well at single step inference or entailment tasks, but struggle to chain together multiple reasoning steps to solve more complex problems. In light of this, we propose a Selection-Inference (SI) framework that exploits pre-trained LLMs as general processing modules, and alternates between selection and inference to generate a series of interpretable, casual reasoning steps leading to the final answer. We show that a 7B parameter LLM used within the SI framework in a 5-shot generalisation setting, with no fine-tuning, yields a performance improvement of over 100% compared to an equivalent vanilla baseline on a suite of 10 logical reasoning tasks. The same model in the same setting even outperforms a significantly larger 280B parameter baseline on the same suite of tasks. Moreover, answers produced by the SI framework are accompanied by a causal natural-language-based reasoning trace, which has important implications for the safety and trustworthiness of the system.
Links:
- Paper
Bibtex:
```
@article{creswell2022selection,
  title={{Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning}},
  author={Creswell, Antonia and Shanahan, Murray and Higgins, Irina},
  journal={ArXiv - abs/2205.09712},
  year={2022}
}
```
Semantic Exploration from Language Abstractions and Pretrained Representations Embodied Vygotskian
Authors:
Tam AC, Rabinowitz NC, Lampinen AK, Roy NA, Chan SC, Strouse DJ, Wang JX, Banino A, Hill F
Abstract:
Effective exploration is a challenge in reinforcement learning (RL). Novelty-based exploration methods can suffer in high-dimensional state spaces, such as continuous partially-observable 3D environments. We address this challenge by defining novelty using semantically meaningful state abstractions, which can be found in learned representations shaped by natural language. In particular, we evaluate vision-language representations, pretrained on natural image captioning datasets. We show that these pretrained representations drive meaningful, task-relevant exploration and improve performance on 3D simulated environments. We also characterize why and how language provides useful abstractions for exploration by considering the impacts of using representations from a pretrained model, a language oracle, and several ablations. We demonstrate the benefits of our approach in two very different task domains -- one that stresses the identification and manipulation of everyday objects, and one that requires navigational exploration in an expansive world -- as well as two popular deep RL algorithms: Impala and R2D2. Our results suggest that using language-shaped representations could improve exploration for various algorithms and agents in challenging environments.
Links:
- Paper
Bibtex:
```
@article{tam2022semantic,
  author = {Tam, Allison C. and Rabinowitz, Neil C. and Lampinen, Andrew K. and Roy, Nicholas A. and Chan, Stephanie C. Y. and Strouse, DJ and Wang, Jane X. and Banino, Andrea and Hill, Felix},
  title = {{Semantic Exploration from Language Abstractions and Pretrained Representations}},
  journal = {ArXiv - abs/2204.05080},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}
```

2021: 16 Papers

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning Embodied
Authors:
Shridhar M, Yuan X, Cote MA, Bisk Y, Trischler A, Hausknecht M
Abstract:
Given a simple request like Put a washed apple in the kitchen fridge, humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstract, text based policies in TextWorld (Côté et al., 2018) and then execute goals from the ALFRED benchmark (Shridhar et al., 2020) in a rich visual environment. ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions. In turn, as we demonstrate empirically, this fosters better agent generalization than training only in the visually grounded environment. BUTLER's simple, modular design factors the problem to allow researchers to focus on models for improving every piece of the pipeline (language understanding, planning, navigation, and visual scene understanding).
Links:
- Paper
- Source-code
Bibtex:
```
@article{shridhar_alfworld_2021,
 author = {Shridhar, Mohit and Yuan, Xingdi and Cote, Marc-Alexandre and Bisk, Yonatan and Trischler, Adam and Hausknecht, Matthew},
 journal = {Proc. of ICLR},
 title = {{ALFWorld}: {Aligning} {Text} and {Embodied} {Environments} for {Interactive} {Learning}},
 year = {2021}
}
```
Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning Embodied
Authors:
Chen V, Gupta A, Marino K
Abstract:
Complex, multi-task problems have proven to be difficult to solve efficiently in a sparse-reward reinforcement learning setting. In order to be sample efficient, multi-task learning requires reuse and sharing of low-level policies. To facilitate the automatic decomposition of hierarchical tasks, we propose the use of step-by-step human demonstrations in the form of natural language instructions and action trajectories. We introduce a dataset of such demonstrations in a crafting-based grid world. Our model consists of a high-level language generator and low-level policy, conditioned on language. We find that human demonstrations help solve the most complex tasks. We also find that incorporating natural language allows the model to generalize to unseen tasks in a zero-shot setting and to learn quickly from a few demonstrations. Generalization is not only reflected in the actions of the agent, but also in the generated natural language instructions in unseen tasks. Our approach also gives our trained agent interpretable behaviors because it is able to generate a sequence of high-level descriptions of its actions. Comments:
Links:
- Paper
Bibtex:
```
@article{chen_ask_2021,
 author = {Chen, Valerie and Gupta, Abhinav and Marino, Kenneth},
 journal = {Proc. of ICLR},
 title = {Ask {Your} {Humans}: {Using} {Human} {Instructions} to {Improve} {Generalization} in {Reinforcement} {Learning}},
 year = {2021}
}
```
ELLA: Exploration through Learned Language Abstraction Embodied Vygotskian
Authors:
Mirchandani S, Karamcheti S, Sadigh D
Abstract:
Building agents capable of understanding language instructions is critical to effective and robust human-AI collaboration. Recent work focuses on training these agents via reinforcement learning in environments with synthetic language; however, instructions often define long-horizon, sparse-reward tasks, and learning policies requires many episodes of experience. We introduce ELLA: Exploration through Learned Language Abstraction, a reward shaping approach geared towards boosting sample efficiency in sparse reward environments by correlating high-level instructions with simpler low-level constituents. ELLA has two key elements: 1) A termination classifier that identifies when agents complete low-level instructions, and 2) A relevance classifier that correlates low-level instructions with success on high-level tasks. We learn the termination classifier offline from pairs of instructions and terminal states. Notably, in departure from prior work in language and abstraction, we learn the relevance classifier online, without relying on an explicit decomposition of high-level instructions to low-level instructions. On a suite of complex BabyAI environments with varying instruction complexities and reward sparsity, ELLA shows gains in sample efficiency relative to language-based shaping and traditional RL methods.
Links:
- Paper
Bibtex:
```
@inproceedings{mirchandani2021ella,
 author = {Mirchandani, Suvir and Karamcheti, Siddharth and Sadigh, Dorsa},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {M. Ranzato and A. Beygelzimer and Y. Dauphin and P.S. Liang and J. Wortman Vaughan},
 pages = {29529--29540},
 publisher = {Curran Associates, Inc.},
 title = {{ELLA: Exploration through Learned Language Abstraction}},
 volume = {34},
 year = {2021}
}
```
Episodic Transformer for Vision-and-Language Navigation Embodied
Authors:
Pashevich A, Schmid C, Sun C
Abstract:
Interaction and navigation defined by natural language instructions in dynamic environments pose significant challenges for neural agents. This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. To improve training, we leverage synthetic instructions as an intermediate representation that decouples understanding the visual appearance of an environment from the variations of natural language instructions. We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
Links:
- Paper
- Source-code
Bibtex:
```
@article{pashevich2021episodic,
 journal = {ArXiv - abs/2105.06453},
 author = {Alexander Pashevich and Cordelia Schmid and Chen Sun},
 title = {{Episodic Transformer for Vision-and-Language Navigation}},
 year = {2021}
}
```
Grounding Language to Autonomously-Acquired Skills Via Goal Generation VAAI
Authors:
Akakzia A, Colas C, Oudeyer PY, Chetouani M, Sigaud O
Abstract:
We are interested in the autonomous acquisition of repertoires of skills. Language-conditioned reinforcement learning (LC-RL) approaches are great tools in this quest, as they allow to express abstract goals as sets of constraints on the states. However, most LC-RL agents are not autonomous and cannot learn without external instructions and feedback. Besides, their direct language condition cannot account for the goal-directed behavior of pre-verbal infants and strongly limits the expression of behavioral diversity for a given language input. To resolve these issues, we propose a new conceptual approach to language-conditioned RL: the Language-Goal-Behavior architecture (LGB). LGB decouples skill learning and language grounding via an intermediate semantic representation of the world. To showcase the properties of LGB, we present a specific implementation called DECSTR. DECSTR is an intrinsically motivated learning agent endowed with an innate semantic representation describing spatial relations between physical objects. In a first stage (G -> B), it freely explores its environment and targets self-generated semantic configurations. In a second stage (L -> G), it trains a language-conditioned goal generator to generate semantic goals that match the constraints expressed in language-based inputs. We showcase the additional properties of LGB w.r.t. both an end-to-end LC-RL approach and a similar approach leveraging non-semantic, continuous intermediate representations. Intermediate semantic representations help satisfy language commands in a diversity of ways, enable strategy switching after a failure and facilitate language grounding.
Links:
- Paper
- Webpage
Bibtex:
```
@article{akakzia_grounding_2021,
 author = {Akakzia, Ahmed and Colas, Cédric and Oudeyer, Pierre-Yves and Chetouani, Mohamed and Sigaud, Olivier},
 journal = {Proc. of ICLR},
 title = {Grounding {Language} to {Autonomously}-{Acquired} {Skills} {Via} {Goal} {Generation}},
 year = {2021}
}
```
Grounding Spatio-Temporal Language with Transformers Embodied
Authors:
Karch T, Teodorescu L, Hofmann K, Moulin-Frier C, Oudeyer PY
Abstract:
Language is an interface to the outside world. In order for embodied agents to use it, language must be grounded in other, sensorimotor modalities. While there is an extended literature studying how machines can learn grounded language, the topic of how to learn spatio-temporal linguistic concepts is still largely uncharted. To make progress in this direction, we here introduce a novel spatio-temporal language grounding task where the goal is to learn the meaning of spatio-temporal descriptions of behavioral traces of an embodied agent. This is achieved by training a truth function that predicts if a description matches a given history of observations. The descriptions involve time-extended predicates in past and present tense as well as spatio-temporal references to objects in the scene. To study the role of architectural biases in this task, we train several models including multimodal Transformer architectures; the latter implement different attention computations between words and objects across space and time. We test models on two classes of generalization: 1) generalization to new sentences, 2) generalization to grammar primitives. We observe that maintaining object identity in the attention computation of our Transformers is instrumental to achieving good performance on generalization overall, and that summarizing object traces in a single token has little influence on performance. We then discuss how this opens new perspectives for language-guided autonomous embodied agents.
Links:
- Paper
- Source-code
Bibtex:
```
@article{karch2021grounding,
 author = {Tristan Karch and Laetitia Teodorescu and Katja Hofmann and Clément Moulin-Frier and Pierre-Yves Oudeyer},
 journal = {Proc. of NeurIPS},
 title = {{Grounding Spatio-Temporal Language with Transformers}},
 year = {2021}
}
```
Interactive Learning from Activity Description Embodied
Authors:
Nguyen K, Misra D, Schapire R, Dudík M, Shafto P
Abstract:
We present a novel interactive learning protocol that enables training request-fulfilling agents by verbally describing their activities. Unlike imitation learning (IL), our protocol allows the teaching agent to provide feedback in a language that is most appropriate for them. Compared with reward in reinforcement learning (RL), the description feedback is richer and allows for improved sample complexity. We develop a probabilistic framework and an algorithm that practically implements our protocol. Empirical results in two challenging request-fulfilling problems demonstrate the strengths of our approach: compared with RL baselines, it is more sample-efficient; compared with IL baselines, it achieves competitive success rates without requiring the teaching agent to be able to demonstrate the desired behavior using the learning agent's actions. Apart from empirical evaluation, we also provide theoretical guarantees for our algorithm under certain assumptions about the teacher and the environment.
Links:
- Paper
- Source-code
Bibtex:
```
@article{nguyen2021interactive,
 author = {Khanh Nguyen and Dipendra Misra and Robert Schapire and Miro Dudík and Patrick Shafto},
 journal = {Proc. of ICML},
 title = {Interactive Learning from Activity Description},
 year = {2021}
}
```
Learning Rewards from Linguistic Feedback Embodied
Authors:
Sumers TR, Ho MK, Hawkins RD, Narasimhan K, Griffiths TL
Abstract:
We explore unconstrained natural language feedback as a learning signal for artificial agents. Humans use rich and varied language to teach, yet most prior work on interactive learning from language assumes a particular form of input (e.g., commands). We propose a general framework which does not make this assumption, using aspect-based sentiment analysis to decompose feedback into sentiment about the features of a Markov decision process. We then perform an analogue of inverse reinforcement learning, regressing the sentiment on the features to infer the teacher's latent reward function. To evaluate our approach, we first collect a corpus of teaching behavior in a cooperative task where both teacher and learner are human. We implement three artificial learners: sentiment-based "literal" and "pragmatic" models, and an inference network trained end-to-end to predict latent rewards. We then repeat our initial experiment and pair them with human teachers. All three successfully learn from interactive human feedback. The sentiment models outperform the inference network, with the "pragmatic" model approaching human performance. Our work thus provides insight into the information structure of naturalistic linguistic feedback as well as methods to leverage it for reinforcement learning.
Links:
- Paper
- Source-code
Bibtex:
```
@article{sumers2021learning,
 journal = {ArXiv - abs/2009.14715},
 author = {Theodore R. Sumers and Mark K. Ho and Robert D. Hawkins and Karthik Narasimhan and Thomas L. Griffiths},
 title = {Learning Rewards from Linguistic Feedback},
 year = {2021}
}
```
Leveraging Language to Learn Program Abstractions and Search Heuristics Disembodied
Authors:
Wong C, Ellis K, Tenenbaum JB, Andreas J
Abstract:
Inductive program synthesis, or inferring programs from examples of desired behavior, offers a general paradigm for building interpretable, robust, and generalizable machine learning systems. Effective program synthesis depends on two key ingredients: a strong library of functions from which to build programs, and an efficient search strategy for finding programs that solve a given task. We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization on three domains -- string editing, image composition, and abstract reasoning about scenes -- even when no natural language hints are available at test time.
Links:
- Paper
Bibtex:
```
@article{wong_leveraging_2021,
 author = {Wong, Catherine and Ellis, Kevin and Tenenbaum, Joshua B. and Andreas, Jacob},
 journal = {ArXiv - abs/2106.11053},
 title = {Leveraging {Language} to {Learn} {Program} {Abstractions} and {Search} {Heuristics}},
 year = {2021}
}
```

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World Embodied

Authors:
Rowan Zellers, Ari Holtzman, Matthew Peters, Roozbeh Mottaghi, Aniruddha Kembhavi, Ali Farhadi, Yejin Choi
Abstract:
We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don’t. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence, simulate neurally what might happen next, and then communicate that result through a literal symbolic representation, or natural language. Experimental results show that our model effectively learns world dynamics, along with how to communicate them. It is able to correctly forecast what happens next given an English sentence over 80% of the time, outperforming a 100x larger, text-to-text approach by over 10%. Likewise, its natural language summaries of physical interactions are also judged by humans as more accurate than LM alternatives. We present comprehensive analysis showing room for future work.
Links:

Paper

Webpage

Bibtex:
@inproceedings{zellers-etal-2021-piglet,
    title = "{PIGL}e{T}: Language Grounding Through Neuro-Symbolic Interaction in a 3{D} World",
    author = "Zellers, Rowan  and
      Holtzman, Ari  and
      Peters, Matthew  and
      Mottaghi, Roozbeh  and
      Kembhavi, Aniruddha  and
      Farhadi, Ali  and
      Choi, Yejin",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.159",
    doi = "10.18653/v1/2021.acl-long.159",
    pages = "2040--2050",
    abstract = "We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don{'}t. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence, simulate neurally what might happen next, and then communicate that result through a literal symbolic representation, or natural language. Experimental results show that our model effectively learns world dynamics, along with how to communicate them. It is able to correctly forecast what happens next given an English sentence over 80{\%} of the time, outperforming a 100x larger, text-to-text approach by over 10{\%}. Likewise, its natural language summaries of physical interactions are also judged by humans as more accurate than LM alternatives. We present comprehensive analysis showing room for future work.",
}

SILG: The Multi-environment Symbolic Interactive Language Grounding Benchmark Embodied
Authors:
Zhong V, Hanjie AW, Wang SI, Narasimhan K, Zettlemoyer L
Abstract:
Existing work in language grounding typically study single environments. How do we build unified models that apply across multiple environments? We propose the multi-environment Symbolic Interactive Language Grounding benchmark (SILG), which unifies a collection of diverse grounded language learning environments under a common interface. SILG consists of grid-world environments that require generalization to new dynamics, entities, and partially observed worlds (RTFM, Messenger, NetHack), as well as symbolic counterparts of visual worlds that require interpreting rich natural language with respect to complex scenes (ALFWorld, Touchdown). Together, these environments provide diverse grounding challenges in richness of observation space, action space, language specification, and plan complexity. In addition, we propose the first shared model architecture for RL on these environments, and evaluate recent advances such as egocentric local convolution, recurrent state-tracking, entity-centric attention, and pretrained LM using SILG. Our shared architecture achieves comparable performance to environment-specific architectures. Moreover, we find that many recent modelling advances do not result in significant gains on environments other than the one they were designed for. This highlights the need for a multi-environment benchmark. Finally, the best models significantly underperform humans on SILG, which suggests ample room for future work. We hope SILG enables the community to quickly identify new methodologies for language grounding that generalize to a diverse set of environments and their associated challenges.
Links:
- Paper
- Source-code
Bibtex:
```
@article{zhong_silg_2021,
 author = {Zhong, Victor and Hanjie, Austin W. and Wang, Sida I. and Narasimhan, Karthik and Zettlemoyer, Luke},
 journal = {ArXiv - abs/2110.10661},
 title = {{SILG}: {The} {Multi}-environment {Symbolic} {Interactive} {Language} {Grounding} {Benchmark}},
 year = {2021}
}
```
Skill Induction and Planning with Latent Language Embodied Vygotskian
Authors:
Sharma P, Torralba A, Andreas J
Abstract:
We present a framework for learning hierarchical policies from demonstrations, using sparse natural language annotations to guide the discovery of reusable skills for autonomous decision-making. We formulate a generative model of action sequences in which goals generate sequences of high-level subtask descriptions, and these descriptions generate sequences of low-level actions. We describe how to train this model using primarily unannotated demonstrations by parsing demonstrations into sequences of named high-level subtasks, using only a small number of seed annotations to ground language in action. In trained models, natural language commands index a combinatorial library of skills; agents can use these skills to plan by generating high-level instruction sequences tailored to novel goals. We evaluate this approach in the ALFRED household simulation environment, providing natural language annotations for only 10% of demonstrations. It achieves task completion rates comparable to state-of-the-art models (outperforming several recent methods with access to ground-truth plans during training and evaluation) while providing structured and human-readable high-level plans.
Links:
- Paper
Bibtex:
```
@article{sharma2021skill,
 journal = {ArXiv - abs/2110.01517},
 author = {Pratyusha Sharma and Antonio Torralba and Jacob Andreas},
 title = {{Skill Induction and Planning with Latent Language}},
 year = {2021}
}
```
Symbolic Knowledge Distillation: from General Language Models to Commonsense Models Disembodied
Authors:
West P, Bhagavatula C, Hessel J, Hwang JD, Jiang L, Bras RL, Lu X, Welleck S, Choi Y
Abstract:
The common practice for training commonsense models has gone from-human-to-corpus-to-machine: humans author commonsense knowledge graphs in order to train commonsense models. In this work, we investigate an alternative, from-machine-to-corpus-to-machine: general language models author these commonsense knowledge graphs to train commonsense models. Our study leads to a new framework, Symbolic Knowledge Distillation. As with prior art in Knowledge Distillation (Hinton et al., 2015), our approach uses larger models to teach smaller models. A key difference is that we distill knowledge symbolically-as text-in addition to the neural model. We also distill only one aspect-the commonsense of a general language model teacher, allowing the student to be a different type, a commonsense model. Altogether, we show that careful prompt engineering and a separately trained critic model allow us to selectively distill high-quality causal commonsense from GPT-3, a general language model. Empirical results demonstrate that, for the first time, a human-authored commonsense knowledge graph is surpassed by our automatically distilled variant in all three criteria: quantity, quality, and diversity. In addition, it results in a neural commonsense model that surpasses the teacher model's commonsense capabilities despite its 100x smaller size. We apply this to the ATOMIC resource, and share our new symbolic knowledge graph and commonsense models.
Links:
- Paper
Bibtex:
```
@article{west_symbolic_2021,
 author = {West, Peter and Bhagavatula, Chandra and Hessel, Jack and Hwang, Jena D. and Jiang, Liwei and Bras, Ronan Le and Lu, Ximing and Welleck, Sean and Choi, Yejin},
 journal = {ArXiv - abs/2110.07178},
 title = {Symbolic {Knowledge} {Distillation}: from {General} {Language} {Models} to {Commonsense} {Models}},
 year = {2021}
}
```
Teachable Reinforcement Learning via Advice Distillation Embodied Vygotskian
Authors:
Watkins O, Gupta A, Darrell T, Abbeel P, Andreas J
Abstract:
Training automated agents to complete complex tasks in interactive environments is challenging: reinforcement learning requires careful hand-engineering of reward functions, imitation learning requires specialized infrastructure and access to a human expert, and learning from intermediate forms of supervision (like binary preferences) is time-consuming and extracts little information from each human intervention. Can we overcome these challenges by building agents that learn from rich, interactive feedback instead? We propose a new supervision paradigm for interactive learning based on "teachable" decision-making systems that learn from structured advice provided by an external teacher. We begin by formalizing a class of human-in-the-loop decision making problems in which multiple forms of teacher-provided advice are available to a learner. We then describe a simple learning algorithm for these problems that first learns to interpret advice, then learns from advice to complete tasks even in the absence of human supervision. In puzzle-solving, navigation, and locomotion domains, we show that agents that learn from advice can acquire new skills with significantly less human supervision than standard reinforcement learning algorithms and often less than imitation learning.
Links:
- Paper
- Source-code
Bibtex:
```
@inproceedings{watkins2021teachable,
 author = {Olivia Watkins and Abhishek Gupta and Trevor Darrell and Pieter Abbeel and Jacob Andreas},
 booktitle = {Advances in Neural Information Processing Systems 35: Annual Conference
on Neural Information Processing Systems 2021, NeurIPS 2021},
 editor = {A. Beygelzimer and Y. Dauphin and P. Liang and J. Wortman Vaughan},
 title = {Teachable Reinforcement Learning via Advice Distillation},
 year = {2021}
}
```
Tell Me Why! -- Explanations Support Learning of Relational and Causal Structure Embodied Vygotskian
Authors:
Lampinen AK, Roy NA, Dasgupta I, Chan SC, Tam AC, McClelland JL, Yan C, Santoro A, Rabinowitz NC, Wang JX, Hill F
Abstract:
Inferring the abstract relational and causal structure of the world is a major challenge for reinforcement-learning (RL) agents. For humans, language--particularly in the form of explanations--plays a considerable role in overcoming this challenge. Here, we show that language can play a similar role for deep RL agents in complex environments. While agents typically struggle to acquire relational and causal knowledge, augmenting their experience by training them to predict language descriptions and explanations can overcome these limitations. We show that language can help agents learn challenging relational tasks, and examine which aspects of language contribute to its benefits. We then show that explanations can help agents to infer not only relational but also causal structure. Language can shape the way that agents to generalize out-of-distribution from ambiguous, causally-confounded training, and explanations even allow agents to learn to perform experimental interventions to identify causal relationships. Our results suggest that language description and explanation may be powerful tools for improving agent learning and generalization.
Links:
- Paper
Bibtex:
```
@article{lampinen_tell_2021,
 author = {Lampinen, Andrew K. and Roy, Nicholas A. and Dasgupta, Ishita and Chan, Stephanie C. Y. and Tam, Allison C. and McClelland, James L. and Yan, Chen and Santoro, Adam and Rabinowitz, Neil C. and Wang, Jane X. and Hill, Felix},
 journal = {ArXiv - abs/2112.03753},
 title = {Tell {Me} {Why}! -- {Explanations} {Support} {Learning} of {Relational} and {Causal} {Structure}},
 year = {2021}
}
```
Towards Teachable Autonomous Agents
Authors:
Sigaud O, Colas C, Akakzia A, Chetouani M, Oudeyer PY
Abstract:
Autonomous discovery and direct instruction are two distinct sources of learning in children but education sciences demonstrate that mixed approaches such as assisted discovery or guided play result in improved skill acquisition. In the field of Artificial Intelligence, these extremes respectively map to autonomous agents learning from their own signals and interactive learning agents fully taught by their teachers. In between should stand teachable autotelic agents (TAA): agents that learn from both internal and teaching signals to benefit from the higher efficiency of assisted discovery. Designing such agents will enable real-world non-expert users to orient the learning trajectories of agents towards their expectations. More fundamentally, this may also be a key step to build agents with human-level intelligence. This paper presents a roadmap towards the design of teachable autonomous agents. Building on developmental psychology and education sciences, we start by identifying key features enabling assisted discovery processes in child-tutor interactions. This leads to the production of a checklist of features that future TAA will need to demonstrate. The checklist allows us to precisely pinpoint the various limitations of current reinforcement learning agents and to identify the promising first steps towards TAA. It also shows the way forward by highlighting key research directions towards the design or autonomous agents that can be taught by ordinary people via natural pedagogy.
Links:
- Paper
Bibtex:
```
@article{sigaud_towards_2021,
 author = {Sigaud, Olivier and Colas, Cedric and Akakzia, Ahmed and Chetouani, Mohamed and Oudeyer, Pierre-Yves},
 journal = {ArXiv - abs/2105.11977},
 title = {Towards {Teachable} {Autonomous} {Agents}},
 year = {2021}
}
```

2020: 13 Papers

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks Embodied
Authors:
Shridhar M, Thomason J, Gordon D, Bisk Y, Han W, Mottaghi R, Zettlemoyer L, Fox D
Abstract:
We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like "Rinse off a mug and place it in the coffee maker." and low-level language instructions like "Walk to the coffee maker on the right." ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
Links:
- Paper
- Webpage
Bibtex:
```
@inproceedings{ALFRED20,
 author = {Mohit Shridhar and
Jesse Thomason and
Daniel Gordon and
Yonatan Bisk and
Winson Han and
Roozbeh Mottaghi and
Luke Zettlemoyer and
Dieter Fox},
 booktitle = {2020 {IEEE/CVF} Conference on Computer Vision and Pattern Recognition,
{CVPR} 2020, Seattle, WA, USA, June 13-19, 2020},
 publisher = {{IEEE}},
 title = {{ALFRED:} {A} Benchmark for Interpreting Grounded Instructions for
Everyday Tasks},
 year = {2020}
}
```
Experience Grounds Language Disembodied
Authors:
Bisk Y, Holtzman A, Thomason J, Andreas J, Bengio Y, Chai J, Lapata M, Lazaridou A, May J, Nisnevich A, Pinto N, Turian J
Abstract:
Language understanding research is held back by a failure to relate language to the physical world it describes and to the social interactions it facilitates. Despite the incredible effectiveness of language processing models to tackle tasks after being trained on text alone, successful linguistic communication relies on a shared experience of the world. It is this shared experience that makes utterances meaningful. Natural language processing is a diverse field, and progress throughout its development has come from new representational theories, modeling techniques, data collection paradigms, and tasks. We posit that the present success of representation learning approaches trained on large, text-only corpora requires the parallel tradition of research on the broader physical and social context of language to address the deeper questions of communication.
Links:
- Paper
Bibtex:
```
@inproceedings{bisk2020experience,
 author = {Bisk, Yonatan  and
Holtzman, Ari  and
Thomason, Jesse  and
Andreas, Jacob  and
Bengio, Yoshua  and
Chai, Joyce  and
Lapata, Mirella  and
Lazaridou, Angeliki  and
May, Jonathan  and
Nisnevich, Aleksandr  and
Pinto, Nicolas  and
Turian, Joseph},
 booktitle = {Proc. of EMNLP},
 publisher = {Association for Computational Linguistics},
 title = {Experience Grounds Language},
 year = {2020}
}
```
Exploration Based Language Learning for Text-Based Games Embodied
Authors:
Madotto A, Namazifar M, Huizinga J, Molino P, Ecoffet A, Zheng H, Papangelis A, Yu D, Khatri C, Tür G
Abstract:
This work presents an exploration and imitation-learning-based agent capable of state-of-the-art performance in playing text-based computer games. Text-based computer games describe their world to the player through natural language and expect the player to interact with the game using text. These games are of interest as they can be seen as a testbed for language understanding, problem-solving, and language generation by artificial agents. Moreover, they provide a learning environment in which these skills can be acquired through interactions with an environment rather than using fixed corpora. One aspect that makes these games particularly challenging for learning agents is the combinatorially large action space. Existing methods for solving text-based games are limited to games that are either very simple or have an action space restricted to a predetermined set of admissible actions. In this work, we propose to use the exploration approach of Go-Explore for solving text-based games. More specifically, in an initial exploration phase, we first extract trajectories with high rewards, after which we train a policy to solve the game by imitating these trajectories. Our experiments show that this approach outperforms existing solutions in solving text-based games, and it is more sample efficient in terms of the number of interactions with the environment. Moreover, we show that the learned policy can generalize better than existing solutions to unseen games without using any restriction on the action space.
Links:
- Paper
Bibtex:
```
@article{madotto_exploration_2020,
 author = {Andrea Madotto and
Mahdi Namazifar and
Joost Huizinga and
Piero Molino and
Adrien Ecoffet and
Huaixiu Zheng and
Alexandros Papangelis and
Dian Yu and
Chandra Khatri and
G{\"{o}}khan T{\"{u}}r},
 journal = {Proc. of IJCAI},
 title = {Exploration Based Language Learning for Text-Based Games},
 year = {2020}
}
```
Graph Constrained Reinforcement Learning for Natural Language Action Spaces Embodied
Authors:
Ammanabrolu P, Hausknecht MJ
Abstract:
Interactive Fiction games are text-based simulations in which an agent interacts with the world purely through natural language. They are ideal environments for studying how to extend reinforcement learning agents to meet the challenges of natural language understanding, partial observability, and action generation in combinatorially-large text-based action spaces. We present KG-A2C, an agent that builds a dynamic knowledge graph while exploring and generates actions using a template-based action space. We contend that the dual uses of the knowledge graph to reason about game state and to constrain natural language generation are the keys to scalable exploration of combinatorially large natural language actions. Results across a wide variety of IF games show that KG-A2C outperforms current IF agents despite the exponential increase in action space size
Links:
- Paper
- Source-code
Bibtex:
```
@inproceedings{ammanabrolu_graph_2020,
 author = {Prithviraj Ammanabrolu and
Matthew J. Hausknecht},
 booktitle = {Proc. of ICLR},
 title = {Graph Constrained Reinforcement Learning for Natural Language Action
Spaces},
 year = {2020}
}
```
HIGhER: Improving Instruction Following with Hindsight Generation for Experience Replay Embodied Vygotskian
Authors:
Cideron G, Seurin M, Strub F, Pietquin O
Abstract:
Language creates a compact representation of the world and allows the description of unlimited situations and objectives through compositionality. While these characterizations may foster instructing, conditioning or structuring interactive agent behavior, it remains an open-problem to correctly relate language understanding and reinforcement learning in even simple instruction following scenarios. This joint learning problem is alleviated through expert demonstrations, auxiliary losses, or neural inductive biases. In this paper, we propose an orthogonal approach called Hindsight Generation for Experience Replay (HIGhER) that extends the Hindsight Experience Replay (HER) approach to the language-conditioned policy setting. Whenever the agent does not fulfill its instruction, HIGhER learns to output a new directive that matches the agent trajectory, and it relabels the episode with a positive reward. To do so, HIGhER learns to map a state into an instruction by using past successful trajectories, which removes the need to have external expert interventions to relabel episodes as in vanilla HER. We show the efficiency of our approach in the BabyAI environment, and demonstrate how it complements other instruction following methods.
Links:
- Paper
Bibtex:
```
@inproceedings{cideron_higher_2020,
 author = {Cideron, G. and Seurin, M. and Strub, F. and Pietquin, O.},
 booktitle = {{IEEE} {Symposium} {Series} on {Computational} {Intelligence} ({SSCI})},
 title = {{HIGhER}: {Improving} {Instruction} {Following} with {Hindsight} {Generation} for {Experience} {Replay}},
 year = {2020}
}
```
Human Instruction-Following with Deep Reinforcement Learning via Transfer-Learning from Text Embodied
Authors:
Hill F, Mokra S, Wong N, Harley T
Abstract:
Recent work has described neural-network-based agents that are trained with reinforcement learning (RL) to execute language-like commands in simulated worlds, as a step towards an intelligent agent or robot that can be instructed by human users. However, the optimisation of multi-goal motor policies via deep RL from scratch requires many episodes of experience. Consequently, instruction-following with deep RL typically involves language generated from templates (by an environment simulator), which does not reflect the varied or ambiguous expressions of real users. Here, we propose a conceptually simple method for training instruction-following agents with deep RL that are robust to natural human instructions. By applying our method with a state-of-the-art pre-trained text-based language model (BERT), on tasks requiring agents to identify and position everyday objects relative to other objects in a naturalistic 3D simulated room, we demonstrate substantially-above-chance zero-shot transfer from synthetic template commands to natural instructions given by humans. Our approach is a general recipe for training any deep RL-based system to interface with human users, and bridges the gap between two research directions of notable recent success: agent-centric motor behavior and text-based representation learning.
Links:
- Paper
Bibtex:
```
@article{hill_human_2020,
 author = {Hill, Felix and Mokra, Sona and Wong, Nathaniel and Harley, Tim},
 file = {Hill et al. - 2020 - Human Instruction-Following with Deep Reinforcemen.pdf:/home/flowers/Zotero/storage/TL5AGHNG/Hill et al. - 2020 - Human Instruction-Following with Deep Reinforcemen.pdf:application/pdf},
 journal = {ArXiv - abs/2005.09382},
 title = {Human {Instruction}-{Following} with {Deep} {Reinforcement} {Learning} via {Transfer}-{Learning} from {Text}},
 year = {2020}
}
```
Imitating Interactive Intelligence Embodied
Authors:
Abramson J, Ahuja A, Brussee A, Carnevale F, Cassin M, Clark S, Dudzik A, Georgiev P, Guy A, Harley T, Hill F, Hung A, Kenton Z, Landon J, Lillicrap TP, Mathewson K, Muldal A, Santoro A, Savinov N, Varma V, Wayne G, Wong N, Yan C, Zhu R
Abstract:
A common vision from science fiction is that robots will one day inhabit our physical spaces, sense the world as we do, assist our physical labours, and communicate with us through natural language. Here we study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment. This setting nevertheless integrates a number of the central challenges of artificial intelligence (AI) research: complex visual perception and goal-directed physical control, grounded language comprehension and production, and multi-agent social interaction. To build agents that can robustly interact with humans, we would ideally train them while they interact with humans. However, this is presently impractical. Therefore, we approximate the role of the human with another learned agent, and use ideas from inverse reinforcement learning to reduce the disparities between human-human and agent-agent interactive behaviour. Rigorously evaluating our agents poses a great challenge, so we develop a variety of behavioural tests, including evaluation by humans who watch videos of agents or interact directly with them. These evaluations convincingly demonstrate that interactive training and auxiliary losses improve agent behaviour beyond what is achieved by supervised learning of actions alone. Further, we demonstrate that agent capabilities generalise beyond literal experiences in the dataset. Finally, we train evaluation models whose ratings of agents agree well with human judgement, thus permitting the evaluation of new agent models without additional effort. Taken together, our results in this virtual environment provide evidence that large-scale human behavioural imitation is a promising tool to create intelligent, interactive agents, and the challenge of reliably evaluating such agents is possible to surmount.
Links:
- Paper
- Source-code
Bibtex:
```
@article{abramson_imitating_2020,
 author = {Abramson, Josh and Ahuja, Arun and Brussee, Arthur and Carnevale, Federico and Cassin, Mary and Clark, Stephen and Dudzik, Andrew and Georgiev, Petko and Guy, Aurelia and Harley, Tim and Hill, Felix and Hung, Alden and Kenton, Zachary and Landon, Jessica and Lillicrap, Timothy P. and Mathewson, Kory and Muldal, Alistair and Santoro, Adam and Savinov, Nikolay and Varma, Vikrant and Wayne, Greg and Wong, Nathaniel and Yan, Chen and Zhu, Rui},
 journal = {CoRR},
 note = {\_eprint: 2012.05672},
 title = {Imitating {Interactive} {Intelligence}},
 volume = {abs/2012.05672},
 year = {2020}
}
```
Inverse Reinforcement Learning with Natural Language Goals Embodied
Authors:
Zhou L, Small K
Abstract:
Humans generally use natural language to communicate task requirements to each other. Ideally, natural language should also be usable for communicating goals to autonomous machines (e.g., robots) to minimize friction in task specification. However, understanding and mapping natural language goals to sequences of states and actions is challenging. Specifically, existing work along these lines has encountered difficulty in generalizing learned policies to new natural language goals and environments. In this paper, we propose a novel adversarial inverse reinforcement learning algorithm to learn a language-conditioned policy and reward function. To improve generalization of the learned policy and reward function, we use a variational goal generator to relabel trajectories and sample diverse goals during training. Our algorithm outperforms multiple baselines by a large margin on a vision-based natural language instruction following dataset (Room-2-Room), demonstrating a promising advance in enabling the use of natural language instructions in specifying agent goals.
Links:
- Paper
Bibtex:
```
@article{zhou_inverse_2020,
 author = {Zhou, Li and Small, Kevin},
 file = {Zhou and Small - 2020 - Inverse Reinforcement Learning with Natural Langua.pdf:/home/flowers/Zotero/storage/EPL6YTIF/Zhou and Small - 2020 - Inverse Reinforcement Learning with Natural Langua.pdf:application/pdf},
 journal = {ArXiv - abs/2008.06924},
 title = {Inverse {Reinforcement} {Learning} with {Natural} {Language} {Goals}},
 year = {2020}
}
```
Language as a Cognitive Tool to Imagine Goals in Curiosity Driven Exploration VAAI
Authors:
Colas C, Karch T, Lair N, Dussoux JM, Moulin-Frier C, Dominey PF, Oudeyer PY
Abstract:
Developmental machine learning studies how artificial agents can model the way children learn open-ended repertoires of skills. Such agents need to create and represent goals, select which ones to pursue and learn to achieve them. Recent approaches have considered goal spaces that were either fixed and hand-defined or learned using generative models of states. This limited agents to sample goals within the distribution of known effects. We argue that the ability to imagine out-of-distribution goals is key to enable creative discoveries and open-ended learning. Children do so by leveraging the compositionality of language as a tool to imagine descriptions of outcomes they never experienced before, targeting them as goals during play. We introduce IMAGINE, an intrinsically motivated deep reinforcement learning architecture that models this ability. Such imaginative agents, like children, benefit from the guidance of a social peer who provides language descriptions. To take advantage of goal imagination, agents must be able to leverage these descriptions to interpret their imagined out-of-distribution goals. This generalization is made possible by modularity: a decomposition between learned goal-achievement reward function and policy relying on deep sets, gated attention and object-centered representations. We introduce the Playground environment and study how this form of goal imagination improves generalization and exploration over agents lacking this capacity. In addition, we identify the properties of goal imagination that enable these results and study the impacts of modularity and social interactions.
Links:
Bibtex:
```
@article{colas_language_2020,
 author = {C{\'{e}}dric Colas and
Tristan Karch and
Nicolas Lair and
Jean{-}Michel Dussoux and
Cl{\'{e}}ment Moulin{-}Frier and
Peter F. Dominey and
Pierre{-}Yves Oudeyer},
journal = {Proc. of NeurIPS},
 title = {{Language as a Cognitive Tool to Imagine Goals in Curiosity Driven
Exploration}},
 year = {2020}
}
```
Language conditioned imitation learning over unstructured data Embodied
Authors:
Lynch C, Sermanet P
Abstract:
Natural language is perhaps the most flexible and intuitive way for humans to communicate tasks to a robot. Prior work in imitation learning typically requires each task be specified with a task id or goal image -- something that is often impractical in open-world environments. On the other hand, previous approaches in instruction following allow agent behavior to be guided by language, but typically assume structure in the observations, actuators, or language that limit their applicability to complex settings like robotics. In this work, we present a method for incorporating free-form natural language conditioning into imitation learning. Our approach learns perception from pixels, natural language understanding, and multitask continuous control end-to-end as a single neural network. Unlike prior work in imitation learning, our method is able to incorporate unlabeled and unstructured demonstration data (i.e. no task or language labels). We show this dramatically improves language conditioned performance, while reducing the cost of language annotation to less than 1% of total data. At test time, a single language conditioned visuomotor policy trained with our method can perform a wide variety of robotic manipulation skills in a 3D environment, specified only with natural language descriptions of each task (e.g. "open the drawer...now pick up the block...now press the green button..."). To scale up the number of instructions an agent can follow, we propose combining text conditioned policies with large pretrained neural language models. We find this allows a policy to be robust to many out-of-distribution synonym instructions, without requiring new demonstrations. See videos of a human typing live text commands to our agent at this http URL
Links:
- Paper
- Webpage
Bibtex:
```
@article{lynch_grounding_2020,
  title={Language conditioned imitation learning over unstructured data},
  author={Lynch, Corey and Sermanet, Pierre},
  journal={ArXiv - abs/2005.07648},
  year={2020}
}
```
PixL2R: Guiding Reinforcement Learning Using Natural Language by Mapping Pixels to Rewards Embodied
Authors:
Prasoon Goyal, Scott Niekum, Raymond J. Mooney
Abstract:
Reinforcement learning (RL), particularly in sparse reward settings, often requires prohibitively large numbers of interactions with the environment, thereby limiting its applicability to complex problems. To address this, several prior approaches have used natural language to guide the agent's exploration. However, these approaches typically operate on structured representations of the environment, and/or assume some structure in the natural language commands. In this work, we propose a model that directly maps pixels to rewards, given a free-form natural language description of the task, which can then be used for policy learning. Our experiments on the Meta-World robot manipulation domain show that language-based rewards significantly improves the sample efficiency of policy learning, both in sparse and dense reward settings.
Links:
- Paper
Bibtex:
```
@article{DBLP:journals/corr/abs-2007-15543,
  author    = {Prasoon Goyal and
               Scott Niekum and
               Raymond J. Mooney},
  title     = {PixL2R: Guiding Reinforcement Learning Using Natural Language by Mapping
               Pixels to Rewards},
  journal   = {CoRR},
  volume    = {abs/2007.15543},
  year      = {2020},
  url       = {https://arxiv.org/abs/2007.15543},
  eprinttype = {arXiv},
  eprint    = {2007.15543},
  timestamp = {Mon, 03 Aug 2020 14:32:13 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2007-15543.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
RTFM: Generalising to New Environment Dynamics via Reading Embodied
Authors:
Zhong V, Rocktäschel T, Grefenstette E
Abstract:
Obtaining policies that can generalise to new environments in reinforcement learning is challenging. In this work, we demonstrate that language understanding via a reading policy learner is a promising vehicle for generalisation to new environments. We propose a grounded policy learning problem, Read to Fight Monsters (RTFM), in which the agent must jointly reason over a language goal, relevant dynamics described in a document, and environment observations. We procedurally generate environment dynamics and corresponding language descriptions of the dynamics, such that agents must read to understand new environment dynamics instead of memorising any particular information. In addition, we propose txt2 π , a model that captures three-way interactions between the goal, document, and observations. On RTFM, txt2 π generalises to new environments with dynamics not seen during training via reading. Furthermore, our model outperforms baselines such as FiLM and language-conditioned CNNs on RTFM. Through curriculum learning, txt2 π produces policies that excel on complex RTFM tasks requiring several reasoning and coreference steps.
Links:
- Paper
- Source-code
Bibtex:
```
@inproceedings{zhong_rtfm_2020,
 author = {Victor Zhong and
Tim Rockt{\"{a}}schel and
Edward Grefenstette},
 booktitle = {Proc. of ICLR},
 publisher = {OpenReview.net},
 title = {{RTFM:} Generalising to New Environment Dynamics via Reading},
 year = {2020}
}
```
The NetHack Learning Environment Embodied
Authors:
Küttler H, Nardelli N, Miller AH, Raileanu R, Selvatici M, Grefenstette E, Rocktäschel T
Abstract:
Progress in Reinforcement Learning (RL) algorithms goes hand-in-hand with the development of challenging environments that test the limits of current methods. While existing RL environments are either sufficiently complex or based on fast simulation, they are rarely both. Here, we present the NetHack Learning Environment (NLE), a scalable, procedurally generated, stochastic, rich, and challenging environment for RL research based on the popular single-player terminal-based roguelike game, NetHack. We argue that NetHack is sufficiently complex to drive long-term research on problems such as exploration, planning, skill acquisition, and language-conditioned RL, while dramatically reducing the computational resources required to gather a large amount of experience. We compare NLE and its task suite to existing alternatives, and discuss why it is an ideal medium for testing the robustness and systematic generalization of RL agents. We demonstrate empirical success for early stages of the game using a distributed Deep RL baseline and Random Network Distillation exploration, alongside qualitative analysis of various agents trained in the environment. NLE is open source at this https URL.
Links:
- Paper
- Source-code
Bibtex:
```
@inproceedings{kuttler2020nethack,
 author = {Heinrich K{\"{u}}ttler and
Nantas Nardelli and
Alexander H. Miller and
Roberta Raileanu and
Marco Selvatici and
Edward Grefenstette and
Tim Rockt{\"{a}}schel},
 booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference
on Neural Information Processing Systems 2020, NeurIPS 2020, December
6-12, 2020, virtual},
 editor = {Hugo Larochelle and
Marc'Aurelio Ranzato and
Raia Hadsell and
Maria{-}Florina Balcan and
Hsuan{-}Tien Lin},
 title = {The NetHack Learning Environment},
 year = {2020}
}
```

2019: 13 Papers

A Survey of Reinforcement Learning Informed by Natural Language
Authors:
Luketina J, Nardelli N, Farquhar G, Foerster JN, Andreas J, Grefenstette E, Whiteson S, Rocktäschel T
Abstract:
To be successful in real-world tasks, Reinforcement Learning (RL) needs to exploit the compositional, relational, and hierarchical structure of the world, and learn to transfer it to the task at hand. Recent advances in representation learning for language make it possible to build models that acquire world knowledge from text corpora and integrate this knowledge into downstream decision making problems. We thus argue that the time is right to investigate a tight integration of natural language understanding into RL in particular. We survey the state of the field, including work on instruction following, text games, and learning from textual domain knowledge. Finally, we call for the development of new environments as well as further investigation into the potential uses of recent Natural Language Processing (NLP) techniques for such tasks.
Links:
- Paper
Bibtex:
```
@inproceedings{luketina_survey_2019,
 author = {Jelena Luketina and
Nantas Nardelli and
Gregory Farquhar and
Jakob N. Foerster and
Jacob Andreas and
Edward Grefenstette and
Shimon Whiteson and
Tim Rockt{\"{a}}schel},
 booktitle = {Proceedings of the Twenty-Eighth International Joint Conference on
Artificial Intelligence, {IJCAI} 2019, Macao, China, August 10-16,
2019},
 editor = {Sarit Kraus},
 publisher = {ijcai.org},
 title = {A Survey of Reinforcement Learning Informed by Natural Language},
 year = {2019}
}
```
ACTRCE: Augmenting Experience via Teacher's Advice For Multi-Goal Reinforcement Learning Embodied
Authors:
Chan H, Wu Y, Kiros J, Fidler S, Ba J
Abstract:
Sparse reward is one of the most challenging problems in reinforcement learning (RL). Hindsight Experience Replay (HER) attempts to address this issue by converting a failed experience to a successful one by relabeling the goals. Despite its effectiveness, HER has limited applicability because it lacks a compact and universal goal representation. We present Augmenting experienCe via TeacheR's adviCE (ACTRCE), an efficient reinforcement learning technique that extends the HER framework using natural language as the goal representation. We first analyze the differences among goal representation, and show that ACTRCE can efficiently solve difficult reinforcement learning problems in challenging 3D navigation tasks, whereas HER with non-language goal representation failed to learn. We also show that with language goal representations, the agent can generalize to unseen instructions, and even generalize to instructions with unseen lexicons. We further demonstrate it is crucial to use hindsight advice to solve challenging tasks, and even small amount of advice is sufficient for the agent to achieve good performance
Links:
- Paper
Bibtex:
```
@article{chan_actrce_2019,
 author = {Chan, Harris and Wu, Yuhuai and Kiros, Jamie and Fidler, Sanja and Ba, Jimmy},
 file = {Chan et al. - 2019 - ACTRCE Augmenting Experience via Teacher's Advice.pdf:/home/flowers/Zotero/storage/BD2VBLNY/Chan et al. - 2019 - ACTRCE Augmenting Experience via Teacher's Advice.pdf:application/pdf},
 journal = {ArXiv - abs/1902.04546},
 title = {{ACTRCE}: {Augmenting} {Experience} via {Teacher}'s {Advice} {For} {Multi}-{Goal} {Reinforcement} {Learning}},
 year = {2019}
}
```
Baby-Ai: First Steps Towards Grounded Language Learning with a Human in the Loop Env
Authors:
Chevalier-Boisvert M, Bahdanau D, Lahlou S, Willems L, Saharia C, Nguyen TH, Bengio Y
Abstract:
Allowing humans to interactively train artificial agents to understand language instructions is desirable for both practical and scientific reasons, but given the poor data efficiency of the current learning methods, this goal may require substantial research efforts. Here, we introduce the BabyAI research platform to support investigations towards including humans in the loop for grounded language learning. The BabyAI platform comprises an extensible suite of 19 levels of increasing difficulty. The levels gradually lead the agent towards acquiring a combinatorially rich synthetic language which is a proper subset of English. The platform also provides a heuristic expert agent for the purpose of simulating a human teacher. We report baseline results and estimate the amount of human involvement that would be required to train a neural network-based agent on some of the BabyAI levels. We put forward strong evidence that current deep learning methods are not yet sufficiently sample efficient when it comes to learning a language with compositional properties.
Links:
- Paper
- Source-code
Bibtex:
```
@article{chevalier-boisvert_baby-ai_2019,
 author = {Chevalier-Boisvert, Maxime and Bahdanau, Dzmitry and Lahlou, Salem and Willems, Lucas and Saharia, Chitwan and Nguyen, Thien Huu and Bengio, Yoshua},
 journal = {Proc. of ICLR},
 title = {Baby-{Ai}: {First} {Steps} {Towards} {Grounded} {Language} {Learning} with a {Human} in the {Loop}},
 year = {2019}
}
```
Emergent Systematic Generalization in a Situated Agent Embodied
Authors:
Hill F, Lampinen AK, Schneider R, Clark S, Botvinick M, McClelland JL, Santoro A
Abstract:
The question of whether deep neural networks are good at generalising beyond their immediate training experience is of critical importance for learning-based approaches to AI. Here, we consider tests of out-of-sample generalisation that require an agent to respond to never-seen-before instructions by manipulating and positioning objects in a 3D Unity simulated room. We first describe a comparatively generic agent architecture that exhibits strong performance on these tests. We then identify three aspects of the training regime and environment that make a significant difference to its performance: (a) the number of object/word experiences in the training set; (b) the visual invariances afforded by the agent's perspective, or frame of reference; and (c) the variety of visual input inherent in the perceptual aspect of the agent's perception. Our findings indicate that the degree of generalisation that networks exhibit can depend critically on particulars of the environment in which a given task is instantiated. They further suggest that the propensity for neural networks to generalise in systematic ways may increase if, like human children, those networks have access to many frames of richly varying, multi-modal observations as they learn.
Links:
- Paper
- Source-code
Bibtex:
```
@article{hill_emergent_2019,
 author = {Hill, Felix and Lampinen, Andrew K. and Schneider, Rosalia and Clark, Stephen and Botvinick, Matthew and McClelland, James L. and Santoro, Adam},
 journal = {ArXiv - abs/1910.00571},
 title = {Emergent {Systematic} {Generalization} in a {Situated} {Agent}},
 year = {2019}
}
```
From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following Embodied
Authors:
Fu J, Korattikara A, Levine S, Guadarrama S
Abstract:
Reinforcement learning is a promising framework for solving control problems, but its use in practical situations is hampered by the fact that reward functions are often difficult to engineer. Specifying goals and tasks for autonomous machines, such as robots, is a significant challenge: conventionally, reward functions and goal states have been used to communicate objectives. But people can communicate objectives to each other simply by describing or demonstrating them. How can we build learning algorithms that will allow us to tell machines what we want them to do? In this work, we investigate the problem of grounding language commands as reward functions using inverse reinforcement learning, and argue that language-conditioned rewards are more transferable than language-conditioned policies to new environments. We propose language-conditioned reward learning (LC-RL), which grounds language commands as a reward function represented by a deep neural network. We demonstrate that our model learns rewards that transfer to novel tasks and environments on realistic, high-dimensional visual environments with natural language commands, whereas directly learning a language-conditioned policy leads to poor performance.
Links:
- Paper
Bibtex:
```
@inproceedings{fu_language_2019,
 author = {Justin Fu and
Anoop Korattikara and
Sergey Levine and
Sergio Guadarrama},
 booktitle = {Proc. of ICLR},
 publisher = {OpenReview.net},
 title = {From Language to Goals: Inverse Reinforcement Learning for Vision-Based
Instruction Following},
 year = {2019}
}
```
Guiding Policies with Language via Meta-Learning Embodied
Authors:
Co-Reyes JD, Gupta A, Sanjeev S, Altieri N, Andreas J, DeNero J, Abbeel P, Levine S
Abstract:
Behavioral skills or policies for autonomous agents are conventionally learned from reward functions, via reinforcement learning, or from demonstrations, via imitation learning. However, both modes of task specification have their disadvantages: reward functions require manual engineering, while demonstrations require a human expert to be able to actually perform the task in order to generate the demonstration. Instruction following from natural language instructions provides an appealing alternative: in the same way that we can specify goals to other humans simply by speaking or writing, we would like to be able to specify tasks for our machines. However, a single instruction may be insufficient to fully communicate our intent or, even if it is, may be insufficient for an autonomous agent to actually understand how to perform the desired task. In this work, we propose an interactive formulation of the task specification problem, where iterative language corrections are provided to an autonomous agent, guiding it in acquiring the desired skill. Our proposed language-guided policy learning algorithm can integrate an instruction and a sequence of corrections to acquire new skills very quickly. In our experiments, we show that this method can enable a policy to follow instructions and corrections for simulated navigation and manipulation tasks, substantially outperforming direct, non-interactive instruction following.
Links:
- Paper
Bibtex:
```
@inproceedings{co-reyes_guiding_2019,
 author = {John D. Co{-}Reyes and
Abhishek Gupta and
Suvansh Sanjeev and
Nick Altieri and
Jacob Andreas and
John DeNero and
Pieter Abbeel and
Sergey Levine},
 booktitle = {Proc. of ICLR},
 publisher = {OpenReview.net},
 title = {Guiding Policies with Language via Meta-Learning},
 year = {2019}
}
```
Hierarchical Decision Making by Generating and Following Natural Language Instructions Embodied Vygotskian
Authors:
Hu H, Yarats D, Gong Q, Tian Y, Lewis M
Abstract:
We explore using latent natural language instructions as an expressive and compositional representation of complex actions for hierarchical decision making. Rather than directly selecting micro-actions, our agent first generates a latent plan in natural language, which is then executed by a separate model. We introduce a challenging real-time strategy game environment in which the actions of a large number of units must be coordinated across long time scales. We gather a dataset of 76 thousand pairs of instructions and executions from human play, and train instructor and executor models. Experiments show that models using natural language as a latent variable significantly outperform models that directly imitate human actions. The compositional structure of language proves crucial to its effectiveness for action representation. We also release our code, models and data.
Links:
- Paper
- Source-code
Bibtex:
```
@article{hu_hierarchical_2019,
 author = {Hengyuan Hu and
Denis Yarats and
Qucheng Gong and
Yuandong Tian and
Mike Lewis},
journal = {Proc. of NeurIPS},
 title = {{Hierarchical Decision Making by Generating and Following Natural Language
Instructions}},
 year = {2019}
}
```
Interactive Language Learning by Question Answering Embodied
Authors:
Yuan X, Côté MA, Fu J, Lin Z, Pal C, Bengio Y, Trischler A
Abstract:
Humans observe and interact with the world to acquire knowledge. However, most existing machine reading comprehension (MRC) tasks miss the interactive, information-seeking component of comprehension. Such tasks present models with static documents that contain all necessary information, usually concentrated in a single short substring. Thus, models can achieve strong performance through simple word- and phrase-based pattern matching. We address this problem by formulating a novel text-based question answering task: Question Answering with Interactive Text (QAit). In QAit, an agent must interact with a partially observable text-based environment to gather information required to answer questions. QAit poses questions about the existence, location, and attributes of objects found in the environment. The data is built using a text-based game generator that defines the underlying dynamics of interaction with the environment. We propose and evaluate a set of baseline models for the QAit task that includes deep reinforcement learning agents. Experiments show that the task presents a major challenge for machine reading systems, while humans solve it with relative ease.
Links:
- Paper
- Source-code
Bibtex:
```
@inproceedings{yuan_interactive_2019,
 author = {Yuan, Xingdi  and
C{\^o}t{\'e}, Marc-Alexandre  and
Fu, Jie  and
Lin, Zhouhan  and
Pal, Chris  and
Bengio, Yoshua  and
Trischler, Adam},
 booktitle = {Proc. of EMNLP},
 publisher = {Association for Computational Linguistics},
 title = {{Interactive Language Learning by Question Answering}},
 year = {2019}
}
```
Language as an Abstraction for Hierarchical Deep Reinforcement Learning Embodied Vygotskian
Authors:
Jiang Y, Gu S, Murphy K, Finn C
Abstract:
Solving complex, temporally-extended tasks is a long-standing problem in reinforcement learning (RL). We hypothesize that one critical element of solving such problems is the notion of compositionality. With the ability to learn concepts and sub-skills that can be composed to solve longer tasks, i.e. hierarchical RL, we can acquire temporally-extended behaviors. However, acquiring effective yet general abstractions for hierarchical RL is remarkably challenging. In this paper, we propose to use language as the abstraction, as it provides unique compositional structure, enabling fast learning and combinatorial generalization, while retaining tremendous flexibility, making it suitable for a variety of problems. Our approach learns an instruction-following low-level policy and a high-level policy that can reuse abstractions across tasks, in essence, permitting agents to reason using structured language. To study compositional task learning, we introduce an open-source object interaction environment built using the MuJoCo physics engine and the CLEVR engine. We find that, using our approach, agents can learn to solve to diverse, temporally-extended tasks such as object sorting and multi-object rearrangement, including from raw pixel observations. Our analysis reveals that the compositional nature of language is critical for learning diverse sub-skills and systematically generalizing to new sub-skills in comparison to non-compositional abstractions that use the same supervision.
Links:
- Paper
- Source-code
Bibtex:
```
@article{jiang_language_2019,
 author = {Yiding Jiang and
Shixiang Gu and
Kevin Murphy and
Chelsea Finn},
journal = {Proc. of NeurIPS},
 title = {Language as an Abstraction for Hierarchical Deep Reinforcement Learning},
 year = {2019}
}
```
Learning to Understand Goal Specifications by Modelling Reward Embodied
Authors:
Bahdanau D, Hill F, Leike J, Hughes E, Hosseini SA, Kohli P, Grefenstette E
Abstract:
Recent work has shown that deep reinforcement-learning agents can learn to follow language-like instructions from infrequent environment rewards. However, this places on environment designers the onus of designing language-conditional reward functions which may not be easily or tractably implemented as the complexity of the environment and the language scales. To overcome this limitation, we present a framework within which instruction-conditional RL agents are trained using rewards obtained not from the environment, but from reward models which are jointly trained from expert examples. As reward models improve, they learn to accurately reward agents for completing tasks for environment configurations---and for instructions---not present amongst the expert data. This framework effectively separates the representation of what instructions require from how they can be executed. In a simple grid world, it enables an agent to learn a range of commands requiring interaction with blocks and understanding of spatial relations and underspecified abstract arrangements. We further show the method allows our agent to adapt to changes in the environment without requiring new expert examples.
Links:
- Paper
Bibtex:
```
@inproceedings{arning_2019,
 author = {Dzmitry Bahdanau and
Felix Hill and
Jan Leike and
Edward Hughes and
Seyed Arian Hosseini and
Pushmeet Kohli and
Edward Grefenstette},
 booktitle = {Proc. of ICLR},
 publisher = {OpenReview.net},
 title = {{Learning to Understand Goal Specifications by Modelling Reward}},
 year = {2019}
}
```
Systematic Generalization: What Is Required and Can It Be Learned? Disembodied
Authors:
Bahdanau D, Murty S, Noukhovitch M, Nguyen TH, de Vries H, Courville AC
Abstract:
Numerous models for grounded language understanding have been recently proposed, including (i) generic models that can be easily adapted to any given task and (ii) intuitively appealing modular models that require background knowledge to be instantiated. We compare both types of models in how much they lend themselves to a particular form of systematic generalization. Using a synthetic VQA test, we evaluate which models are capable of reasoning about all possible object pairs after training on only a small subset of them. Our findings show that the generalization of modular models is much more systematic and that it is highly sensitive to the module layout, i.e. to how exactly the modules are connected. We furthermore investigate if modular models that generalize well could be made more end-to-end by learning their layout and parametrization. We find that end-to-end methods from prior work often learn inappropriate layouts or parametrizations that do not facilitate systematic generalization. Our results suggest that, in addition to modularity, systematic generalization in language understanding may require explicit regularizers or priors.
Links:
- Paper
Bibtex:
```
@inproceedings{bahdanau_systematic_2018,
 author = {Dzmitry Bahdanau and
Shikhar Murty and
Michael Noukhovitch and
Thien Huu Nguyen and
Harm de Vries and
Aaron C. Courville},
 booktitle = {Proc. of ICLR},
 publisher = {OpenReview.net},
 title = {Systematic Generalization: What Is Required and Can It Be Learned?},
 year = {2019}
}
```
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences from Natural Supervision Disembodied
Authors:
Mao J, Gan C, Kohli P, Tenenbaum JB, Wu J
Abstract:
We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual concepts, words, and semantic parsing of sentences without explicit supervision on any of them; instead, our model learns by simply looking at images and reading paired questions and answers. Our model builds an object-based scene representation and translates sentences into executable, symbolic programs. To bridge the learning of two modules, we use a neuro-symbolic reasoning module that executes these programs on the latent scene representation. Analogical to human concept learning, the perception module learns visual concepts based on the language description of the object being referred to. Meanwhile, the learned visual concepts facilitate learning new words and parsing new sentences. We use curriculum learning to guide the searching over the large compositional space of images and language. Extensive experiments demonstrate the accuracy and efficiency of our model on learning visual concepts, word representations, and semantic parsing of sentences. Further, our method allows easy generalization to new object attributes, compositions, language concepts, scenes and questions, and even new program domains. It also empowers applications including visual question answering and bidirectional image-text retrieval.
Links:
- Paper
- Source-code
Bibtex:
```
@article{mao2019neuro,
  title={{The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences from Natural Supervision}},
  author={Mao, Jiayuan and Gan, Chuang and Kohli, Pushmeet and Tenenbaum, Joshua B and Wu, Jiajun},
  journal={ArXiv - abs/1904.12584},
  year={2019}
}
```
Using Natural Language for Reward Shaping in Reinforcement Learning Embodied Vygotskian
Authors:
Goyal P, Niekum S, Mooney RJ
Abstract:
Recent reinforcement learning (RL) approaches have shown strong performance in complex domains such as Atari games, but are often highly sample inefficient. A common approach to reduce interaction time with the environment is to use reward shaping, which involves carefully designing reward functions that provide the agent intermediate rewards for progress towards the goal. However, designing appropriate shaping rewards is known to be difficult as well as time-consuming. In this work, we address this problem by using natural language instructions to perform reward shaping. We propose the LanguagE-Action Reward Network (LEARN), a framework that maps free-form natural language instructions to intermediate rewards based on actions taken by the agent. These intermediate language-based rewards can seamlessly be integrated into any standard reinforcement learning algorithm. We experiment with Montezuma's Revenge from the Atari Learning Environment, a popular benchmark in RL. Our experiments on a diverse set of 15 tasks demonstrate that, for the same number of interactions with the environment, language-based rewards lead to successful completion of the task 60% more often on average, compared to learning without language.
Links:
- Paper
Bibtex:
```
@inproceedings{goyal_using_2019,
 author = {Prasoon Goyal and
Scott Niekum and
Raymond J. Mooney},
 booktitle = {Proceedings of the Twenty-Eighth International Joint Conference on
Artificial Intelligence, {IJCAI} 2019, Macao, China, August 10-16,
2019},
 editor = {Sarit Kraus},
 publisher = {ijcai.org},
 title = {Using Natural Language for Reward Shaping in Reinforcement Learning},
 year = {2019}
}
```

2018: 6 Papers

Embodied Question Answering Embodied
Authors:
Das A, Datta S, Gkioxari G, Lee S, Parikh D, Batra D
Abstract:
We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where an agent is spawned at a random location in a 3D environment and asked a question ("What color is the car?"). In order to answer, the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question ("orange"). This challenging task requires a range of AI skills -- active perception, language understanding, goal-driven navigation, commonsense reasoning, and grounding of language into actions. In this work, we develop the environments, end-to-end-trained reinforcement learning agents, and evaluation protocols for EmbodiedQA.
Links:
- Paper
- Webpage
Bibtex:
```
@article{das_embodied_2018,
 author = {Abhishek Das and
Samyak Datta and
Georgia Gkioxari and
Stefan Lee and
Devi Parikh and
Dhruv Batra},
 journal = {Proc. of CVPR},
 title = {Embodied Question Answering},
 year = {2018}
}
```
Gated-Attention Architectures for Task-Oriented Language Grounding Embodied
Authors:
Chaplot DS, Sathyendra KM, Pasumarthi RK, Rajagopal D, Salakhutdinov R
Abstract:
To perform tasks specified by natural language instructions, autonomous agents need to extract semantically meaningful representations of language and map it to visual elements and actions in the environment. This problem is called task-oriented language grounding. We propose an end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input. The proposed model combines the image and text representations using a Gated-Attention mechanism and learns a policy to execute the natural language instruction using standard reinforcement and imitation learning methods. We show the effectiveness of the proposed model on unseen instructions as well as unseen maps, both quantitatively and qualitatively. We also introduce a novel environment based on a 3D game engine to simulate the challenges of task-oriented language grounding over a rich set of instructions and environment states.
Links:
- Paper
- Source-code
Bibtex:
```
@article{chaplot_gated-attention_2018,
 author = {Devendra Singh Chaplot and
Kanthashree Mysore Sathyendra and
Rama Kumar Pasumarthi and
Dheeraj Rajagopal and
Ruslan Salakhutdinov},
 journal = {Proc. of AAAI},
 title = {{Gated-Attention Architectures for Task-Oriented Language Grounding}},
 year = {2018}
}
```
Learning with Latent Language Embodied
Authors:
Andreas J, Klein D, Levine S
Abstract:
The named concepts and compositional operators present in natural language provide a rich source of information about the kinds of abstractions humans use to navigate the world. Can this linguistic background knowledge improve the generality and efficiency of learned classifiers and control policies? This paper aims to show that using the space of natural language strings as a parameter space is an effective way to capture natural task structure. In a pretraining phase, we learn a language interpretation model that transforms inputs (e.g. images) into outputs (e.g. labels) given natural language descriptions. To learn a new concept (e.g. a classifier), we search directly in the space of descriptions to minimize the interpreter's loss on training examples. Crucially, our models do not require language data to learn these concepts: language is used only in pretraining to impose structure on subsequent learning. Results on image classification, text editing, and reinforcement learning show that, in all settings, models with a linguistic parameterization outperform those without.
Links:
- Paper
Bibtex:
```
@inproceedings{andreas_learning_2017,
 author = {Andreas, Jacob  and
Klein, Dan  and
Levine, Sergey},
 booktitle = {Proc. of NAACL-HLT},
 publisher = {Association for Computational Linguistics},
 title = {Learning with Latent Language},
 year = {2018}
}
```
Representation Learning for Grounded Spatial Reasoning Embodied
Authors:
Janner M, Narasimhan K, Barzilay R
Abstract:
The interpretation of spatial references is highly contextual, requiring joint inference over both language and the environment. We consider the task of spatial reasoning in a simulated environment, where an agent can act and receive rewards. The proposed model learns a representation of the world steered by instruction text. This design allows for precise alignment of local neighborhoods with corresponding verbalizations, while also handling global references in the instructions. We train our model with reinforcement learning using a variant of generalized value iteration. The model outperforms state-of-the-art approaches on several metrics, yielding a 45% reduction in goal localization error.
Links:
- Paper
Bibtex:
```
@article{janner_representation_2018,
 author = {Janner, Michael  and
Narasimhan, Karthik  and
Barzilay, Regina},
 journal = {Transactions of the Association for Computational Linguistics},
 title = {{Representation Learning for Grounded Spatial Reasoning}},
 year = {2018}
}
```
TextWorld: A Learning Environment for Text-Based Games Env
Authors:
Côté MA, K\'\{a\}d\'\{a\}r kos, Yuan X, Kybartas B, Barnes T, Fine E, Moore J, Hausknecht MJ, Asri LE, Adada M, Tay W, Trischler A
Abstract:
We introduce TextWorld, a sandbox learning environment for the training and evaluation of RL agents on text-based games. TextWorld is a Python library that handles interactive play-through of text games, as well as backend functions like state tracking and reward assignment. It comes with a curated list of games whose features and challenges we have analyzed. More significantly, it enables users to handcraft or automatically generate new games. Its generative mechanisms give precise control over the difficulty, scope, and language of constructed games, and can be used to relax challenges inherent to commercial text games like partial observability and sparse rewards. By generating sets of varied but similar games, TextWorld can also be used to study generalization and transfer learning. We cast text-based games in the Reinforcement Learning formalism, use our framework to develop a set of benchmark games, and evaluate several baseline agents on this set and the curated list.
Links:
- Paper
- Source-code
Bibtex:
```
@article{cote_textworld_2018,
 author = {Côté, Marc-Alexandre and K{\textbackslash}'\{a\}d{\textbackslash}'\{a\}r, {\textbackslash}'\{A\}kos and Yuan, Xingdi and Kybartas, Ben and Barnes, Tavian and Fine, Emery and Moore, James and Hausknecht, Matthew J. and Asri, Layla El and Adada, Mahmoud and Tay, Wendy and Trischler, Adam},
 journal = {Computer {Games} - 7th {Workshop} @ IJCAI},
 title = {{TextWorld}: {A} {Learning} {Environment} for {Text}-{Based} {Games}},
 year = {2018}
}
```
https://proceedings.neurips.cc/paper/2018/file/6a81681a7af700c6385d36577ebec359-Paper.pdf Embodied Vygotskian
Authors:
Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, Trevor Darrell
Abstract:
Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this is doubly challenging: it is difficult to collect enough annotated data to enable learning of this reasoning process from scratch, and also difficult to implement the reasoning process using generic sequence models. Here we describe an approach to vision-and-language navigation that addresses both these issues with an embedded speaker model. We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction. Both steps are supported by a panoramic action space that reflects the granularity of human-generated instructions. Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.
Links:
- Paper
- Source-code
Bibtex:
```
@inproceedings{NEURIPS2018_6a81681a,
 author = {Fried, Daniel and Hu, Ronghang and Cirik, Volkan and Rohrbach, Anna and Andreas, Jacob and Morency, Louis-Philippe and Berg-Kirkpatrick, Taylor and Saenko, Kate and Klein, Dan and Darrell, Trevor},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
 pages = {},
 publisher = {Curran Associates, Inc.},
 title = {Speaker-Follower Models for Vision-and-Language Navigation},
 url = {https://proceedings.neurips.cc/paper/2018/file/6a81681a7af700c6385d36577ebec359-Paper.pdf},
 volume = {31},
 year = {2018}
}
```

2017: 2 Papers

Grounded Language Learning in a Simulated 3D World Embodied
Authors:
Hermann KM, Hill F, Green S, Wang F, Faulkner R, Soyer H, Szepesvari D, Czarnecki WM, Jaderberg M, Teplyashin D, Wainwright M, Apps C, Hassabis D, Blunsom P
Abstract:
We are increasingly surrounded by artificially intelligent technology that takes decisions and executes actions on our behalf. This creates a pressing need for general means to communicate with, instruct and guide artificial agents, with human language the most compelling means for such communication. To achieve this in a scalable fashion, agents must be able to relate language to the world and to actions; that is, their understanding of language must be grounded and embodied. However, learning grounded language is a notoriously challenging problem in artificial intelligence research. Here we present an agent that learns to interpret language in a simulated 3D environment where it is rewarded for the successful execution of written instructions. Trained via a combination of reinforcement and unsupervised learning, and beginning with minimal prior knowledge, the agent learns to relate linguistic symbols to emergent perceptual representations of its physical surroundings and to pertinent sequences of actions. The agent's comprehension of language extends beyond its prior experience, enabling it to apply familiar language to unfamiliar situations and to interpret entirely novel instructions. Moreover, the speed with which this agent learns new words increases as its semantic knowledge grows. This facility for generalising and bootstrapping semantic knowledge indicates the potential of the present approach for reconciling ambiguous natural language with the complexity of the physical world.
Links:
- Paper
Bibtex:
```
@article{hermann_grounded_2017,
 author = {Hermann, Karl Moritz and Hill, Felix and Green, Simon and Wang, Fumin and Faulkner, Ryan and Soyer, Hubert and Szepesvari, David and Czarnecki, Wojciech Marian and Jaderberg, Max and Teplyashin, Denis and Wainwright, Marcus and Apps, Chris and Hassabis, Demis and Blunsom, Phil},
 journal = {ArXiv - abs/1706.06551},
 title = {Grounded {Language} {Learning} in a {Simulated} {3D} {World}},
 year = {2017}
}
```
Modular Multitask Reinforcement Learning with Policy Sketches Embodied Disembodied
Authors:
Andreas J, Klein D, Levine S
Abstract:
We describe a framework for multitask deep reinforcement learning guided by policy sketches. Sketches annotate tasks with sequences of named subtasks, providing information about high-level structural relationships among tasks but not how to implement them---specifically not providing the detailed guidance used by much previous work on learning policy abstractions for RL (e.g. intermediate rewards, subtask completion signals, or intrinsic motivations). To learn from sketches, we present a model that associates every subtask with a modular subpolicy, and jointly maximizes reward over full task-specific policies by tying parameters across shared subpolicies. Optimization is accomplished via a decoupled actor--critic training objective that facilitates learning common behaviors from multiple dissimilar reward functions. We evaluate the effectiveness of our approach in three environments featuring both discrete and continuous control, and with sparse rewards that can be obtained only after completing a number of high-level subgoals. Experiments show that using our approach to learn policies guided by sketches gives better performance than existing techniques for learning task-specific or shared policies, while naturally inducing a library of interpretable primitive behaviors that can be recombined to rapidly adapt to new tasks.
Links:
- Paper
- Source-code
Bibtex:
```
@inproceedings{andreas_modular_2017,
 author = {Jacob Andreas and
Dan Klein and
Sergey Levine},
 booktitle = {Proceedings of the 34th International Conference on Machine Learning,
{ICML} 2017, Sydney, NSW, Australia, 6-11 August 2017},
 editor = {Doina Precup and
Yee Whye Teh},
 publisher = {{PMLR}},
 series = {Proceedings of Machine Learning Research},
 title = {Modular Multitask Reinforcement Learning with Policy Sketches},
 volume = {70},
 year = {2017}
}
```

2015: 1 Papers

Language Understanding for Text-based Games using Deep Reinforcement Learning Embodied
Authors:
Narasimhan K, Kulkarni T, Barzilay R
Abstract:
In this paper, we consider the task of learning control policies for text-based games. In these games, all interactions in the virtual world are through text and the underlying state is not observed. The resulting language barrier makes such environments challenging for automatic game players. We employ a deep reinforcement learning framework to jointly learn state representations and action policies using game rewards as feedback. This framework enables us to map text descriptions into vector representations that capture the semantics of the game states. We evaluate our approach on two game worlds, comparing against baselines using bag-of-words and bag-of-bigrams for state representations. Our algorithm outperforms the baselines on both worlds demonstrating the importance of learning expressive representations.
Links:
- Paper
Bibtex:
```
@article{narasimhan_language_2015,
 author = {Narasimhan, Karthik  and
Kulkarni, Tejas  and
Barzilay, Regina},
 journal = {Proc. of EMNLP},
 title = {{Language Understanding for Text-based Games using Deep Reinforcement Learning}},
 year = {2015}
}
```

2010: 1 Papers

Integration of action and language knowledge: A roadmap for developmental robotics

Authors:

Cangelosi A, Metta G, Sagerer G, Nolfi S, Nehaniv C, Fischer K, Tani J, Belpaeme T, Sandini G, Nori F, Others

Bibtex:

@article{cangelosi2010integration,
 author = {Cangelosi, Angelo and Metta, Giorgio and Sagerer, Gerhard and Nolfi, Stefano and Nehaniv, Chrystopher and Fischer, Kerstin and Tani, Jun and Belpaeme, Tony and Sandini, Giulio and Nori, Francesco and others},
 journal = {IEEE Transactions on Autonomous Mental Development},
 number = {3},
 publisher = {IEEE},
 title = {Integration of action and language knowledge: A roadmap for developmental robotics},
 volume = {2},
 year = {2010}
}