IIR - Reinforcement learning and how general-purpose learning and communication methods can solve complex environments

This class aims to familiarize students with the process of reading scientific articles and summarize them. It’s closely related to the masters’ thesis at the end of the two years program. The various articles analyzed in this post have been selected by me.

IIR - Reinforcement learning and how general-purpose learning and communication methods can solve complex environments.

1. Student

LAGUE Pierre pierre.lague.etu@univ-lille.fr

2. Main Article

Citation : Vinyals, O., Babuschkin, I., Czarnecki, W.M. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019). https://doi.org/10.1038/s41586-019-1724-z
Paper / Journal : Nature
Classification : Impact Score: 23.95, H-Index: 1331
Number of citations : 1156 citations (based on Nature metrics)

3. Summary

Problem: How can intelligent agents achieve superhuman capabilities in complex reinforcement learning environments like StarCraft II using general-purpose learning methods, while adhering to human-like constraints?

Possible avenues (pointed out by the authors) The authors propose several avenues to address the problem of enabling intelligent agents in complex reinforcement learning environments like StarCraft II. These include exploring cooperation among multiple agents, determining whether agents should individually define memory, finding a balance between exploration and exploitation, understanding AlphaStar’s navigation of the vast action space, analyzing strategies in complex scenarios, comparing AI strategies with human players, and assessing the impact of constraints. The authors also employ a multi-agent actor-critic approach, evaluate AlphaStar empirically in various states, utilize a learning algorithm combining imitation learning and reinforcement learning, and design a specific architecture for AlphaStar’s policy-based method. These avenues collectively contribute to understanding and overcoming challenges in creating intelligent agents for real-world applications.

Key Questions:

How can multiple agents cooperate effectively towards a common reward?
Should agents be individually defined or share a common “memory”?
How can agents balance exploration and exploitation for maximum rewards?
How does AlphaStar navigate the vast combinatorial action space in StarCraft?
What strategies does AlphaStar use to explore complex strategy spaces in StarCraft?
What are the differences and similarities between AlphaStar’s strategies and those of human players in real-world play?
How do constraints like network latency and limited Actions Per Minute (APM) impact AlphaStar’s performance compared to humans in real-time strategy games?

Approach:

Limiting AlphaStar:
- Constrained APM (22 actions per 5 seconds) and real-time processing delay (~80ms).
- Restricted visibility to units visible on the minimap.
Dataset:
- Used a dataset of 971,000 replays from top 22% of players, containing various species (Protoss, Terran, Zerg).
Multi-Agent Actor-Critic Approach:
- Three pools of agents initialized from human data, employing reinforcement learning.
- Agents train, create copies (“League Exploiters” or “Main Exploiters”), and engage in a cyclic training process.
- Fictitious self-play avoids cycles, enhancing learning and addressing weaknesses.
Empirical Evaluation:
- Evaluated in three states: after supervised training, after 27 days of League Training, and after 44 days of training.
- Evaluated using the unconditional policy of the official matchmaking system Battle.net.

Implementation of the Approach:

Learning Algorithm:
- Utilized imitation learning, reinforcement learning, and multi-agent learning.
- Adopted a policy-based method, updating parameters based on the gradient of long-term rewards.
Architecture:
- AlphaStar’s policy function, represented by πθ(at st, z), is implemented as a deep neural network.
- The neural network has 139 million weights, with only 55 million used during inference.

Results:

Game Results:
- AlphaStar Final achieved MMR ratings surpassing 99.8% of human players, reaching Grandmaster level across all three races.
- AlphaStar Supervised outperformed 84% of human players, highlighting the effectiveness of supervised learning.
Scientific Result:
- AlphaStar attained Grandmaster level in StarCraft II without simplifying the game.
- The success suggests that general-purpose machine learning algorithms can impact solving real-world problems in complex domains.

4. Articles connexes

Article 1

Mastering the game of Go without human knowledge.

References and metrics

Citation : Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017). https://doi.org/10.1038/nature24270
Paper / Journal : Nature
Classification : Impact Score: 23.95, H-Index: 1331
Number of citations : 4,192 (based on Nature metrics)

Summary

Problem : How can the limitations associated with supervised learning systems, such as dependence on costly or unreliable expert datasets be effectively addressed to achieve superhuman proficiency in complex domains like the game of Go?

Possible avenues (pointed out by the authors) The authors propose a solution to the limitations of supervised learning systems by introducing AlphaGo Zero, an approach rooted in reinforcement learning. The primary avenue is the development of a self-play reinforcement learning algorithm, starting from random play and without any human data. The algorithm to autonomously learn and iteratively improve, surpassing human capabilities in complex domains like the game of Go. Additionally, the incorporation of Monte Carlo Tree Search (MCTS) during self-play and a simplified tree search without Monte Carlo rollouts contribute to effective move selection after being processed by a single neural network.

Key Questions

Efficiency of Reinforcement Learning in Complex Domains: What evidence supports the fact that reinforcement learning is more effective than supervised learning.
Impact of Self-Play on Autonomous Learning: How does the self-play reinforcement learning algorithm contribute to the autonomous learning and iterative improvement of the AI system
Single Neural Network Architecture : What are the specific advantages and potential drawbacks of utilizing a single neural network architecture that integrates both policy and value networks
Role of Monte Carlo Tree Search (MCTS): To what extent does the incorporation of MCTS during self-play enhance the quality of move selection

Approach:

Neural Network:
- AlphaGo Zero utilizes a deep neural network (fθ) for move probabilities and values.
Reinforcement Learning:
- Trained through self-play reinforcement learning with policy iteration.
MCTS:
- Monte Carlo Tree Search enhances move selection during self-play.

Implementation of the Approach:

Training Process:
- Reinforcement learning pipeline trained for 3 days, 4.9 million self-play games, 1,600 simulations per MCTS.
Parameters:
- 700,000 mini-batches, 20 residual blocks.
Performance:
- Smooth learning, outperformed AlphaGo Lee in 36 hours, defeated Lee in 72 hours.

Results:

Extended Training:
- Second instance trained for 40 days, 29 million self-play games, 3.1 million mini-batches, 40 residual blocks.
Performance Ratings:
- AlphaGo Zero achieved 5,185 Elo, outperforming AlphaGo Master (4,858), AlphaGo Lee (3,739), and AlphaGo Fan (3,144).
Head-to-Head Match:
- AlphaGo Zero won 89-11 against AlphaGo Master in a 100-game match with 2-hour time controls.

Article 2

The Hanabi Challenge: A New Frontier for AI Research

Référence & indicateur

Citation : N. Bard, J. N. Foerster, S. Chandar, et al., “The Hanabi Challenge: A New Frontier for AI Research,” Artificial Intelligence, vol. 280, 2020, 103216. DOI: 10.1016/j.artint.2019.103216.
Conférence / Revue : Artificial Intelligence
Classification : Impact Score : 14.4, H-Index : 161
Nombre de citations : 327 citations (based on Google Scholar metrics)

Summary

The article discusses the challenges posed by multi-agent environments, emphasizing the need for artificial intelligent agents to effectively cooperate with other agents, especially humans. It introduces the card game Hanabi that mirrors human use of theory of mind in addressing multi-agent challenges. The article highlights the complexity of imperfect information in Hanabi, noting that human players approach the game differently by considering perspectives, beliefs, and intentions to signal and interpret actions effectively.

Problem

How can multi-agent reinforcement learning algorithms be effectively designed to address the unique challenges presented by cooperative, imperfect information games such as Hanabi ?

Possible avenues (proposed by the authors) The authors propose using Hanabi as a benchmark for AI, presenting two challenges: learning fixed policies through self-play and ad-hoc team play. They recommend evaluation in Sample Limited (SL) and Unlimited (UL) regimes. Possible avenues include developing efficient algorithms for the SL regime and exploring large-scale computation for the UL regime.

Key Questions:

Develop efficient algorithms for multi-agent learning in Hanabi’s cooperative, imperfect information setting.
Explore strategies enabling AI agents to adapt to unknown partners in ad-hoc team scenarios.
Investigate computational approaches for improved asymptotic performance in Hanabi’s Unlimited regime.

Approach:

Utilize a diverse pool of agents, incorporating hard-coded and self-play-learned strategies for evaluation.
Expose agents to diverse strategies through ten random self-play games with ad-hoc teammates.
Conduct trials where agents play games in place of hold-out team players, reporting mean and standard deviation across 1000 trials.
Evaluate diversity using crosstables analyzing hold-out teams’ performance when paired with each other.

Implementation of the approach:

hanabi_learning_environment
- A research platform for Hanabi experiments.
RL-based methods
- Actor-critic agents demonstrates good performance in single agent task.
- Rainbow-agents based on Deep Q Learning, share parameters with other agents in a multi-agent context
- BAD (Bayesian Action Decoder)
Rule-based methods
- Smart-Bot, rule based agent solely based on the available informations in the game.
- Hat-Bot & WTFWThat are agents provide a lower-bound for optimal play than a baseline suggestive of human performance
- FireFlower implements a set of human-style conventions to maximize win probability

Results:

RL-based agent (Unlimited Regime)
- demonstrates diverse performances. Globally outperformed by a majority of the agents.
Rule-based agents
- varying performances across player configurations.
WTFWThat agent
- significantly improves perfect games, reaching 91.5% in the 5-player configuration.
Rainbow (Sample Limited Regime)
- exhibits varied performances with a low percentage of perfect games.
BAD (Unlimited Regime)
- achieves a mean performance of 23.92 points in the 2-player configuration with a high percentage of perfect games at 58.6%.

Article 3

Learning to Teach in Cooperative Multiagent Reinforcement Learning

Référence & indicateur

Citation : Omidshafiei, S., Kim, D.-K., Liu, M., Tesauro, G., Riemer, M., Amato, C., Campbell, M., & How, J. P. (2018). “Learning to Teach in Cooperative Multiagent Reinforcement Learning.” arXiv preprint arXiv:1805.07830.
Conférence / Revue : Conference on Artificial Intelligence
Classification : Impact Factor : 4 H-index : 212,
Nombre de citations : 129 citations (based on Google Scholar metrics)

Summary

The paper introduces a novel framework and algorithm, Learning to Coordinate and Teach Reinforcement (LeCTR), addressing the learning to teach problem in cooperative Multiagent Reinforcement Learning (MARL). LeCTR enables agents to learn when and what to advise allowing peer-to-peer teaching. The algorithm proves effective in heterogeneous team settings with communication costs, outperforming state-of-the-art teaching methods.

Scientific Question: How can intelligent agents in a cooperative MARL setting learn to teach, determine when to teach, what to teach in a multiagent learning environment?

Possible avenues (pointed out by the authors) The authors propose a framework and algorithm called Learning to Coordinate and Teach Reinforcement (LeCTR) to address the learning to teach problem in cooperative Multiagent Reinforcement Learning (MARL). The proposed approach focuses on agents learning when and what to advise, using local knowledge to improve teamwide performance and accelerate learning.

Key Questions

How can learning-to-teach algorithms improve coordination and task-level learning in cooperative multiagent settings?
What strategies and algorithms are effective for knowledge transfer through advising in cooperative MARL, addressing challenges like partial observability and decentralized learning?
How can learning-to-teach algorithms adapt to scenarios where agents lack complete knowledge, and how can success factors be addressed?
How can the learning-to-teach framework be optimized for specific applications, considering challenges in industrial robotics, communication policy learning, and online/lifelong learning?

Approach The approach, named Learning to Coordinate and Teach Reinforcement (LeCTR), focuses on solving the advising-level problem (PAdvise).

Phase I:
- Agents learn the task-level problem from scratch using blackbox learning algorithms with the latest advising policies.
Phase II:
- Advising policies are updated based on advising-level rewards correlated with teammates’ task-level learning.
- No constraints on the heterogeneity of agents’ task-level algorithms.
- Iterative execution of Phases I and II contributes to training more effective advising policies.

Implementation of the Approach

Advising Policies:
- LeCTR learns advising policies for students and teachers to determine when to request/give advice.
Advising-Level Observations:
- Observations convey task-level state and knowledge more compactly than full policy parameters.
Student and Teacher Actions:
- Students decide to request advice or not.
- Teachers decide what to advise: an action from the student’s space or a no-advice action.
Execution and Local Behavior:
- Students execute local behavior actions with no advice.
- With advice, students execute a transformed local behavior action.

The results

Heterogeneous Teammates:
- LeCTR outperforms existing methods for heterogeneous scenarios, achieving the highest AUC for all rotations.
Communication Cost Impact:
- Introducing communication cost makes advising unidirectional, impacting action advice probabilities.
Overall:
- LeCTR shows adaptability in heterogeneous scenarios and dynamic advising strategies with communication costs.

Article 4

Adaptive Load Balancing: A Study in Multi-Agent Learning

Référence & indicateur

Citation : Schaerf, A., Shoham, Y., & Tennenholtz, M. (1995). “Adaptive Load Balancing: A Study in Multi-Agent Learning.” arXiv preprint cs/9505102.
Conférence / Revue : JAIR (Journal of Artificial Intelligence Research)
Classification : Impact Score : 4.91, H-Index : 127
Nombre de citations : 282 (according to Google Scholar)

Summary

The text discusses the application of multi-agent reinforcement learning to the problem of load balancing. It introduces a formal framework called a multi-agent multi-resources stochastic system, involving agents, resources, probabilistic changes in resource capacities, and probabilistic job assignments. The goal is to optimize resource usage while ensuring fairness among agents.

Problem How agents should autonomously choose resources to optimize global performance measures in a dynamic multi-agent system with changing resource capacities and job assignments

Possible avenues (pointed out by the authors)

Development of Effective Resource-Selection Rules:
- Investigate and develop resource-selection rules that allow agents to adapt their behavior to changing system conditions
Impact of Heterogeneous Agent Behavior:
- Explore the consequences of different agents using diverse resource-selection rules
Role of Communication:
- Examine the potential benefits of communication among agents in improving system efficiency
Comparison with Existing Models:
- Compare the proposed model with existing models to identify similarities, differences, and potential insights.
Evaluation of Learning Automata:
- Assess the applicability of learning automata in the context of distributed computer systems for adaptive load balancing

Key Questions

Optimal Selection Rules:
- What selection rules (SRs) for agents in a homogeneous system lead to efficient behavior ?
Comparison of SRs:
- How do different SRs, including adaptive ones proposed in the study, compare to each other and to non-adaptive selection rules ?
Competitive Benchmarking:
- Can adaptive SRs outperform non-adaptive SRs that perform best on specific problems ?
Load-Querying SR Comparison:
- How do adaptive SRs fare against a load-querying SR, where agents query resource loads and always choose the less crowded resource ?

Approach:

Learning Rule:
- The authors employ the Best Choice Selection Rule (BCSR) as their primary learning rule.
BCSR Characteristics:
- BCSR assumes a high value of n and consistently selects the best resource at a given point in the adaptive load balancing context.
Restrictions:
- The discussion is restricted to discrete, synchronous systems, with a job being executable using any of the available resources—a common practice in distributed systems research (Mirchandaney & Stankovic, 1986).
Communication Assumption:
- The authors initially assumed no direct communication among agents, they later assume scenarios where each agent can communicate only with some of its neighbors.
Heterogeneous/Homogeneous populations:
- Initially assume that all the agents would use the same SR, it is supposed that if the agents were using a different SR within a given neighbourhood, it would output better results.

Implementation of the approach

The experimental setup followed multiple steps :

Fixed Load
- Fixed load scenario, which, while less dynamic, serves as a foundational test for the adaptability of selection rules.
Changing Load
- Dynamic scenarios, specifically exploring cases where the system’s load, represented by the probability of agent job submissions, evolves over time. Two dynamic settings are introduced: one where the load follows a fixed pattern, and another where the load exhibits random variations.
Changing Capacities
- Scenarios where the capacities of resources can fluctuate over time. Specifically, the results will be presented within the previously outlined dynamic setting.
Communication Channels
- Similar scenarios as above but with an additionnal communication channel observable by the agents in a given neighbourhood.

Results

These results highlight the performance of adaptive selection rules in the context of random load scenarios. The comparison with non-adaptive deterministic rules and the influence of exploration-versus-exploitation dynamics contribute to the understanding of how agents adapt to varying system conditions. The study also suggests that the introduced communication mechanism, while not entirely without merit, does not really enhance the adaptive load balancing system’s performance.

5. Analysis Grid

	Reinforcement Learning	Inter-agent communication	Performed well in the environment	General-Purpose Learning	Specific framework and learning environment
Article de Ref	Yes, specifically multi-agent reinforcement learning (MARL)	Yes, via an Actor-Critic model and a common memory buffer	Yes, has achieved superhuman performance by being ranked above 99.8% of the players in less than 44 days	The model is based on imitation-learning, deep-learning and self-play	A specific framework was designed by Google DeepMind for the AlphaStar model
Article Connx 1	Single Agent reinforcement learning	The only communication between two iterations if the Actor-Critic feedback	Outperformed every other model without the help of human data	Algorithm solely-based on self-play and monte carlo tree search	Yes, Google DeepMind has released a specific learning env to evaluate new self-play algorithms on Go
Article Connx 2	Specifically, MARL. Also hard-coded agents	Actor-Critic agents with feedback, shared parameters with DQL	RL-based agents are outperformed, Rule-Based (WTFWThat) reached 91.5% success rate.	agents based on reinforcement learning	hanabi_learning_environement, allows teams to conduct hanabi experiments with new algorithms
Article Connx 3	Yes, specifically MARL	Inter-Agent communication brought by LeCTR algorithm. Allows agents to learn when to teach/learn	With heterogenous teammates, LeCTR outperformed existing methods. Even with a communication cost.	BlackBox learning, reinforcement learning	No specific framework but algorithm available on a repository
Article Connx 4	Single agent reinforcement learning and MARL	Communication later introduced by the information carried by other agents in a neighbourhood	Adaptive selection rules perform well on random loads. Communication mechanism is not making a bug difference.	Illustrated by adaptative policies and the BCSR	No specific environment but algorithm available on a repository

6. Research Question

Regarding the articles that we’ve studied our work will be based on multi-agent reinforcement learning using self-play and inter-agent communication (LeCTR-like) in a RL environment like Hanabi_learning_environment or StarcraftII_learning_environment.

The problem is the following : How can novel techniques for inter-agent communication like LeCTR be applied to a MARL environment, based on general-purpose learning techniques like self-play to perform in complex environment like Hanabi, Load-Balancing or StarCraftII ?

7. Planned method for the proposed research

For our research question we will undertake an extensive experiment based on TPU-powered machines to train the various models in the environments previously defined.

Ressources

We will need to adress a special request to GoogleDeepMind to access their learning environment for StarCract II and the structure of their self-play algorithm.
The other environments or algorithms like Hanabi or LeCTR are available online.
We will need to implement a simple instance of the Load-Balancing environment.
Finally we will need to allocate a certain amount of TPU-powered nodes on a server for our models to train efficiently.

Implementation

The difficulty in the implementation will be to link the self-play algorithm to the LeCTR algorithm and adapt them to the different environments (e.g. the starcraftII and Hanabi environment do not have the same observation and action space).
The implementation of hard-coded agents to establish a lower-bound performance standard will be more time-consuming than the rest of the algorithms.

Evaluation

The assessment of the models’ performances will be identical to what’s been proposed in the articles mentionned above. For novelty purposes, we will also take into consideration the computational consumption of the models and try to make them as lightweight as possible. We will need to create various scenarios (communication vs no communcation, ad-hoc teams vs pre-trained teams).

This protocol will allow us to have an overview of what’s possible with, on one hand, a self-play algorithm saving us from using tons and tons of human data, and on the other hand, a novel learning-based communication protocol between agents that has showed itself promising in a MARL context.

IIR - Reinforcement learning and how general-purpose learning and communication methods can solve complex environments

IIR - Reinforcement learning and how general-purpose learning and communication methods can solve complex environments

IIR - Reinforcement learning and how general-purpose learning and communication methods can solve complex environments.

1. Student

2. Main Article

3. Summary

4. Articles connexes

Article 1

References and metrics

Summary

Article 2

Référence & indicateur

Summary

Article 3

Référence & indicateur

Summary

Article 4

Référence & indicateur

Summary

5. Analysis Grid

6. Research Question

7. Planned method for the proposed research

Further Reading

Presentation - Work and Project Overview

Multi-Agent Systems

Thesis - Load-balancing and Task Allocation in Dynamic Multi-Agent Systems