ML-Agents Platformer: Curriculum Training for an Advanced Puzzle
Intro
In this tutorial, you’ll learn to use Curriculum when training Unity ML-Agents. Curriculum allows you to start training your agent with a simpler task, then gradually make the task more difficult as it learns the simpler task.
The idea with this is not to give you a complete project, but to show you how to apply curriculum to your own project. I provide the code I used as a demonstration.
In this example, I set up a green block that the agent needs to slide into place to use as a platform for accessing the upper level. If I start training with the block too far away, the agent never learns to push it into place. However, if I start training with the block already in place and gradually move it farther away, it slowly learns how to push the block into place. Eventually, the green block can start as a doorway that must be pushed inward to complete the puzzle.
Prerequisites
If you didn’t come from the ML-Agents Platformer: Simple Coin Collector and ML-Agents Platformer: Visual Coin Collector tutorials, make sure to do those first so that you’re starting from the same spot.
Create an Interesting Puzzle
This may require some creativity if you don’t already have a task in mind that you think could benefit from curriculum. In my case, I made a castle with a series of Blue Ramps inside and a large gap between the lower level and upper level. Then I created a Green Block that could be slid into place to access the upper level. Red Coins give a small amount of reward. Touching the Sun at the top gives a large reward and ends the episode, restarting the puzzle.
With the Green Block in place, the agent is able to jump up to the second level platform and continue to collect Red Coins.
The ultimate starting place of the sliding Green Block is as a doorway that is aligned with the gap between the lower and upper levels.
To keep things simple, I locked the position of the sliding block so that it could only slide forward and backward and would not rotate.
Implement your Puzzle
Here’s some sample code I used for implementing this Green Block/Red Coin/Sun castle puzzle.
- The list of collectibles will contain all of the Red Coins 
- The block is the Green Block 
- The block offset is the distance to offset the Green Block from the most helpful final position (allowing easy access to the upper level). 5 meters places it in the doorway. 
In the Awake() function, we save the original block position and rotation for future use, then set the offset using curriculum.
In the ResetArea() function, we re-activate each collectible (the Red Coins) and reset the position/rotation of the GreenBlock plus an offset using curriculum.
The Collect() function is called by the agent when it touches a collectible Red Coin.
I’m assuming your own project will be different from mine, but hopefully this example can demonstrate how to use curriculum.
Curriculum
The way we access curriculum values is through Academy.Instance.EnvironmentParameters. The GetWithDefault() function allows us to attempt to get a value from our config .yaml file while training. If it can’t find one, will default to the second parameter (blockOffset in this case). If we are training, the Academy will have a value to give us because it knows which stage of the curriculum we are on. If we are running inference later, it will use the default you supply.
Take a look below at the AdvancedCollector.yaml file. It has a section called “environment_parameters” which has a child called “block_offset”. This is what we are accessing when we pass it in as the first parameter of GetWithDefault(“block_offset”, blockOffset).
CastleArea.cs
using System.Collections.Generic;
using Unity.MLAgents;
using UnityEngine;
public class CastleArea : MonoBehaviour
{
    public List<GameObject> collectibles;
    public GameObject block;
    [Range(0, 5)]
    public float blockOffset = 5f;
    private Vector3 originalBlockPosition;
    private Quaternion originalBlockRotation;
    private void Awake()
    {
        originalBlockPosition = block.transform.position;
        originalBlockRotation = block.transform.rotation;
        block.transform.position += Vector3.forward * Academy.Instance.EnvironmentParameters.GetWithDefault("block_offset", blockOffset);
    }
    public void ResetArea()
    {
        foreach (GameObject col in collectibles) col.SetActive(true);
        block.transform.position = originalBlockPosition + Vector3.forward * Academy.Instance.EnvironmentParameters.GetWithDefault("block_offset", blockOffset);
        block.transform.rotation = originalBlockRotation;
    }
    internal void Collect(GameObject gameObject)
    {
        gameObject.SetActive(false);
    }
}AdvancedCollectorAgent.cs
using Unity.MLAgents;
using Unity.MLAgents.Actuators;
using UnityEngine;
public class AdvancedCollectorAgent : Agent
{
    private Vector3 startPosition;
    private SimpleCharacterController characterController;
    new private Rigidbody rigidbody;
    private CastleArea castleArea;
    /// <summary>
    /// Called once when the agent is first initialized
    /// </summary>
    public override void Initialize()
    {
        startPosition = transform.position;
        characterController = GetComponent<SimpleCharacterController>();
        rigidbody = GetComponent<Rigidbody>();
        castleArea = GetComponentInParent<CastleArea>();
    }
    /// <summary>
    /// Called every time an episode begins. This is where we reset the challenge.
    /// </summary>
    public override void OnEpisodeBegin()
    {
        // Reset agent position, rotation
        transform.position = startPosition;
        transform.rotation = Quaternion.Euler(Vector3.up * Random.Range(0f, 360f));
        rigidbody.velocity = Vector3.zero;
        castleArea.ResetArea();
    }
    /// <summary>
    /// Controls the agent with human input
    /// </summary>
    /// <param name="actionsOut">The actions parsed from keyboard input</param>
    public override void Heuristic(in ActionBuffers actionsOut)
    {
        // Read input values and round them. GetAxisRaw works better in this case
        // because of the DecisionRequester, which only gets new decisions periodically.
        int vertical = Mathf.RoundToInt(Input.GetAxisRaw("Vertical"));
        int horizontal = Mathf.RoundToInt(Input.GetAxisRaw("Horizontal"));
        bool jump = Input.GetKey(KeyCode.Space);
        // Convert the actions to Discrete choices (0, 1, 2)
        ActionSegment<int> actions = actionsOut.DiscreteActions;
        actions[0] = vertical >= 0 ? vertical : 2;
        actions[1] = horizontal >= 0 ? horizontal : 2;
        actions[2] = jump ? 1 : 0;
    }
    /// <summary>
    /// React to actions coming from either the neural net or human input
    /// </summary>
    /// <param name="actions">The actions received</param>
    public override void OnActionReceived(ActionBuffers actions)
    {
        // Punish and end episode if the agent strays too far
        if (Vector3.Distance(startPosition, transform.position) > 20f)
        {
            AddReward(-1f);
            EndEpisode();
        }
        // Convert actions from Discrete (0, 1, 2) to expected input values (-1, 0, +1)
        // of the character controller
        float vertical = actions.DiscreteActions[0] <= 1 ? actions.DiscreteActions[0] : -1;
        float horizontal = actions.DiscreteActions[1] <= 1 ? actions.DiscreteActions[1] : -1;
        bool jump = actions.DiscreteActions[2] > 0;
        characterController.ForwardInput = vertical;
        characterController.TurnInput = horizontal;
        characterController.JumpInput = jump;
        if (vertical > 0f) AddReward(.2f / MaxStep);
    }
    /// <summary>
    /// Respond to entering a trigger collider
    /// </summary>
    /// <param name="other">The object (with trigger collider) that was touched</param>
    private void OnTriggerEnter(Collider other)
    {
        // If the other object is a collectible, reward and end episode
        if (other.tag == "collectible")
        {
            AddReward(1f / castleArea.collectibles.Count);
            castleArea.Collect(other.gameObject);
        }
        else if (other.tag == "goal")
        {
            AddReward(1f);
            EndEpisode();
        }
    }
}AdvancedCollector.yaml
behaviors:
  AdvancedCollector:
    trainer_type: ppo
    hyperparameters:
      batch_size: 128
      buffer_size: 2048
      learning_rate: 0.0003
      beta: 0.005
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 3
      learning_rate_schedule: linear
    network_settings:
      normalize: false
      hidden_units: 256
      num_layers: 2
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
      curiosity:
        strength: 0.02
        gamma: 0.99
        encoding_size: 256
        learning_rate: 3.0e-4
    keep_checkpoints: 5
    max_steps: 20000000
    time_horizon: 128
    summary_freq: 20000
    threaded: true
environment_parameters:
  block_offset:
    curriculum:
      - name: Lesson0 # This is the start of the second lesson
        completion_criteria:
          measure: reward
          behavior: AdvancedCollector
          signal_smoothing: true
          min_lesson_length: 100
          threshold: 1.5
        value:
          sampler_type: uniform
          sampler_parameters:
            min_value: 0.0
            max_value: 1.0
      - name: Lesson1
        completion_criteria:
          measure: reward
          behavior: AdvancedCollector
          signal_smoothing: true
          min_lesson_length: 100
          threshold: 1.8
        value:
          sampler_type: uniform
          sampler_parameters:
            min_value: 0.0
            max_value: 2.0
      - name: Lesson2
        completion_criteria:
          measure: reward
          behavior: AdvancedCollector
          signal_smoothing: true
          min_lesson_length: 100
          threshold: 1.8
        value:
          sampler_type: uniform
          sampler_parameters:
            min_value: 0.5
            max_value: 3.0
      - name: Lesson3
        completion_criteria:
          measure: reward
          behavior: AdvancedCollector
          signal_smoothing: true
          min_lesson_length: 100
          threshold: 1.8
        value:
          sampler_type: uniform
          sampler_parameters:
            min_value: 2.0
            max_value: 5.0
      - name: Lesson4
        value: 5.0Training
Aside from using the new config yaml file, the training command is the same as in the previous tutorials. The difference is, as your agent gets past the reward thresholds you set, the lessons will increase in difficulty. In Tensorboard, you can see the Lesson Number increasing in the far right chart. Some lessons go much faster than others, and it will look a bit different every training run you do due to randomness.
License
All code on this page is licensed by the MIT License.
MIT License
Copyright (c) 2021 Immersive Limit LLC
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE. 
          
        
       
             
             
            