ML-Agents Platformer: Visual Coin Collector
Intro
In this tutorial, you’ll learn how to use a camera for visual input to your agent instead of ray perception.
Prerequisites
If you didn’t come from the ML-Agents Platformer: Simple Coin Collector tutorial, make sure to do that first so that you’re starting from the same spot.
Companion YouTube Video
Check out the YouTube video where I explain the whole thing with more detail and extra contextual information.
Camera Vision | Unity ML-Agents
Replace Ray Perception with a Camera
First, remove the Ray Perception Sensor from the agent.
Next, make sure Use Child Sensors is enabled.
Note, if you still have a NNModel hooked up from a training run that didn’t have any Visual Observations, you will get a warning. The warning in the image above is basically saying: “Hey, this neural network wasn’t trained with camera input, so it won’t work the way you expect.”
Add a Camera object as a child of the Character object. Set the following parameters:
Position: 0, 1.1, -0.75
Rotation: 15, 0, 0
Field of View: 90
Remove the Audio Listener component.
Add a Camera Sensor to the Camera object
Drag the Camera object from the Hierarchy to the Camera field
SensorName: CameraSensor
Width/Height: 84/84
Grayscale: Disabled
Observation Stacks: 2
Compression: PNG
This will create an 84 pixel by 84 pixel RGB image that gets passed into your neural net. The Observations Stacks just means show the current frame and the previous frame. While this may not be necessary, in theory, it will allow for detecting motion.
If you want to see what the neural network sees, set up a new aspect ratio for the game display.
Training
In the previous tutorial, we set up our config file to use a Simple visual encoder.
network_settings:
vis_encode_type: simple
According to the ML-Agents documentation:
(default =
simple
) Encoder type for encoding visual observations.simple
(default) uses a simple encoder which consists of two convolutional layers,nature_cnn
uses the CNN implementation proposed by Mnih et al., consisting of three convolutional layers, andresnet
uses the IMPALA Resnet consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two.match3
is a smaller CNN (Gudmundsoon et al.) that is optimized for board games, and can be used down to visual observation sizes of 5x5.
So by using a simple encoder, we’re not doing anything fancy. If you’re interested in more advanced visualization, you may want to experiment with other options.
The training command is identical to any other training run.
It can take longer to train with visual inputs for a couple reasons:
The more cameras rendering per scene, the lower the framerate, which slows down the simulation
Much larger neural networks due to adding lots of pixel inputs as well as convolutional layers
For reference, with 8 simultaneous training areas, mine trained completely in about 15 minutes.