Intro

In this tutorial, you’ll learn how to use a camera for visual input to your agent instead of ray perception.

Prerequisites

If you didn’t come from the ML-Agents Platformer: Simple Coin Collector tutorial, make sure to do that first so that you’re starting from the same spot.

Companion YouTube Video

Check out the YouTube video where I explain the whole thing with more detail and extra contextual information.

Camera Vision | Unity ML-Agents

Replace Ray Perception with a Camera

First, remove the Ray Perception Sensor from the agent.
Next, make sure Use Child Sensors is enabled.

Note, if you still have a NNModel hooked up from a training run that didn’t have any Visual Observations, you will get a warning. The warning in the image above is basically saying: “Hey, this neural network wasn’t trained with camera input, so it won’t work the way you expect.”

Add a Camera object as a child of the Character object. Set the following parameters:

Position: 0, 1.1, -0.75
Rotation: 15, 0, 0
Field of View: 90

Remove the Audio Listener component.
Add a Camera Sensor to the Camera object

Drag the Camera object from the Hierarchy to the Camera field
SensorName: CameraSensor
Width/Height: 84/84
Grayscale: Disabled
Observation Stacks: 2
Compression: PNG

This will create an 84 pixel by 84 pixel RGB image that gets passed into your neural net. The Observations Stacks just means show the current frame and the previous frame. While this may not be necessary, in theory, it will allow for detecting motion.

If you want to see what the neural network sees, set up a new aspect ratio for the game display.

Training

In the previous tutorial, we set up our config file to use a Simple visual encoder.

    network_settings:
      vis_encode_type: simple

According to the ML-Agents documentation:

(default = simple) Encoder type for encoding visual observations.

simple (default) uses a simple encoder which consists of two convolutional layers, nature_cnn uses the CNN implementation proposed by Mnih et al., consisting of three convolutional layers, and resnet uses the IMPALA Resnet consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two. match3 is a smaller CNN (Gudmundsoon et al.) that is optimized for board games, and can be used down to visual observation sizes of 5x5.

So by using a simple encoder, we’re not doing anything fancy. If you’re interested in more advanced visualization, you may want to experiment with other options.

The training command is identical to any other training run.

It can take longer to train with visual inputs for a couple reasons:

The more cameras rendering per scene, the lower the framerate, which slows down the simulation
Much larger neural networks due to adding lots of pixel inputs as well as convolutional layers

For reference, with 8 simultaneous training areas, mine trained completely in about 15 minutes.

ML-Agents Platformer: Visual Coin Collector

Intro

Prerequisites

Companion YouTube Video

Replace Ray Perception with a Camera

Training