After the last environment construction, we will actually perform reinforcement learning with Unity! Click here for the previous article, https://qiita.com/nol_miryuu/items/32cda0f5b7172197bb09
Some basic knowledge of Unity is required (how to create and name objects)
Train AI so that the blue sphere (AI Agent) can quickly approach the yellow box (Target) without falling off the floor
State: Vector Observation (size = 8) ・ Three X, Y, Z coordinates of Target ・ Three of RollerAgent's X, Y, Z coordinates -RollerAgent's X and Z speeds (excluded because it does not move in the Y direction)
Action: Continuous (size = 2) ・ 0: The force applied to the Roller Agent in the X direction ・ 1: The force applied to the Roller Agent in the Z direction
Reward: ・ When the Roller Agent approaches the Target (the distance between the Roller Agent and the Target approaches 0), a reward (+1.0) is given and the episode is completed. -If the Roller Agent falls off the floor (when the Roller Agent's position in the Y direction is less than 0), the episode is completed without rewarding.
Decision: ・ Every 10 steps
Reinforcement learning cycle (process executed step by step) Status acquisition → Action decision → Action execution and reward acquisition → Policy update






Press the "+" button in the upper left and select Add package from disk

Go to the directory you created last time and select ml-agents / com.unity.ml-agents / package.json

・ Rigidbody: Mechanism of physics simulation ・ Behavior Parameters: Set roller agent status and behavior data ・ Decision Requester: Set how many steps to request "decision" `` Basically the steps are executed every 0.02 seconds. If the Decision Period is "5", then every 5 x 0.02 = 0.1 seconds, In the case of "10", "decision" is executed every 10 x 0.02 = 0.2 seconds. Finally, set as shown in the figure below for the Roller Agent.
Rigidbody

Behavior Paramenters
-Behavior Name: RollerBall (model is generated with this name)
・ Space Size of Vector Observation: 8 (Type of observation state)
・ Space Type: Continuous (type of action)
・ Space Size of Vector Action: 2 (type of action)

Decision Requester

・ Void Initialize () ・ ・ ・ Called only once when the agent game object is created ・ OnEpisodeBegin () ・ ・ ・ Called at the beginning of the episode ・ CollectObservations (Vector Sensor sensor) ・ ・ ・ Set the status data to be passed to the agent. ・ OnActionReceived (float [] vactorAction) ・ ・ ・ Executes the determined action, obtains the reward, and completes the episode.
RollerAction
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using Unity.MLAgents;
using Unity.MLAgents.Sensors;
public class RollerAgent : Agent
{
    public Transform target;
    Rigidbody rBody;
    public override void Initialize()
    {
        this.rBody = GetComponent<Rigidbody>();
    }
    //Called at the beginning of the episode
    public override void OnEpisodeBegin()
    {
        if (this.transform.position.y < 0) //Reset the following when the RollerAgent (sphere) is falling off the floor
        {
            this.rBody.angularVelocity = Vector3.zero; //Reset rotational acceleration
            this.rBody.velocity = Vector3.zero;         //Reset speed
            this.transform.position = new Vector3(0.0f, 0.5f, 0.0f); //Reset position
        }
        //Reset the position of the Target (cube)
        target.position = new Vector3(Random.value * 8 - 4, 0.5f, Random.value * 8 - 4);
    }
    //Set the observation data (8 items) to be passed to the agent
    public override void CollectObservations(VectorSensor sensor)
    {
        sensor.AddObservation(target.position); //XYZ coordinates of Target (cube)
        sensor.AddObservation(this.transform.position); //Roller Agent XYZ coordinates
        sensor.AddObservation(rBody.velocity.x); //Roller Agent X-axis velocity
        sensor.AddObservation(rBody.velocity.z); //Roller Agent Z-axis velocity
    }
    //Called when performing an action
    public override void OnActionReceived(float[] vectorAction)
    {
        //Power the Roller Agent
        Vector3 controlSignal = Vector3.zero;
        controlSignal.x = vectorAction[0]; //Set behavior data determined by policy
                                                              // vectorAction[0]Is the force applied in the X direction(-1.0 〜 +1.0)
        controlSignal.z = vectorAction[1]; //Set behavior data determined by policy
                                                              // vectorAction[1]Is the force applied in the Y direction(-1.0 〜 +1.0)
        rBody.AddForce(controlSignal * 10);
        //Measure the distance between Roller Agent and Target
        float distanceToTarget = Vector3.Distance(this.transform.position, target.position);
        //When the Roller Agent arrives at the Target position
        if(distanceToTarget < 1.42f)
        {
            AddReward(1.0f); //Give a reward
            EndEpisode(); //Complete the episode
        }
        //When the Roller Agent falls off the floor
        if(this.transform.position.y < 0)
        {
            EndEpisode(); //Complete the episode without rewarding
        }
    }
}
Max Step: The episode is completed when the maximum number of steps in the episode and the number of steps in the episode exceed the set values. Select Max Step 1000 and the yellow box "Target" in the Target field.
-Create a sample directory in ml-agents / config / -Create a RollerBall.yaml file in it, the file contents are as follows
Hyperparameters (training configuration file extension .yaml [read as yamuru])  --Parameters used for learning  --Human needs to adjust  --Setting items are different for each reinforcement learning algorithm (PPO / SAC) 
RollerBall.yaml
behaviors:
  RollerBall:
    trainer_type: ppo
    summary_freq: 1000
    hyperparameters:
      batch_size: 10
      buffer_size: 100
      learning_rate: 0.0003
      learning_rate_schedule: linear
      beta: 0.005
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 3
    network_settings:
      normalize: true
      hidden_units: 128
      num_layers: 2
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
    keep_checkpoints: 5
Run the virtual environment created in the previous Qitta
terminal
poetry shell
Execute the following command in the ml-agents directory
terminal
mlagents-learn config/sample/RollerBall.yaml --run-id=model01
The last model01 is given an alias for each new training
terminal
Start training by pressing the Play button in the in the Unity Editor.
When the above code is written in terminal, go back to Unity and press the play button to execute it
Information is displayed on the terminal every 50000 Steps. Mean Reward: Average reward points ... The higher the value, the higher the accuracy. When it reaches 1.0, let's finish the training.
Recommended Posts