After the last environment construction, we will actually perform reinforcement learning with Unity! Click here for the previous article, https://qiita.com/nol_miryuu/items/32cda0f5b7172197bb09
Some basic knowledge of Unity is required (how to create and name objects)
Train AI so that the blue sphere (AI Agent) can quickly approach the yellow box (Target) without falling off the floor
State: Vector Observation (size = 8) ・ Three X, Y, Z coordinates of Target ・ Three of RollerAgent's X, Y, Z coordinates -RollerAgent's X and Z speeds (excluded because it does not move in the Y direction)
Action: Continuous (size = 2) ・ 0: The force applied to the Roller Agent in the X direction ・ 1: The force applied to the Roller Agent in the Z direction
Reward: ・ When the Roller Agent approaches the Target (the distance between the Roller Agent and the Target approaches 0), a reward (+1.0) is given and the episode is completed. -If the Roller Agent falls off the floor (when the Roller Agent's position in the Y direction is less than 0), the episode is completed without rewarding.
Decision: ・ Every 10 steps
Reinforcement learning cycle (process executed step by step) Status acquisition → Action decision → Action execution and reward acquisition → Policy update
Press the "+" button in the upper left and select Add package from disk
Go to the directory you created last time and select ml-agents / com.unity.ml-agents / package.json
・ Rigidbody: Mechanism of physics simulation ・ Behavior Parameters: Set roller agent status and behavior data ・ Decision Requester: Set how many steps to request "decision" `` Basically the steps are executed every 0.02 seconds. If the Decision Period is "5", then every 5 x 0.02 = 0.1 seconds, In the case of "10", "decision" is executed every 10 x 0.02 = 0.2 seconds. Finally, set as shown in the figure below for the Roller Agent.
Rigidbody
Behavior Paramenters -Behavior Name: RollerBall (model is generated with this name) ・ Space Size of Vector Observation: 8 (Type of observation state) ・ Space Type: Continuous (type of action) ・ Space Size of Vector Action: 2 (type of action)
Decision Requester
・ Void Initialize () ・ ・ ・ Called only once when the agent game object is created ・ OnEpisodeBegin () ・ ・ ・ Called at the beginning of the episode ・ CollectObservations (Vector Sensor sensor) ・ ・ ・ Set the status data to be passed to the agent. ・ OnActionReceived (float [] vactorAction) ・ ・ ・ Executes the determined action, obtains the reward, and completes the episode.
RollerAction
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using Unity.MLAgents;
using Unity.MLAgents.Sensors;
public class RollerAgent : Agent
{
public Transform target;
Rigidbody rBody;
public override void Initialize()
{
this.rBody = GetComponent<Rigidbody>();
}
//Called at the beginning of the episode
public override void OnEpisodeBegin()
{
if (this.transform.position.y < 0) //Reset the following when the RollerAgent (sphere) is falling off the floor
{
this.rBody.angularVelocity = Vector3.zero; //Reset rotational acceleration
this.rBody.velocity = Vector3.zero; //Reset speed
this.transform.position = new Vector3(0.0f, 0.5f, 0.0f); //Reset position
}
//Reset the position of the Target (cube)
target.position = new Vector3(Random.value * 8 - 4, 0.5f, Random.value * 8 - 4);
}
//Set the observation data (8 items) to be passed to the agent
public override void CollectObservations(VectorSensor sensor)
{
sensor.AddObservation(target.position); //XYZ coordinates of Target (cube)
sensor.AddObservation(this.transform.position); //Roller Agent XYZ coordinates
sensor.AddObservation(rBody.velocity.x); //Roller Agent X-axis velocity
sensor.AddObservation(rBody.velocity.z); //Roller Agent Z-axis velocity
}
//Called when performing an action
public override void OnActionReceived(float[] vectorAction)
{
//Power the Roller Agent
Vector3 controlSignal = Vector3.zero;
controlSignal.x = vectorAction[0]; //Set behavior data determined by policy
// vectorAction[0]Is the force applied in the X direction(-1.0 〜 +1.0)
controlSignal.z = vectorAction[1]; //Set behavior data determined by policy
// vectorAction[1]Is the force applied in the Y direction(-1.0 〜 +1.0)
rBody.AddForce(controlSignal * 10);
//Measure the distance between Roller Agent and Target
float distanceToTarget = Vector3.Distance(this.transform.position, target.position);
//When the Roller Agent arrives at the Target position
if(distanceToTarget < 1.42f)
{
AddReward(1.0f); //Give a reward
EndEpisode(); //Complete the episode
}
//When the Roller Agent falls off the floor
if(this.transform.position.y < 0)
{
EndEpisode(); //Complete the episode without rewarding
}
}
}
Max Step: The episode is completed when the maximum number of steps in the episode and the number of steps in the episode exceed the set values. Select Max Step 1000 and the yellow box "Target" in the Target field.
-Create a sample directory in ml-agents / config / -Create a RollerBall.yaml file in it, the file contents are as follows
Hyperparameters (training configuration file extension .yaml [read as yamuru]) --Parameters used for learning --Human needs to adjust --Setting items are different for each reinforcement learning algorithm (PPO / SAC)
RollerBall.yaml
behaviors:
RollerBall:
trainer_type: ppo
summary_freq: 1000
hyperparameters:
batch_size: 10
buffer_size: 100
learning_rate: 0.0003
learning_rate_schedule: linear
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
network_settings:
normalize: true
hidden_units: 128
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
keep_checkpoints: 5
Run the virtual environment created in the previous Qitta
terminal
poetry shell
Execute the following command in the ml-agents directory
terminal
mlagents-learn config/sample/RollerBall.yaml --run-id=model01
The last model01 is given an alias for each new training
terminal
Start training by pressing the Play button in the in the Unity Editor.
When the above code is written in terminal, go back to Unity and press the play button to execute it
Information is displayed on the terminal every 50000 Steps. Mean Reward: Average reward points ... The higher the value, the higher the accuracy. When it reaches 1.0, let's finish the training.
Recommended Posts