Author Archives: Tejan Karmali

GSoC’18: AlphaGo.jl

By: Tejan Karmali

Re-posted from: https://tejank10.github.io/jekyll/update/2018/07/08/GSoC-Phase-2.html

Hello, world!

The phase 2 of GSoC is over and AlphaGo.jl is ready! In this post I am going to explain about the usage of it.
AlphaGo.jl is built to try and test the Alpha(Go)Zero algorithm with your own parameters on the game of Go. Today, I’ll explain about higher level methods of it. For more details, which mainly includes MCTS implementation, you can check out the repo. It is built using Flux.jl, a machine learning library for Julia.

Environment

GameEnv is an abstract type used to represent game environment. Setting up environment is the first thing to do before starting with anything. This is because environment stores important information about the game which is used by other modules. Other modules require environment as input in order to set up themselves using this information.

Example:

env = GoEnv(9)

Here we have set up environment for Go having board size of 9×9.

NeuralNet

NeuralNet structure is used to store the AlphaZero neural network. The AlphaZero neural network is made up of three parts. A base network is there, which branches out into value netwrok and policy network.
The base network accepts Position of the board as input. Value network outputs a single value between -1 to 1 denoting who will win from the given position. -1 denotes white will win from that position and 1 implies black. The policy network returns the probability distribution over the different actions for that position of board.

You can replace any of these three networks with your own Flux model, provided it is consistent with the whole pipeline fo the NeuralNet.

mutable struct NeuralNet
  base_net::Chain
  value::Chain
  policy::Chain
  opt

  function NeuralNet(env::T; base_net = nothing, value = nothing, policy = nothing,
                          tower_height::Int = 19) where T <: GameEnv
    if base_net == nothing
      res_block() = ResidualBlock([256,256,256], [3,3], [1,1], [1,1])
      # 19 residual blocks
      tower = [res_block() for i = 1:tower_height]
      base_net = Chain(Conv((3,3), env.planes=>256, pad=(1,1)), BatchNorm(256, relu),
                        tower...) |> gpu
    end
    if value == nothing
      value = Chain(Conv((1,1), 256=>1), BatchNorm(1, relu), x->reshape(x, :, size(x, 4)),
                    Dense(env.N*env.N, 256, relu), Dense(256, 1, tanh)) |> gpu
    end
    if policy == nothing
      policy = Chain(Conv((1,1), 256=>2), BatchNorm(2, relu), x->reshape(x, :, size(x, 4)),
                      Dense(2env.N*env.N, env.action_space), x -> softmax(x)) |> gpu
    end

    all_params = vcat(params(base_net), params(value), params(policy))
    opt = Momentum(all_params, 0.02f0)
    new(base_net, value, policy, opt)
  end
end

MCTSPlayer

MCTSPlayer struct simulates a game using Monte-Carlo Tree Search and NeuralNet. It takes NeuralNet and env as input. The player plays the game upto the number of readouts. MCTSPlayer can perform following functions:

  • MCTS
  • Pick a move based on MCTS and play it
  • Extract data from the games played by it

These functionalities are used during the training and testing phase.

Selfplay

Self-play stage is used in the training phase. In this stage, the MCTSPlayer plays a game against itself. Every move in the game is picked based on the MCTS and played. After the game ends, the MCTSPlayer object is returned for extraction of data.

Training

train() method is used by the user to train the model based on the following parameters:

  • env
  • num_games: Number of self-play games to be played
    Optional arguments:
  • memory_size: Size of the memory buffer
  • batch_size
  • epochs: Number of epochs to train the data on
  • ckp_freq: Frequecy of saving the model and weights
  • tower_height: AlphaGo Zero Architecture uses residual networks stacked together. This is called a tower of residual networks. tower_height specifies how many residual blocks to be stacked.
  • model: Object of type NeuralNet
  • readouts: number of readouts by MCTSPlayer
  • start_training_after: Number of games after which training will be started

train() starts off with a game of selfplay() using the current best NeuralNet. On completion of the game, the data from that game is extracted. This includes the board states, policy used at each move,and the result of that game. This data is stored in the memory buffer.

for i = 1:num_games
  player = selfplay(env, cur_nn, readouts)
  p, π, v = extract_data(player)

  pos_buffer = vcat(pos_buffer, p)
  π_buffer   = vcat(π_buffer, π)
  res_buffer = vcat(res_buffer, v)

  if length(pos_buffer) > memory_size
    pos_buffer = pos_buffer[end-memory_size+1:end]
    π_buffer   = π_buffer[end-memory_size+1:end]
    res_buffer = res_buffer[end-memory_size+1:end]
  end

  if length(pos_buffer) >= start_training_after
    replay_pos, replay_π, replay_res = get_replay_batch(pos_buffer, π_buffer, res_buffer; batch_size = batch_size)
    loss = train!(cur_nn, (replay_pos, replay_π, replay_res); epochs = epochs)
    result = player.result_string
    num_moves = player.root.position.n
    println("Episode $i over. Loss: $loss. Winner: $result. Moves: $num_moves.")
  end

  if i % ckp_freq == 0
    save_model(cur_nn)
    print("Model saved. ")
  end
end

At every training step, batch_sizenumber of samples are picked from the memory. The features are extracted from the board states picked and fed into the NeuralNet, which gives out the value and policy as described above in the NeuralNet section.

We then compute losses. There are three kinds of losses here: Policy loss, Value loss and L2 regularisation.

# Policy loss: p is predicted policy
loss_π(π, p) = crossentropy(p, π; weight = 0.01f0)

# Value loss
loss_value(z, v) = 0.01f0 * mse(z, v)

The losses are added and backpropagated, after which the optimizer updates the weights. epochs can be specified in the train call to train on this data. Periodically, the NeuralNet and its weights are backed up using BSON.jl.

Play

To play against saved NeuralNet model, we have to load it using load_model. It accepts path of the model and env as parameters and returns an object of NeuralNet.
play() takes following arguments:

  • env
  • nn: an object of type NeuralNet
  • tower_height
  • num_readouts
  • mode: It specifies human will play as Black or white. If mode is 0 then human is Black, else White.

Sample usage

using AlphaGo

# This makes a Go board of 9x9
env = GoEnv(9)

# A NeuralNet object of tower_height 10 is made and trained and returned
neural_net = train(env, num_games=100, ckp_freq=10, tower_height=10, start_training_after=500)

# Plays a game against the trained network, with human as White
play(env, neural_net, mode = 1)

GSoC’18: From Go to AlphaGo

By: Tejan Karmali

Re-posted from: https://tejank10.github.io/jekyll/update/2018/06/09/GSoC-week3-4.html

Hello, world!

In the last post, I had talked about the game of Go and its dynamics. In last two weeks, I worked on the set objectives which were Monte-Carlo Tree Search and putting things together to make AlphaGo Zero.

In the week 3, it was mostly about MCTS. The MCTS is organized into two parts. One is a struct for a node of tree. The other is a struct for Player which uses MCTS to perform move. MCTSPlayer is the struct with which the user will interact. A node of Monte-Carlo tree is defined by a board position, and the different positions it can go to upon playing any of the moves from that board position in the action space. MCTSPlayer provides an API to perform Tree Search. It then selects a move based on tree search, and perform it. While performing tree search virtual loss was used. It means that when selecting a node, it is pretended that evaluation has already taken place. This introduces some stochasticity in selection of node.

MCTSPlayer consists of a neural network. This neural network is used to generate a prediction of policy and value for a given input of board position. This prediction is used to update the values related to the nodes (which we had earlier used as virtual loss). MCTSPlayer’s neural network is of type NeuralNet, which is broken down into 3 parts: Base network, value head and policy head. The input passes through the base network first. The output of it is fed into value head to obtain value of state and into policy head to obtain the policy for it. NeuralNet can also perform the evaluation of two MCTSPlayers, where two players compete in a series of games to decide who is the winner.

I have put up an example of AlphaGo Zero algorithm here. By setting the flags, you can run it. By default it runs the default AGZ algorithms from the paper.

GSoC’18: Flux baselines, Go and more

By: Tejan Karmali

Re-posted from: https://tejank10.github.io/jekyll/update/2018/05/26/GSoC-week1-2.html

Hello, world!

The commmunity bonding period and first 2 weeks of GSoC has come to an end. Community bonding period lasted over three weeks. I coult not do much work over first two weeks due to my end semester exams. In the third week, I implemented MADE, which is Masked Autoencoder for Distributed Estimation. I also added dilation feature for convolutions (which was a feature request for NNlib.jl).

  • PR#19: Implemented MADE architecture in Flux.
  • PR#40: Dilation support for convolutions. So far, dilation support is available for 1D, 2D and 3D convolutions.

In week 1, I implemented and created demos of some seminal papers in deep reinforcement learning. The algorithms implemented were tested on environments in OpenAI Gym using the package OpenAIGym.jl. Environments used for testing are CartPole-v0, Pong-v0 and Pendulum-v0. The work done in this regard over pre-GSoC period and this week has been compiled into Flux baselines repo. As of now it contains 6 models, which include:

In week 2, I started working towards one major milestone of the project: the AlphaGo Zero model. For those of you not familiar, AlphaGo Zero is latest version of AlphaGo which is a program to play (and defeat :P) the ancient Chinese game of Go. This version of AlphaGo doesn’t take any human amateur and professional games to learn how to play Go. Instead it learns to play by playing games against itself, starting from completely random play.

AlphaGo.jl is where I am implementing the Flux based version of AlphaGo Zero. This part of project is divided into three tasks:

  • Creating the environment for Go
  • Monte-Carlo Tree Search
  • Main model of AlphaGo Zero using Go and MCTS

In week 2 I created the environment of Go. The environment simulates the game of Go, with abstraction like that of OpenAI Gym. The game can be played on a board of size 9×9, 13×13, 17×17 or 19×19. A player is assigned stones of either black or white color. The player with black stones makes the first move. Now, this can be an advantage for the black player. Hence white player is awarded some extra points. These extra points are called komi. Players can place a stone on any intersection of a vertical and horizontal lines on the board. A NxN board has N^2 intersections. On a player’s turn, he can either place a stone or can pass to the other player. Thus, action space for the environment is N^2 + 1. The game ends when both the players pass consecutively. Depending on the end game state of the board, scores are calculated and winner is decided.

In the coming days, my goals will be to complete the other two tasks. Hopefully in the next blog post, I’ll present the demo of the game before you.