Alpha Zero General playing Tic Tac Toe in p5 using tf.js

So for my final, I eventually decided on this goal:

a p5 based template for training RL models on games, with the hope of eventually having a game library in p5 that allows users to easily and intuitively set up custom board games (with boards, pieces, scorekeeping, and more) or select from pre-designed games such as Chess, Go, and other classics that are used as benchmarks for reinforcement learning algorithms. Ideally have trained a model that can play a game against me (and not lose horribly)

And what I ended up with was a great exploratory foundation to that idea haha. I have a game of tic tac toe in the browser, with a downloadable or self-training model that you can play against. I’m really excited to use this as a jumping off point for more games and more RL. Though a part of me wishes I had expanded upon my fun KNN-Classifier game from Week 2…

When I first started working on this, I was convinced I would have to try and train the model from scratch by doing some sort of adversarial/genetic algorithm based training program that would start with randomly generated neural networks playing games of tic tac toe against each other, and using the victor as the template for a new generation of opponents, gradually increasing in playing strength each time.

Upon researching how other models learn how to play games, I encountered a lot of interesting aspects of reinforcement learning: Deep Q Networks, alpha-beta pruning, minimax search algorithms, and more. All were sort of adjacent to what I was looking for, but it wasn’t until I learned about “Universal Learning Architectures” that I realized what I needed to base my system off of — Alpha Go Zero.

Since Go is considered the most complex of the classic abstract board games, it was a big deal when Google Deep mind’s original model, Alpha Go, beat the world’s best go player, Ke Jie. Alpha Go was trained on thousands of recorded human-played Go games, but the model that eventually surpassed it wasn’t trained on any. When the successor, Alpha Go Zero, was developed, it only took 21 days of training to surpass the original, and instead of learning from pre-played games, it simply played against itself over and over in a similar way to what I was proposing at the beginning of my project. It didn’t know anything about the game other than what the legal moves were and who had won at the end.

Just as a side note, I think this is one of the most revolutionary aspects of this type of Machine Learning, because instead of having to do the most challenging and often controversial part of training, data collection, all the data is self-generated with no need for human input.

So the model that I eventually used was called Alpha Zero General, based on Alpha Go Zero, but using that universal learning architecture aspect that allows it to be thrown at any game and learn how to play with no need for human input. My implementation was based off of the following previous implementations, thanks so much to those who made this possible!

https://github.com/grimmer0125/alphago-zero-tictactoe-js

https://github.com/suragnair/alpha-zero-general/tree/master/connect4

https://arxiv.org/abs/1712.01815

From that paper, this is how they describe their methods:

For at least four decades the strongest computer chess programs have used alpha-beta search (18, 23). AlphaZero uses a markedly different approach that averages over the position evaluations within a subtree, rather than computing the minimax evaluation of that subtree. However, chess programs using traditional MCTS were much weaker than alpha-beta search programs, (4, 24); while alpha-beta programs based on neural networks have previously been unable to compete with faster, handcrafted evaluation functions. AlphaZero evaluates positions using non-linear function approximation based on a deep neural network, rather than the linear function approximation used in typical chess programs. This provides a much more powerful representation, but may also introduce spurious approximation errors. MCTS averages over these approximation errors, which therefore tend to cancel out when evaluating a large subtree. In contrast, alpha-beta search computes an explicit minimax, which propagates the biggest approximation errors to the root of the subtree. Using MCTS may allow AlphaZero to effectively combine its neural network representations with a powerful, domain-independent search.

Meow Wolf Style Transfer Webcam

Try it out!

Final Project Proposal

I’m designing a browser-based game that uses socket.io and p5.js, and it explores a lot of group dynamics under the lens of cooperation/competition when players have a shared enemy. In this case, the universal enemy would (in theory) be a ML model that is constantly reacting to the players and working to bring them down. It would either be something that is learning/reacting on the fly (with a system more akin to genetic algorithm based evolution strategies and hard-coded directive parameters) or a pre-trained model that uses its history with hundreds of previous games to inform what it should do in this one (using a reinforcement learning based model).

Sooooo there’s a lot flying around my brain and I’m not sure how to simplify just yet into something that’s deliverable by the final deadline. The options I perceive for the final are as follows:

Make the final project the AI enemy in the game, just to demo as a proof of concept. One of three paths:
1. Focus solely on the evolutionary approach, using the tensorflow.js demo I’ve been playing around with for ml5.
2. Develop a training environment for a RL model so I can then use that model in live games.
3. Instead of using something not currently well documented, use an image classifier on the game map (since it will be a simple array of pixels) to determine a general “direction” for the AI enemy.
Take a more exploratory approach to implementing RL with tensorflow.js
1. concurrent with ongoing research by Aidan Nelson into this topic with regards to the ml5 library
2. lots of good similar projects with Open AI’s gym environments and Andrej Karpathy’s ConvNetJS