## Wednesday, July 23, 2014

This blog post contains entertaining videos of an AI agent playing Super Mario Bros, making a series of interesting mistakes but ultimately playing quite well (after several modifications). But before watching those, I want you to read about the algorithm that plays the game.

Monte Carlo Tree Search (MCTS) is the new black in game AI. At least for board games and similar games, including many types of card games and puzzles. The algorithm has only been around since 2006, but has already become the method of choice for Go (the Asian board game) and "general game playing" (developing AI that can play any of a wide range of games). People had been working on AI for Go for decades without producing anything that played better than a human beginner, but after MCTS was invented playing strength increased drastically to the point where computers can now compete with human club players. All the best Go-playing programs use MCTS. For the General Game Playing Competition, and the General Video Game Playing Competition, agents based on MCTS are pretty much the only ones that work at all.

At its core, MCTS is a tree search algorithm. Just like Minimax, the simple algorithm that is used for playing games such as Chess and Checkers among many others, it looks at the possible moves that can be made from a particular (board) state, and selects the best move based on which states these moves result in and which moves can be taken from them. There are two big differences with Minimax, however. The first is that while Minimax explores all moves equally, MCTS is prejudiced and explores promising branches of the tree much deeper than others. For this, it uses the UCB equation, which determines which nodes are underexplored and should be explored further. The second difference is that while Minimax needs an evaluation function that assigns a "goodness" value to a game state (how close that game state is to winning), MCTS evaluates a state through playing a number of games until the end by taking random actions. Yes, random; it sounds crazy but it works. Because MCTS does not need an evaluation function, it works well for games where it is very hard to effectively evaluate the quality of a game state (such as Go), and because MCTS builds an unbalanced tree, it works for games that have very high branching factors (such as Go).

Are you still with me? Good. The above was of course a gross simplification. For the full story of MCTS, including pseudocode, history and several common improvements, see this excellent survey paper. Now back to the story.

So, MCTS does really well for Go and for a number of other board game-like games. What other games would it work well on? An hypothesis of mine was that for MCTS to work well, the game needed to be:
1. Turn-based, or close to turn-based. Relatively few actions should be taken in a given time period, and each action should have a measurable and frequently meaningful impact on the game state.
2. Turn-limited. After playing a certain number of turns, even if just taking random actions, one should be guaranteed to arrive at an end state (win, loss or draw).
The reasoning is if that the game is not guaranteed to finish in a limited time, the playouts might go on forever; also, if the game is not turn-based, each action might mean too little for the value of each node to be estimated. Go, the main success story of MCTS, fulfils both of these criteria.

Most arcade games, and other action games, fulfils none of these criteria. Typically, you have high frame rates (meaning that each action only changes the state a little bit) and considerable freedom of movement (so that you're not pushed towards an inevitable end of the game). So I thought it would be interesting to investigate whether MCTS would work for such a game, in what ways it would fail if it didn't work, and how we could modify the algorithm to make it work. But then, someone would need to actually do this research. Luckily, two very talented students, Emil Juul Jacobsen and Rasmus Greve, walked into my office one day and inquired about doing a bachelors thesis. So we had a deal. In what follows, Emil and Rasmus did all the hard work, whereas I was mostly waving my hands around and saying encouraging things ("supervising").

What follows below is mostly a concise summary of our recent paper.

First, we decided on a game. We chose the Mario AI Benchmark, which is a hacked version of a clone of Super Mario Bros. (You can find the old version here, the new version here and an overview paper here.) The version of the Mario AI Benchmark we used has rather simple levels, and it has previously been shown by Baumgarten that A* search in state space can play these levels very well. (In fact, our MCTS implementation actually reuses the forward model from Baumgarten's A* implementation.)

Our first, naive attempts at playing Mario with MCTS met with limited success. We had to make a couple of modifications to the algorithm to make it work at all. As it is very rarely possible to "roll out" (play randomly) until the end of the level, we had to cap the rollouts to a very low number (e.g. 6) actions, and if Mario wasn't either dead or winning by then we returned how far Mario had travelled to the right as a state evaluation. This gave us rudimentary gameplay ability. After removing the possibility to press the "down" button (or rather, to issue the "down" command) performance improved somewhat, but we still see a shortsighted and somewhat cowardly Mario in the video below.

Indeed, not very impressive. Note the red line in front of Mario, which shows the best found path in each time step; it is not very long. The short action size and constraints on real-time control conspire to limit the maximum depth of the tree that can be explored. That explains the shortsightedness. But what about the cowardice, with Mario hesitating in front of pipes and cannons and not having the courage to jump? (An earlier version of the agent had him hesitating in front of holes too.) We hypothesised that this is because the average outcome of jumping (falling into a hole, or getting hurt by a flower) is worse than the average outcome of doing nothing, even though the best-case outcome is considerably better.

Now we had our baseline result. We decided to see whether we could improve on it, by modifying MCTS to perform better on this type of task. Below, describe the five modifications we implemented and evaluated.

Mixmax rewards tries to address the issue of cowardice. The basic idea is very simple: in addition to just backing up the average value of all children of a node, we also back up to maximum value of any child of that node. The actual value of the node is some combination of the average and max. A bit of empirical testing showed that around 20% max was a good number. The Mixmax Mario player below shows a somewhat more assertive behaviour, and also scores better than the baseline in systematic evaluations.

The next modification is Macro actions. This is a rather crude attempt to counteract the shortsightedness of Mario by making each move last longer. Each time an action is explored, the same action is taken for a number (e.g. 3 or 5) of time steps. The idea of Macro-actions is certainly not new; it has been explored before in the context of the Physical Travelling Salesperson Problem. In our case, an added feature is that the search reverts to a single time step when Mario is close to an enemy. This yields a player who sees further into the future than standard MCTS but does not play much better.

Partial expansion is another attempt to remedy Mario's shortsightedness. Here, we've modified the UCT formula that underlies the MCTS algorithm to make is possible to explore grandchildren of a node before having explored all children of that node. This allows the algorithm to explore deeper but sparser trees, and works relatively well in practice.

The order in which actions are explored can be quite important in MCTS. Ideally, you want the best actions explored first - but of course, you don't know which those are. There have been several attempts at making more intelligent action ordering in the literature. We call our solution Roulette wheel selection; we simply estimate the average value of each action (there are only 16 possible button combinations) and then use the classic roulette wheel selection operator from genetic algorithms to probabilistically select which action to explore each time we expand a node.

Finally, we experimented with throwing in a little bit of domain knowledge. Hole detection is a very simple modification of the reward scheme, where we simply give a negative reward for ever being inside a hole, rather than only for reaching the bottom of a hole. As you can see, this allows Mario to avoid holes better.

We then evaluated all possible combinations of these various improvements. It turns out that there are several combinations that perform indistinguishably from A*, and thus seem to play optimally (at least on these simple linear levels). In general, all modifications work with each other most of the time except macro actions, which are frequently detrimental. Below you can see an example of gameplay with all modifications except macro actions turned on.

Pretty neat, eh?

It even works well of you remove the domain knowledge.

So, it seems that we've cracked the problem of how to make MCTS play Super Mario Bros. A couple of rather simple modifications to the basic MCTS algorithm takes it from very bad performance to really rather good performance. We believe most of these modifications will translate well to other game domains, and might therefore be useful to anyone interested in applying MCTS to games that more like Super Mario Bros than like Go.

There's only one tiny fly in the ointment though: A* can already play Super Mario Bros just as well. So what's the point in attempting something fancy like MCTS when you can just the trusty old best-first search algorithm?

Here's why. A* frequently produces solutions that seem extremely brittle, and relying on everything going as planned. So we decided to see what happens if we changed the benchmark a bit, so things sometimes worked out differently than planned. We simply added a 20% action noise, so that there was a 20% chance that for any given action, a random action would be taken instead of the one that the agent chose. This resulted in complete collapse for A*; performance degraded by around 2/3. MCTS on the other hand held up relatively well, losing some performance but playing considerably better than A* under noisy conditions.

So MCTS can indeed be useful for arcade games, as it produces gameplay that is not only strong but also robust.

The above was of course a simplified account of what we did, missing lots of details and results. Read the paper, published in the proceedings of GECCO 2014, to find out more!