Performance of AI Algorithms in Playing Games — Empirical Evidence From Jupiter, My 2048 AI
The AI uses the Monte Carlo Tree Search (MCTS) algorithm, which makes moves based on the results of many simulations of random games, also known as Monte-Carlo simulations.
I've written an article on how this algorithm works, how it can be implemented, and where MCTS can be useful. I highly recommend reading that article:
Here's a brief summary of the algorithm that you can feel free to skip if you've read the above article or you understand it already:
To find the optimal move at any given position, the program conducts a set of simulations for each possible move in that position (ex:
down). For each set of simulations, the algorithm starts by playing the move for that set first.
After that, the rest of the game can be played completely randomly until it is over. The algorithm can then gather the total final game scores (sum of all the tiles on the board) of all the simulations, and average them for each set. We can then find the optimal move by optimizing for the highest final game score.
For example, there could be 50 simulations where the first move was
left, in which the average score for those simulations was 250. Support there were 50 simulations for each of the
down moves, and the average score for the 50 simulations in each of those was only 225. In this case, the optimal move would be
left since the algorithm optimizes for the move that yields the highest final game score.
Let's start with a few definitions that are relevant to the rest of the article:
- Performance: how well the AI performs at the end of each game, in which a higher final game score would be better
- Game State: a set of tiles on the board which represents the board at a specific time
- Game Score: the sum of all the tiles on the board
- Real Game: the game that is being played and shown on the browser, not a simulation
- Landmark Score/Tile: a high tile or score of a power of two like 512, 1024, 2048, or 4096
Analyzing the Performance of the Algorithm
I ran 50 trial games with the AI at 200 simulations per move in about 34 minutes (avg. 40.8s/trial), storing data after every move, including:
- Current Game Score
- Best Tile in the Board
- Average Score of Simulations
- Average Move Count of Simulations
- Milliseconds Taken to Calculate Optimal Move
- The Move Made
The default 200 simulations per move used is meant to be a baseline so that the AI runs a decent speed at all tiers of devices including phones and tablets as well as desktops and laptops. I have personally done runs with 1500 simulations per move and have achieved much better results there while sacrificing speed in the process.
Game Score and Best Tiles
In all of the 50 simulations done, 96% reached at least the 1024 tile, 62% reached at least the 2048 tile and 2% reached the 4096 tile. None of the simulations reached a tile beyond 4096.
For game score, all trials reached at least 1024, including the two trials which didn't get the 1024 tile itself.
In fact, there's a clear trend in which games reach a landmark game score like 2048 or 4096, but don't survive long enough to get the tile itself.
I hypothesize this is because the game begins to get very cluttered with tiles right before a landmark tile is reached. For example, one move before getting 4096, the game board must already include at least 11 tiles: 2048, 1024, 512, 256, 128, 64, 32, 16, 8, 4, and 2. In this case, the placement of the tiles could not be ideal, or a 4 tile may appear where a 2 tile was needed. As a result, the board could have all the tiles it needs to achieve a landmark tile, but is too cluttered and the game ends up ending before the landmark tile is reached.
Another trend here is in trials which ended between the 2048 and 4096 game scores. There are several of these and this is likely because of board clutter. The algorithm doesn't have a built-in heuristic for tile placement, so tiles aren't perfectly lined up in increasing order like other strategies use.
This becomes a problem for very high tiles, depending on the total simulations per move. For 200 simulations per move, it looks like 4096 is the most common to cause cluttering problems.
Jupiter is Very Different to Other AIs and Strategies
One of the most interesting things about Jupiter's algorithm is that it doesn't follow a particular strategy. Watching Jupiter play 2048, it's hard to see any sort of pattern in its moves, other than the fact that it wins most of the time.
The most common winning strategy among humans is a system where all tiles are lined up in increasing order along rows, alternating the direction in each row so that tiles can easily be added as new ones appear. Jupiter does not follow this type of strategy; in fact, it has no built-in strategy or heuristics at all.
I personally see this as an advantage which makes Jupiter a more creative AI. Jupiter's algorithm typically leads to more interesting and less predictable 2048 gameplay than other 2048 AIs or strategies. However, Jupiter's algorithm has a lack of precision that can lead to the issues with board clutter, because tiles are not algorithmically placed in a logical pattern.
The common human strategy mentioned also depends on the idea that tiles are concentrated on a particular corner, which results in two of the possible moves (
right in the above visualization) being used much less than their counterparts. Jupiter is not like this, and makes all moves an equal fraction of the time.
What a Single Trial Game Looks Like
Let's take a look at a single game trial, trial #50, which got to the 4096 tile.
As we can see, the current game score is almost linear, with an approximate 2.2 slope. This is because in 2048, a new tile is added after each move. This new tile is typically a 2 tile, but has a 10% chance of being a 4 tile instead. Thus, the expected value of the new tile is 2.2
(2 × 90% + 4 × 10%), which increases the game score by an average of 2.2 after every move.
The average game score of all the simulations is always slightly above the current game score, because random moves incur a few tile combinations and increase the score before the simulated game ends.
And as noted previously, game score is directly connected to current game move count, where game score can be calculated by multiplying current game move count by 2.2. Therefore, spikes in average move count of simulations occur identically in the average score of simulations as well.
Notice how all three of these things occur at the same time:
- Increase in best tile
- Spike in average move count of simulations
- Spike in average score of simulations
In the graph, average score of simulations is scaled to be much smaller than average move count of simulations, so you may have to look closely to see spikes there.
As we can also see, the game score reaches a landmark before the corresponding landmark tile is actually reached — when the orange best tile line jumps up, the blue game score line has already surpassed the value of the orange best tile line.
Finally, possibly the most interesting insights we can gain from this graph are from the yellow average move count of simulations variable.
In the beginning, the average move count starts very high because there are very little tiles on the board, and the ones that are there are tiles 2 or 4. This means that simulations can survive pretty long by just playing randomly.
As higher tiles are created, the average move count starts getting lower and lower because there is more clutter and therefore a higher chance of the game ending in a shorter period of time.
The less clutter there is on the board, the higher the average move count is. Clutter is reduced by combining larger and larger tiles.
As a result of this relationship with tile combination, the amount of board clutter, and thus the average move count, we can see a clear fractal-like repeating shape, in which the average move count spikes up, goes down over time, spikes up again, goes down over time, and repeats.
These spikes are always when large tiles are created by combining smaller ones. This is corroborated by the fact that several of these spikes occur at the same time as new best tile being created (see 512, 1024, 2048 for example).
In the middle of each new best tile being created, there is another smaller spike, which we can assume is the tile half of the next best tile. For example, we can see right in the middle of 1024 and 2048 being reached, there is a large spike. This is likely when a new 512 tile was created. Subsequently we can see even smaller spikes between all adjacent spikes, corresponding to tiles being created with even smaller powers of two.
I kept the original implementation without workers in the code to run for browsers like Opera Mini that don't support the Web Workers specification.
This increased performance greatly. On a mid-tier laptop running on battery power, I was able to run 50 trials of full games on 200 simulations per move in approximately 34 minutes. This meant that I was able to run one full game of 2048 with approximately 1600 moves in about 40.8 seconds on average. This means the AI played ~39 moves per second, with each move taking ~25 ms to calculate.
I hope you enjoyed this post, and found it interesting in analyzing performance and improving the speed of Jupiter, my 2048 AI.
Go check out Jupiter and its source code on GitHub.
Thanks for scrolling.
— Gabriel Romualdo, October 11, 2020