My cousin and I made a card game called Kerdu along time ago. He was always better than me, so I made an AI to beat him.

Game Rules:

Each player has four rows (1-4).
At the end of every round, a card will move to the next row (ex: 4->3 or 2->1).
If a card is in one of a player’s rows, they may play a card of equal or greater value, and discard both cards into a discard pile.
A player must have no cards in their first row at the end of a round or they lose.
Once both players pass, a round is over and both players may draw up to a handsize of 5 cards.
Once the deck is empty, shuffle the discard pile as a new deck.
You may attack a player with a card and depending on the value of the card, it gets placed in a different row. (1st Row: 2-3, 2nd Row: 4-10, 3rd Row: Jack, Queen, King, 4th Row: Ace)

A balance between aggression and defence:

The case for aggressive play:

Because both players will get the same average value of cards, the one who is attacking will always have an advantage. If I’m being attacked with a 9 in my first row, and in my hand I have a 2, 3, 7, Queen and King, then I’ll have to use a Queen to beat a 9. This means it’s almost always best to attack. What counters this type of play?

The case for defensive play:

If a player attacks with high value cards such as Queens Kings and Aces, then if they’re attacked with a high value card, they won’t be able to defend against it.

Also if I’m too aggressive in attacking, or are playing a lot of cards in a single turn I may run out of cards to play in a round. If then, my opponent plays a 2, I’d have to immediately beat the card, but since I have no cards to play I’d instantly lose. This can be taken to cases where I have 3 cards in my hand, but if my opponent has 4 cards that must be immediately beaten, I’d lose as I cannot discard of all 4 cards with only 3.

Hardcoding AI:

For my aggressive, hard to beat hardcoded AI, it would hold onto its highest value cards to defend against Aces and the like, and use all of its low value cards to attack. It would never go under 2-3 cards unless it was forced to defend.

This is a good balance between aggressive and defensive play. We are forcing our opponent to make unsavory trades, we can defend against strong threats easily, and we’re not easily suseptible to being barraged with low-level threats we cannot deal with. This is a formitable AI that even a perfect player cannot always beat.

Understanding Theory:

Markov Decision Processes (MDPs):

We will first find a base-level representation of our game. We can use MDPs to represent the following: Agent: Our player.
Environment: Our simple game board and hand.
State: The specific environment our player finds themselves in. This will give us the details of our hand, and what card is in which row.
Action: The move that the player is legally allowed to make.
Reward: The assessment of how good a state is given a particular action.

We want our agent to maximize rewards by analyzing a state and chosing the correct aciton. Everytime an action is taken, there is a subsiquent feedback in the form of a reward.

Our agent’s brain is their policy. A policy π defines the level of curiosity we have in exploring our options. Initially we may want to take random actions regardless of how dumb they seem, but slowly over time we want to instead use a probability distribution that narrows down to what actions give us the best results. This is known as an epsilon greedy strategy; a way we can balance exploration with exploitation.

So how do we assess how valuable a state is? We need two functions: A state-value function v_π(s) tells us how valuable a given state is. An action-value function q_π(s,a) tells us how valuable an action is given our current state. These functions are the summations of the expected value given a specific policy.

\[q_{\pi}(s, a) = E[R_{t} | S_{t} = s, A_{t} = a]\]

To balance short term losses with long term gains we’ll introduce the idea that we want to slowly overtime discount what rewards we can expect to get. If we care about the future just as much as the present we’d set a value gamma (γ) to 1. If we don’t care about the future at all we’d set γ to 0 such that our equation below is summing the discounted rewards R_total at time t.

\[R_{total} = \sum\limits_{k=0}^{\infty} {\gamma}^k * R_{k + t + 1}\]

The following equation therefore states that for given a state and action, the expected reward will be the sumation of all discounted rewards. This is what’s known as our Q-function that calculates our Q-value.

\[q_{\pi}(s, a) = E[\sum\limits_{k=0}^{\infty} {\gamma}^k * R_{k + t + 1} | S_{t} = s, A_{t} = a]\]

Q-tables are brought up as a way of storing the value of a particular action with a particular state, however this obviously won’t scale well with more complex games with nearly infinite state-action pairs.

Optimal Play

Now we’re going to return to the epsilon greedy strategy and flush out how we make our agent learn.

The only thing we’ve been vague on so far is what exactly is the best policy to select. We’d normally use Bellman’s Optimality Equation that states that the best policy * is whatever maximizes the Q-Value. The pair s’, a’ refer to the next state action pair.

\[q_{\star}(s,a) = E[R_{t+1} + \gamma \max_{\{a'\}} q_{\star} (s',a') ]\]

If we were to have a Q-Table, once we take an action, our old assessment of the action will change to the following where α is a learning rate between 0 and 1.

\[q^{new} (s,a) = (1-\alpha)q_{old\ value} + \alpha q_{new\ value}\] \[q^{new} (s,a) = (1-\alpha)q(s,a) + \alpha (R_{t+1} + \gamma \max_{\{a'\}} q_{\star} (s',a'))\]

We now have all the math required for a basic grid world! Now the next step is to go into Reinforcement Learning!

Deep Q-Learning (DQN)

A Q-Table uses value iteration to find the best solution, however in sufficiently large systems we find this solution doesn’t scale well and can’t generalize. The optimal Q-Fucntion must be found via a function approximator instead.

For our model we’ll have each aspect of our state be represented by a neuron in a nerual network. The output will be the estimated Q-Values we wish to find. We can calculate a reward after each iteration and then we backpropagate our results! Just a simple neural network now.

Also, because we’re using a neural network, we need to make sure all of our data is i.i.d (independent and identically distributed). We get our new values in a sequential manner, so instead of updating our network after every iteration we hold onto a buffer such that we update batches of samples in an i.d.d manner. That’s all there really is to a DQN! I used TF-Agents to handle most of the complex overhead for this algorithm.

One last agent (AI architecture) I trained was a Categorical DQN (C51) that doesn’t return an expected value for each Q-Value but rather a probability distribution for each Q-Value. Atoms in the C51 paper are the quantiles (samples that split the distribution curve evenly) used. I saw that this paper got better results, but didn’t realize that this was overkill and not necessary for our game, as it would have to train longer and its benefits of recognizing non-uniform distributions for Q-Values was out-of-place in a simple card game.

Optimizing The Featurespace and Gym Environment

I noticed that my AI during training would make a lot of illegal moves, so I focused on limiting the available number of actions to optimal moves that narrowed the AI to only focusing on important high-end decision making, rather than making optimal coverages. Lets say we had two cards to beat an attacking 4, a 5 and 10. Defending with the 5 will always be the best play, so I eliminated my AI from making such errors through hard coding.

I played around with introducing higher and higher difficulties in my environment for the AI to train on, however doing so ultimately increased the length of training time unecessarily and had my AI get lazy with decisions that should’ve been punished. In this light of thinking I created a gym environment for my AI to play in such that it would focus on learning “optimal” plays. I’d consistently put the model in exercises such that there’d only be one “correct” play such that I could more effectively train the AI to learn faster.

Optimal First Row Defence: The AI would sometimes not defend itself when if it didn’t it would lose. So one of the exercises was to make obvious and optimal defences of self.
Perfect Defense: If I have a 5, it can beat an attacking 5. This means that it will almost always be best to use this defensive card immediately.
Assassinate: This is a hard one. If we have more 2s and 3s than the enemy player has cards, we always go to play all of our cards. This is the hardest and most technical victory to find, and it can be incredibly hard to recognize and find in regular play, so setting up this exercise trains the AI to know a good opportunity when it comes by.

Results

This AI had a winning average of 86% over the hard-coded AI. It had a winning average over myself. I’m sure with more time I could’ve improved the model as I never trained it for over an hour on an unoptimized laptop. If I were to ever revisit this project it’d be nice to maybe try out new techniques and compare them against my more primitive implementations.

Through this project I learned not only a lot about reinforcement learning on my own, but developed code that to me was easy to handle for easy experimentation. All AIs and environments were separated, easily saved and loaded, and could be easily swapped out made to face a human player or continue training.

Thank you for reading!