Reinforcement Learning BlackJack

BlackJack is an ideal game for a computer to learn. It is relatively straightforward in its gameplay and the strategy can be learned by a computer quite well through trial and error. This is an example of a strategy that can be used in many different types of games.

This implementation will use reinforcement learning in order to 'learn' an appropriate strategy for the game. The 'Player'(Jack) will be provided the same information that a human player would have access to and is only given feedback from the results of the game itself.

The implementation begins by randomly making moves and then slowly over time it builds confidence in it's internally developed strategy. Eventually, it will stop playing randomly and will only make strategic decisions.

Game Set-up

Card Deck and 'Hand' Class to assist with the gameplay and scoring.

Some aspects of the game have been simplified but the core concept remains the same. For instance, there is no possibility to count cards and the dealer will always try to hit if he is losing.

In [3]:
import numpy as np
face=["2","3","4","5","6","7","8","9","0","J","Q","K","A"]
suit=["H","S","C","D"]
deck=[]
for i in range(13):
    for z in range(4):
        deck.append(face[i]+suit[z])
deck=['2H','2S','2C','2D','3H','3S','3C','3D','4H','4S','4C','4D','5H', '5S','5C','5D','6H','6S','6C','6D',
'7H', '7S','7C','7D','8H','8S','8C','8D','9H','9S','9C','9D','0H','0S','0C','0D','JH','JS','JC','JD','QH','QS', 'QC',
'QD', 'KH', 'KS', 'KC', 'KD', 'AH','AS', 'AC', 'AD']
In [4]:
class hand():
    #inputs an array of cards from "deck" and scores the current hand.
    def __init__(self,hand):
        import numpy as np
        switch={"A":[11,1],"K":[10],"Q":[10],"J":[10],"0":[10],"9":[9],"8":[8],"7":[7],"6":[6],"5":[5],"4":[4],"3":[3],"2":[2]}
        self.cards=hand
        self.score=0
        self.aces=0
        for each in self.cards:
            self.score+=switch[each[0]] [0]
            if each[0]=="A":
                self.aces+=1
        while (self.score>21) & (self.aces>=1):    
            self.score-=10
            self.aces-=1
    def getscore(self):
        return self.score
    def getcards(self):
        return self.cards
    def __str__(self):
        return np.array2string(self.cards)
In [11]:
a=hand(np.array(["AC","KS"]))
print(a,a.getscore())
b=hand(np.array(["AC","AS"]))
print(b,b.getscore())
['AC' 'KS'] 21
['AC' 'AS'] 12

Player Class

Originally programmed to 'Explore' randomly and continuously update its 'q_space' to learn from rewards.

Given a reward of +2 for a win and a reward of -1 for a bust or loss

In [220]:
class player():
    def __init__(self,epsilon=1.0, q_space=None,loading=False):
        self.wins=0
        self.losses=0
        self.gamma=.9
        self.epsilon=epsilon
        self.cur_score=0
        self.hand=None
        if loading==False:
            self.q_space=np.zeros((22,13,2))
        else:
            self.q_space=q_space
    def playhand(self,cards,dealercard):
        self.hand=hand(cards)
        self.cur_score=self.hand.getscore()
        greed=np.random.choice(["Explore","Exploit"],p=[self.epsilon,1-self.epsilon])
        if greed=="Explore":
                action=np.random.choice(["Stay","Hit"],p=[.5,.5])
        else:
            if self.cur_score>21:
                index=21
            else:
                index=self.cur_score
            action=np.argmax(self.q_space[index-4,dealercard,:])
            if action==0:
                action="Stay"
            else:
                action="Hit"
        
        return action
    def play(self,_print=True):
        
        won=False
        a=np.random.choice(deck,2)
    
        dc=np.random.randint(2,12)
        
        action=self.playhand(a,dc)
        states=[]
        states.append([self.cur_score,dc,action,0])
        index=0

        while action=="Hit":
        
            card=np.random.choice(deck)

            a=list(a)
            a.append(card)
            a=np.array(a)
            
            action=self.playhand(a,dc)
            if self.cur_score>21:
                states[index][3]=-1
            if self.cur_score==21:
                states[index][3]=+1
            states.append([self.cur_score,dc,action,0])
            index+=1
            
        if self.cur_score>21:
            states[index][3]=-1
            
        dealer=dc
        #Dealer plays
        if self.cur_score<22:
            while (self.cur_score>=dealer) & (dealer<22):
                dealer+=np.random.randint(2,12)
            
        if (dealer<22) & (self.cur_score<dealer):
            states[index][3]=-1
        elif self.cur_score<22:
            states[index][3]=1
            won=True
        else:
            states[index][3]=-1
            
        self.learn(states)
        if won==True:
            
            self.wins+=1
            
            if self.wins%3==0:
                self.epsilon*=.999
        else:
            self.losses+=1
        result= "Win" if won==True else "Lost"
        if _print==True:
            print(self.hand,"Score: "+str(self.cur_score), "Dealer: "+str(dealer),result)
        
    def learn(self,states):
        #Update the q_space in order to learn from current play
        for each in states:
            if each[0]>21:
                index=21
            else:
                index=each[0]-4
            if each[2]=="Hit": 
                action=1
            else:
                action=0
            self.q_space[index,each[1],action]=self.q_space[index,each[1],action]*self.gamma+each[3] 
In [169]:
Jack=player()
Jack.play()
Jack.play()
['AC' '7C' 'JS'] Score: 18 Dealer: 22 Win
['AH' 'QD' 'QC'] Score: 21 Dealer: 22 Win

From the first attempts you can see that Jack is randomly playing. The second hand was dealth a 21 and decided to hit-ironically still getting 21. Jack will not play with any sense in the first stages of the learning the game. This process of exploration is to learn the rewards of the game and try new things. This process is extremely important to avoid getting stuck in sub-optimal strategies.

In [221]:
Jack=player()
winlist=[]
for q in range(3000):
    for i in range(100):
        Jack.play(_print=False)
    winrate=Jack.wins/(Jack.wins+Jack.losses)
    winlist.append(winrate)
    
x=np.arange(len(winlist))*100

Learning Curve

The initial stages of the learning process are very volatile but over time as Jack becomes more comfortable it becomes more consistent in win/loss ratio. Of course, the dealer still has an advantage in this game set up and will win more than 50% of the time regardless of strategy.

This learning curve can be slowed or sped up with simple changes to the learning rate and explore/exploit ratio.

In [255]:
import pandas as pd
import matplotlib.pyplot as plt

z=pd.DataFrame({"Games":x,"Winrate": winlist},columns=["Games","Winrate"])
plt.figure(figsize=(13,7))
plt.plot(z["Games"],z["Winrate"])
plt.xlabel('Games')
plt.ylabel('Win-Rate')
plt.title('Learning Curve')
plt.show()

Strategies

Most of Jack's strategies end up being fairly monotonic. That is partly why this game is a good set-up for a computer to learn. Concretely, this means that game strategies are deterministic and static over time. The game does not adapt to your strategies choices like some games do.

Looking at the q_space below (Jacks decision making memory):

In [256]:
Jack.q_space[17,2:12,:]
#Decisions for having 21 (Left- stay, Right-hit) Of course, these decisions are fairly easy to make
Out[256]:
array([[10.        , -6.46065687],
       [10.        , -7.06011677],
       [10.        , -2.80074289],
       [10.        , -3.67710728],
       [10.        , -4.20706848],
       [10.        , -1.917327  ],
       [10.        , -2.24952502],
       [10.        , -5.53667808],
       [10.        , -4.93291285],
       [10.        , -2.83902765]])

Decisions for holding 16 is more mixed. The dealer's card needs to be considered before making a decision

While holding 16, Jack ends up deciding to stay only when the dealer is showing a 5.

In [257]:
Jack.q_space[12,2:12,:]
Out[257]:
array([[-8.66239175, -5.61061663],
       [-8.83849294, -3.12728268],
       [-8.21582785, -6.92599006],
       [-2.31666161, -8.13780881],
       [-8.77131512, -5.49943127],
       [-8.2443085 , -4.09373049],
       [-8.55068364, -3.95204492],
       [-9.10602751, -4.87100256],
       [-8.87112017, -4.12153443],
       [-8.55068324, -4.79273661]])
In [235]:
Jack.wins, Jack.losses=0, 0
for i in range(100):
    Jack.play()
['4C' '5C' '3C' 'QC'] Score: 22 Dealer: 8 Lost
['AS' '4S' '9S' 'JD'] Score: 24 Dealer: 4 Lost
['6C' 'QC' '2H'] Score: 18 Dealer: 20 Lost
['5D' 'KH' '6C'] Score: 21 Dealer: 22 Win
['5H' '0D' 'JH'] Score: 25 Dealer: 9 Lost
['JS' 'KC'] Score: 20 Dealer: 22 Win
['6H' '7H' '9C'] Score: 22 Dealer: 2 Lost
['3H' 'QS' '4C'] Score: 17 Dealer: 21 Lost
['9C' '2S' '6H'] Score: 17 Dealer: 20 Lost
['JS' '2H' '0H'] Score: 22 Dealer: 4 Lost
['AC' '0C'] Score: 21 Dealer: 32 Win
['0D' '8H'] Score: 18 Dealer: 20 Lost
['3C' '5S' 'KC'] Score: 18 Dealer: 29 Win
['5H' '0S' 'QH'] Score: 25 Dealer: 6 Lost
['9H' 'QH'] Score: 19 Dealer: 27 Win
['8C' '0D'] Score: 18 Dealer: 27 Win
['4C' 'KH' '2H' '5D'] Score: 21 Dealer: 24 Win
['JS' 'QS'] Score: 20 Dealer: 24 Win
['QS' 'QD'] Score: 20 Dealer: 22 Win
['AH' 'KH'] Score: 21 Dealer: 23 Win
['2C' '4H' '6S' 'AC' '4D'] Score: 17 Dealer: 19 Lost
['8C' '7S' '4S'] Score: 19 Dealer: 27 Win
['3D' 'AS' 'QD' 'AH' '3D'] Score: 18 Dealer: 22 Win
['4H' '7D' '5H' 'KD'] Score: 26 Dealer: 3 Lost
['9S' '7H' 'JS'] Score: 26 Dealer: 10 Lost
['AD' 'KC'] Score: 21 Dealer: 30 Win
['JC' '8H'] Score: 18 Dealer: 22 Win
['AH' 'JD'] Score: 21 Dealer: 29 Win
['JH' '2C' 'AD' 'AD' '3D'] Score: 17 Dealer: 21 Lost
['JS' '0S'] Score: 20 Dealer: 22 Win
['7S' '8C' '0C'] Score: 25 Dealer: 8 Lost
['KD' '6D' 'QS'] Score: 26 Dealer: 2 Lost
['0H' '3H' '0H'] Score: 23 Dealer: 7 Lost
['7C' '3H' 'QH'] Score: 20 Dealer: 23 Win
['8S' '4H' 'KH'] Score: 22 Dealer: 10 Lost
['0C' '9H'] Score: 19 Dealer: 21 Lost
['QS' '8S'] Score: 18 Dealer: 20 Lost
['QD' '4H' '7H'] Score: 21 Dealer: 23 Win
['7C' '4H' '6S'] Score: 17 Dealer: 22 Win
['2H' '0C' '2D' '4H'] Score: 18 Dealer: 28 Win
['2S' '8C' 'QD'] Score: 20 Dealer: 22 Win
['2H' '3D' '7C' '9H'] Score: 21 Dealer: 24 Win
['9H' 'KS'] Score: 19 Dealer: 22 Win
['QH' 'QC'] Score: 20 Dealer: 23 Win
['QH' '3S' 'QS'] Score: 23 Dealer: 5 Lost
['5D' 'KS' '6S'] Score: 21 Dealer: 27 Win
['7S' 'KH'] Score: 17 Dealer: 20 Lost
['4C' '5S' 'JD'] Score: 19 Dealer: 25 Win
['7H' '9D' '9C'] Score: 25 Dealer: 4 Lost
['5C' 'KS' '3S'] Score: 18 Dealer: 27 Win
['2H' 'KD' '7C'] Score: 19 Dealer: 23 Win
['KS' '0C'] Score: 20 Dealer: 26 Win
['QH' '8C'] Score: 18 Dealer: 19 Lost
['5C' '3D' '2H' '0H'] Score: 20 Dealer: 26 Win
['5C' '4H' 'JD'] Score: 19 Dealer: 22 Win
['4D' '8D' 'KD'] Score: 22 Dealer: 2 Lost
['6D' 'JH' '6H'] Score: 22 Dealer: 11 Lost
['6H' 'KS' '6S'] Score: 22 Dealer: 8 Lost
['KD' '2H' '5S'] Score: 17 Dealer: 19 Lost
['0D' '9S'] Score: 19 Dealer: 21 Lost
['3S' 'AC' '9H' '6S'] Score: 19 Dealer: 22 Win
['6D' '4S' 'KH'] Score: 20 Dealer: 21 Lost
['8D' 'KH'] Score: 18 Dealer: 23 Win
['QH' '5S' 'KS'] Score: 25 Dealer: 11 Lost
['8H' '0D'] Score: 18 Dealer: 23 Win
['QD' '6H' '6H'] Score: 22 Dealer: 2 Lost
['AH' 'KC'] Score: 21 Dealer: 25 Win
['7H' '4H' '6S'] Score: 17 Dealer: 18 Lost
['2S' '9C' '6D'] Score: 17 Dealer: 26 Win
['3S' 'QH' '8S'] Score: 21 Dealer: 25 Win
['7C' '0S'] Score: 17 Dealer: 19 Lost
['JD' '7D'] Score: 17 Dealer: 18 Lost
['2H' '4D' '9C' '4S'] Score: 19 Dealer: 20 Lost
['2C' '8D' '8D'] Score: 18 Dealer: 20 Lost
['5C' '4D' '4S' 'QD'] Score: 23 Dealer: 3 Lost
['9H' '4D' '4C'] Score: 17 Dealer: 27 Win
['KS' '6D' '9C'] Score: 25 Dealer: 6 Lost
['QD' '9S'] Score: 19 Dealer: 24 Win
['8S' '7S' '3S'] Score: 18 Dealer: 24 Win
['JS' '8S'] Score: 18 Dealer: 28 Win
['3S' '5D' 'KH'] Score: 18 Dealer: 25 Win
['AS' 'QC'] Score: 21 Dealer: 28 Win
['KC' 'JH'] Score: 20 Dealer: 27 Win
['9D' '6C' '4C'] Score: 19 Dealer: 22 Win
['2H' '4D' 'QS' '6D'] Score: 22 Dealer: 8 Lost
['AC' '9C'] Score: 20 Dealer: 25 Win
['AS' 'QC'] Score: 21 Dealer: 27 Win
['9H' '8D'] Score: 17 Dealer: 24 Win
['5D' '2D' 'KD'] Score: 17 Dealer: 27 Win
['4C' 'QS' 'QH'] Score: 24 Dealer: 7 Lost
['7H' '9H' '6C'] Score: 22 Dealer: 8 Lost
['QS' '3S' 'QS'] Score: 23 Dealer: 10 Lost
['JH' '6C'] Score: 16 Dealer: 17 Lost
['QS' '0S'] Score: 20 Dealer: 22 Win
['JS' '4H' '5D'] Score: 19 Dealer: 20 Lost
['KD' '0D'] Score: 20 Dealer: 21 Lost
['JS' '4C' '0D'] Score: 24 Dealer: 10 Lost
['QC' '5C' '6S'] Score: 21 Dealer: 23 Win
['7D' '7C' 'QS'] Score: 24 Dealer: 9 Lost
['QH' 'KH'] Score: 20 Dealer: 23 Win
In [259]:
print("100 Sampled Hands Win rate {}%".format(Jack.wins*100/(Jack.losses+Jack.wins)))
100 Sampled Hands Win rate 52.0%