使用DQN建立國際象棋代理

首頁 > 程式設計 > 使用DQN建立國際象棋代理

使用DQN建立國際象棋代理

發佈於2025-03-24

我最近尝试实施基于DQN的国际象棋代理。

现在，任何知道DQN和国际象棋工作方式的人都会告诉您这是一个愚蠢的想法。

而且...是的，但是作为初学者，我仍然喜欢它。在本文中，我将分享我在研究此工作时学到的见解。

了解环境。

在我开始实施代理本身之前，我必须熟悉我要使用的环境，并在训练过程中与代理商进行自定义包装器。

我使用了Kaggle_environments Library中的国际象棋环境。

来自kaggle_environments Import Make env = make（“国际象棋”，debug = true）
```
 from kaggle_environments import make
 env = make("chess", debug=True)
```

摘自Chessnut Import Game onitire_fen = env.State [0] ['观察'] ['板'] game = game（Env.State [0] ['观察'] ['板']）

在这种环境中，董事会状态以FEN格式存储。
```
 from Chessnut import Game
 initial_fen = env.state[0]['observation']['board']
 game=Game(env.state[0]['observation']['board'])
```
它提供了一种紧凑的方式来表示板上的所有作品和当前活动的播放器。但是，由于我计划将输入馈送到神经网络，因此我必须修改状态的表示。

将FEN转换为矩阵格式

[2

由于板上有12种不同类型的作品，因此我创建了12个频道的8x8网格，以表示板上每种类型的状态。

Building a Chess Agent using DQN 为环境创建包装器

类环境： def __init __（自我）： self.env = make（“国际象棋”，debug = true） self.game = game（env.State [0] ['observation'] ['board']）打印（self.env.state [0] ['observation'] ['board']） self.action_space = game.get_moves（）; self.obs_space =（self.env.state [0] ['observation'] ['board']） def get_action（self）：返回游戏（self.env.state [0] ['observation'] ['board']）。get_moves（）; def get_obs_space（self）：返回fen_to_board（self.env.state [0] ['observation'] ['board']） def步骤（自我，动作）：奖励= 0 g = game（self.env.state [0] ['observation'] ['board']）; if（g.board.get_piece（game.xy2i（action [2：4]））=='q'）：奖励= 7 Elif G.board.get_piece（game.xy2i（action [2：4]））=='n'或g.board.get_piece（game.xy2i（action [2：4]））=='b'或g.board.get.get_piece（game.xy.xy.xy2i（action.xy2i）奖励= 4 elif G.board.get_piece（game.xy2i（action [2：4]））=='p'：奖励= 2 g = game（self.env.state [0] ['observation'] ['board']）; G.Apply_move（动作）完成= false 如果（g.status == 2）：完成= true 奖励= 10 Elif G.Status == 1：完成= true 奖励= -5 self.env.Step（[[动作，'none']） self.action_space = list（self.get_action（）） if（self.action_space == []）：完成= true 别的： self.env.Step（[['none'，random.choice（self.action_space）]） g = game（self.env.state [0] ['observation'] ['board']）; 如果G.Status == 2：奖励= -10 完成= true self.action_space = list（self.get_action（））返回self.env.State [0] ['observation'] ['board']，奖励，完成

此包装器的重点是为代理提供奖励策略，并在训练过程中与环境交互。

我试图创建一个奖励策略，以给校友给出积极的观点，并取出敌人的零件，而负面的积分失去了游戏。

创建重播缓冲区 Building a Chess Agent using DQN [2

在培训期间使用重播缓冲区来保存（状态，操作，奖励，下一个状态）输出，然后随机使用用于反向传播的目标网络

辅助功能

[2 [2

我知道，并非所有64*64的动作都是合法的，但是我可以使用Chessnut处理合法性，并且模式很简单。

神经网络结构

导入火炬导入Torch.nn作为nn 导入Torch.optim作为最佳 DQN类（nn.Module）： def __init __（自我）： super（dqn，self）.__ init __（） self.conv_layers = nn.Sequepential（ nn.conv2d（12，32，kernel_size = 3，大步= 1，padding = nn.relu（）， nn.conv2d（32，64，kernel_size = 3，步幅= 1，填充= 1）， nn.relu（）） self.fc_layers = nn.Sequepential（ nn.flatten（）， nn.linear（64 * 8 * 8，256）， nn.relu（）， nn.linear（256，128）， nn.relu（）， nn.linear（128，4096）） def向前（self，x）： x = x.unsqueeze（0） x = self.conv_layers（x） x = self.fc_layers（x）返回x def预测（自我，状态，有效_ACTION_INDICES）：使用Torch.no_grad（）： q_values = self.forward（状态） q_values = q_values.squeeze（0）有效_Q_VALUES = q_values [有效_ACTION_INDICES] best_action_relative_index =有效_Q_VALUES.ARGMAX（）。项目（） max_q_value =有效_Q_VALUE.ARGMAX（）（） BEST_ACTION_INDEX =有效_ACTION_INDICES [BEST_ACTION_RELATITY_INDEX] 返回max_q_value，best_action_index

class EnvCust:
    def __init__(self):
        self.env = make("chess", debug=True)
        self.game=Game(env.state[0]['observation']['board'])
        print(self.env.state[0]['observation']['board'])
        self.action_space=game.get_moves();
        self.obs_space=(self.env.state[0]['observation']['board'])

    def get_action(self):
        return Game(self.env.state[0]['observation']['board']).get_moves();


    def get_obs_space(self):
        return fen_to_board(self.env.state[0]['observation']['board'])

    def step(self,action):
        reward=0
        g=Game(self.env.state[0]['observation']['board']);
        if(g.board.get_piece(Game.xy2i(action[2:4]))=='q'):
            reward=7
        elif g.board.get_piece(Game.xy2i(action[2:4]))=='n' or g.board.get_piece(Game.xy2i(action[2:4]))=='b' or g.board.get_piece(Game.xy2i(action[2:4]))=='r':
            reward=4
        elif g.board.get_piece(Game.xy2i(action[2:4]))=='P':
            reward=2
        g=Game(self.env.state[0]['observation']['board']);
        g.apply_move(action)
        done=False
        if(g.status==2):
            done=True
            reward=10
        elif g.status == 1:  
            done = True
            reward = -5 
        self.env.step([action,'None'])
        self.action_space=list(self.get_action())
        if(self.action_space==[]):
            done=True
        else:
            self.env.step(['None',random.choice(self.action_space)])
            g=Game(self.env.state[0]['observation']['board']);
            if g.status==2:
                reward=-10
                done=True

        self.action_space=list(self.get_action())
        return self.env.state[0]['observation']['board'],reward,done

这个神经网络使用卷积层进行12个通道输入，还使用有效的操作索引来过滤奖励输出预测。

实现代理

）：休息 a_index = action_index（action）如果random.random（） batch_size： mini_batch = replay_buffer.sample（batch_size）对于mini_batch中的e：状态，行动，奖励，next_state，完成= e g = game（next_state） act = g.get_moves（）; ind_a = action_index（ACT） input_state = TORCH.TENSOR（fen_to_board（next_state），dtype = type = turch.float32，需要 tpred，_ = target_network.predict（input_state，ind_a） target =奖励伽玛 * tpred *（1-完成） act_ind = uci_to_action_index（Action） input_state2 = torch.tensor（fen_to_board（state），dtype = type = terch.float32，quirens_grad = true）。 current_q_value =模型（input_state2）[0，act_ind] 损失=（current_q_value-目标）** 2 优化器.zero_grad（） loss.backward（）优化器.step（）如果EP％5 == 0： target_network.load_state_dict（model.state_dict（））这显然是一个非常基本的模型，没有机会实际上表现良好（但事实并非如此），但是它确实有助于我理解DQN的工作方式。 [2

版本聲明本文轉載於：https://dev.to/ankit_upadhyay_1c38ae52c0/building-a-chess-agent-using-dqn-40po?1如有侵犯，請聯繫[email protected]刪除