1 action

Comment

Author: Admin | 2025-04-28

Future reward expectation, considering the corresponding action under each state as the value, initialized as 0. This value is obtained according to the Q function: Q S t , a t = R S t , a t + γ ∗ m a x ⁡ { Q S t + 1 , a t + 1 } (6) where R represents the reward return obtained at the moment t in state St by taking action at; γ is the discount factor, satisfying 0 ≤ γ ≤ 1; and m a x ⁡ { Q S t + 1 , a t + 1 } } represents the maximum expected Q-value for the next state S t + 1 . The action a t + 1 , corresponding to the maximum expected Q-value, is the action taken at the moment t + 1 , pertaining to the state transition information. The Q-value of the corresponding state and action taken constitute the element of the matrix, and the Q-table is thus constructed, as indicated in Table 1.Table 1 has 11 global states as columns and m actions as rows. Each cell in the table contains the expected reward Q-value calculated using the Q function. Notably, this value does not represent the maximum reward return obtained after taking a certain action; instead, it aims at maximizing the sum of the future discount rewards, i.e., obtaining the maximum discount reward expectation. All values are initialized to 0, thereby obtaining Q-table0. Based on Q-table0, the first row of global states is selected as the starting state S0, that is, the state with a coverage of 0%. The maximum number of iterations N is set to attain the target point Sn−1. In the case of the Q-table, among all columns of the current state St, we select the column with the maximum Q-value as action at. If multiple columns house the same Q-value and the maximum value, one column is randomly selected from among these columns. After the agent takes action at, the generated incentives are sent to the hardware design verification platform and test design module. The state control module determines whether the current state St+1 is the target point in the global state. If so, the bit position of the designer is reached, the hardware design verification target is completed, and the training is terminated. If the target state is not reached in a given iteration, the following steps are implemented: The number of iterations is increased by 1, that is, t = t + 1. Next, we check whether the number of iterations has reached the maximum defined value, N. If so, the algorithm proceeds to the final step, and the training is terminated. If not, the algorithm proceeds to the next step.The complete process for bit adjustment involves the following steps:Initialization:Initialize the Q-table or neural network (if using deep RL) with arbitrary values.Set the initial state, typically the current bit-width configuration of the HLS design.Determine the termination conditions to ensure effective exploration

Add Comment