Idee d action

Comment

Author: Admin | 2025-04-28

Decision-making problems that were previously intractable because of their high-dimensional state and action spaces. In this section, we briefly describe our learning environment and AI agent, discuss the learning process and some implementation considerations. Accordingly, an automated trading system is introduced to optimize the trading strategies of the experiments performed in this work. 3.1. Learning EnvironmentThe parameter optimization problem is formulated as a Markov Decision Process represented by a tuple ( S , A , R , τ ) , where S is the set of possible states, A is the set of legal actions, a reward function R : S × A → R and the transition function τ : S × A × R → S that generates a new state in a possibly stochastic or deterministic environment 𝜀.The scenario or the state of the environment is defined as the data sets D plus the history of evaluated parameter configurations and their corresponding response:The agent navigates the parameter response space through a series of actions, which are simply the next parameter configurations to be evaluated, and thus the action space corresponds to the space of all parameter configurations through the function g : A → Λ . According to the definition of the action space, the agent executes an action from A = { 1 , … , | A | } . For example, action a = 1 , a ∈ A , corresponds to parameter set λ = g ( a ) = { λ 1 } d i m ( Λ ) and action a = | A | corresponds to parameter set λ = { λ | A | } d i m ( Λ ) .The parameter response surface can be any performance metric which is defined by the function f : D × Λ → R . The response surface is to estimate the value from an objective function ℒ of a strategy M λ ∈ M , with parameters λ ∈ Λ , over a data set D ⊂ D : f ( D , λ ) = ℒ ( M λ , D ) . (2) Considering that the agent’s task is to maximize the reward, the reward function is set as the parameter response function, and depends on the data set D and the action selected, as shown below: R ( D , a ) = − f D , λ = g ( a ) . (3) The observed reward depends solely on the data set and the parameter configuration selected. Once an action is selected, a new parameter configuration is evaluated.The transition function then generates a new state, s ′ ∈ S , by appending the newly

Add Comment