How to define an MDP as a python function? - python

I’m interested in defining a Markov Decision Process as a python function. It would need to interface with PyTorch API for reinforcement learning, however that constraint shapes the function’s form, inputs and outputs.
For context, my problem involves optimally placing items in a warehouse, not knowing the value of future items which might arrive. Anticipating these arrivals would limit greedy behavior of algorithm, effectively reserving some high value locations for high value items which might arrive as learned by the RL model.
How can I best define such a function? (Not asking about business logic but about requirements of its form, inputs outputs etc) What is PyTorch expecting of an MDP?

Use CleanRL
Make custom environment using Gymnasium https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation.html

Related

Best Deep-DQN implementation that works with (state,action) pairs

I'm moderately new to RL(Reinforcement learning) and trying to solve a problem by using a deep Q learning agent (trying a bunch of algorithms) and I don't want to implement my own agent (I would probably do a worse job than anyone writing an RL library).
My problem is that the way I'm able to view my state space is as (state,action) pairs which poses a technical problem more than an algorithmic one.
I've found libraries that allow me to upload my own Neural network as a Q function estimator,
but I haven't been able to find a library that allows me to evaluate
(state,action) -> Q_estimation or even better [(state,action)_1,...,(state,action)_i] -> action to take according to policy (either greedy or exploratory).
all I've found are libraries that allow me to input "state" and "possible actions" and get Q values or action choice back.
My second problem is that I want to control the horizon - meaning I want to use a finite horizon.
In short, what I'm looking for is :
A RL library that will allow me to use a deep - Q-network agent that accepts (state,action) pairs and approximates the relevant Q value.
Also I would like to control the horizon of the problem.
Does anyone know any solutions, I've spent days searching the internet for an implemented solution
There is some reward "function" R(s,a) (or perhaps R(s') where s' is determined by the state-action pair) that gives the reward necessary for training a deep Q learner in this method. The reason why there aren't really out-of -box solutions is that the reward function is up to your discretion to define based on the specific learning problem and furthermore is dependent on your state transition function P(s,s',a) which gives the probability of reaching state s' from state s given action a. This function is also learning problem specific.
Let's assume for simplicity that your reward function is a function only of the state. In this case, you simply need to write a function that scores each reachable state starting from state s and given action a and weights it by the probability of reaching that state.
def get_expected_reward(s,a):
# sum over all possible next states
expected value = 0
for s_next in reachable:
# here I assume your reward is a function of state
# and I assume your transition probabilities are stored in a tensor
expected_reward += P[s,s_next,a] * reward(s_next)
return expected_reward
Then, when training:
(states,actions) = next(dataloader) # here I assume a batch size of b
targets = torch.tensor(len(actions))
for b in range(len(actions)):
targets[b] = get_expected_reward(states[b],actions[b])
# forward pass through model
predicted_rewards = model(states,actions)
# loss
loss = loss_function(predicted_rewards,targets)

Reinforcement learning DQN environment structure

I am wondering how best to feed back the changes my DQN agent makes on its environment, back to itself.
I have a battery model whereby an agent can observe a time-series forecast of 17 steps, and 5 features. It then makes a decision on whether to charge or discharge.
I want to includes its current state of charge (empty, half full, full etc) in its observation space (i.e. somewhere within the (17,5) dataframes I am feeding it).
I have several options, I can either set a whole column to the state of charge value, a whole row, or I can flatten the whole dataframe and set one value to the state of charge value.
Is any of these unwise? It seem a little rudimentary to me to set a whole columns to a single value, but should it actually impact performance? I am wary of flattening the whole thing as I plan to use either conv or lstm layers (although the current model is just dense layers).
You would not want to add in unnecessary features which are repetitive in the state representation as it might hamper your RL agent convergence later when you would want to scale your model to larger input sizes(if that is in your plan).
Also, the decision of how much of information you would want to give in the state representation is mostly experimental. The best way to start would be to just give in a single value as the battery state. But if the model does not converge, then maybe you could try out the other options you have mentioned in your question.

<lifelines> Solving Cox Proportional Hazard after creating interaction variable with time

I am using lifelines package to do Cox Regression. After trying to fit the model, I checked the CPH assumptions for any possible violations and it returned some problematic variables, along with the suggested solutions.
One of the solution that I would like to try is the one suggested here:
https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
However, the example written here is using CoxTimeVaryingFitter which, unlike CoxPHFitter, does not have concordance score, which will help me gauge the model performance. Additionally, CoxTimeVaryingFitter does not have check assumption feature. Does this mean that by putting it into episodic format, all the assumptions are automatically satisfied?
Alternatively, after reading a SAS textbook on survival analysis, it seemed like their solution is to create the interaction term directly (multiplying the problematic variable with the survival time) without changing the format to episodic format (as shown in the link). This way, I was hoping to just keep using CoxPHFitter due to its model scoring capability.
However, after doing this alternative, when I call check_assumptions again on the model with the time-interaction variable, the CPH assumption on the time-interaction variable is violated.
Now I am torn between:
Using CoxTimeVaryingFitter without knowing what the model performance is (seems like a bad idea)
Using CoxPHFitter, but the assumption is violated on the time-interaction variable (which inherently does not seem to fix the problem)
Any help regarding to solve this confusion is greatly appreciated
Here is one suggestion:
If you choose the CoxTimeVaryingFitter, then you need to somehow evaluate the quality of your model. Here is one way. Use the regression coefficients B and write down your model. I'll write it as S(t;x;B), where S is an estimator of the survival, t is the time, and x is a vector of covariates (age, wage, education, etc.). Now, for every individual i, you have a vector of covariates x_i. Thus, you have the survival function for each individual. Consequently, you can predict which individual will 'fail' first, which 'second', and so on. This produces a (predicted) ranking of survival. However, you know the real ranking of survival since you know the failure times or times-to-event. Now, quantify how many pairs (predicted survival, true survival) share the same ranking. In essence, you would be estimating the concordance.
If you opt to use CoxPHFitter, I don't think it was meant to be used with time-varying covariates. Instead, you could use two other approaches. One is to stratify your variable, i.e., cph.fit(dataframe, time_column, event_column, strata=['your variable to stratify']). The downside is that you no longer obtain a hazard ratio for that variable. The other approach is to use splines. Both of these methods are explained in here.

How to implement a predictive model (10 types with 100 parameters) in C++ when a DLL (+header) is the expected delivery?

My desktop application process documents (10 types exist) to provide intelligence using 100s of parameters. Thru supervised training, I came up with a predictive model that uses 100 of the parameters to indicate whether a document is a reliable source of information or not. The training and testing of the Machine Learning model was done in Python. Currently I need to implement the prediction part, which uses the parameters weights from the training part, into my [MFC/VC++] desktop application.
The suggested architecture is to provide a DLL plus header that exposes a function:
bool isDocumentReliable(int docID)
Based on the type of the document, the prediction uses a set of parameters to calculate the probability of the document being reliable. Based on risk assessment (requirements of the business) we translate the probability into a true/false answer.
I am looking for some architectural/implementation information to guide my implementation.
That's my first Machine Learning project and my questions are:
What are the questions I need to be asking?
Should the parameters be hard-coded into my functions? or
Should the parameters be read from text files at runtime?
I strongly suspect that the function should be bool isDocumentReliable(std::wstring pathname), but that's a minor detail. The main question, by far, should be: How are we going to prototype this? Don't expect this to work straight away.
If you've got a boss that thinks Machine Learning is like writing software, and it either works or not, tell him he's flat out wrong.

Python libraries for on-line machine learning MDP

I am trying to devise an iterative markov decision process (MDP) agent in Python with the following characteristics:
observable state
I handle potential 'unknown' state by reserving some state space
for answering query-type moves made by the DP (the state at t+1 will
identify the previous query [or zero if previous move was not a query]
as well as the embedded result vector) this space is padded with 0s to
a fixed length to keep the state frame aligned regardless of query
answered (whose data lengths may vary)
actions that may not always be available at all states
reward function may change over time
policy convergence should incremental and only computed per move
So the basic idea is the MDP should make its best guess optimized move at T using its current probability model (and since its probabilistic the move it makes is expectedly stochastic implying possible randomness), couple the new input state at T+1 with the reward from previous move at T and reevaluate the model. The convergence must not be permanent since the reward may modulate or the available actions could change.
What I'd like to know is if there are any current python libraries (preferably cross-platform as I necessarily change environments between Windoze and Linux) that can do this sort of thing already (or may support it with suitable customization eg: derived class support that allows redefining say reward method with one's own).
I'm finding information about on-line per-move MDP learning is rather scarce. Most use of MDP that I can find seems to focus on solving the entire policy as a preprocessing step.
Here is a python toolbox for MDPs.
Caveat: It's for vanilla textbook MDPs and not for partially observable MDPs (POMDPs), or any kind of non-stationarity in rewards.
Second Caveat: I found the documentation to be really lacking. You have to look in the python code if you want to know what it implements or you can quickly look at their documentation for a similar toolbox they have for MATLAB.

Categories