I am trying to figure out what type of a machine learning model to build for the following case:
I want to be able to predict whether a customer will deposit in the next 30 days, using past data:
# Data
import pandas as pd
client_id = [1 , 1, 1, 2, 2, 3]
deposit_amount = [10, 20, 30, 15, 45, 55]
deposit_date = ["2022-01-05", "2022-01-06", "2022-01-07", "2022-01-05", "2022-01-06", "2022-01-06"]
dat = pd.DataFrame([client_id, deposit_amount, deposit_date]).T
dat.columns = ["client_id", "deposit_amount", "deposit_date"]
dat
# Use ML algorithm
---
# Output of the ML Algorithm
model_prediction = pd.DataFrame([[1, 2, 3],[0, 1, 0]]).T
model_prediction.columns = ["client_id", "target"]
As you can see there are multiple rows for a customer for each time when they deposit. Which machine learning algorithm do you suggest? Since I am trying to predict something time related, is there any model which can deal with time-based data and outputs a binary column?
You can use whatever you want as classification algorithm (SVC, RandomForest, LogisticRegression, etc). The most important thing is to prepare your data correctly.
For example, the date variable is not useful as such but the time delta before two deposits is interesting for you:
dat['deposit_date'] = pd.to_datetime(dat['deposit_date'])
dat['deposit_date'] = dat.groupby('client_id')['deposit_date'].diff().dt.days.fillna(0)
Once you have prepared your data, it is important to normalize/standardize them to avoid giving importance to certain variables with large numbers.
You can use several algorithms and compare them. You can also try different hyper settings for each model (watch out for overfitting).
The second most important thing is choosing your metric to evaluate your model. As you have classification problem (and more specifically binary classification), you need to use metrics like f1-score, precision, recall.
As you can see, the choice of the algorithm is not very important compared to your analysis of the problem :-)
Related
I want to use darts to model the predicted sales of our product.
I am using the autoARIMA model.
After I have fitted a model I want to visually investigate how similar the model is to the known values.
Is there anyway to make a prediction for all known values and put it on a graph?
At the moment I am graphing the next 8 predictions(blue line in graph below) but I would like it to graph what it would have guessed for the previous month if it didn't have that in the training data.
Another but less important question I have is, is there a way to predict 100 timesteps ahead without needing to explicitly specify the dates?
If there isn't then is there a utility function to create the dates that would be valid for that time series?
import pandas as pd
from darts import TimeSeries
from darts.utils.missing_values import fill_missing_values
all_data = fill_missing_values(TimeSeries.from_csv("rows.csv", time_col="ds", freq="B"))
all_data.add_holidays("UK")
all_data["y"].plot()
from darts.models import AutoARIMA
model = AutoARIMA()
model.fit(series=all_data["y"], future_covariates=all_data["users_joined"])
results = pd.DataFrame(list(zip([10, 10, 10, 10, 10, 10, 10, 10], [pd.Timestamp("2022-09-16"), pd.Timestamp("2022-09-19"),
pd.Timestamp("2022-09-20"), pd.Timestamp("2022-09-21"),
pd.Timestamp("2022-09-22"),
pd.Timestamp("2022-09-23"),
pd.Timestamp("2022-09-26"),
pd.Timestamp("2022-09-27")])),
columns=["marketing_spend", "ds"])
model.predict(8,
future_covariates=all_data["marketing_spend"].concatenate(TimeSeries.from_dataframe(results, time_col="ds", freq="B"), axis = 0)).plot()
I have gotten the assignment to analyze a dataset of 1.000+ houses, build a multiple regression model to predict prices and then select the three houses which are the cheapest compared to the predicted price. Other than selecting specifically three houses, there is also the constraint of a "budget" of 7.000.000 total for purchasing the three houses.
I have gotten so far as to develop the regression model as well as calculate the predicted prices and the risiduals and added them to the original dataset. I am however completely stumped as to how to write a code to select the three houses, given the budget restraint and optimizing for highest combined risidual.
Here is my code so far:
### modules
import pandas as pd
import statsmodels.api as sm
### Data import
df = pd.DataFrame({"Zip Code" : [94127, 94110, 94112, 94114],
"Days listed" : [38, 40, 40, 40],
"Price" : [633000, 1100000, 440000, 1345000],
"Bedrooms" : [0, 3, 0, 0],
"Loft" : [1, 0, 1, 1],
"Square feet" : [1124, 2396, 625, 3384],
"Lotsize" : [2500, 1750, 2495, 2474],
"Year" : [1924, 1900, 1923, 1907]})
### Creating LM
y = df["Price"] # dependent variable
x = df[["Zip Code", "Days listed", "Bedrooms", "Loft", "Square feet", "Lotsize", "Year"]]
x = sm.add_constant(x) # adds a constant
lm = sm.OLS(y,x).fit() # fitting the model
# predict house prices
prices = sm.add_constant(x)
### Summary
#print(lm.summary())
### Adding predicted values and risidual values to df
df["predicted"] = pd.DataFrame(lm.predict(prices)) # predicted values
df["risidual"] = df["Price"] - df["predicted"] # risidual values
If anyone has an idea, could you explain to me the steps and give a code example?
Thank you very much!
With the clarification that you are looking for the best combination your problem is more complicated ;)
I have tried a "brute-force" approach but at least my laptop takes forever with the full dataset. Find below my thoughts:
Obviously we have to calculate the combinations of many houses, therefore my first approach was to reduce the dataset as far as possible.
If Price+2*min(Price)>budget there will be no combination with two houses that is smaller
If risidual is negative we will not consider the house during optimization
In pandas this will look as this:
budget=7000000
df=df[df['Price']<(budget-2*df['Price'].min())].copy()
df=df[df['risidual']>0].copy()
This reduces the objects from 1395 to 550.
Unfortunatly, 550 ID are still many combinations (27578100) as calculated with itertools:
import itertools
idx=[a for a in itertools.combinations(df.index,3)]
You can evaluate these combinations by
result={comb: df.loc[[*comb], 'risidual'].sum() for comb in idx[10000:] if df.loc[[*comb], 'Price'].sum() < budget}
Note: I have limited the evaluation to the first 10000 values due to the calculation duration.
print("Combination: {}\nPrice: {}\nCost: {}".format(max(result),df.loc[[*max(result)], 'Price'].sum(),result[max(result)] ))
Maybe it is advisable to calculate the combination of just two object first to further reduce the possible combinations. I think you should have a look at the Knapsack problem
I think you are almost there. Given that df["risidual"] has the difference between predicted and real price you have to select the subset that fits your limit e.g.
df_budget=df[df['price']<=budget].copy()
using pandas nlargest() you could retrieve the three biggest differences
df_budget.nlargest(3, 'risidual')
Note: Code was not tested due to missing sample data
I have two time series representing two independent periods of data observation. I would like to fit an autoregressive model to this data. In other words, I would like to perform two partial fits, or two sessions of incremental learning.
This is a simplified description of a not-unusual scenario which could also apply to batch fitting on large datasets.
How do I do this (in statsmodels or otherwise)? Bonus points if the solution can generalise to other time-series models like ARIMA.
In pseudocode, something like:
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
res = AutoReg(data_1, lags=12).fit()
res.aic
# This is more like what I would like to do
model = AutoReg(lags=12)
model.partial_fit(data_1)
model.partial_fit(data_2)
model.results.aic
Statsmodels does not directly have this functionality. As Kevin S mentioned though, pmdarima does have a wrapper that provides this functionality. Specifically the update method. Per their documentation: "Update the model fit with additional observed endog/exog values.".
See example below around your particular code:
from pmdarima.arima import ARIMA
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
model = ARIMA(order=(12,0,0))
model.fit(data_1)
# update the model parameters with the new parameters
model.update(data_2)
I don't know how to achieve that in autoreg, but I think it can be achieved somehow, but need to manually evaluate results or somehow add the data.
But in ARIMA and SARIMAX, it's already implemented and it's simple.
For incremental learning, there are three functions related and it's documented here. First is apply which use fitted parameters on new unrelated data. Then there are extend and append. Append can be refit. I don't know exact difference though.
Here is my example that is different but similar...
from statsmodels.tsa.api import ARIMA
data = np.array(range(200))
order = (4, 2, 1)
model = ARIMA(data, order=order)
fitted_model = model.fit()
prediction = fitted_model.forecast(7)
new_data = np.array(range(600, 800))
fitted_model = fitted_model.apply(new_data)
new_prediction = fitted_model.forecast(7)
print(prediction) # [200. 201. 202. 203. 204. 205. 206.]
print(new_prediction) # [800. 801. 802. 803. 804. 805. 806.]
This replace all the data, so it can be used on unrelated data (unknown index). I profiled it and apply is very fast in comparison to fit.
Scenario: I am trying to do multiple portfolio optimizations, with different constraints (weights, risk, risk aversion...) in a multi-period scenario.
What I already did: From the examples of cvxpy I found how to optimize a portfolio under a non-linear quadratic formula that results in a list of weights for the assets in the portfolio composition. My problem is that, although I have 15 years of monthly data, I don't know how to optimize for different periods (the code, as of its current form, yields the best composition for the entire time span of my data).
Question 1: Is it possible to make the code optimize for different periods. such as 1, 3, 4, 6, 9, 12 months (in that case, yielding different weights for each of those periods) If so, how could one do that?.
Question 2: Is it possible to restrain the number of assets in each portfolio composition? What is the best way to achieve that? (the current code uses all of them, but I would like to test when the number of assets is limited, to control the turnover level).
Code:
from cvxpy import *
from cvxopt import *
import pandas as pd
import numpy as np
prices = pd.DataFrame()
logret = pd.DataFrame()
normret = pd.DataFrame()
returns = pd.DataFrame()
prices = pd.read_excel(open('//folder//Dgms89//calculation v3.xlsx', 'rb'), sheetname='Prices Final')
logret = pd.read_excel(open('//folder//Dgms89//calculation v3.xlsx', 'rb'), sheetname='Returns log')
normret = pd.read_excel(open('//folder//Dgms89//calculation v3.xlsx', 'rb'), sheetname='Returns normal')
returns = normret
def calculate_portfolio(returns, selected_solver):
cov_mat = returns.cov()
Sigma = np.asarray(cov_mat.values)
w = Variable(len(cov_mat))
gamma = quad_form(w, Sigma)
prob = Problem(Minimize(gamma), [sum_entries(w) == 1])
prob.solve(solver=selected_solver)
weights = []
for weight in w.value:
weights.append(float(weight[0]))
return weights
The problem of multiperiod is that your model will be overfitted. On the other hand, you can backtest traditional portfolio optimization models asumming a rebalancing period. Riskfolio-Lib has an example using backtrader where it compares S&P500 with diferent portfolios using quarterly rebalancing. You can check the example in this link: https://riskfolio-lib.readthedocs.io/en/latest/examples.html
The standard mean-variance portfolio model is a static model. No dynamics in the model. (Time series are only used to estimate the variance-covariance matrix and the expected return). Some related models can answer questions like when and how to rebalance.
Restricting the number of assets in a portfolio leads to what is called the cardinality-constrained portfolio problem. This becomes basically an MIQP (Mixed-Integer Quadratic Programming problem).
I keep having a problem with a deep learning model. I have a model trained on rrc data frame, and if I do:
rrc['preds'] = dp.cross_validation_holdout_predictions().as_data_frame().predict
I always get misaligned the response columns and predictions. At the top of the data frame there are aligned, but at some point it seems that they are misaligned and if I calculate a correlation between them is very bad because of this misalignment. I have been trying to fix this for over 3 day but I have no idea how to do it.
I'm using H2O 3.10.4.5.
The model itself:
dp = H2ODeepLearningEstimator(activation = "Tanh", hidden = [10, 10, 10], epochs = 10000,
keep_cross_validation_predictions=True,
ignored_columns = ['fn', 'pdb_id','pdb_id_chain', 'pdb_id_chain_source', 'source'])
dp.train(x = list(set(rrch.col_names) - set(['rmsd_all'])), y ="rmsd_all", training_frame = rrch,
fold_column="cv")
Edit: I think I found the problem (Cell #58) https://github.com/mmagnus/mmagnus.github.io/blob/master/mq-test.ipynb If I do rrc3 = rrc3[rrc3.rmsd_all < 10] to remove some rows that rmsd_all (the response column) value is higher than 10 and then I do rrc3h = h2o.H2OFrame(rrc3) caused the problem. I'm not sure why though. The dataset, 40mb https://www.dropbox.com/s/1et38o3xx47jw1m/rasp_rnakb_cv2.csv?dl=0
Solved: rrc3.reset_index(inplace=True) will do the job!