What is 'G' in CVXPY and how to fix it - python

I'm trying to use a binary integer linear program to assign members of my staff to different shift. I have a 16x9 matrix of preferences for my staff in a csv (16 staff members, 9 slots to fill) and I used the following code to try and assign them:
weights = pd.read_csv("holiday_green day.csv", index_col= 0)
weights = weights.to_numpy().astype(float)
selection = cvx.Variable((9,16), boolean = True)
row_sum_vector = np.ones((16,1)).astype(float)
result_constraint = np.ones((9,1)).astype(float) * 2
objective = cvx.Minimize(cvx.trace(weights # assignments))
prob = cvx.Problem(objective, [assignments # row_sum_vector == result_constraint])
prob.solve()
When I try running this, I get the error TypeError: G must be a 'd' matrix and I don't know where to start debugging. I looked at this post, but it wasn't helpful. Can someone help me figure out what G is and what it means by 'd' matrix? Its my first time actually using CVXPY and I'm very lost.
Full Stack Trace:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-23-d07ad22cbc25> in <module>()
6 objective = cvx.Minimize(cvx.atoms.affine.trace.trace(weights # assignments))
7 prob = cvx.Problem(objective, [assignments # row_sum_vector == result_constraint])
----> 8 prob.solve()
3 frames
/usr/local/lib/python3.7/dist-packages/cvxpy/problems/problem.py in solve(self, *args, **kwargs)
288 else:
289 solve_func = Problem._solve
--> 290 return solve_func(self, *args, **kwargs)
291
292 #classmethod
/usr/local/lib/python3.7/dist-packages/cvxpy/problems/problem.py in _solve(self, solver, warm_start, verbose, parallel, gp, qcp, **kwargs)
570 self._intermediate_problem)
571 solution = self._solving_chain.solve_via_data(
--> 572 self, data, warm_start, verbose, kwargs)
573 full_chain = self._solving_chain.prepend(self._intermediate_chain)
574 inverse_data = self._intermediate_inverse_data + solving_inverse_data
/usr/local/lib/python3.7/dist-packages/cvxpy/reductions/solvers/solving_chain.py in solve_via_data(self, problem, data, warm_start, verbose, solver_opts)
194 """
195 return self.solver.solve_via_data(data, warm_start, verbose,
--> 196 solver_opts, problem._solver_cache)
/usr/local/lib/python3.7/dist-packages/cvxpy/reductions/solvers/conic_solvers/glpk_mi_conif.py in solve_via_data(self, data, warm_start, verbose, solver_opts, solver_cache)
73 data[s.B],
74 set(int(i) for i in data[s.INT_IDX]),
---> 75 set(int(i) for i in data[s.BOOL_IDX]))
76 results_dict = {}
77 results_dict["status"] = results_tup[0]
TypeError: G must be a 'd' matrix
Edit: Tried casting all numpy arrays as float like they suggested in a different post. It didn't work.

Related

lifetimes BetaGeoFitter model is not working, it gives ConvergenceError

I just wanted to apply BetaGeoFitter model to my dataframe as:
> df = summary_data_from_transaction_data(df,"COMPANY_ID","INVOICE_DATE","TOTAL_PRICE",include_first_transaction=True, observation_period_end= today_date, freq= "W")
> bgf = BetaGeoFitter(penalizer_coef=0.0)
> bgf.fit(df['frequency'], df['recency'], df['T'])
It gives the error below(last rows of the error because it's too long). I don't know where is the problem and what this error is telling. By the way, it gives the same error when I change it with a larger penalizer_coef. Can anyone help me to fix it?
C:\ProgramData\Miniconda3\lib\site-packages\autograd\tracer.py:48: RuntimeWarning: invalid value encountered in multiply
return f_raw(*args, **kwargs)
C:\ProgramData\Miniconda3\lib\site-packages\autograd\tracer.py:48: RuntimeWarning: invalid value encountered in subtract
return f_raw(*args, **kwargs)
C:\ProgramData\Miniconda3\lib\site-packages\autograd\numpy\numpy_vjps.py:78: RuntimeWarning: invalid value encountered in double_scalars
defvjp(anp.log, lambda ans, x : lambda g: g / x)
---------------------------------------------------------------------------
ConvergenceError Traceback (most recent call last)
Cell In [19], line 2
1 bgf = BetaGeoFitter(penalizer_coef=0.0)
----> 2 bgf.fit(df['frequency'], df['recency'], df['T'])
3 print(bgf)
File C:\ProgramData\Miniconda3\lib\site-packages\lifetimes\fitters\beta_geo_fitter.py:137, in BetaGeoFitter.fit(self, frequency, recency, T, weights, initial_params, verbose, tol, index, **kwargs)
134 scaled_recency = recency * self._scale
135 scaled_T = T * self._scale
--> 137 log_params_, self._negative_log_likelihood_, self._hessian_ = self._fit(
138 (frequency, scaled_recency, scaled_T, weights, self.penalizer_coef),
139 initial_params,
140 4,
141 verbose,
142 tol,
143 **kwargs
144 )
146 self.params_ = pd.Series(np.exp(log_params_), index=["r", "alpha", "a", "b"])
147 self.params_["alpha"] /= self._scale
File C:\ProgramData\Miniconda3\lib\site-packages\lifetimes\fitters\__init__.py:115, in BaseFitter._fit(self, minimizing_function_args, initial_params, params_size, disp, tol, bounds, **kwargs)
113 return output.x, output.fun, hessian_
114 print(output)
--> 115 raise ConvergenceError(
116 dedent(
117 """
118 The model did not converge. Try adding a larger penalizer to see if that helps convergence.
119 """
120 )
121 )
ConvergenceError:
The model did not converge. Try adding a larger penalizer to see if that helps convergence.
Try grouping values together like it worked for me:
df_ = df.groupby(["frequency", "recency", "T"]).size().reset_index()
BetaGeoBetaBinomFitter().fit(df_['frequency'], df_['recency'], df_['T'], weights=df_[0])

Sktime - how to make in-sample and out-of-sample forecasts with exogenous variables?

I'm trying to make forecasts using sktime for my entire training data and an arbitrary length of out-of-sample data but can't figure it out.
# Generate 2 years of daily data
data = np.random.random(365 * 2,)
df = pd.DataFrame({'y': data})
# Arbirtrary X variable (8% per year growth as a daily increase)
df['daily_growth'] = 8 / 365
# Forecast for entire dataset and 1 year into the future
fh = np.arange(-len(df)+1, 365+1)
# Fit model
arima = AutoARIMA()
arima.fit(df.y, X=df.daily_growth)
# Create forecast df for in-sample and out-of-sample data...
# ...this is probably where the problem lies
forecast_df = pd.DataFrame(index=range(len(fh))) # `index=fh` also fails
forecast_df['daily_growth'] = 8 / 365
# ValueError...
preds_with_X = arima.predict(fh=fh, X=forecast_df)
Output
ValueError Traceback (most recent call last)
Input In [3], in <cell line: 15>()
13 preds_no_X = arima_no_X.predict(fh=fh)
14 len(fh) == len(forecast_df)
---> 15 preds_with_X = arima_with_X.predict(fh=fh, X=forecast_df)
17 plt.plot(df.y, label='Actual')
18 plt.plot(preds_no_X, label='preds_no_X')
File ~/opt/anaconda3/envs/humbl_keywords/lib/python3.9/site-packages/sktime/forecasting/base/_base.py:318, in BaseForecaster.predict(self, fh, X)
316 # we call the ordinary _predict if no looping/vectorization needed
317 if not self._is_vectorized:
--> 318 y_pred = self._predict(fh=fh, X=X_inner)
319 else:
320 # otherwise we call the vectorized version of predict
321 y_pred = self._vectorize("predict", X=X_inner, fh=fh)
File ~/opt/anaconda3/envs/humbl_keywords/lib/python3.9/site-packages/sktime/forecasting/base/adapters/_pmdarima.py:84, in _PmdArimaAdapter._predict(self, fh, X)
81 # both in-sample and out-of-sample values
82 else:
83 y_ins = self._predict_in_sample(fh_ins, X=X)
---> 84 y_oos = self._predict_fixed_cutoff(fh_oos, X=X)
85 return pd.concat([y_ins, y_oos])
File ~/opt/anaconda3/envs/humbl_keywords/lib/python3.9/site-packages/sktime/forecasting/base/adapters/_pmdarima.py:177, in _PmdArimaAdapter._predict_fixed_cutoff(self, fh, X, return_pred_int, alpha)
162 """Make predictions out of sample.
163
164 Parameters
(...)
174 Returns series of predicted values.
175 """
176 n_periods = int(fh.to_relative(self.cutoff)[-1])
--> 177 result = self._forecaster.predict(
178 n_periods=n_periods,
179 X=X,
180 return_conf_int=False,
181 alpha=DEFAULT_ALPHA,
182 )
184 fh_abs = fh.to_absolute(self.cutoff)
185 fh_idx = fh.to_indexer(self.cutoff)
File ~/opt/anaconda3/envs/humbl_keywords/lib/python3.9/site-packages/pmdarima/utils/metaestimators.py:53, in _IffHasDelegate.__get__.<locals>.<lambda>(*args, **kwargs)
50 attrgetter(self.delegate_names[-1])(obj)
52 # lambda, but not partial, allows help() to work with update_wrapper
---> 53 out = (lambda *args, **kwargs: self.fn(obj, *args, **kwargs))
54 # update the docstring of the returned function
55 update_wrapper(out, self.fn)
File ~/opt/anaconda3/envs/humbl_keywords/lib/python3.9/site-packages/pmdarima/arima/auto.py:257, in AutoARIMA.predict(self, n_periods, X, return_conf_int, alpha, **kwargs)
247 #if_has_delegate("model_")
248 def predict(self,
249 n_periods=10,
(...)
254
255 # Temporary shim until we remove `exogenous` support completely
256 X, _ = pm_compat.get_X(X, **kwargs)
--> 257 return self.model_.predict(
258 n_periods=n_periods,
259 X=X,
260 return_conf_int=return_conf_int,
261 alpha=alpha,
262 )
File ~/opt/anaconda3/envs/humbl_keywords/lib/python3.9/site-packages/pmdarima/arima/arima.py:785, in ARIMA.predict(self, n_periods, X, return_conf_int, alpha, **kwargs)
783 X = self._check_exog(X) # type: np.ndarray
784 if X is not None and X.shape[0] != n_periods:
--> 785 raise ValueError('X array dims (n_rows) != n_periods')
787 # f = self.arima_res_.forecast(steps=n_periods, exog=X)
788 arima = self.arima_res_
ValueError: X array dims (n_rows) != n_periods
Alas, pmdarima doesn't print what input it receives for n_rows and n_periods. But I think I am passing the correct shapes.
len(fh) == len(forecast_df) # True
fh.shape, forecast_df.shape # ((1095,), (1095, 1))
P.S. I'm not sure my daily_growth var would actually have any impact on the results. Advice on this point and how to get the model to have 8% growth would be helpful too!

Shapes not aligned in Python AutoImpute data imputation package?

I'm trying to use the (relatively new) Python AutoImpute package, but I keep getting a shape mismatch error when trying to use a particular column as a predictor.
This is what my pandas dataframe looks like
I can impute using the 'sex', 'group', and 'binned_age' columns, but not using the 'experiment' column. When I try doing that, I get this error:
ValueError: shapes (9,) and (4,13) not aligned: 9 (dim 0) != 4 (dim 0)
This is my code for actually fitting and running the imputer:
cat_predictors = ['experiment', 'sex', 'group', 'binned_age']
si = SingleImputer(
strategy={'FSIQ': 'default predictive'},
predictors={'FSIQ': cat_predictors},
)
imputed_data = si.fit_transform(df2)
In trying to diagnose the problem, I found out that if I reduce the number of unique strings in the 'experiment' column to 3 or fewer, my problem goes away for some reason. But, I don't want to do that and lose some of my data. Any help?
Full trace below:
ValueError Traceback (most recent call last)
<ipython-input-11-3d4388ba92e4> in <module>
1 si = SingleImputer(
2 strategy={'FSIQ': 'pmm'}, imp_kwgs={'pmm': {'tune': 10000, 'sample':10000}})
----> 3 data_imputed_once = si.fit_transform(df2)
/om/user/agupta81/anaconda/envs/myenv/lib/python3.8/site-packages/autoimpute/imputations/dataframe/single_imputer.py in fit_transform(self, X, y)
288 X (pd.DataFrame): imputed in place or copy of original.
289 """
--> 290 return self.fit(X, y).transform(X)
/om/user/agupta81/anaconda/envs/myenv/lib/python3.8/site-packages/autoimpute/utils/checks.py in wrapper(d, *args, **kwargs)
59 err = f"Neither {d_err} nor {a_err} are of type pd.DataFrame"
60 raise TypeError(err)
---> 61 return func(d, *args, **kwargs)
62 return wrapper
63
/om/user/agupta81/anaconda/envs/myenv/lib/python3.8/site-packages/autoimpute/utils/checks.py in wrapper(d, *args, **kwargs)
124
125 # return func if no missingness violations detected, then return wrap
--> 126 return func(d, *args, **kwargs)
127 return wrapper
128
/om/user/agupta81/anaconda/envs/myenv/lib/python3.8/site-packages/autoimpute/utils/checks.py in wrapper(d, *args, **kwargs)
171 err = f"All values missing in column(s) {nc}. Should be removed."
172 raise ValueError(err)
--> 173 return func(d, *args, **kwargs)
174 return wrapper
175
/om/user/agupta81/anaconda/envs/myenv/lib/python3.8/site-packages/autoimpute/imputations/dataframe/single_imputer.py in transform(self, X, imp_ixs)
274
275 # perform imputation given the specified imputer and value for x_
--> 276 X.loc[imp_ix, column] = imputer.impute(x_)
277 return X
278
/om/user/agupta81/anaconda/envs/myenv/lib/python3.8/site-packages/autoimpute/imputations/series/pmm.py in impute(self, X)
187 # imputed values are actual y vals corresponding to nearest neighbors
188 # therefore, this is a form of "hot-deck" imputation
--> 189 y_pred_bayes = alpha_bayes + beta_bayes.dot(X.T)
190 n_ = self.neighbors
191 if X.columns.size == 1:
ValueError: shapes (9,) and (4,13) not aligned: 9 (dim 0) != 4 (dim 0)

Converting TfidfVectorizer sparse matrix to dataframe or dense array results in memory error

My input is a pandas dataframe ("vector") with one column and 178885 rows holding strings with up to 600 words each.
0 this is an example text...
1 more examples...
...
178885 last example
Name: vectortext, Length: 178886, dtype: object
I'm doing feature extraction (unigrams) using the TfidfVectorizer:
vectorizer_uni = TfidfVectorizer(ngram_range=(1,1), use_idf=True, analyzer="word", stop_words=stop)
X = vectorizer_uni.fit_transform(vector).toarray()
X = pd.DataFrame(X, columns=vectorizer_uni.get_feature_names()) #map grams
k = len(X.columns) #number of features
Unfortunately I'm receiving a Memory Error as below. I'm using the 64bit version of python 3.6 with 16GB RAM on my windows 10 machine. I've red alot about python generators etc. but I can't figure out how to solve this problem without limiting the number of features (which is not really an option). Any ideas how to solve this? Could I somehow split my dataframe before?
Traceback:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-88-15b6091ceec7> in <module>()
1 vectorizer_uni = TfidfVectorizer(ngram_range=(1,1), use_idf=True, analyzer="word", stop_words=stop)
----> 2 X = vectorizer_uni.fit_transform(vector).toarray()
3 X = pd.DataFrame(X, columns=vectorizer_uni.get_feature_names()) #map grams
4 k = len(X.columns) # number of features
C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out)
962 def toarray(self, order=None, out=None):
963 """See the docstring for `spmatrix.toarray`."""
--> 964 return self.tocoo(copy=False).toarray(order=order, out=out)
965
966 ##############################################################
C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\coo.py in toarray(self, order, out)
250 def toarray(self, order=None, out=None):
251 """See the docstring for `spmatrix.toarray`."""
--> 252 B = self._process_toarray_args(order, out)
253 fortran = int(B.flags.f_contiguous)
254 if not fortran and not B.flags.c_contiguous:
C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\base.py in _process_toarray_args(self, order, out)
1037 return out
1038 else:
-> 1039 return np.zeros(self.shape, dtype=self.dtype, order=order)
1040
1041 def __numpy_ufunc__(self, func, method, pos, inputs, **kwargs):
MemoryError:

Agglomerative Clustering to cluster doc2vec

I'm new to Agglomerative Clustering and doc2vec, so I hope somebody can help me with the following issue.
This is my code:
model = AgglomerativeClustering(linkage='average',
connectivity=None, n_clusters=2)
X = model_dm.docvecs.doctag_syn0
model.fit(X, y=None)
model.fit_predict(X, y=None)
What I want is to predict the average of the distances of each observation. I got the following error:
MemoryErrorTraceback (most recent call last)
<ipython-input-22-d8b93bc6abe1> in <module>()
2 model = AgglomerativeClustering(linkage='average',connectivity=None,n_clusters=2)
3 X = model_dm.docvecs.doctag_syn0
----> 4 model.fit(X, y=None)
5
/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in fit(self, X, y)
763 n_components=self.n_components,
764 n_clusters=n_clusters,
--> 765 **kwargs)
766 # Cut the tree
767 if compute_full_tree:
/usr/local/lib64/python2.7/site-packages/sklearn/externals/joblib/memory.pyc in __call__(self, *args, **kwargs)
281
282 def __call__(self, *args, **kwargs):
--> 283 return self.func(*args, **kwargs)
284
285 def call_and_shelve(self, *args, **kwargs):
/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in _average_linkage(*args, **kwargs)
547 def _average_linkage(*args, **kwargs):
548 kwargs['linkage'] = 'average'
--> 549 return linkage_tree(*args, **kwargs)
550
551
/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in linkage_tree(X, connectivity, n_components, n_clusters, linkage, affinity, return_distance)
428 i, j = np.triu_indices(X.shape[0], k=1)
429 X = X[i, j]
--> 430 out = hierarchy.linkage(X, method=linkage, metric=affinity)
431 children_ = out[:, :2].astype(np.int)
432
/usr/local/lib64/python2.7/site-packages/scipy/cluster/hierarchy.pyc in linkage(y, method, metric)
669 'matrix looks suspiciously like an uncondensed '
670 'distance matrix')
--> 671 y = distance.pdist(y, metric)
672 else:
673 raise ValueError("`y` must be 1 or 2 dimensional.")
/usr/local/lib64/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
1375
1376 m, n = s
-> 1377 dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
1378
1379 # validate input for multi-args metrics
MemoryError:
You are getting a MemoryError. This is a reliable indicator that you are running out of memory, on the line indicated.
The line indicates an attempt to allocate an np.zeros() array of (m * (m - 1)) // 2 values of type double (8 bytes). Looking at the scipy source, m, here, is the number of vectors in X, aka model_dm.docvecs.doctag_syn0.shape[0].
So, how many docvecs are you working with? If it's 200,000, you will need...
((200000 * 199999) // 2) * 8 bytes
...or about 320GB of RAM for that np.zeros() allocation to succeed. (If you have more docvecs, even more RAM.)
(Agglomerative clustering needs to know all the pairwise distances, which the scipy implementation tries to calculate and store at the beginning, which is very space-consuming.)
You may need to have more RAM, or use fewer docvecs, or use a different clustering algorithm, or use an implementation which is lazier about calculating distances (but is then much much slower because it will often be recalculating, rather than reusing, distances it needs repeatedly.

Categories