I have a dataframe similar to the following, which we'll call "df":
id value time
a 1 1
a 1.5 2
a 2 3
a 2.5 4
b 1 1
b 1.5 2
b 2 3
b 2.5 4
I am running various regressions by "id" in Python on this dataframe. Generally, this requires a grouping by "id" and then applying a function to those groupings that calculates the regression.
I am working with 2 similar regression techniques in Scipy's stats library:
Theil-Sen estimator:
(https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.mstats.theilslopes.html)
Siegel estimator:
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.siegelslopes.html).
Both of these intake the same type of data. Therefore the function to calculate them should be the same aside from the actual technique used.
For Theil-Sen, I wrote the following function and the groupby statement that would be applied to that function:
def theil_reg(df, xcol, ycol):
model = stats.theilslopes(ycol,xcol)
return pd.Series(model)
out = df.groupby('id').apply(theil_reg, xcol='time', ycol='value')
However, I get the following error, which I've been having the hardest time understanding how to address:
ValueError: could not convert string to float: 'time'
The actual variable time is a numpy float object, so it isn't a string. This makes me believe that the stats.theilslopes function is not recognizing that time is a column in the dataframe and is instead using 'time' as a string input into the function.
However if that's the case, then this seems to be a bug in the stats.theilslopes package, and would need to be addressed by Scipy. The reason I believe this to be the case is because the exact same function as above, but instead using the siegelslopes package, works perfectly fine and provides the output I'm expecting, and they're essentially the same estimation with the same inputs.
Doing the following on Siegel:
def siegel_reg(df, xcol, ycol):
model = stats.siegelslopes(ycol,xcol)
return pd.Series(model)
out = df.groupby('id').apply(siegel_reg, xcol='time',ycol='value')
Does not create any errors about the time variable and conducts the regression as needed.
Does anyone have thoughts on whether I'm missing something here? If so I would appreciate any thoughts, or if not, any thoughts on how to address this with Scipy.
Edit: here is the full error message that shows up when I run this script:
ValueError Traceback (most recent call last)
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs)
688 try:
--> 689 result = self._python_apply_general(f)
690 except Exception:
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in _python_apply_general(self, f)
706 keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707 self.axis)
708
C:\Anaconda\lib\site-packages\pandas\core\groupby\ops.py in apply(self, f, data, axis)
189 group_axes = _get_axes(group)
--> 190 res = f(group)
191 if not _is_indexed_like(res, group_axes):
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in f(g)
678 with np.errstate(all='ignore'):
--> 679 return func(g, *args, **kwargs)
680 else:
<ipython-input-506-0a1696f0aecd> in theil_reg(df, xcol, ycol)
1 def theil_reg(df, xcol, ycol):
----> 2 model = stats.theilslopes(ycol,xcol)
3 return pd.Series(model)
C:\Anaconda\lib\site-packages\scipy\stats\_stats_mstats_common.py in
theilslopes(y, x, alpha)
221 else:
--> 222 x = np.array(x, dtype=float).flatten()
223 if len(x) != len(y):
ValueError: could not convert string to float: 'time'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-507-9a199e0ce924> in <module>
----> 1 df_accel_correct.groupby('chart').apply(theil_reg, xcol='time',
ycol='value')
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs)
699
700 with _group_selection_context(self):
--> 701 return self._python_apply_general(f)
702
703 return result
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in _python_apply_general(self, f)
705 def _python_apply_general(self, f):
706 keys, values, mutated = self.grouper.apply(f,
self._selected_obj,
--> 707 self.axis)
708
709 return self._wrap_applied_output(
C:\Anaconda\lib\site-packages\pandas\core\groupby\ops.py in apply(self, f, data, axis)
188 # group might be modified
189 group_axes = _get_axes(group)
--> 190 res = f(group)
191 if not _is_indexed_like(res, group_axes):
192 mutated = True
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in f(g)
677 def f(g):
678 with np.errstate(all='ignore'):
--> 679 return func(g, *args, **kwargs)
680 else:
681 raise ValueError('func must be a callable if args or '
<ipython-input-506-0a1696f0aecd> in theil_reg(df, xcol, ycol)
1 def theil_reg(df, xcol, ycol):
----> 2 model = stats.theilslopes(ycol,xcol)
3 return pd.Series(model)
C:\Anaconda\lib\site-packages\scipy\stats\_stats_mstats_common.py in theilslopes(y, x, alpha)
220 x = np.arange(len(y), dtype=float)
221 else:
--> 222 x = np.array(x, dtype=float).flatten()
223 if len(x) != len(y):
224 raise ValueError("Incompatible lengths ! (%s<>%s)" % (len(y), len(x)))
ValueError: could not convert string to float: 'time'
Update 2: after calling df in the function, I received the following error message:
ValueError Traceback (most recent call last)
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs)
688 try:
--> 689 result = self._python_apply_general(f)
690 except Exception:
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in _python_apply_general(self, f)
706 keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707 self.axis)
708
C:\Anaconda\lib\site-packages\pandas\core\groupby\ops.py in apply(self, f, data, axis)
189 group_axes = _get_axes(group)
--> 190 res = f(group)
191 if not _is_indexed_like(res, group_axes):
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in f(g)
678 with np.errstate(all='ignore'):
--> 679 return func(g, *args, **kwargs)
680 else:
<ipython-input-563-5db69048f347> in theil_reg(df, xcol, ycol)
1 def theil_reg(df, xcol, ycol):
----> 2 model = stats.theilslopes(df[ycol],df[xcol])
3 return pd.Series(model)
C:\Anaconda\lib\site-packages\scipy\stats\_stats_mstats_common.py in theilslopes(y, x, alpha)
248 sigma = np.sqrt(sigsq)
--> 249 Ru = min(int(np.round((nt - z*sigma)/2.)), len(slopes)-1)
250 Rl = max(int(np.round((nt + z*sigma)/2.)) - 1, 0)
ValueError: cannot convert float NaN to integer
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-564-d7794bd1d495> in <module>
----> 1 correct_theil = df_accel_correct.groupby('chart').apply(theil_reg, xcol='time', ycol='value')
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs)
699
700 with _group_selection_context(self):
--> 701 return self._python_apply_general(f)
702
703 return result
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in _python_apply_general(self, f)
705 def _python_apply_general(self, f):
706 keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707 self.axis)
708
709 return self._wrap_applied_output(
C:\Anaconda\lib\site-packages\pandas\core\groupby\ops.py in apply(self, f, data, axis)
188 # group might be modified
189 group_axes = _get_axes(group)
--> 190 res = f(group)
191 if not _is_indexed_like(res, group_axes):
192 mutated = True
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in f(g)
677 def f(g):
678 with np.errstate(all='ignore'):
--> 679 return func(g, *args, **kwargs)
680 else:
681 raise ValueError('func must be a callable if args or '
<ipython-input-563-5db69048f347> in theil_reg(df, xcol, ycol)
1 def theil_reg(df, xcol, ycol):
----> 2 model = stats.theilslopes(df[ycol],df[xcol])
3 return pd.Series(model)
C:\Anaconda\lib\site-packages\scipy\stats\_stats_mstats_common.py in theilslopes(y, x, alpha)
247 # Find the confidence interval indices in `slopes`
248 sigma = np.sqrt(sigsq)
--> 249 Ru = min(int(np.round((nt - z*sigma)/2.)), len(slopes)-1)
250 Rl = max(int(np.round((nt + z*sigma)/2.)) - 1, 0)
251 delta = slopes[[Rl, Ru]]
ValueError: cannot convert float NaN to integer
However, I have no null values in either column, and both columns are floats. Any suggestions on this error?
Essentially, you are passing the string values of column names (not any value entities) into methods but the slopes calls require numpy arrays (or pandas series that can be coerced into arrays). Specifically, you are attempting this call with no reference to df and hence your error:
model = stats.theilslopes('value', 'time')
Simply reference df in the calls:
model = stats.theilslopes(df['value'], df['time'])
model = stats.theilslopes(df[ycol], df[xcol])
Regarding different results across packages does not mean bugs with Scipy. Packages run different implementations. Read docs carefully to see how to call methods. Possibly, the other package you refer to allows a data input as argument inside call and the named strings reference the columns like below:
slopes_call(y='y_string', x='x_string', data=df)
In general, the Python object model always requires explicit named references to calls and objects and does not assume context.
Related
The example is fully reproducible. Here is full notebook (which downloads data too): https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb
After this part in notebook above:
full_pipeline_with_predictor = Pipeline([
("preparation", full_pipeline),
("linear", LinearRegression())
])
full_pipeline_with_predictor.fit(housing, housing_labels)
full_pipeline_with_predictor.predict(some_data)
I am trying to get predictions on the test set with this code:
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = full_pipeline_with_predictor.predict(X_test_prepared)
But I am receiving error:
C:\Users\Alex\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py:430: FutureWarning: Given feature/column names or counts do not match the ones for the data given during fit. This will fail from v0.24.
FutureWarning)
---------------------------------------------------------------------------
Empty Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
796 try:
--> 797 tasks = self._ready_batches.get(block=False)
798 except queue.Empty:
~\AppData\Local\Continuum\anaconda3\lib\queue.py in get(self, block, timeout)
166 if not self._qsize():
--> 167 raise Empty
168 elif timeout is None:
Empty:
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-141-dc87b1c9e658> in <module>
5
6 X_test_prepared = full_pipeline.transform(X_test)
----> 7 final_predictions = full_pipeline_with_predictor.predict(X_test_prepared)
8
9 final_mse = mean_squared_error(y_test, final_predictions)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
114
115 # lambda, but not partial, allows help() to work with update_wrapper
--> 116 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
117 # update the docstring of the returned function
118 update_wrapper(out, self.fn)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\pipeline.py in predict(self, X, **predict_params)
417 Xt = X
418 for _, name, transform in self._iter(with_final=False):
--> 419 Xt = transform.transform(Xt)
420 return self.steps[-1][-1].predict(Xt, **predict_params)
421
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in transform(self, X)
586
587 self._validate_features(X.shape[1], X_feature_names)
--> 588 Xs = self._fit_transform(X, None, _transform_one, fitted=True)
589 self._validate_output(Xs)
590
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in _fit_transform(self, X, y, func, fitted)
455 message=self._log_message(name, idx, len(transformers)))
456 for idx, (name, trans, column, weight) in enumerate(
--> 457 self._iter(fitted=fitted, replace_strings=True), 1))
458 except ValueError as e:
459 if "Expected 2D array, got 1D array instead" in str(e):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
1002 # remaining jobs.
1003 self._iterating = False
-> 1004 if self.dispatch_one_batch(iterator):
1005 self._iterating = self._original_iterator is not None
1006
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
806 big_batch_size = batch_size * n_jobs
807
--> 808 islice = list(itertools.islice(iterator, big_batch_size))
809 if len(islice) == 0:
810 return False
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in <genexpr>(.0)
454 message_clsname='ColumnTransformer',
455 message=self._log_message(name, idx, len(transformers)))
--> 456 for idx, (name, trans, column, weight) in enumerate(
457 self._iter(fitted=fitted, replace_strings=True), 1))
458 except ValueError as e:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\__init__.py in _safe_indexing(X, indices, axis)
404 if axis == 1 and indices_dtype == 'str' and not hasattr(X, 'loc'):
405 raise ValueError(
--> 406 "Specifying the columns using strings is only supported for "
407 "pandas DataFrames"
408 )
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
Question: How can I correct that error? And why that error happens?
Since your final pipeline:
full_pipeline_with_predictor = Pipeline([
("preparation", full_pipeline),
("linear", LinearRegression())
])
clearly contains already the full_pipeline, you should not "prepare" again your X_test; doing so, you are "preparing" X_test twice, which is wrong. So, your code should be simply
final_predictions = full_pipeline_with_predictor.predict(X_test)
exactly as it is for getting predictions for some_data, i.e.
full_pipeline_with_predictor.predict(some_data)
which some_data you correctly do not "prepare" before feeding them into the final pipeline.
The whole point of using pipelines is exactly this, i.e. to avoid having to run separately fit-predict for possibly several preparation steps, having wrapped all of them into a single pipeline instead. You correctly apply this process here when you predict some_data, but you somehow seem to have forgotten it in the next step, when you try to predict X_test.
I'm getting the below error when I try to target encode a categorical column.
"TypeError: Categorical cannot perform the operation mean"
When I try to run similar code through Jupyter Notebook it works fine but when I try to run it as part of python file it errors out with the above error message.
I know this sounds a bit crazy but I am not able to understand what's going on in the background?
Error:
TypeError Traceback (most recent call last)
<ipython-input-4-9ba53c6b7375> in <module>()
----> 1 tpmodeller.initialize()
~/SageMaker/tp/tp_kvr.py in initialize(self)
127
128 # Target Encode Te_cat_col Features
--> 129 df_cat_te = target_encode_bin(self.train[self.Te_cat_col], self.train['vol_trmnt_in_4_quarters'])
130 self.train = pd.concat([self.train, df_cat_te], axis=1)
131
~/SageMaker/tp/tp_kvr.py in target_encode_bin(df_te, target)
366 te = TargetEncoder(smoothing = 1, min_samples_leaf = 5, handle_unknown='ignore')
--> 367 df_te = te.fit_transform(df_te, target)
368 #
369 # Binning and then placing it in {col}_bin feature
~/anaconda3/envs/python3/lib/python3.6/site-packages/category_encoders/target_encoder.py in fit_transform(self, X, y, **fit_params)
249 transform(X)
250
--> 251 return self.fit(X, y, **fit_params).transform(X, y)
252
~/anaconda3/envs/python3/lib/python3.6/site-packages/category_encoders/target_encoder.py in fit(self, X, y, **kwargs)
138 self.ordinal_encoder = self.ordinal_encoder.fit(X)
139 X_ordinal = self.ordinal_encoder.transform(X)
--> 140 self.mapping = self.fit_target_encoding(X_ordinal, y)
141
142 X_temp = self.transform(X, override_return_df=True)
~/anaconda3/envs/python3/lib/python3.6/site-packages/category_encoders/target_encoder.py in fit_target_encoding(self, X, y)
164 values = switch.get('mapping')
165
--> 166 prior = self._mean = y.mean()
167
168 stats = y.groupby(X[col]).agg(['count', 'mean'])
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
10954 skipna=skipna)
10955 return self._reduce(f, name, axis=axis, skipna=skipna,
> 10956 numeric_only=numeric_only)
10957
10958 return set_function_name(stat_func, name, cls)
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
3614 # TODO deprecate numeric_only argument for Categorical and use
3615 # skipna as well, see GH25303
-> 3616 return delegate._reduce(name, numeric_only=numeric_only, **kwds)
3617 elif isinstance(delegate, ExtensionArray):
3618 # dispatch to ExtensionArray interface
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/arrays/categorical.py in _reduce(self, name, axis, **kwargs)
2170 if func is None:
2171 msg = 'Categorical cannot perform the operation {op}'
-> 2172 raise TypeError(msg.format(op=name))
2173 return func(**kwargs)
2174
TypeError: Categorical cannot perform the operation mean
Your target feature needs to be converted to a 'number' type.
I'm Trying to convert a column into category in order to perform a pivot_table operation.
I've tried the following:
user_item_df = user_item.pivot_table(index='msno',
columns='song_id',
values='interacted',
aggfunc='mean')
And I got this:
ValueError Traceback (most recent call last)
<ipython-input-76-a870ece1f3e8> in <module>
2 columns='song_id',
3 values='interacted',
----> 4 aggfunc='mean')
~/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py in pivot_table(self, index, columns, values, aggfunc)
3123 from .reshape import pivot_table
3124 return pivot_table(self, index=index, columns=columns, values=values,
-> 3125 aggfunc=aggfunc)
3126
3127 def to_records(self, index=False):
~/anaconda3/lib/python3.6/site-packages/dask/dataframe/reshape.py in pivot_table(df, index, columns, values, aggfunc)
190 raise ValueError("'columns' must be the name of an existing column")
191 if not is_categorical_dtype(df[columns]):
--> 192 raise ValueError("'columns' must be category dtype")
193 if not has_known_categories(df[columns]):
194 raise ValueError("'columns' must have known categories. Please use "
ValueError: 'columns' must be category dtype
So I've tried to convert the column:
user_item.song_id = user_item.song_id.astype('category')
But I got this when calling pivot_table:
ValueError Traceback (most recent call last)
<ipython-input-78-a870ece1f3e8> in <module>
2 columns='song_id',
3 values='interacted',
----> 4 aggfunc='mean')
~/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py in pivot_table(self, index, columns, values, aggfunc)
3123 from .reshape import pivot_table
3124 return pivot_table(self, index=index, columns=columns, values=values,
-> 3125 aggfunc=aggfunc)
3126
3127 def to_records(self, index=False):
~/anaconda3/lib/python3.6/site-packages/dask/dataframe/reshape.py in pivot_table(df, index, columns, values, aggfunc)
192 raise ValueError("'columns' must be category dtype")
193 if not has_known_categories(df[columns]):
--> 194 raise ValueError("'columns' must have known categories. Please use "
195 "`df[columns].cat.as_known()` beforehand to ensure "
196 "known categories")
ValueError: 'columns' must have known categories. Please use `df[columns].cat.as_known()` beforehand to ensure known categories
Then I tried:
user_item.song_id = user_item.song_id.astype('category').cat.as_known()
And I immediately got:
KeyError Traceback (most recent call last)
<timed exec> in <module>
~/anaconda3/lib/python3.6/site-packages/dask/dataframe/categorical.py in as_known(self, **kwargs)
187 if self.known:
188 return self._series
--> 189 categories = self._property_map('categories').unique().compute(**kwargs)
190 return self.set_categories(categories.values)
191
~/anaconda3/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
154 dask.base.compute
155 """
--> 156 (result,) = compute(self, traverse=False, **kwargs)
157 return result
158
~/anaconda3/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
395 keys = [x.__dask_keys__() for x in collections]
396 postcomputes = [x.__dask_postcompute__() for x in collections]
--> 397 results = schedule(dsk, keys, **kwargs)
398 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
399
~/anaconda3/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
2336 try:
2337 results = self.gather(packed, asynchronous=asynchronous,
-> 2338 direct=direct)
2339 finally:
2340 for f in futures.values():
~/anaconda3/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
1660 return self.sync(self._gather, futures, errors=errors,
1661 direct=direct, local_worker=local_worker,
-> 1662 asynchronous=asynchronous)
1663
1664 #gen.coroutine
~/anaconda3/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
674 return future
675 else:
--> 676 return sync(self.loop, func, *args, **kwargs)
677
678 def __repr__(self):
~/anaconda3/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
275 e.wait(10)
276 if error[0]:
--> 277 six.reraise(*error[0])
278 else:
279 return result[0]
~/anaconda3/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
684 if value.__traceback__ is not tb:
685 raise value.with_traceback(tb)
--> 686 raise value
687
688 else:
~/anaconda3/lib/python3.6/site-packages/distributed/utils.py in f()
260 if timeout is not None:
261 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262 result[0] = yield future
263 except Exception as exc:
264 error[0] = sys.exc_info()
~/anaconda3/lib/python3.6/site-packages/tornado/gen.py in run(self)
1131
1132 try:
-> 1133 value = future.result()
1134 except Exception:
1135 self.had_exception = True
~/anaconda3/lib/python3.6/site-packages/tornado/gen.py in run(self)
1139 if exc_info is not None:
1140 try:
-> 1141 yielded = self.gen.throw(*exc_info)
1142 finally:
1143 # Break up a reference to itself
~/anaconda3/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1501 six.reraise(type(exception),
1502 exception,
-> 1503 traceback)
1504 if errors == 'skip':
1505 bad_keys.add(key)
~/anaconda3/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
683 value = tp()
684 if value.__traceback__ is not tb:
--> 685 raise value.with_traceback(tb)
686 raise value
687
/home/pi/env/lib/python3.5/site-packages/dask/dataframe/core.py in apply_and_enforce()
KeyError: '_func'
And the output of my workers is:
Exception: KeyError('_func',)
distributed.worker - WARNING - Compute Failed
Function: execute_task
args: ((<function apply at 0x764b3c90>, <function unique at 0x6ef24a50>, [(<function apply_and_enforce at 0x6eeede88>, <function Accessor._delegate_property at 0x6ef28198>, [(<function apply_and_enforce at 0x6eeede88>, <methodcaller: astype>, [(<built-in function getitem>, (<function apply at 0x764b3c90>, <function partial_by_order at 0x762ebd20>, [ msno ... interacted
0 vDi/nHqBu7wb+DtI2Ix4TupWQatUEFR41mDC0c8Voh8= ... 1
1 3IGfhB6dtaYxEGm20yFtRxN7KoFZjzGJbXPSjsjW5cM= ... 1
2 4QugsKXr1pJXSBj6CbSYCF6O7QY2/MHGICUU16p3fig= ... 1
3 i4g6DQpmkTuRCS6/osUsQ8GSBJM8261is4Q04NDGRPk= ... 1
4 TTNNMisplhps4y5gdQ6rsv0++TIKOOIIZLz05W97vFU= ... 1
5 sDR8kS+t73zE9QM8D03Zw3mVrsRXc0Nih/WRl02sfZI= ... 1
6 yiGYGWyGrCYHlMOtPv65urw9RfdH43PNGzu8TRaO+m8= ... 1
7 7lXXPZLRbAPWE5ILi2BFQVEhYzPz9cwNvuzIVCuHfZY= ... 1
8 4clHF4wjaFgY6+nQWoXm1EEAvB
kwargs: {}
Exception: KeyError('_func',)
If anyone Know how to fix this issue it would helps me a lot.
Solved by putting the same version of dask-distributed dask-core across all the workers, scheduler and client.
I'm taking an online python course (EpiSkills, which uses the Jupyter notebook) that was written in Python 2.7, and I'm on Python 3.6.4 so I have run into a few compatibility issues along the way. Most of the time I've been able to stumble through, but can't figure out this one, so was hoping someone might be able to help.
I start with the following packages:
import pandas as pd
import epipy
import seaborn as sns
%pylab inline
import statsmodels.api as sm
from scipy import stats
import numpy as np
And use the following code to create a pandas series and model:
multivar_model = sm.formula.glm('age ~ onset_to_hospital + onset_to_death +
data=my_data).fit()
new_data = pd.Series([6, 8, 'male'], index=['onset_to_hospital', 'onset_to_death', 'sex'])
When I try to use this to the following code, I throw the error that I've attached:
multivar_model.predict(new_data)
NameError part1
NameError part2
The intended output is meant to be this:
array([ 60.6497459])
I know that a lot of NameErrors are because something has been specified in the local, not global, environment but I'm unsure how to correct it in this instance. Any help is much appreciated.
Thanks!
C
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
~\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\patsy\compat.py in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
116 try:
--> 117 return f(*args, **kwargs)
118 except Exception as e:
~\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\patsy\eval.py in eval(self, expr, source_name, inner_namespace)
165 return eval(code, {}, VarLookupDict([inner_namespace]
--> 166 + self._namespaces))
167
<string> in <module>()
NameError: name 'onset_to_death' is not defined
The above exception was the direct cause of the following exception:
PatsyError Traceback (most recent call last)
<ipython-input-79-e0364e267da7> in <module>()
----> 1 multivar_model.predict(new_data)
~\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\statsmodels\base\model.py in predict(self, exog, transform, *args, **kwargs)
774 exog_index = exog.index
775 exog = dmatrix(self.model.data.design_info.builder,
--> 776 exog, return_type="dataframe")
777 if len(exog) < len(exog_index):
778 # missing values, rows have been dropped
~\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\patsy\highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
289 eval_env = EvalEnvironment.capture(eval_env, reference=1)
290 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 291 NA_action, return_type)
292 if lhs.shape[1] != 0:
293 raise PatsyError("encountered outcome variables for a model "
~\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\patsy\highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
167 return build_design_matrices(design_infos, data,
168 NA_action=NA_action,
--> 169 return_type=return_type)
170 else:
171 # No builders, but maybe we can still get matrices
~\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\patsy\build.py in build_design_matrices(design_infos, data, NA_action, return_type, dtype)
886 for factor_info in six.itervalues(design_info.factor_infos):
887 if factor_info not in factor_info_to_values:
--> 888 value, is_NA = _eval_factor(factor_info, data, NA_action)
889 factor_info_to_isNAs[factor_info] = is_NA
890 # value may now be a Series, DataFrame, or ndarray
~\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\patsy\build.py in _eval_factor(factor_info, data, NA_action)
61 def _eval_factor(factor_info, data, NA_action):
62 factor = factor_info.factor
---> 63 result = factor.eval(factor_info.state, data)
64 # Returns either a 2d ndarray, or a DataFrame, plus is_NA mask
65 if factor_info.type == "numerical":
~\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\patsy\eval.py in eval(self, memorize_state, data)
564 return self._eval(memorize_state["eval_code"],
565 memorize_state,
--> 566 data)
567
568 __getstate__ = no_pickling
~\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\patsy\eval.py in _eval(self, code, memorize_state, data)
549 memorize_state["eval_env"].eval,
550 code,
--> 551 inner_namespace=inner_namespace)
552
553 def memorize_chunk(self, state, which_pass, data):
~\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\patsy\compat.py in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
122 origin)
123 # Use 'exec' to hide this syntax from the Python 2 parser:
--> 124 exec("raise new_exc from e")
125 else:
126 # In python 2, we just let the original exception escape -- better
~\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\patsy\compat.py in <module>()
PatsyError: Error evaluating factor: NameError: name 'onset_to_death' is not defined
age ~ onset_to_hospital + onset_to_death + sex
^^^^^^^^^^^^^^
I'm trying to reproduce coal mining example with deterministic function for switchpoint instead of using theano's switch function. Code:
%matplotlib inline
import matplotlib.pyplot as plt
import pymc3
import numpy as np
import theano.tensor as t
import theano
data = np.hstack((np.random.poisson(15,1000),np.random.poisson(2,100)))
plt.plot(data)
#theano.compile.ops.as_op(itypes=[t.lscalar, t.dscalar,t.dscalar],otypes=[t.dvector])
def rate1(sw,mu1,mu2):
n = len(data)
out = np.empty(n)
out[:sw] = mu1
out[sw:] = mu2
return out
with pymc3.Model() as dis:
switchpoint = pymc3.DiscreteUniform('switchpoint',lower=0, upper=len(data)-1)
mu1 = pymc3.Exponential('mu1', lam=1.)
mu2 = pymc3.Exponential('mu2',lam=1.)
disasters=pymc3.Poisson('disasters', mu=rate1, observed = data)
But this code rise an error:
--------------------------------------------------------------------------- KeyError Traceback (most recent call
last) c:\program files\git\theano\theano\tensor\type.py in
dtype_specs(self)
266 'complex64': (complex, 'theano_complex64', 'NPY_COMPLEX64')
--> 267 }[self.dtype]
268 except KeyError:
KeyError: 'object'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call
last) c:\program files\git\theano\theano\tensor\basic.py in
constant_or_value(x, rtype, name, ndim, dtype)
407 rval = rtype(
--> 408 TensorType(dtype=x_.dtype, broadcastable=bcastable),
409 x_.copy(),
c:\program files\git\theano\theano\tensor\type.py in init(self,
dtype, broadcastable, name, sparse_grad)
49 self.broadcastable = tuple(bool(b) for b in broadcastable)
---> 50 self.dtype_specs() # error checking is done there
51 self.name = name
c:\program files\git\theano\theano\tensor\type.py in dtype_specs(self)
269 raise TypeError("Unsupported dtype for %s: %s"
--> 270 % (self.class.name, self.dtype))
271
TypeError: Unsupported dtype for TensorType: object
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call
last) c:\program files\git\theano\theano\tensor\basic.py in
as_tensor_variable(x, name, ndim)
201 try:
--> 202 return constant(x, name=name, ndim=ndim)
203 except TypeError:
c:\program files\git\theano\theano\tensor\basic.py in constant(x,
name, ndim, dtype)
421 ret = constant_or_value(x, rtype=TensorConstant, name=name, ndim=ndim,
--> 422 dtype=dtype)
423
c:\program files\git\theano\theano\tensor\basic.py in
constant_or_value(x, rtype, name, ndim, dtype)
416 except Exception:
--> 417 raise TypeError("Could not convert %s to TensorType" % x, type(x))
418
TypeError: ('Could not convert FromFunctionOp{rate1} to TensorType',
)
During handling of the above exception, another exception occurred:
AsTensorError Traceback (most recent call
last) in ()
14 mu2 = pymc3.Exponential('mu2',lam=1.)
15 #rate1 = pymc3.switch(switchpoint >= np.arange(len(data)), mu1,mu2)
---> 16 disasters=pymc3.Poisson('disasters', mu=rate1, observed = data)
C:\Users\User\Anaconda3\lib\site-packages\pymc3\distributions\distribution.py
in new(cls, name, *args, **kwargs)
19 if isinstance(name, str):
20 data = kwargs.pop('observed', None)
---> 21 dist = cls.dist(*args, **kwargs)
22 return model.Var(name, dist, data)
23 elif name is None:
C:\Users\User\Anaconda3\lib\site-packages\pymc3\distributions\distribution.py
in dist(cls, *args, **kwargs)
32 def dist(cls, *args, **kwargs):
33 dist = object.new(cls)
---> 34 dist.init(*args, **kwargs)
35 return dist
36
C:\Users\User\Anaconda3\lib\site-packages\pymc3\distributions\discrete.py
in init(self, mu, *args, **kwargs)
185 super(Poisson, self).init(*args, **kwargs)
186 self.mu = mu
--> 187 self.mode = floor(mu).astype('int32')
188
189 def random(self, point=None, size=None, repeat=None):
c:\program files\git\theano\theano\gof\op.py in call(self,
*inputs, **kwargs)
598 """
599 return_list = kwargs.pop('return_list', False)
--> 600 node = self.make_node(*inputs, **kwargs)
601
602 if config.compute_test_value != 'off':
c:\program files\git\theano\theano\tensor\elemwise.py in
make_node(self, *inputs)
540 using DimShuffle.
541 """
--> 542 inputs = list(map(as_tensor_variable, inputs))
543 shadow = self.scalar_op.make_node(
544 *[get_scalar_type(dtype=i.type.dtype).make_variable()
c:\program files\git\theano\theano\tensor\basic.py in
as_tensor_variable(x, name, ndim)
206 except Exception:
207 str_x = repr(x)
--> 208 raise AsTensorError("Cannot convert %s to TensorType" % str_x, type(x))
209
210 # this has a different name, because _as_tensor_variable is the
AsTensorError: ('Cannot convert FromFunctionOp{rate1} to TensorType',
)
How i handle this?
The second thing - when i'm using the pymc3.switch function like this:
with pymc3.Model() as dis:
switchpoint = pymc3.DiscreteUniform('switchpoint',lower=0, upper=len(data)-1)
mu1 = pymc3.Exponential('mu1', lam=1.)
mu2 = pymc3.Exponential('mu2',lam=1.)
rate1 = pymc3.switch(switchpoint >= np.arange(len(data)), mu1,mu2)
disasters=pymc3.Poisson('disasters', mu=rate1, observed = data)
And next try to sample:
with dis:
step1 = pymc3.NUTS([mu1, mu2])
step2 = pymc3.Metropolis([switchpoint])
trace = pymc3.sample(10000, step = [step1,step2])
I get an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
c:\program files\git\theano\theano\compile\function_module.py in __call__(self, *args, **kwargs)
858 try:
--> 859 outputs = self.fn()
860 except Exception:
TypeError: expected type_num 9 (NPY_INT64) got 7
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-4-3247d908f897> in <module>()
2 step1 = pymc3.NUTS([mu1, mu2])
3 step2 = pymc3.Metropolis([switchpoint])
----> 4 trace = pymc3.sample(10000, step = [step1,step2])
C:\Users\User\Anaconda3\lib\site-packages\pymc3\sampling.py in sample(draws, step, start, trace, chain, njobs, tune, progressbar, model, random_seed)
153 sample_args = [draws, step, start, trace, chain,
154 tune, progressbar, model, random_seed]
--> 155 return sample_func(*sample_args)
156
157
C:\Users\User\Anaconda3\lib\site-packages\pymc3\sampling.py in _sample(draws, step, start, trace, chain, tune, progressbar, model, random_seed)
162 progress = progress_bar(draws)
163 try:
--> 164 for i, strace in enumerate(sampling):
165 if progressbar:
166 progress.update(i)
C:\Users\User\Anaconda3\lib\site-packages\pymc3\sampling.py in _iter_sample(draws, step, start, trace, chain, tune, model, random_seed)
244 if i == tune:
245 step = stop_tuning(step)
--> 246 point = step.step(point)
247 strace.record(point)
248 yield strace
C:\Users\User\Anaconda3\lib\site-packages\pymc3\step_methods\compound.py in step(self, point)
11 def step(self, point):
12 for method in self.methods:
---> 13 point = method.step(point)
14 return point
C:\Users\User\Anaconda3\lib\site-packages\pymc3\step_methods\arraystep.py in step(self, point)
116 bij = DictToArrayBijection(self.ordering, point)
117
--> 118 apoint = self.astep(bij.map(point))
119 return bij.rmap(apoint)
120
C:\Users\User\Anaconda3\lib\site-packages\pymc3\step_methods\metropolis.py in astep(self, q0)
123
124
--> 125 q_new = metrop_select(self.delta_logp(q,q0), q, q0)
126
127 if q_new is q:
c:\program files\git\theano\theano\compile\function_module.py in __call__(self, *args, **kwargs)
869 node=self.fn.nodes[self.fn.position_of_error],
870 thunk=thunk,
--> 871 storage_map=getattr(self.fn, 'storage_map', None))
872 else:
873 # old-style linkers raise their own exceptions
c:\program files\git\theano\theano\gof\link.py in raise_with_op(node, thunk, exc_info, storage_map)
312 # extra long error message in that case.
313 pass
--> 314 reraise(exc_type, exc_value, exc_trace)
315
316
C:\Users\User\Anaconda3\lib\site-packages\six.py in reraise(tp, value, tb)
656 value = tp()
657 if value.__traceback__ is not tb:
--> 658 raise value.with_traceback(tb)
659 raise value
660
c:\program files\git\theano\theano\compile\function_module.py in __call__(self, *args, **kwargs)
857 t0_fn = time.time()
858 try:
--> 859 outputs = self.fn()
860 except Exception:
861 if hasattr(self.fn, 'position_of_error'):
TypeError: expected type_num 9 (NPY_INT64) got 7
Apply node that caused the error: Elemwise{Composite{Switch(GE(i0, i1), i2, i3)}}(InplaceDimShuffle{x}.0, TensorConstant{[ 0 1..1098 1099]}, InplaceDimShuffle{x}.0, InplaceDimShuffle{x}.0)
Toposort index: 11
Inputs types: [TensorType(int64, (True,)), TensorType(int32, vector), TensorType(float64, (True,)), TensorType(float64, (True,))]
Inputs shapes: [(1,), (1100,), (1,), (1,)]
Inputs strides: [(4,), (4,), (8,), (8,)]
Inputs values: [array([549]), 'not shown', array([ 1.07762995]), array([ 1.01502801])]
Outputs clients: [[Elemwise{eq,no_inplace}(Elemwise{Composite{Switch(GE(i0, i1), i2, i3)}}.0, TensorConstant{(1,) of 0}), Elemwise{Composite{Switch(GE(i0, i1), ((Switch(i2, i3, (i4 * log(i0))) - i5) - i0), i3)}}[(0, 0)](Elemwise{Composite{Switch(GE(i0, i1), i2, i3)}}.0, TensorConstant{(1,) of 0}, InplaceDimShuffle{x}.0, TensorConstant{(1,) of -inf}, TensorConstant{[ 13. 13... 0. 1.]}, TensorConstant{[ 22.55216... ]})]]
HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
Being simple analyst, should i learn all this stuff about theano to being able to work with my statistical problems? Is a new mcmc sampler with gradient feature is only one thing that should motivates me to switch from pymc2 to pymc3?
For your first question, it looks like you're trying to pass a theano function as a variable. You need to call the function with the other variables as arguments, which will then return a theano variable. Try changing your line to
disasters=pymc3.Poisson('disasters', mu=rate1(switchpoint, mu1, mu2), observed = data)
I couldn't reproduce the error in your second part; the sampling worked just fine for me.