Resampling of categorical column in pandas data frame

Resampling of categorical column in pandas data frame - python

I need some help in figuring out this. Have been trying a few things but not working. I have a pandas data frame shown below(in the end) :
The data is available at irregular intervals ( frequency not fixed). I am looking to sample the data at a fixed frequency for eg every 1 minute. If the column is a float then mean every 1 minute works fine
df1.resample('1T',base = 1).mean()
but since the data is categorical mean doesn't make sense, I also tried sum which is also not making sense from sampling. What essentially I need is the max count of the column when sampled at 1 minute To do this I used the following code to apply the custom function to the values that fall in 1 minute when resampling . .
def custome_mod(arraylike):
vals, counts = np.unique(arraylike, return_counts=True)
return (np.argwhere(counts == np.max(counts)))
df1.resample('1T',base = 1).apply(custome_mod)
The output I am expecting is : data frame available at every 1 minute and value with maximum count for the data that fall in that 1 minute .
For some reason it does not seem to work and gives me error . Have been trying to debugg for a very long time . Can somebody please provide some inputs/code check ?
The error I get is following :
ValueError: zero-size array to reduction operation maximum which has no identity
ValueError Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
264 try:
--> 265 return self._python_agg_general(func, *args, **kwargs)
266 except (ValueError, KeyError):
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs)
935
--> 936 result, counts = self.grouper.agg_series(obj, f)
937 assert result is not None
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in agg_series(self, obj, func)
862 grouper = libreduction.SeriesBinGrouper(obj, func, self.bins, dummy)
--> 863 return grouper.get_result()
864
pandas/_libs/reduction.pyx in pandas._libs.reduction.SeriesBinGrouper.get_result()
pandas/_libs/reduction.pyx in pandas._libs.reduction._BaseGrouper._apply_to_group()
pandas/_libs/reduction.pyx in pandas._libs.reduction._check_result_array()
ValueError: Function does not reduce
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
358 # Check if the function is reducing or not.
--> 359 result = grouped._aggregate_item_by_item(how, *args, **kwargs)
360 else:
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_item_by_item(self, func, *args, **kwargs)
1171 try:
-> 1172 result[item] = colg.aggregate(func, *args, **kwargs)
1173
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
268 # see see test_groupby.test_basic
--> 269 result = self._aggregate_named(func, *args, **kwargs)
270
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_named(self, func, *args, **kwargs)
453 if isinstance(output, (Series, Index, np.ndarray)):
--> 454 raise ValueError("Must produce aggregated value")
455 result[name] = output
ValueError: Must produce aggregated value
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<command-36984414005459> in <module>
----> 1 df1.resample('1T',base = 1).apply(custome_mod)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in aggregate(self, func, *args, **kwargs)
283 how = func
284 grouper = None
--> 285 result = self._groupby_and_aggregate(how, grouper, *args, **kwargs)
286
287 result = self._apply_loffset(result)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
380 # we have a non-reducing function
381 # try to evaluate
--> 382 result = grouped.apply(how, *args, **kwargs)
383
384 result = self._apply_loffset(result)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
733 with option_context("mode.chained_assignment", None):
734 try:
--> 735 result = self._python_apply_general(f)
736 except TypeError:
737 # gh-20949
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
749
750 def _python_apply_general(self, f):
--> 751 keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
752
753 return self._wrap_applied_output(
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
204 # group might be modified
205 group_axes = group.axes
--> 206 res = f(group)
207 if not _is_indexed_like(res, group_axes):
208 mutated = True
<command-36984414005658> in custome_mod(arraylike)
1 def custome_mod(arraylike):
2 vals, counts = np.unique(arraylike, return_counts=True)
----> 3 return (np.argwhere(counts == np.max(counts)))
<__array_function__ internals> in amax(*args, **kwargs)
/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims, initial, where)
2666 """
2667 return _wrapreduction(a, np.maximum, 'max', axis, None, out,
-> 2668 keepdims=keepdims, initial=initial, where=where)
2669
2670
/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
88 return reduction(axis=axis, out=out, **passkwargs)
89
---> 90 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
91
92
ValueError: zero-size array to reduction operation maximum which has no identity
Sample Dataframe and expected Output
Sample Df
6/3/2021 1:19:05 0
6/3/2021 1:19:15 1
6/3/2021 1:19:26 1
6/3/2021 1:19:38 1
6/3/2021 1:20:06 0
6/3/2021 1:20:16 0
6/3/2021 1:20:36 1
6/3/2021 1:21:09 1
6/3/2021 1:21:19 1
6/3/2021 1:21:45 0
6/4/2021 1:19:15 0
6/4/2021 1:19:25 0
6/4/2021 1:19:36 0
6/4/2021 1:19:48 1
6/4/2021 1:22:26 1
6/4/2021 1:22:36 0
6/4/2021 1:22:46 0
6/5/2021 2:20:19 0
6/5/2021 2:20:21 1
6/5/2021 2:20:40 0
Expected Output
6/3/2021 1:19 1
6/3/2021 1:20 0
6/3/2021 1:21 1
6/4/2021 1:19 0
6/4/2021 1:22 0
6/5/2021 2:20 0
Notice that original Data frame has data available at irregular frequency ( sometime every 5 second 20 seconds etc . The output expected is also show abover - need data every 1 minute ( resample to every minute instead of original irregular seconds) and the categorical column should have most frequent value during that minute. For ex : in orginal data at in 19minute there are four data points and the most frequent value in that is 1. Similarly at 20 minute there are three data points in original data and the most frquent is 0 . Similarly for 21 minutes there are three data points and the most frequent is 1. Also data I am working has 20 million rows . Hope it helps, This is an effort to reduce the data dimension .
After expected output I would do groupby column and count . This count will be in minutes and I will be able to know How long this column was 1 (in time )

Update after your edit:
out = df.set_index(pd.to_datetime(df.index).floor('T')) \
.groupby(level=0)['category'] \
.apply(lambda x: x.value_counts().idxmax())
print(out)
# Output
2021-06-03 01:19:00 1
2021-06-03 01:20:00 0
2021-06-03 01:21:00 1
2021-06-04 01:19:00 0
2021-06-04 01:22:00 0
2021-06-05 02:20:00 0
Name: category, dtype: int64
Old answer
# I used 'D' instead of 'T'
>>> df.set_index(df.index.floor('D')).groupby(level=0).count()
category
2021-06-03 6
2021-06-04 2
2021-06-06 1
2021-06-08 1
2021-06-25 1
2021-06-29 6
2021-06-30 3
# OR
>>> df.set_index(df.index.floor('D')).groupby(level=0).sum()
category
2021-06-03 2
2021-06-04 0
2021-06-06 1
2021-06-08 1
2021-06-25 0
2021-06-29 3
2021-06-30 1

Related

Pandas pivot_table Assertion error: `result` has not been initialized

df:
avg
count
date
val
prop
unit
distance
d-atmp
d-clouds
d-dewpoint
0.0786107
12
2014-10-03 00:00:00
22
atmp
(Deg C)
24829.6
24829.6
nan
nan
0.0786107
12
2014-10-03 00:00:00
0
clouds
(oktas)
22000.6
nan
22000.6
nan
0.0786107
12
2014-10-03 00:00:00
32
dewpoint
(Deg C)
21344.1
nan
nan
21344.1
0.0684246
6
2014-10-04 00:00:00
21.5
atmp
(Deg C)
26345.1
26345.1
nan
nan
cols = ['avg', 'date', 'count', 'd-atmp', 'd-cloud', 'd-dewpoint']
d = pd.pivot_table(x, index=cols, columns=['prop', 'unit'], values='val', aggfunc=max)
Ideal result:
date
countObs
avg
d-atmp
atmp (Deg C)
d-clouds
clouds (oktas)
d-dewpoint
dewpoint (Deg C)
2014-10-03 00:00:00
12
0.0786107
24829.6
22
22000.6
0
21344.1
32
2014-10-04 00:00:00
6
0.0684246
26345.1
21.5
nan
nan
nan
nan
Error
--------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) ~/.local/lib/python3.9/site-packages/pandas/core/groupby/generic.py in array_func(values) 1067 try:
-> 1068 result = self.grouper._cython_operation( 1069 "aggregate", values, how, axis=data.ndim - 1, min_count=min_count
~/.local/lib/python3.9/site-packages/pandas/core/groupby/ops.py in
_cython_operation(self, kind, values, how, axis, min_count, **kwargs)
998 ngroups = self.ngroups
--> 999 return cy_op.cython_operation( 1000 values=values,
~/.local/lib/python3.9/site-packages/pandas/core/groupby/ops.py in cython_operation(self, values, axis, min_count, comp_ids, ngroups,
**kwargs)
659
--> 660 return self._cython_op_ndim_compat(
661 values,
~/.local/lib/python3.9/site-packages/pandas/core/groupby/ops.py in
_cython_op_ndim_compat(self, values, min_count, ngroups, comp_ids, mask, **kwargs)
515
--> 516 return self._call_cython_op(
517 values,
~/.local/lib/python3.9/site-packages/pandas/core/groupby/ops.py in
_call_cython_op(self, values, min_count, ngroups, comp_ids, mask, **kwargs)
561 out_shape = self._get_output_shape(ngroups, values)
--> 562 func, values = self.get_cython_func_and_vals(values, is_numeric)
563 out_dtype = self.get_out_dtype(values.dtype)
~/.local/lib/python3.9/site-packages/pandas/core/groupby/ops.py in get_cython_func_and_vals(self, values, is_numeric)
204
--> 205 func = self._get_cython_function(kind, how, values.dtype, is_numeric)
206
~/.local/lib/python3.9/site-packages/pandas/core/groupby/ops.py in
_get_cython_function(cls, kind, how, dtype, is_numeric)
169 # raise NotImplementedError here rather than TypeError later
--> 170 raise NotImplementedError(
171 f"function is not implemented for this dtype: "
NotImplementedError: function is not implemented for this dtype: [how->mean,dtype->object]
During handling of the above exception, another exception occurred:
AssertionError Traceback (most recent call last) <ipython-input-119-b64b487d2810> in <module>
5 # o
6 # cols += []
----> 7 d = pd.pivot_table(x, index=cols, columns=['osmcObsProperty', 'unit'], values='val') #, aggfunc=max #np.mean or max appear similar , dropna=False
8
9 d.reset_index(inplace=True)
~/.local/lib/python3.9/site-packages/pandas/core/reshape/pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)
93 return table.__finalize__(data, method="pivot_table")
94
---> 95 table = __internal_pivot_table(
96 data,
97 values,
~/.local/lib/python3.9/site-packages/pandas/core/reshape/pivot.py in
__internal_pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)
163
164 grouped = data.groupby(keys, observed=observed, sort=sort)
--> 165 agged = grouped.agg(aggfunc)
166 if dropna and isinstance(agged, ABCDataFrame) and len(agged.columns):
167 agged = agged.dropna(how="all")
~/.local/lib/python3.9/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
977
978 op = GroupByApply(self, func, args, kwargs)
--> 979 result = op.agg()
980 if not is_dict_like(func) and result is not None:
981 return result
~/.local/lib/python3.9/site-packages/pandas/core/apply.py in agg(self)
156
157 if isinstance(arg, str):
--> 158 return self.apply_str()
159
160 if is_dict_like(arg):
~/.local/lib/python3.9/site-packages/pandas/core/apply.py in apply_str(self)
505 elif self.axis != 0:
506 raise ValueError(f"Operation {f} does not support axis=1")
--> 507 return self._try_aggregate_string_function(obj, f, *self.args, **self.kwargs)
508
509 def apply_multiple(self) -> FrameOrSeriesUnion:
~/.local/lib/python3.9/site-packages/pandas/core/apply.py in
_try_aggregate_string_function(self, obj, arg, *args, **kwargs)
575 if f is not None:
576 if callable(f):
--> 577 return f(*args, **kwargs)
578
579 # people may try to aggregate on a non-callable attribute
~/.local/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in mean(self, numeric_only) 1685 numeric_only = self._resolve_numeric_only(numeric_only) 1686
-> 1687 result = self._cython_agg_general( 1688 "mean", 1689 alt=lambda x: Series(x).mean(numeric_only=numeric_only),
~/.local/lib/python3.9/site-packages/pandas/core/groupby/generic.py in
_cython_agg_general(self, how, alt, numeric_only, min_count) 1080 # TypeError -> we may have an exception in trying to aggregate 1081 # continue and exclude the block
-> 1082 new_mgr = data.grouped_reduce(array_func, ignore_failures=True) 1083 1084 if len(new_mgr) < len(data):
~/.local/lib/python3.9/site-packages/pandas/core/internals/managers.py in grouped_reduce(self, func, ignore_failures) 1233 for sb in blk._split(): 1234 try:
-> 1235 applied = sb.apply(func) 1236 except (TypeError, NotImplementedError): 1237 if not ignore_failures:
~/.local/lib/python3.9/site-packages/pandas/core/internals/blocks.py in apply(self, func, **kwargs)
379 """
380 with np.errstate(all="ignore"):
--> 381 result = func(self.values, **kwargs)
382
383 return self._split_op_result(result)
~/.local/lib/python3.9/site-packages/pandas/core/groupby/generic.py in array_func(values) 1074 # try to python agg 1075
# TODO: shouldn't min_count matter?
-> 1076 result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt) 1077 1078 return result
~/.local/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in
_agg_py_fallback(self, values, ndim, alt) 1396 # should always be preserved by the implemented aggregations 1397 # TODO: Is this exactly right; see WrappedCythonOp get_result_dtype?
-> 1398 res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True) 1399 1400 if isinstance(values, Categorical):
~/.local/lib/python3.9/site-packages/pandas/core/groupby/ops.py in agg_series(self, obj, func, preserve_dtype) 1047 1048 else:
-> 1049 result = self._aggregate_series_fast(obj, func) 1050 1051 npvalues = lib.maybe_convert_objects(result, try_float=False)
~/.local/lib/python3.9/site-packages/pandas/core/groupby/ops.py in
_aggregate_series_fast(self, obj, func) 1072 ids = ids.take(indexer) 1073 sgrouper = libreduction.SeriesGrouper(obj, func, ids, ngroups)
-> 1074 result, _ = sgrouper.get_result() 1075 return result 1076
~/.local/lib/python3.9/site-packages/pandas/_libs/reduction.pyx in pandas._libs.reduction.SeriesGrouper.get_result()
AssertionError: `result` has not been initialized.

IIUC, you can use groupby_agg:
out = df.groupby('date', as_index=False).agg(max)
Output:
date
avg
count
val
prop
unit
distance
d-atmp
d-clouds
d-dewpoint
2014-10-03 00:00:00
0.0786107
12
32
dewpoint
(oktas)
24829.6
24829.6
22000.6
21344.1
2014-10-04 00:00:00
0.0684246
6
21.5
atmp
(Deg C)
26345.1
26345.1
nan
nan

You could pivot; then use groupby + max:
cols = ['avg', 'date', 'count', 'd-atmp', 'd-clouds', 'd-dewpoint']
tmp = df.pivot(index=cols, columns=['prop', 'unit'], values='val')
tmp.columns = tmp.columns.map(' '.join)
out = tmp.reset_index().groupby('date', as_index=False).max()\
[['date', 'count', 'avg', 'd-atmp', 'atmp (Deg C)', 'd-clouds',
'clouds (oktas)', 'd-dewpoint', 'dewpoint (Deg C)']]
Output:
date count avg d-atmp atmp (Deg C) d-clouds clouds (oktas) d-dewpoint dewpoint (Deg C)
0 2014-10-03 00:00:00 12 0.078611 24829.6 22.0 22000.6 0.0 21344.1 32.0
1 2014-10-04 00:00:00 6 0.068425 26345.1 21.5 NaN NaN NaN NaN

Python Pandas apply qcut to grouped by level 0 of multi-index in multi-index dataframe

I have a multi-index dataframe in pandas (date and entity_id) and for each date/entity I have obseravtions of a number of variables (A, B ...). My goal is to create a dataframe with the same shape but where the values are replaced by their decile scores.
My test data looks like this:
I want to apply qcut to each column grouped by level 0 of the multi-index - the issue I have is creating a result Dataframe
This code
def qcut_sub_index(df_with_sub_index):
# create empty return value same shape as passed dataframe
df_return=pd.DataFrame()
for date, sub_df in df_with_sub_index.groupby(level=0):
df_return=df_return.append(pd.DataFrame(pd.qcut(sub_df, 10, labels=False, duplicates='drop')))
print(df_return)
return df_return
print(df_values.apply(lambda x: qcut_sub_index(x), axis=0))
returns
A
as_at_date entity_id
2008-01-27 2928 0
2932 3
3083 6
3333 9
2008-02-27 2928 3
2935 9
3333 0
3874 6
2008-03-27 2928 1
2932 2
2934 0
2936 9
2937 4
2939 9
2940 7
2943 3
2944 0
2945 8
2946 6
2947 5
2949 4
B
as_at_date entity_id
2008-01-27 2928 9
2932 6
3083 0
3333 3
2008-02-27 2928 6
2935 0
3333 3
3874 9
2008-03-27 2928 0
2932 9
2934 2
2936 8
2937 7
2939 6
2940 3
2943 1
2944 4
2945 9
2946 5
2947 4
2949 0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-104-72ff0e6da288> in <module>
11
12
---> 13 print(df_values.apply(lambda x: qcut_sub_index(x), axis=0))
~\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7546 kwds=kwds,
7547 )
-> 7548 return op.get_result()
7549
7550 def applymap(self, func) -> "DataFrame":
~\Anaconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
178 return self.apply_raw()
179
--> 180 return self.apply_standard()
181
182 def apply_empty_result(self):
~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
272
273 # wrap results
--> 274 return self.wrap_results(results, res_index)
275
276 def apply_series_generator(self) -> Tuple[ResType, "Index"]:
~\Anaconda3\lib\site-packages\pandas\core\apply.py in wrap_results(self, results, res_index)
313 # see if we can infer the results
314 if len(results) > 0 and 0 in results and is_sequence(results[0]):
--> 315 return self.wrap_results_for_axis(results, res_index)
316
317 # dict of scalars
~\Anaconda3\lib\site-packages\pandas\core\apply.py in wrap_results_for_axis(self, results, res_index)
369
370 try:
--> 371 result = self.obj._constructor(data=results)
372 except ValueError as err:
373 if "arrays must all be same length" in str(err):
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
466
467 elif isinstance(data, dict):
--> 468 mgr = init_dict(data, index, columns, dtype=dtype)
469 elif isinstance(data, ma.MaskedArray):
470 import numpy.ma.mrecords as mrecords
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype)
281 arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
282 ]
--> 283 return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
284
285
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity)
76 # figure out the index, if necessary
77 if index is None:
---> 78 index = extract_index(arrays)
79 else:
80 index = ensure_index(index)
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in extract_index(data)
385
386 if not indexes and not raw_lengths:
--> 387 raise ValueError("If using all scalar values, you must pass an index")
388
389 if have_series:
ValueError: If using all scalar values, you must pass an index
so something is preventing the second application of the lambda function.
I'd appreciate your help, thanks for takign a look.
p.s. if this can be done implcitly without using apply would love to hear. thanks

You solution appears over complicated. Your terminology is none standard, multi-indexes have levels. Stated as qcut() by level 0 of multi-index (not talking about sub-frames which are not pandas concepts)
Bring it all back together
use **kwargs approach to pass arguments to assign() for all columns in data frame
groupby(level=0) is as_of_date
transform() to get a row back for every entry in index
s = 12
df = pd.DataFrame({"as_at_date":np.random.choice(pd.date_range(dt.date(2020,1,27), periods=3, freq="M"), s),
"entity_id":np.random.randint(2900, 3500, s),
"A":np.random.random(s),
"B":np.random.random(s)*(10**np.random.randint(8,10,s))
}).sort_values(["as_at_date","entity_id"])
df = df.set_index(["as_at_date","entity_id"])
df2 = df.assign(**{c:df.groupby(level=0)[c].transform(lambda x: pd.qcut(x, 10, labels=False))
for c in df.columns})
df
A B
as_at_date entity_id
2020-01-31 2926 0.770121 2.883519e+07
2943 0.187747 1.167975e+08
2973 0.371721 3.133071e+07
3104 0.243347 4.497294e+08
3253 0.591022 7.796131e+08
3362 0.810001 6.438441e+08
2020-02-29 3185 0.690875 4.513044e+08
3304 0.311436 4.561929e+07
2020-03-31 2953 0.325846 7.770111e+08
2981 0.918461 7.594753e+08
3034 0.133053 6.767501e+08
3355 0.624519 6.318104e+07
df2
A B
as_at_date entity_id
2020-01-31 2926 7 0
2943 0 3
2973 3 1
3104 1 5
3253 5 9
3362 9 7
2020-02-29 3185 9 9
3304 0 0
2020-03-31 2953 3 9
2981 9 6
3034 0 3
3355 6 0

Using concat inside an iteration on the original dataframe does the trick but is there a smarter way to do this?
thanks
def qcut_sub_index(df_with_sub_index):
# create empty return value same shape as passed dataframe
df_return=pd.DataFrame()
for date, sub_df in df_with_sub_index.groupby(level=0):
df_return=df_return.append(pd.DataFrame(pd.qcut(sub_df, 10, labels=False,
duplicates='drop')))
return df_return
df_x=pd.DataFrame()
for (columnName, columnData) in df_values.iteritems():
df_x=pd.concat([df_x, qcut_sub_index(columnData)], axis=1, join="outer")
df_x

How to fix TypeError: data type not understood with a datetime object in Pandas

I am working with a date column in pandas. I have a date column. I want to have just the year and month as a separate column.
I achieved that by:
df1["month"] = pd.to_datetime(Table_A_df['date']).dt.to_period('M')
Printing it looks like this:
df1["month"]
Out:
0 2017-03
1 2017-03
2 2017-03
3 2017-03
4 2017-03
...
79638 2018-03
79639 2018-03
79640 2018-03
79641 2018-03
79642 2018-03
Name: month, Length: 79643, dtype: period[M]
My customer id looks like this:
0 5094298f068196c5349d43847de5afc9125cf989
1 NaN
2 NaN
3 433fdf385e33176cf9b0d67ecf383aa928fa261c
4 NaN
...
79638 6836d8cdd9c6c537c702b35ccd972fae58070004
79639 bbc08d8abad5e699823f2f0021762797941679be
79640 39b5fdd28cb956053d3e4f3f0b884fb95749da8a
79641 3342d5b210274b01e947cc15531ad53fbe25435b
79642 b3f02d0768c0ba8334047d106eb759f3e80517ac
Name: customer_id, Length: 79643, dtype: object
Now trying to groupby customer id and transform the data.
user_groups = df1.groupby("customer_id")["month"]
df1["Cohort_month"] = user_groups.transform("min")
I get the following error:
TypeError: data type not understood
Complete error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-108-107e17f9a489> in <module>
----> 1 df1["Cohort_month"] = user_groups.transform("min")
C:\Users\Public\Anaconda\lib\site-packages\pandas\core\groupby\generic.py in transform(self, func, *args, **kwargs)
475 # result to the whole group. Compute func result
476 # and deal with possible broadcasting below.
--> 477 result = getattr(self, func)(*args, **kwargs)
478 return self._transform_fast(result, func)
479
C:\Users\Public\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in f(self, **kwargs)
1375 # try a cython aggregation if we can
1376 try:
-> 1377 return self._cython_agg_general(alias, alt=npfunc, **kwargs)
1378 except DataError:
1379 pass
C:\Users\Public\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
887
888 result, agg_names = self.grouper.aggregate(
--> 889 obj._values, how, min_count=min_count
890 )
891
C:\Users\Public\Anaconda\lib\site-packages\pandas\core\groupby\ops.py in aggregate(self, values, how, axis, min_count)
568 ) -> Tuple[np.ndarray, Optional[List[str]]]:
569 return self._cython_operation(
--> 570 "aggregate", values, how, axis, min_count=min_count
571 )
572
C:\Users\Public\Anaconda\lib\site-packages\pandas\core\groupby\ops.py in _cython_operation(self, kind, values, how, axis, min_count, **kwargs)
560 result = type(orig_values)(result.astype(np.int64), dtype=orig_values.dtype)
561 elif is_datetimelike and kind == "aggregate":
--> 562 result = result.astype(orig_values.dtype)
563
564 return result, names
TypeError: data type not understood
This was working before when I had 1 as the day, but when I made it just year and month. I am getting an error. Is there a fix around this?

It's working for the sample you shared, not sure where the issue is, are there any missing values in your month column?
df['month'] = pd.to_datetime(df['month']).dt.to_period('M')
user_groups = df.groupby("customer_id")["month"]
df["Cohort_month"] = user_groups.transform("min")
print(df)
customer_id month Cohort_month
0 5094298f068196c5349d43847de5afc9125cf989 2017-03 2017-03
1 NaN 2017-03 NaT
2 NaN 2017-03 NaT
3 433fdf385e33176cf9b0d67ecf383aa928fa261c 2017-03 2017-03
4 NaN 2017-03 NaT

DateTimeIndex.to_period raises a ValueError exception for many offset aliases

I am trying to solve a very simple problem, but am running into a wall.
I have a DateTimeIndex based on a simple dataframe like follows:
df=pd.DataFrame(
index=pd.date_range(
start='2017-01-01',
end='2017-03-04', closed=None),
data=np.arange(63), columns=['val']).rename_axis(index='date')
In [179]: df
Out[179]:
val
date
2017-01-01 0
2017-01-02 1
2017-01-03 2
2017-01-04 3
2017-01-05 4
... ...
2017-02-28 58
2017-03-01 59
2017-03-02 60
2017-03-03 61
2017-03-04 62
[63 rows x 1 columns]
I wish to summarize the values by periods of weekly, semi-monthly, monthly etc.
So I tried:
In [180]: df.to_period('W').groupby('date').sum()
Out[180]:
val
date
2016-12-26/2017-01-01 0
2017-01-02/2017-01-08 28
2017-01-09/2017-01-15 77
2017-01-16/2017-01-22 126
2017-01-23/2017-01-29 175
2017-01-30/2017-02-05 224
2017-02-06/2017-02-12 273
2017-02-13/2017-02-19 322
2017-02-20/2017-02-26 371
2017-02-27/2017-03-05 357
That works fine for offset aliases like Y, M, D, W, T, S, L, U, N.
But fails for SM, SMS and others listed here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
It raises a ValueError exception:
In [181]: df.to_period('SMS').groupby('date').sum()
--------------------------------------------------------------------------- KeyError Traceback (most recent call
last) pandas/_libs/tslibs/frequencies.pyx in
pandas._libs.tslibs.frequencies._period_str_to_code()
KeyError: 'SMS-15'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call
last) <ipython-input-181-6779559a0596> in <module>
----> 1 df.to_period('SMS').groupby('date').sum()
~/.virtualenvs/py36/lib/python3.6/site-packages/pandas/core/frame.py
in to_period(self, freq, axis, copy) 8350 axis =
self._get_axis_number(axis) 8351 if axis == 0:
-> 8352 new_data.set_axis(1, self.index.to_period(freq=freq)) 8353 elif axis == 1:
8354 new_data.set_axis(0,
self.columns.to_period(freq=freq))
~/.virtualenvs/py36/lib/python3.6/site-packages/pandas/core/accessor.py
in f(self, *args, **kwargs)
91 def _create_delegator_method(name):
92 def f(self, *args, **kwargs):
---> 93 return self._delegate_method(name, *args, **kwargs)
94
95 f.__name__ = name
~/.virtualenvs/py36/lib/python3.6/site-packages/pandas/core/indexes/datetimelike.py
in _delegate_method(self, name, *args, **kwargs)
811
812 def _delegate_method(self, name, *args, **kwargs):
--> 813 result = operator.methodcaller(name, *args, **kwargs)(self._data)
814 if name not in self._raw_methods:
815 result = Index(result, name=self.name)
~/.virtualenvs/py36/lib/python3.6/site-packages/pandas/core/arrays/datetimes.py
in to_period(self, freq) 1280 freq =
get_period_alias(freq) 1281
-> 1282 return PeriodArray._from_datetime64(self._data, freq, tz=self.tz) 1283 1284 def to_perioddelta(self, freq):
~/.virtualenvs/py36/lib/python3.6/site-packages/pandas/core/arrays/period.py
in _from_datetime64(cls, data, freq, tz)
273 PeriodArray[freq]
274 """
--> 275 data, freq = dt64arr_to_periodarr(data, freq, tz)
276 return cls(data, freq=freq)
277
~/.virtualenvs/py36/lib/python3.6/site-packages/pandas/core/arrays/period.py
in dt64arr_to_periodarr(data, freq, tz)
914 data = data._values
915
--> 916 base, mult = libfrequencies.get_freq_code(freq)
917 return libperiod.dt64arr_to_periodarr(data.view("i8"), base, tz), freq
918
pandas/_libs/tslibs/frequencies.pyx in
pandas._libs.tslibs.frequencies.get_freq_code()
pandas/_libs/tslibs/frequencies.pyx in
pandas._libs.tslibs.frequencies.get_freq_code()
pandas/_libs/tslibs/frequencies.pyx in
pandas._libs.tslibs.frequencies.get_freq_code()
pandas/_libs/tslibs/frequencies.pyx in
pandas._libs.tslibs.frequencies._period_str_to_code()
ValueError: Invalid frequency: SMS-15
I am using python 3.6.5, pandas version '0.25.1'

Use DataFrame.resample here:
print (df.resample('W').sum())
val
date
2017-01-01 0
2017-01-08 28
2017-01-15 77
2017-01-22 126
2017-01-29 175
2017-02-05 224
2017-02-12 273
2017-02-19 322
2017-02-26 371
2017-03-05 357
print (df.resample('SM').sum())
val
date
2016-12-31 91
2017-01-15 344
2017-01-31 555
2017-02-15 663
2017-02-28 300
print (df.resample('SMS').sum())
val
date
2017-01-01 91
2017-01-15 374
2017-02-01 525
2017-02-15 721
2017-03-01 242
Alternatives with groupby and Grouper:
print (df.groupby(pd.Grouper(freq='W')).sum())
print (df.groupby(pd.Grouper(freq='SM')).sum())
print (df.groupby(pd.Grouper(freq='SMS')).sum())

Why are some values in the pandas dataframe becoming integers from strings after I add a column to the dataframe manually?

I have made a pandas dataframe from a CSV file like so:
import pandas as pd
data = pd.read_csv('dataset.csv')
In it, there's a column called CLASS. These are the values contained in CLASS:
from collections import Counter
Counter(CLASS)
Out [1]: Counter({'1': 60783, '2': 37313, '3': 2564, '4': 959, ' ': 346, 'D': 27})
Now, I add a column to the dataframe manually, and save it in a new csv:
data['DURATION'] = DURATION
data.to_csv('new_dataset.csv')
Then, when I open the new CSV and check the values in CLASS, some of them have become integers!
dataset = pd.read_csv('new_dataset.csv')
CLASS = dataset['OCCUP_CLASS']
Counter(CLASS)
Out [1]: Counter({' ': 346,
1: 48418,
'1': 12365,
2: 16189,
'2': 21124,
3: 848,
'3': 1716,
4: 81,
'4': 878,
'D': 43})
Why is this happening? This creates problems as I cannot plot or make histograms of CLASS anymore, while before I was able to do so:
import matplotlib.pyplot as plt
plt.plot(CLASS)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-158-b6bafcfd7ad5> in <module>()
1 import matplotlib.pyplot as plt
----> 2 plt.plot(CLASS)
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\pyplot.py in plot(*args, **kwargs)
3356 mplDeprecation)
3357 try:
-> 3358 ret = ax.plot(*args, **kwargs)
3359 finally:
3360 ax._hold = washold
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
1853 "the Matplotlib list!)" % (label_namer, func.__name__),
1854 RuntimeWarning, stacklevel=2)
-> 1855 return func(ax, *args, **kwargs)
1856
1857 inner.__doc__ = _add_data_doc(inner.__doc__,
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\axes\_axes.py in plot(self, *args, **kwargs)
1526
1527 for line in self._get_lines(*args, **kwargs):
-> 1528 self.add_line(line)
1529 lines.append(line)
1530
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\axes\_base.py in add_line(self, line)
1930 line.set_clip_path(self.patch)
1931
-> 1932 self._update_line_limits(line)
1933 if not line.get_label():
1934 line.set_label('_line%d' % len(self.lines))
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\axes\_base.py in _update_line_limits(self, line)
1952 Figures out the data limit of the given line, updating self.dataLim.
1953 """
-> 1954 path = line.get_path()
1955 if path.vertices.size == 0:
1956 return
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\lines.py in get_path(self)
949 """
950 if self._invalidy or self._invalidx:
--> 951 self.recache()
952 return self._path
953
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\lines.py in recache(self, always)
655 if always or self._invalidy:
656 yconv = self.convert_yunits(self._yorig)
--> 657 y = _to_unmasked_float_array(yconv).ravel()
658 else:
659 y = self._y
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\cbook\__init__.py in _to_unmasked_float_array(x)
2048 return np.ma.asarray(x, float).filled(np.nan)
2049 else:
-> 2050 return np.asarray(x, float)
2051
2052
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494
ValueError: could not convert string to float:
EDIT: Adding the first 20 rows of the 2 relevant columns from the dataset:
DURATION CLASS
10 1
14 1
-1 1
-1 1
0 1
-1 1
14 2
8 2
-1 1
14 3
-1 3
-1
-1 4
-1 4
-1 3
8 1
-1 2
-1 2
-1 1
EDIT 2: Output of print(dataset['CLASS'].value_counts()):
import pandas as pd
dataset = pd.read_csv('dataset.csv', dtype={'CLASS': str})
print(dataset['CLASS'].value_counts())
1 48418
2 21124
2 16189
1 12365
3 1716
4 878
3 848
346
4 81
D 43
Name: CLASS, dtype: int64
EDIT 3: Plotting is not a problem for blank elements, as shown with the following plot with the original data, where the blank point on x-axis is highlighted:

Pandas tries to detect the data type of a column, but sometimes fails as you noticed. You can force the data type of a column in read_csv like this:
dataset = pd.read_csv('new_dataset.csv', dtype={'CLASS': str})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Resampling of categorical column in pandas data frame - python

Related

Pandas pivot_table Assertion error: `result` has not been initialized

Python Pandas apply qcut to grouped by level 0 of multi-index in multi-index dataframe

How to fix TypeError: data type not understood with a datetime object in Pandas

DateTimeIndex.to_period raises a ValueError exception for many offset aliases

Why are some values in the pandas dataframe becoming integers from strings after I add a column to the dataframe manually?

Categories

Resources