I'm having a pandas Series object filled with decimal numbers of dtype Decimal. I'd like to use the new pandas 0.8 function to resample the decimal time series like this:
resampled = ts.resample('D', how = 'mean')
When trying this i get an "GroupByError: No numeric types to aggregate" error. I assume the problem is that np.mean is used internaly to resample the values and np.mean expects floats instead of Decimals.
Thanks to the help of this forum i managed to solve a similar question using groupBy and the apply function but i would love to also use the cool resample function.
How use the mean method on a pandas TimeSeries with Decimal type values?
Any idea how to solve this?
Here is the complete ipython session creating the error:
In [37]: from decimal import Decimal
In [38]: from pandas import *
In [39]: rng = date_range('1.1.2012',periods=48, freq='H')
In [40]: rnd = np.random.randn(len(rng))
In [41]: rnd_dec = [Decimal(x) for x in rnd]
In [42]: ts = Series(rnd_dec, index=rng)
In [43]: ts[0:3]
Out[43]:
2012-01-01 00:00:00 -0.1020591335576267189022559023214853368699550628
2012-01-01 01:00:00 0.99245713975437366283216533702216111123561859130
2012-01-01 02:00:00 1.80080710727195758558139004890108481049537658691
Freq: H
In [44]: type(ts[0])
Out[44]: decimal.Decimal
In [45]: ts.resample('D', how = 'mean')
---------------------------------------------------------------------------
GroupByError Traceback (most recent call last)
C:\Users\THM\Documents\Python\<ipython-input-45-09c898403ddd> in <module>()
----> 1 ts.resample('D', how = 'mean')
C:\Python27\lib\site-packages\pandas\core\generic.pyc in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, l
imit, base)
187 fill_method=fill_method, convention=convention,
188 limit=limit, base=base)
--> 189 return sampler.resample(self)
190
191 def first(self, offset):
C:\Python27\lib\site-packages\pandas\tseries\resample.pyc in resample(self, obj)
65
66 if isinstance(axis, DatetimeIndex):
---> 67 rs = self._resample_timestamps(obj)
68 elif isinstance(axis, PeriodIndex):
69 offset = to_offset(self.freq)
C:\Python27\lib\site-packages\pandas\tseries\resample.pyc in _resample_timestamps(self, obj)
184 if len(grouper.binlabels) < len(axlabels) or self.how is not None:
185 grouped = obj.groupby(grouper, axis=self.axis)
--> 186 result = grouped.aggregate(self._agg_method)
187 else:
188 # upsampling shortcut
C:\Python27\lib\site-packages\pandas\core\groupby.pyc in aggregate(self, func_or_funcs, *args, **kwargs)
1215 """
1216 if isinstance(func_or_funcs, basestring):
-> 1217 return getattr(self, func_or_funcs)(*args, **kwargs)
1218
1219 if hasattr(func_or_funcs,'__iter__'):
C:\Python27\lib\site-packages\pandas\core\groupby.pyc in mean(self)
290 """
291 try:
--> 292 return self._cython_agg_general('mean')
293 except GroupByError:
294 raise
C:\Python27\lib\site-packages\pandas\core\groupby.pyc in _cython_agg_general(self, how)
376
377 if len(output) == 0:
--> 378 raise GroupByError('No numeric types to aggregate')
379
380 return self._wrap_aggregated_output(output, names)
GroupByError: No numeric types to aggregate
Any help is appreciated.
Thanks,
Thomas
I found the answer by myself. It is possible to provide a function to the 'how' argument of resample:
f = lambda x: Decimal(np.mean(x))
ts.resample('D', how = f)
I get the error for object type columns in DataFrame. I got around it by using
df.resample('D', method='ffill', how=lambda c: c[-1])
Related
Hello,
I'm working on a column that has missing values ('year_of_release'). The data type is 'timestamp64'.
At first, I created a function that "pulls" the year numbers, from a column in which years appears next to the names of some games, and finally, I combined this data into a new column - 'years_from_titles':
def get_year(row):
regex="\d{4}"
match=re.findall(regex, row)
for i in match:
if (int(i) > 1970) & (int(i) < 2017):
return int(I)
gaming['years_from_titles']=gaming['name'].apply(lambda x: get_year(str(x)))
I tested the function and it works.
Now, I'm trying to create another function, which will fill in those missing years of the original column - 'year_of_release', but only if they appear on the same row:
def year_row(row):
if math.isnan(row['year_of_release']):
return row['years_from_titles']
else:
return row['year_of_release']
gaming['year_of_release']=gaming.apply(year_row,axis=1)
But when I'm running the code I get TypeError:
/tmp/ipykernel_31/133192424.py in <module>
7 return row['year_of_release']
8
----> 9 gaming['year_of_release']=gaming.apply(year_row,axis=1)
/opt/conda/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py in get_result(self)
183 return self.apply_raw()
184
--> 185 return self.apply_standard()
186
187 def apply_empty_result(self):
/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
274
275 def apply_standard(self):
--> 276 results, res_index = self.apply_series_generator()
277
278 # wrap results
/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
288 for i, v in enumerate(series_gen):
289 # ignore SettingWithCopy here in case the user mutates
--> 290 results[i] = self.f(v)
291 if isinstance(results[i], ABCSeries):
292 # If we have a view on v, we need to make a copy because
/tmp/ipykernel_31/133192424.py in year_row(row)
2 # but only if a year is found, on the same row, and in correspond to years_from_titles column.
3 def year_row(row):
----> 4 if math.isnan(row['year_of_release']):
5 return row['years_from_titles']
6 else:
TypeError: must be real number, not Timestamp.
If anyone knows how to overcome this I would greatly appreciate it.
Thanks
You can use the feature that NaN is not equal with itself.
def year_row(row):
if row['year_of_release'] != row['year_of_release']:
return row['years_from_titles']
else:
return row['year_of_release']
gaming['year_of_release']=gaming.apply(year_row,axis=1)
Or with Series.mask
gaming['year_of_release'] = gaming['year_of_release'].mask(gaming['year_of_release'].isna(), gaming['years_from_titles'])
Or with Series.fillna
gaming['year_of_release'] = gaming['year_of_release'].fillna(gaming['years_from_titles'])
Instead of using the math module to check for missing values, here's a more pandas-specific approach.
Change this line:
if math.isnan(row['year_of_release']):
to this:
if row['year_of_release'].isna():
I have a pandas dataframe like this:
User-Id Training-Id TrainingTaken
0 4327024 25 10
1 6662572 3 10
2 3757520 26 10
and I need to convert it to a Matrix like they do here:
https://github.com/tr1ten/Anime-Recommender-System/blob/main/HybridRecommenderSystem.ipynb
Cell 13.
So I did the following:
from lightfm import LightFM
from lightfm.evaluation import precision_at_k
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_profiling
from scipy.sparse import csr_matrix
from lightfm.evaluation import auc_score
from lightfm.data import Dataset
user_training_interaction = pd.pivot_table(trainingtaken, index='User-Id', columns='Training-Id', values='TrainingTaken')
user_training_interaction.fillna(0,inplace=True)
user_training_csr = csr_matrix(user_training_interaction.values)
But I get this error:
---------------------------------------------------------------------------
DataError Traceback (most recent call last)
<ipython-input-96-5a2c7ba28976> in <module>
10 from lightfm.data import Dataset
11
---> 12 user_training_interaction = pd.pivot_table(trainingtaken, index='User-Id', columns='Training-Id', values='TrainingTaken')
13 user_training_interaction.fillna(0,inplace=True)
14 user_training_csr = csr_matrix(user_training_interaction.values)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/reshape/pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed)
110
111 grouped = data.groupby(keys, observed=observed)
--> 112 agged = grouped.agg(aggfunc)
113 if dropna and isinstance(agged, ABCDataFrame) and len(agged.columns):
114 agged = agged.dropna(how="all")
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
949 func = maybe_mangle_lambdas(func)
950
--> 951 result, how = self._aggregate(func, *args, **kwargs)
952 if how is None:
953 return result
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
305
306 if isinstance(arg, str):
--> 307 return self._try_aggregate_string_function(arg, *args, **kwargs), None
308
309 if isinstance(arg, dict):
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/base.py in _try_aggregate_string_function(self, arg, *args, **kwargs)
261 if f is not None:
262 if callable(f):
--> 263 return f(*args, **kwargs)
264
265 # people may try to aggregate on a non-callable attribute
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in mean(self, numeric_only)
1396 "mean",
1397 alt=lambda x, axis: Series(x).mean(numeric_only=numeric_only),
-> 1398 numeric_only=numeric_only,
1399 )
1400
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
1020 ) -> DataFrame:
1021 agg_blocks, agg_items = self._cython_agg_blocks(
-> 1022 how, alt=alt, numeric_only=numeric_only, min_count=min_count
1023 )
1024 return self._wrap_agged_blocks(agg_blocks, items=agg_items)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
1128
1129 if not (agg_blocks or split_frames):
-> 1130 raise DataError("No numeric types to aggregate")
1131
1132 if split_items:
DataError: No numeric types to aggregate
What am I missing?
The Pandas Documentation states:
While pivot() provides general purpose pivoting with various data
types (strings, numerics, etc.), pandas also provides pivot_table()
for pivoting with aggregation of numeric data
Make sure the column is numeric. Without seeing how you create trainingtaken I can't provide more specific guidance. However the following may help:
Make sure you handle "empty" values in that column. The Pandas guide is a very good place to start. Pandas points out that "a column of integers with even one missing values is cast to floating-point dtype".
If working with a dataframe, the column can be cast to a specific type via your_df.your_col.astype(int) or for your example, pd.trainingtaken.astype(int)
The code worked just fine but now it gives me this error after these lines:
end = dt.datetime.now()
start = dt.date(end.year - 3, end.month, end.day)
prices = reader.get_data_yahoo(tickers,start,end)['Adj Close']
I tried upgrading packages and everything but it didn't help.The code doesn't work now even for the data I previously successfully downloaded and analysied via it.
ValueError Traceback (most recent call last)
Input In [6], in <cell line: 3>()
1 end = dt.datetime.now()
2 start = dt.date(end.year - 3, end.month, end.day)
----> 3 prices = reader.get_data_yahoo(tickers,start,end)['Adj Close']
File C:\Python310\lib\site-packages\pandas_datareader\data.py:80, in get_data_yahoo(*args, **kwargs)
79 def get_data_yahoo(*args, **kwargs):
---> 80 return YahooDailyReader(*args, **kwargs).read()
File C:\Python310\lib\site-packages\pandas_datareader\base.py:256, in _DailyBaseReader.read(self)
254 # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
255 elif isinstance(self.symbols, DataFrame):
--> 256 df = self._dl_mult_symbols(self.symbols.index)
257 else:
258 df = self._dl_mult_symbols(self.symbols)
File C:\Python310\lib\site-packages\pandas_datareader\base.py:285, in _DailyBaseReader._dl_mult_symbols(self, symbols)
283 stocks[sym] = df_na
284 if PANDAS_0230:
--> 285 result = concat(stocks, sort=True).unstack(level=0)
286 else:
287 result = concat(stocks).unstack(level=0)
File C:\Python310\lib\site-packages\pandas\core\frame.py:8413, in DataFrame.unstack(self, level, fill_value)
8351 """
8352 Pivot a level of the (necessarily hierarchical) index labels.
8353
(...)
8409 dtype: float64
8410 """
8411 from pandas.core.reshape.reshape import unstack
-> 8413 result = unstack(self, level, fill_value)
8415 return result.__finalize__(self, method="unstack")
File C:\Python310\lib\site-packages\pandas\core\reshape\reshape.py:478, in unstack(obj, level, fill_value)
476 if isinstance(obj, DataFrame):
477 if isinstance(obj.index, MultiIndex):
--> 478 return _unstack_frame(obj, level, fill_value=fill_value)
479 else:
480 return obj.T.stack(dropna=False)
File C:\Python310\lib\site-packages\pandas\core\reshape\reshape.py:501, in _unstack_frame(obj, level, fill_value)
499 def _unstack_frame(obj, level, fill_value=None):
500 if not obj._can_fast_transpose:
--> 501 unstacker = _Unstacker(obj.index, level=level)
502 mgr = obj._mgr.unstack(unstacker, fill_value=fill_value)
503 return obj._constructor(mgr)
File C:\Python310\lib\site-packages\pandas\core\reshape\reshape.py:140, in _Unstacker.__init__(self, index, level, constructor)
133 if num_cells > np.iinfo(np.int32).max:
134 warnings.warn(
135 f"The following operation may generate {num_cells} cells "
136 f"in the resulting pandas object.",
137 PerformanceWarning,
138 )
--> 140 self._make_selectors()
File C:\Python310\lib\site-packages\pandas\core\reshape\reshape.py:192, in _Unstacker._make_selectors(self)
189 mask.put(selector, True)
191 if mask.sum() < len(self.index):
--> 192 raise ValueError("Index contains duplicate entries, cannot reshape")
194 self.group_index = comp_index
195 self.mask = mask
ValueError: Index contains duplicate entries, cannot reshape
I know it can be frustrating but for the moment you have to read each ticker individually. The API is probably broken since the lastest versions of Pandas:
tickers = ['AAPL', 'MSFT']
end = dt.datetime.now()
start = dt.date(end.year - 3, end.month, end.day)
data = {}
for ticker in tickers:
data[ticker] = reader.get_data_yahoo(ticker, start, end)['Adj Close']
prices = pd.concat(data, axis=1)
Output:
>>> prices
AAPL MSFT
Date
2019-03-11 43.548748 109.345795
2019-03-12 44.038033 110.111404
2019-03-13 44.232773 110.964211
2019-03-14 44.724491 111.051437
2019-03-15 45.306278 112.330688
... ... ...
2022-03-07 159.300003 278.910004
2022-03-08 157.440002 275.850006
2022-03-09 162.949997 288.500000
2022-03-10 158.520004 285.589996
2022-03-10 158.520004 285.589996
[759 rows x 2 columns]
The error occurs when the query is made on a Saturday or Sunday, since Yahoo Finance repeats the data for Friday twice.
You can check it by looking at the historical data in finance yahoo itself.
For a single stock can be solved with:
data = data[~data.index.duplicated(keep='last')]
But, when downloading info for a list of stocks, , the solution is proposed by iterating over said list and then concatenating the series to construct the df.
Then you can use the code above to remove the duplicate indexes.
I am trying to create a function that returns either the mean, median, or standard deviation of all columns in a Pandas DataFrame using NumPy functions.
It is for a school assignment, so there's no reason for using NumPy other than it is what is being asked of me. I am struggling to figure out how to use a NumPy function with a Pandas DataFrame for this problem.
Here is the text of the problem.
The code cell below contains a function called comp_sample_stat that accepts 2 parameters "df" which contains data from the dow jones for a particular company, and stat which will contain 1 of the 3 strings: "mean", "std", or "median".
For this problem:
if the stat is equal to "mean" return the mean of the dataframe columns using numpy's mean function
if the stat is equal to "median" return the median of the dataframe columns using numpy's median function
if the stat is equal to "std" return the std of the dataframe columns using numpy's std function
Here is the function I have written.
def comp_sample_stat(df, stat='mean'):
'''
Computes a sample statistic for any dataframe passed in
Parameters
----------
df: Pandas dataframe
Returns
-------
a pandas dataframe
'''
df_mean = df.apply(np.mean(df))
df_median = df.apply(np.median(df))
df_std = df.apply(np.std(df))
if stat is str('std'):
return df_std
elif stat is str('median'):
return df_median
else:
return df_mean
df is a DataFrame that has been defined previously in my assignment as follows:
def read_data(file_path):
'''
Reads in a dataset using pandas.
Parameters
----------
file_path : string containing path to a file
Returns
-------
pandas dataframe with data read in from the file path
'''
read_file = pd.read_csv(file_path)
new_df = pd.DataFrame(read_file)
return new_df
df = read_data('data/dow_jones_index.data')
The variable df_AA has also been previously defined as follows:
def select_stock(df, symbol):
'''
Selects data only containing a particular stock symbol.
Parameters
----------
df: dataframe containing data from the dow jones index
stock: string containing the stock symbol to select
Returns
-------
dataframe containing a particular stock
'''
stock = df[df.stock == symbol]
return stock
df_AA = select_stock(df.copy(), 'AA')
When I call the function within a Jupyter Notebook as follows:
comp_sample_stat(df_AA)
I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call
last)
<ipython-input-17-a2bcbeedcc56> in <module>()
22 return df_mean
23
---> 24 comp_sample_stat(df_AA)
<ipython-input-17-a2bcbeedcc56> in comp_sample_stat(df, stat)
11 a pandas dataframe
12 '''
---> 13 df_mean = df.apply(np.mean(df))
14 df_median = df.apply(np.median(df))
15 df_std = df.apply(np.std(df))
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in
apply(self, func, axis, broadcast, raw, reduce, result_type, args,
**kwds)
6012 args=args,
6013 kwds=kwds)
-> 6014 return op.get_result()
6015
6016 def applymap(self, func):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in
get_result(self)
316 *self.args, **self.kwds)
317
--> 318 return super(FrameRowApply, self).get_result()
319
320 def apply_broadcast(self):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in
get_result(self)
140 return self.apply_raw()
141
--> 142 return self.apply_standard()
143
144 def apply_empty_result(self):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in
apply_standard(self)
246
247 # compute the result using the series generator
--> 248 self.apply_series_generator()
249
250 # wrap results
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in
apply_series_generator(self)
275 try:
276 for i, v in enumerate(series_gen):
--> 277 results[i] = self.f(v)
278 keys.append(v.name)
279 except Exception as e:
TypeError: ("'Series' object is not callable", 'occurred at index
quarter')
DataFrame.apply expects you to pass it a function, not a dataframe. So you should be passing np.mean without arguments.
That is, you should be doing something like this:
df_mean = df.apply(np.mean)
The docs.
I have a pandas Panel with a non-unique major_axis and I am trying to sum the non unique rows using groupby, but I get an error saying that the major_axis is not iterable. I have searched stack overflow and the message board, but it seems like the Panel is not as widely used as the dataframe.
Here is an example that produces there error:
import pandas as pd
import datetime as dt
import dateutil.relativedelta as rd
import numpy as np
items = ['A','B']
minor_axis = ['x','y']
diff = rd.relativedelta(years=1)
major_axis = [dt.date(2013,1,1) + (diff * shift) for shift in xrange(4)] * 2
values = np.random.randn(2,8,2)
data = pd.Panel(data=values, major_axis=major_axis, minor_axis=minor_axis, items=items)
data.groupby(sum, axis='major')
and here is the stacktrace:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-29-e30fb9b32fce> in <module>()
----> 1 data.groupby(sum, axis='major')
/home/brendan/python_dev/venv/local/lib/python2.7/site-packages/pandas/core/panel.pyc in groupby(self, function, axis)
1084 from pandas.core.groupby import PanelGroupBy
1085 axis = self._get_axis_number(axis)
-> 1086 return PanelGroupBy(self, function, axis=axis)
1087
1088 def swapaxes(self, axis1='major', axis2='minor', copy=True):
/home/brendan/python_dev/venv/local/lib/python2.7/site-packages/pandas/core/groupby.pyc in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze)
195 if grouper is None:
196 grouper, exclusions = _get_grouper(obj, keys, axis=axis,
--> 197 level=level, sort=sort)
198
199 self.grouper = grouper
/home/brendan/python_dev/venv/local/lib/python2.7/site-packages/pandas/core/groupby.pyc in _get_grouper(obj, key, axis, level, sort)
1323 raise AssertionError(errmsg)
1324
-> 1325 ping = Grouping(group_axis, gpr, name=name, level=level, sort=sort)
1326 groupings.append(ping)
1327
/home/brendan/python_dev/venv/local/lib/python2.7/site-packages/pandas/core/groupby.pyc in __init__(self, index, grouper, name, level, sort)
1197 # no level passed
1198 if not isinstance(self.grouper, np.ndarray):
-> 1199 self.grouper = self.index.map(self.grouper)
1200 if not (hasattr(self.grouper,"__len__") and \
1201 len(self.grouper) == len(self.index)):
/home/brendan/python_dev/venv/local/lib/python2.7/site-packages/pandas/core/index.pyc in map(self, mapper)
856
857 def map(self, mapper):
--> 858 return self._arrmap(self.values, mapper)
859
860 def isin(self, values):
/home/brendan/python_dev/venv/local/lib/python2.7/site-packages/pandas/algos.so in pandas.algos.arrmap_object (pandas/algos.c:62269)()
TypeError: 'datetime.date' object is not iterable
Any ideas about how to handle this situation?
Many thanks,
Brendan
In 0.12 you can try
>>> data.groupby(np.sum, axis='major')
<pandas.core.groupby.PanelGroupBy object at 0x1a2ba50>
The answer of #alko is indeed the solution to your question, although I think you misunderstand the groupby. You still need to apply a function or aggregation on the groupby() call, in your case to sum all items in a group data.groupby(..).sum().
But I would recommend to consider if you need to use a Panel. Of course I don't know your case, but in many case using a MultiIndex can solve the problem.
Your panel and groupby would look like the following:
>>> items = ['A', 'A', 'B', 'B']
>>> minor_axis = ['x','y', 'x', 'y']
>>> diff = rd.relativedelta(years=1)
>>> major_axis = [dt.date(2013,1,1) + (diff * shift) for shift in xrange(4)] * 2
>>> values = np.random.randn(8,4)
>>>
>>> data = pd.DataFrame(values, index=major_axis, columns=pd.MultiIndex.from_arrays([items, minor_axis]))
>>> data
A B
x y x y
2013-01-01 -1.063086 0.564123 0.128006 -0.658767
2014-01-01 2.182473 -0.851618 1.180264 0.165581
2015-01-01 -0.003941 0.590801 -1.616197 -2.270557
2016-01-01 -0.736524 0.172791 1.220589 -1.303294
2013-01-01 -1.052184 -1.171545 -0.473488 -0.140327
2014-01-01 0.021189 0.827241 0.775863 -0.882874
2015-01-01 -1.762289 0.705692 0.593365 -0.984109
2016-01-01 -1.946106 -1.108336 -1.691758 -0.088932
>>> data.groupby(data.index).sum()
A B
x y x y
2013-01-01 -2.115270 -0.607422 -0.345482 -0.799094
2014-01-01 2.203662 -0.024377 1.956127 -0.717293
2015-01-01 -1.766230 1.296492 -1.022832 -3.254667
2016-01-01 -2.682630 -0.935544 -0.471170 -1.392226