Filling in missing values using a function

Filling in missing values using a function - python

Hello,
I'm working on a column that has missing values ('year_of_release'). The data type is 'timestamp64'.
At first, I created a function that "pulls" the year numbers, from a column in which years appears next to the names of some games, and finally, I combined this data into a new column - 'years_from_titles':
def get_year(row):
regex="\d{4}"
match=re.findall(regex, row)
for i in match:
if (int(i) > 1970) & (int(i) < 2017):
return int(I)
gaming['years_from_titles']=gaming['name'].apply(lambda x: get_year(str(x)))
I tested the function and it works.
Now, I'm trying to create another function, which will fill in those missing years of the original column - 'year_of_release', but only if they appear on the same row:
def year_row(row):
if math.isnan(row['year_of_release']):
return row['years_from_titles']
else:
return row['year_of_release']
gaming['year_of_release']=gaming.apply(year_row,axis=1)
But when I'm running the code I get TypeError:
/tmp/ipykernel_31/133192424.py in <module>
7 return row['year_of_release']
8
----> 9 gaming['year_of_release']=gaming.apply(year_row,axis=1)
/opt/conda/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py in get_result(self)
183 return self.apply_raw()
184
--> 185 return self.apply_standard()
186
187 def apply_empty_result(self):
/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
274
275 def apply_standard(self):
--> 276 results, res_index = self.apply_series_generator()
277
278 # wrap results
/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
288 for i, v in enumerate(series_gen):
289 # ignore SettingWithCopy here in case the user mutates
--> 290 results[i] = self.f(v)
291 if isinstance(results[i], ABCSeries):
292 # If we have a view on v, we need to make a copy because
/tmp/ipykernel_31/133192424.py in year_row(row)
2 # but only if a year is found, on the same row, and in correspond to years_from_titles column.
3 def year_row(row):
----> 4 if math.isnan(row['year_of_release']):
5 return row['years_from_titles']
6 else:
TypeError: must be real number, not Timestamp.
If anyone knows how to overcome this I would greatly appreciate it.
Thanks

You can use the feature that NaN is not equal with itself.
def year_row(row):
if row['year_of_release'] != row['year_of_release']:
return row['years_from_titles']
else:
return row['year_of_release']
gaming['year_of_release']=gaming.apply(year_row,axis=1)
Or with Series.mask
gaming['year_of_release'] = gaming['year_of_release'].mask(gaming['year_of_release'].isna(), gaming['years_from_titles'])
Or with Series.fillna
gaming['year_of_release'] = gaming['year_of_release'].fillna(gaming['years_from_titles'])

Instead of using the math module to check for missing values, here's a more pandas-specific approach.
Change this line:
if math.isnan(row['year_of_release']):
to this:
if row['year_of_release'].isna():

Related

DataError: No numeric types to aggregate pandas pivot

I have a pandas dataframe like this:
User-Id Training-Id TrainingTaken
0 4327024 25 10
1 6662572 3 10
2 3757520 26 10
and I need to convert it to a Matrix like they do here:
https://github.com/tr1ten/Anime-Recommender-System/blob/main/HybridRecommenderSystem.ipynb
Cell 13.
So I did the following:
from lightfm import LightFM
from lightfm.evaluation import precision_at_k
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_profiling
from scipy.sparse import csr_matrix
from lightfm.evaluation import auc_score
from lightfm.data import Dataset
user_training_interaction = pd.pivot_table(trainingtaken, index='User-Id', columns='Training-Id', values='TrainingTaken')
user_training_interaction.fillna(0,inplace=True)
user_training_csr = csr_matrix(user_training_interaction.values)
But I get this error:
---------------------------------------------------------------------------
DataError Traceback (most recent call last)
<ipython-input-96-5a2c7ba28976> in <module>
10 from lightfm.data import Dataset
11
---> 12 user_training_interaction = pd.pivot_table(trainingtaken, index='User-Id', columns='Training-Id', values='TrainingTaken')
13 user_training_interaction.fillna(0,inplace=True)
14 user_training_csr = csr_matrix(user_training_interaction.values)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/reshape/pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed)
110
111 grouped = data.groupby(keys, observed=observed)
--> 112 agged = grouped.agg(aggfunc)
113 if dropna and isinstance(agged, ABCDataFrame) and len(agged.columns):
114 agged = agged.dropna(how="all")
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
949 func = maybe_mangle_lambdas(func)
950
--> 951 result, how = self._aggregate(func, *args, **kwargs)
952 if how is None:
953 return result
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
305
306 if isinstance(arg, str):
--> 307 return self._try_aggregate_string_function(arg, *args, **kwargs), None
308
309 if isinstance(arg, dict):
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/base.py in _try_aggregate_string_function(self, arg, *args, **kwargs)
261 if f is not None:
262 if callable(f):
--> 263 return f(*args, **kwargs)
264
265 # people may try to aggregate on a non-callable attribute
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in mean(self, numeric_only)
1396 "mean",
1397 alt=lambda x, axis: Series(x).mean(numeric_only=numeric_only),
-> 1398 numeric_only=numeric_only,
1399 )
1400
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
1020 ) -> DataFrame:
1021 agg_blocks, agg_items = self._cython_agg_blocks(
-> 1022 how, alt=alt, numeric_only=numeric_only, min_count=min_count
1023 )
1024 return self._wrap_agged_blocks(agg_blocks, items=agg_items)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
1128
1129 if not (agg_blocks or split_frames):
-> 1130 raise DataError("No numeric types to aggregate")
1131
1132 if split_items:
DataError: No numeric types to aggregate
What am I missing?

The Pandas Documentation states:
While pivot() provides general purpose pivoting with various data
types (strings, numerics, etc.), pandas also provides pivot_table()
for pivoting with aggregation of numeric data
Make sure the column is numeric. Without seeing how you create trainingtaken I can't provide more specific guidance. However the following may help:
Make sure you handle "empty" values in that column. The Pandas guide is a very good place to start. Pandas points out that "a column of integers with even one missing values is cast to floating-point dtype".
If working with a dataframe, the column can be cast to a specific type via your_df.your_col.astype(int) or for your example, pd.trainingtaken.astype(int)

Cannot Get Series From Dataframe (Python)

a = np.random.standard_normal((9,4))
dg = pd.DataFrame(a)
dg.columns = [["No1", "No2", "No3", "No4"]]
dg["No1"]
Hello all. I have been using JupyterLab opened through Anaconda Navigator and I wrote the above code. The first three lines look normal, however, for the fourth line I was given an error as below. If I change the fourth line into dg[["No1"]] then it "worked". However, in that case type(dg[["No1"]]) is actually dataframe, not series.
I am a bit noob and I have scratched my head for almost two hours and still don't understand what's wrong. Can somebody help? Thanks!!!
TypeError Traceback (most recent call last)
<ipython-input-393-b26f43cf53bf> in <module>
----> 1 dg["No1"]
~\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2774 if self.columns.nlevels > 1:
2775 return self._getitem_multilevel(key)
-> 2776 return self._get_item_cache(key)
2777
2778 # Do we have a slicer (on rows)?
~\anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
3584 res = cache.get(item)
3585 if res is None:
-> 3586 values = self._data.get(item)
3587 res = self._box_item_values(item, values)
3588 cache[item] = res
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in get(self, item)
966 raise ValueError("cannot label index with a null key")
967
--> 968 return self.iget(loc)
969 else:
970
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in iget(self, i)
983 Otherwise return as a ndarray
984 """
--> 985 block = self.blocks[self._blknos[i]]
986 values = block.iget(self._blklocs[i])
987
TypeError: only integer scalar arrays can be converted to a scalar index

You can just do this, unless you want multi-index :
dg.columns = ["No1", "No2", "No3", "No4"]

Pivot: ValueError: Index contains duplicate entries, cannot reshape [duplicate]

This question already has answers here:
Pandas pivot table ValueError: Index contains duplicate entries, cannot reshape
(2 answers)
Closed 4 years ago.
I want to plot a heatmap between hashtags and username from the given final table after cleaning and pre-processing.
Getting the following error.
I have pasted the full error which I'm getting I searched on similar StackOverflow errors but was unable to get the correct result.
final_sns = final.pivot("hashtags", "username")
ax = sns.heatmap(final_sns)
ValueError Traceback (most recent call last)
<ipython-input-51-277e0506604d> in <module>()
----> 1 final_sns = final.pivot("hashtags", "username")
2 ax = sns.heatmap(final_sns)
c:\users\apex_predator\appdata\local\programs\python\python36\lib\site-packages\pandas\core\frame.py in pivot(self, index, columns, values)
5192 """
5193 from pandas.core.reshape.reshape import pivot
-> 5194 return pivot(self, index=index, columns=columns, values=values)
5195
5196 _shared_docs['pivot_table'] = """
c:\users\apex_predator\appdata\local\programs\python\python36\lib\site-packages\pandas\core\reshape\reshape.py in pivot(self, index, columns, values)
413 indexed = self._constructor_sliced(self[values].values,
414 index=index)
--> 415 return indexed.unstack(columns)
416
417
c:\users\apex_predator\appdata\local\programs\python\python36\lib\site-packages\pandas\core\frame.py in unstack(self, level, fill_value)
5532 """
5533 from pandas.core.reshape.reshape import unstack
-> 5534 return unstack(self, level, fill_value)
5535
5536 _shared_docs['melt'] = ("""
c:\users\apex_predator\appdata\local\programs\python\python36\lib\site-packages\pandas\core\reshape\reshape.py in unstack(obj, level, fill_value)
493 if isinstance(obj, DataFrame):
494 if isinstance(obj.index, MultiIndex):
--> 495 return _unstack_frame(obj, level, fill_value=fill_value)
496 else:
497 return obj.T.stack(dropna=False)
c:\users\apex_predator\appdata\local\programs\python\python36\lib\site-packages\pandas\core\reshape\reshape.py in _unstack_frame(obj, level, fill_value)
507 unstacker = partial(_Unstacker, index=obj.index,
508 level=level, fill_value=fill_value)
--> 509 blocks = obj._data.unstack(unstacker)
510 return obj._constructor(blocks)
511 else:
c:\users\apex_predator\appdata\local\programs\python\python36\lib\site-packages\pandas\core\internals.py in unstack(self, unstacker_func)
4608 unstacked : BlockManager
4609 """
-> 4610 dummy = unstacker_func(np.empty((0, 0)), value_columns=self.items)
4611 new_columns = dummy.get_new_columns()
4612 new_index = dummy.get_new_index()
c:\users\apex_predator\appdata\local\programs\python\python36\lib\site-packages\pandas\core\reshape\reshape.py in __init__(self, values, index, level, value_columns, fill_value, constructor)
135
136 self._make_sorted_values_labels()
--> 137 self._make_selectors()
138
139 def _make_sorted_values_labels(self):
c:\users\apex_predator\appdata\local\programs\python\python36\lib\site-packages\pandas\core\reshape\reshape.py in _make_selectors(self)
173
174 if mask.sum() < len(self.index):
--> 175 raise ValueError('Index contains duplicate entries, '
176 'cannot reshape')
177
ValueError: Index contains duplicate entries, cannot reshape
What is that I am missing?

Seems like you have duplicated rows in your DataFrame, so your Pivot doesn't know which row to to take while pivoting.
Try this duplicated method to check them.

list comprehension does not work with sparse dataframe + never ending groupby and apply computation

I have been stuck for few days on a perhaps easy problem with python (I am a new user). I will report here a simplified version of the issue, considering a very small dataframe (df). In the simplified world the codes work, while with the big df normal operations, like slicing df by column, do not work anymore.
1) Consider a (5x2) df:
df = pd.DataFrame({'a': [1432, 1432, 1433, 1432, 1434],
'b': ['ab152', 'ab153', 'ab154', np.nan, 'ab156']})
df2 = pd.get_dummies(df.b, sparse=True)
type(df2)
[out] pandas.sparse.frame.SparseDataFrame
df2['a'] = df.a
df2 = df2.groupby('a').apply(max)[df2.columns[:-1]].to_sparse()
all works fine here. In plain text, I'd like to create a dummy matrix according to a specific column and then remove duplicates in the index by using, in this case, a max function (other functions could be used according to the purpose). 'Sparse' is necessary for memory efficiency reasons (the number of zeros is relatively very high).
Moreover, if I want to extract column 'b', I just write
df['b']
and it works.
2) In my more complex case I have roughly 5 million rows and 3 thousands cols. I apply the same set of codes.
dummy_matrix = pd.get_dummies(big_df.b, sparse=True)
type(dummy_matrix)
[out] pandas.sparse.frame.SparseDataFrame
dummy_matrix['a'] = big_df.a
dummy_matrix = dummy_matrix.groupby('a').apply(max)[dummy_matrix.columns[:-1]].to_sparse()
But the last line of code never ends and does not provide any error message.
Moreover, If I want to select column 'b' in this case I get an error as in the following:
In [81]: dummy_matrix['b']
Out[81]: ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/usr/local/lib/python2.7/dist-packages/IPython/core/formatters.pyc in __call__(self, obj)
688 type_pprinters=self.type_printers,
689 deferred_pprinters=self.deferred_printers)
--> 690 printer.pretty(obj)
691 printer.flush()
692 return stream.getvalue()
/usr/local/lib/python2.7/dist-packages/IPython/lib/pretty.pyc in pretty(self, obj)
407 if callable(meth):
408 return meth(obj, self, cycle)
--> 409 return _default_pprint(obj, self, cycle)
410 finally:
411 self.end_group()
/usr/local/lib/python2.7/dist-packages/IPython/lib/pretty.pyc in _default_pprint(obj, p, cycle)
527 if _safe_getattr(klass, '__repr__', None) not in _baseclass_reprs:
528 # A user-provided repr. Find newlines and replace them with
p.break_()
--> 529 _repr_pprint(obj, p, cycle)
530 return
531 p.begin_group(1, '<')
/usr/local/lib/python2.7/dist-packages/IPython/lib/pretty.pyc in _repr_pprint(obj, p, cycle)
709 """A pprint that just redirects to the normal repr function."""
710 # Find newlines and replace them with p.break_()
--> 711 output = repr(obj)
712 for idx,output_line in enumerate(output.splitlines()):
713 if idx:
/usr/local/lib/python2.7/dist-packages/pandas/core/base.pyc in __repr__(self)
62 Yields Bytestring in Py2, Unicode String in py3.
63 """
---> 64 return str(self)
65
66
/usr/local/lib/python2.7/dist-packages/pandas/core/base.pyc in __str__(self)
42 if compat.PY3:
43 return self.__unicode__()
---> 44 return self.__bytes__()
45
46 def __bytes__(self):
/usr/local/lib/python2.7/dist-packages/pandas/core/base.pyc in __bytes__(self)
54
55 encoding = get_option("display.encoding")
---> 56 return self.__unicode__().encode(encoding, 'replace')
57
58 def __repr__(self):
/usr/local/lib/python2.7/dist-packages/pandas/sparse/series.pyc in __unicode__(self)
290 def __unicode__(self):
291 # currently, unicode is same as repr...fixes infinite loop
--> 292 series_rep = Series.__unicode__(self)
293 rep = '%s\n%s' % (series_rep, repr(self.sp_index))
294 return rep
/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in __unicode__(self)
897
898 self.to_string(buf=buf, name=self.name, dtype=self.dtype,
--> 899 max_rows=max_rows)
900 result = buf.getvalue()
901
/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in to_string(self, buf, na_rep, float_format, header, length, dtype, name, max_rows)
962 the_repr = self._get_repr(float_format=float_format,
na_rep=na_rep,
963 header=header, length=length, dtype=dtype,
--> 964 name=name, max_rows=max_rows)
965
966 # catch contract violations
/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in _get_repr(self, name, header, length, dtype, na_rep, float_format, max_rows)
991 na_rep=na_rep,
992 float_format=float_format,
--> 993 max_rows=max_rows)
994 result = formatter.to_string()
995
/usr/local/lib/python2.7/dist-packages/pandas/core/format.pyc in __init__(self, series, buf, length, header, na_rep, name, float_format, dtype, max_rows)
146 self.dtype = dtype
147
--> 148 self._chk_truncate()
149
150 def _chk_truncate(self):
/usr/local/lib/python2.7/dist-packages/pandas/core/format.pyc in _chk_truncate(self)
159 else:
160 row_num = max_rows // 2
--> 161 series = concat((series.iloc[:row_num], series.iloc[-row_num:]))
162 self.tr_row_num = row_num
163 self.tr_series = series
/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
752 keys=keys, levels=levels, names=names,
753 verify_integrity=verify_integrity,
--> 754 copy=copy)
755 return op.get_result()
756
/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index,
verify_integrity, copy)
803 for obj in objs:
804 if not isinstance(obj, NDFrame):
--> 805 raise TypeError("cannot concatenate a non-NDFrame
object")
806
807 # consolidate
TypeError: cannot concatenate a non-NDFrame object
What is the difference between the simpler and the more complex case? Why in one case the code works, while in the other it does not? Could it be related to dtypes? I checked in both case and dtypes are the same for each col so I don't think the issue resides there. Moreover, do you think the two issues, i.e. list comprehension problem and never ending comoutation, are related? I hope yes -> 1 solution for two problems.
Your help would be very appreciated and I am willing to give more details if necessary. Many thanks.

How to resample a python pandas TimeSeries containing dytpe Decimal values?

I'm having a pandas Series object filled with decimal numbers of dtype Decimal. I'd like to use the new pandas 0.8 function to resample the decimal time series like this:
resampled = ts.resample('D', how = 'mean')
When trying this i get an "GroupByError: No numeric types to aggregate" error. I assume the problem is that np.mean is used internaly to resample the values and np.mean expects floats instead of Decimals.
Thanks to the help of this forum i managed to solve a similar question using groupBy and the apply function but i would love to also use the cool resample function.
How use the mean method on a pandas TimeSeries with Decimal type values?
Any idea how to solve this?
Here is the complete ipython session creating the error:
In [37]: from decimal import Decimal
In [38]: from pandas import *
In [39]: rng = date_range('1.1.2012',periods=48, freq='H')
In [40]: rnd = np.random.randn(len(rng))
In [41]: rnd_dec = [Decimal(x) for x in rnd]
In [42]: ts = Series(rnd_dec, index=rng)
In [43]: ts[0:3]
Out[43]:
2012-01-01 00:00:00 -0.1020591335576267189022559023214853368699550628
2012-01-01 01:00:00 0.99245713975437366283216533702216111123561859130
2012-01-01 02:00:00 1.80080710727195758558139004890108481049537658691
Freq: H
In [44]: type(ts[0])
Out[44]: decimal.Decimal
In [45]: ts.resample('D', how = 'mean')
---------------------------------------------------------------------------
GroupByError Traceback (most recent call last)
C:\Users\THM\Documents\Python\<ipython-input-45-09c898403ddd> in <module>()
----> 1 ts.resample('D', how = 'mean')
C:\Python27\lib\site-packages\pandas\core\generic.pyc in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, l
imit, base)
187 fill_method=fill_method, convention=convention,
188 limit=limit, base=base)
--> 189 return sampler.resample(self)
190
191 def first(self, offset):
C:\Python27\lib\site-packages\pandas\tseries\resample.pyc in resample(self, obj)
65
66 if isinstance(axis, DatetimeIndex):
---> 67 rs = self._resample_timestamps(obj)
68 elif isinstance(axis, PeriodIndex):
69 offset = to_offset(self.freq)
C:\Python27\lib\site-packages\pandas\tseries\resample.pyc in _resample_timestamps(self, obj)
184 if len(grouper.binlabels) < len(axlabels) or self.how is not None:
185 grouped = obj.groupby(grouper, axis=self.axis)
--> 186 result = grouped.aggregate(self._agg_method)
187 else:
188 # upsampling shortcut
C:\Python27\lib\site-packages\pandas\core\groupby.pyc in aggregate(self, func_or_funcs, *args, **kwargs)
1215 """
1216 if isinstance(func_or_funcs, basestring):
-> 1217 return getattr(self, func_or_funcs)(*args, **kwargs)
1218
1219 if hasattr(func_or_funcs,'__iter__'):
C:\Python27\lib\site-packages\pandas\core\groupby.pyc in mean(self)
290 """
291 try:
--> 292 return self._cython_agg_general('mean')
293 except GroupByError:
294 raise
C:\Python27\lib\site-packages\pandas\core\groupby.pyc in _cython_agg_general(self, how)
376
377 if len(output) == 0:
--> 378 raise GroupByError('No numeric types to aggregate')
379
380 return self._wrap_aggregated_output(output, names)
GroupByError: No numeric types to aggregate
Any help is appreciated.
Thanks,
Thomas

I found the answer by myself. It is possible to provide a function to the 'how' argument of resample:
f = lambda x: Decimal(np.mean(x))
ts.resample('D', how = f)

I get the error for object type columns in DataFrame. I got around it by using
df.resample('D', method='ffill', how=lambda c: c[-1])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filling in missing values using a function - python

Instead of using the math module to check for missing values, here's a more pandas-specific approach. Change this line: if math.isnan(row['year_of_release']): to this: if row['year_of_release'].isna():

Related

DataError: No numeric types to aggregate pandas pivot

Cannot Get Series From Dataframe (Python)

Pivot: ValueError: Index contains duplicate entries, cannot reshape [duplicate]

list comprehension does not work with sparse dataframe + never ending groupby and apply computation

How to resample a python pandas TimeSeries containing dytpe Decimal values?

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filling in missing values ​using a function - python

Instead of using the math module to check for missing values, here's a more pandas-specific approach. Change this line: if math.isnan(row['year_of_release']): to this: if row['year_of_release'].isna():

Related

DataError: No numeric types to aggregate pandas pivot

Cannot Get Series From Dataframe (Python)

Pivot: ValueError: Index contains duplicate entries, cannot reshape [duplicate]

list comprehension does not work with sparse dataframe + never ending groupby and apply computation

How to resample a python pandas TimeSeries containing dytpe Decimal values?

Categories

Resources

Filling in missing values using a function - python