I have data similar to this post: pandas: Filling missing values within a group
That is, I have data in a number of observation sessions, and there is a focal individual for each session. That focal individual is only noted once, but I want to fill in the focal ID data for each line during that session. So, the data look something like this:
Focal Session
0 NaN 1
1 50101 1
2 NaN 1
3 NaN 2
4 50408 2
5 NaN 2
Based on the post linked above, I was using this code:
g = data.groupby('Session')
g['Focal'].transform(lambda s: s.loc[s.first_valid_index()])
But this returns a KeyError (specifically, KeyError:None). According to the .loc documentation, KeyErrors can result when the data isn't found. So, I've checked and while I have 152 sessions, I only have 150 non-null data points in the Focal column. Before I decide to manually search my data for which of the sessions is missing a Focal ID, I have two questions:
I am very much a beginner. So is this a reasonable explanation for why I am getting a KeyError?
If it is reasonable, is there a way to figure out which Session is missing Focal ID data, that will save me from manually looking through the data?
Output here:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-330-0e4f27aa7e14> in <module>()
----> 1 data['Focal'] = g['Focal'].transform(lambda s: s.loc[s.first_valid_index()])
2 g['Focal'].transform(lambda s: s.loc[s.first_valid_index()])
//anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in transform(self, func, *args, **kwargs)
1540 for name, group in self:
1541 object.__setattr__(group, 'name', name)
-> 1542 res = wrapper(group)
1543 # result[group.index] = res
1544 indexer = self.obj.index.get_indexer(group.index)
//anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in <lambda>(x)
1536 wrapper = lambda x: getattr(x, func)(*args, **kwargs)
1537 else:
-> 1538 wrapper = lambda x: func(x, *args, **kwargs)
1539
1540 for name, group in self:
<ipython-input-330-0e4f27aa7e14> in <lambda>(s)
----> 1 data['Focal'] = g['Focal'].transform(lambda s: s.loc[s.first_valid_index()])
2 g['Focal'].transform(lambda s: s.loc[s.first_valid_index()])
//anaconda/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getitem__(self, key)
669 return self._getitem_tuple(key)
670 else:
--> 671 return self._getitem_axis(key, axis=0)
672
673 def _getitem_axis(self, key, axis=0):
//anaconda/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
756 return self._getitem_iterable(key, axis=axis)
757 else:
--> 758 return self._get_label(key, axis=axis)
759
760 class _iLocIndexer(_LocationIndexer):
//anaconda/lib/python2.7/site-packages/pandas/core/indexing.pyc in _get_label(self, label, axis)
58 return self.obj._xs(label, axis=axis, copy=False)
59 except Exception:
---> 60 return self.obj._xs(label, axis=axis, copy=True)
61
62 def _get_loc(self, key, axis=0):
//anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in _xs(self, key, axis, level, copy)
570
571 def _xs(self, key, axis=0, level=None, copy=True):
--> 572 return self.__getitem__(key)
573
574 def _ixs(self, i, axis=0):
//anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
611 def __getitem__(self, key):
612 try:
--> 613 return self.index.get_value(self, key)
614 except InvalidIndexError:
615 pass
//anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in get_value(self, series, key)
761 """
762 try:
--> 763 return self._engine.get_value(series, key)
764 except KeyError, e1:
765 if len(self) > 0 and self.inferred_type == 'integer':
//anaconda/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_value (pandas/index.c:2565)()
//anaconda/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_value (pandas/index.c:2380)()
//anaconda/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3166)()
KeyError: None
The problem is that first_valid_index returns None if there are no valid values (some groups in your DataFrame are all NaN):
In [1]: s = pd.Series([np.nan])
In [2]: s.first_valid_index() # None
Now, loc throws an error because there is no index None:
In [3]: s.loc[s.first_valid_index()]
KeyError: None
What do you want your code to do in this particular case? ...
If you wanted it to be NaN, you could backfill and then take the first element:
g['Focal'].transform(lambda s: s.bfill().iloc[0])
If you want to fix the problem that some groups contains only Nan you could do the following:
g = data.groupby('Session')
g['Focal'].transform(lambda s: 'No values to aggregate' if pd.isnull(s).all() == True else s.loc[s.first_valid_index()])
df['Focal'] = g['Focal'].transform(lambda s: 'No values to aggregate' if pd.isnull(s).all() == True else s.loc[s.first_valid_index()])
In this way you input 'No Values to aggregate' (or whatever you want) when the program find all Nan for a particular group, instead of blocking the execution to return an error.
Hope this helps :)
Federico
Related
I have the following pandas series:
>>>df.A.head()
0 {"Date_": "2022-06-01T01:00:00+05:30", "submit...
1 {"Growth": [{"textField": "", "Change_Size": "...
2 {"submit": true, "HSI_Tag": "xyz...
3 {"submit": true, "HSI_Tag": "xyz...
4 {"submit": true, "roleList": "xy...
Name: A, dtype: object
Every item in the series is a serialized JSON
item. I would like to turn every item into a dictionary. I am trying to do the following, but I get an error:
for i in range(len(df.A)):
df.A.iloc[i] = json.loads(df.A.iloc[i])
The error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-27-9b4e8d4e6d76> in <module>
1 for i in range(len(df.A)):
----> 2 df.A.iloc[i] = json.loads(df.A.iloc[i])
C:\ANACONDA3\lib\site-packages\pandas\core\indexing.py in __setitem__(self, key, value)
188 key = com.apply_if_callable(key, self.obj)
189 indexer = self._get_setitem_indexer(key)
--> 190 self._setitem_with_indexer(indexer, value)
191
192 def _validate_key(self, key, axis):
C:\ANACONDA3\lib\site-packages\pandas\core\indexing.py in _setitem_with_indexer(self, indexer, value)
640 # setting for extensionarrays that store dicts. Need to decide
641 # if it's worth supporting that.
--> 642 value = self._align_series(indexer, Series(value))
643
644 elif isinstance(value, ABCDataFrame):
C:\ANACONDA3\lib\site-packages\pandas\core\indexing.py in _align_series(self, indexer, ser, multiindex_indexer)
774
775 elif is_scalar(indexer):
--> 776 ax = self.obj._get_axis(1)
777
778 if ser.index.equals(ax):
C:\ANACONDA3\lib\site-packages\pandas\core\generic.py in _get_axis(self, axis)
376
377 def _get_axis(self, axis):
--> 378 name = self._get_axis_name(axis)
379 return getattr(self, name)
380
C:\ANACONDA3\lib\site-packages\pandas\core\generic.py in _get_axis_name(cls, axis)
373 pass
374 raise ValueError('No axis named {0} for object type {1}'
--> 375 .format(axis, type(cls)))
376
377 def _get_axis(self, axis):
ValueError: No axis named 1 for object type <class 'type'>
How can I fix it?
I managed to do it eventually with apply and a lambda like this:
df.A = df.A.apply(lambda x: json.loads(x))
Could someone tell me what non-printable character I have in my code that makes python not recognize the columns names in my dataframe? :
import pandas as pd
data_olymp = pd.read_csv("Olympics_data.csv", sep=";")
Here is the Traceback of the error when I try to group by teamname :
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-103-ae95f10f5210> in <module>
30 # print(type(réponse1))
31 # print(len(winter_games_bronze_won))
---> 32 print(data_olymp.loc[" winter_games_bronze_won"] == 9)
~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
893
894 maybe_callable = com.apply_if_callable(key, self.obj)
--> 895 return self._getitem_axis(maybe_callable, axis=axis)
896
897 def _is_scalar_access(self, key: Tuple):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1122 # fall thru to straight lookup
1123 self._validate_key(key, axis)
-> 1124 return self._get_label(key, axis=axis)
1125
1126 def _get_slice_axis(self, slice_obj: slice, axis: int):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_label(self, label, axis)
1071 def _get_label(self, label, axis: int):
1072 # GH#5667 this will fail if the label is not present in the axis.
-> 1073 return self.obj.xs(label, axis=axis)
1074
1075 def _handle_lowerdim_multi_index_axis0(self, tup: Tuple):
~\anaconda3\lib\site-packages\pandas\core\generic.py in xs(self, key, axis, level, drop_level)
3737 raise TypeError(f"Expected label or tuple of labels, got {key}") from e
3738 else:
-> 3739 loc = index.get_loc(key)
3740
3741 if isinstance(loc, np.ndarray):
~\anaconda3\lib\site-packages\pandas\core\indexes\range.py in get_loc(self, key, method, tolerance)
352 except ValueError as err:
353 raise KeyError(key) from err
--> 354 raise KeyError(key)
355 return super().get_loc(key, method=method, tolerance=tolerance)
356
KeyError: ' winter_games_bronze_won'
The file looks like that :
team_name; summer_games_played; summer_games_gold_won; summer_games_silver_won; summer_games_bronze_won; summer_games_medals_won; winter_games_played; winter_games_gold_won; winter_games_silver_won; winter_games_bronze_won; winter_games_medals_won; total_games_played
Canada (CAN);13;0;0;2;2;0;0;0;0;0;13
United States (USA);12;5;2;8;15;3;0;0;0;0;15
Russia (RUS);23;18;24;28;70;18;0;0;0;0;41
Key errors are raised when you are trying to access a key that is not in a dictionary. While working Pandas, it is about the same thing. .loc is trying to locate a key value that is not found in the data frame.
Looking at your code and the traceback error, my assumption is that because you are trying to look up winter_games_bronze_won (with the spaces at the beginning), you are getting the error. Try removing the spaces before winter_games_bronze_won and see what happens.
would appreciate any help with this, I'm getting and error of
ValueError: Length of values (1191) does not match length of index (1250).
I don't understand where Numpy is getting the length of 1191 from ?, I've created a Dataframe of 1250, and I'm trying to assign future['floor'] to it based on conditions, future['cap'] works fine, but that is Pandas, whereas 'Floor' is using NP, but I don't understand why NP would cause this error. Thanks for your help. Gav
future = m.make_future_dataframe(periods=1250,freq='D', include_history=False)
conditions = [
g['Operator'] == 100151,
g['Operator'] == 20137,
g['Operator'] == 20147,
]
values = [
g['y'].mean()/2,
g['y'].mean()/2,
g['y'].mean()/2
]
future['floor'] = np.select(conditions,values)
future['cap'] = max(g['y'])*1.25
forecast = m.predict(future)
ValueError Traceback (most recent call last)
<ipython-input-184-a698f789f6b3> in <module>
----> 1 fout = df.groupby('Operator').apply(forecast_data)
~\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs)
892 with option_context("mode.chained_assignment", None):
893 try:
--> 894 result = self._python_apply_general(f, self._selected_obj)
895 except TypeError:
896 # gh-20949
~\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py in _python_apply_general(self, f, data)
926 data after applying f
927 """
--> 928 keys, values, mutated = self.grouper.apply(f, data, self.axis)
929
930 return self._wrap_applied_output(
~\Anaconda3\lib\site-packages\pandas\core\groupby\ops.py in apply(self, f, data, axis)
236 # group might be modified
237 group_axes = group.axes
--> 238 res = f(group)
239 if not _is_indexed_like(res, group_axes, axis):
240 mutated = True
<ipython-input-183-f88148e0e94e> in forecast_data(g)
42 g['y'].mean()/2
43 ]
---> 44 future['floor'] = np.select(conditions,values)
45 future['cap'] = max(g['y'])*1.25
46 forecast = m.predict(future)
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
3161 else:
3162 # set column
-> 3163 self._set_item(key, value)
3164
3165 def _setitem_slice(self, key: slice, value):
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
3240 """
3241 self._ensure_valid_index(value)
-> 3242 value = self._sanitize_column(key, value)
3243 NDFrame._set_item(self, key, value)
3244
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
3897
3898 # turn me into an ndarray
-> 3899 value = sanitize_index(value, self.index)
3900 if not isinstance(value, (np.ndarray, Index)):
3901 if isinstance(value, list) and len(value) > 0:
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in sanitize_index(data, index)
749 """
750 if len(data) != len(index):
--> 751 raise ValueError(
752 "Length of values "
753 f"({len(data)}) "
ValueError: Length of values (1191) does not match length of index (1250)
Problem description
I have a DataFrame in which last column is a format column. The purpose of this column is to contain the format of the DataFrame row.
Here is an example of such a dataframe:
df = pd.DataFrame({'ID': [1, 24, 31, 37],
'Status': ['to analyze', 'to analyze','to analyze','analyzed'],
'priority' : ['P1','P1','P2','P1'],
'format' : ['n;y;n','n;n;n','n;y;y','y;n;y']}
Each df['format'] row contains a string intended to be taken as a list (when split) to give the format of the row.
Symbols meaning:
n means "no highlight"
y means "to highlight in yellow"
df['format'].to_list()[0] = 'n;y;n' means for example:
n: first column ID item "1" not highlighted
y: second column Status item "to analyze" to be highlighted
n: third column Priority item "P1" not highlighted
So that expected outcome is:
What I've tried
I've tried to use df.format to get a list of lists containing the format needed. Here is my code:
import pandas as pd
import numpy as np
def highlight_case(df):
list_of_format_lists = []
for format_line in df['format']:
format_line_list = format_line.split(';')
format_list = []
for form in format_line_list:
if 'y' in form:
format_list.append('background-color: yellow')
else:
format_list.append('')
list_of_format_lists.append(format_list)
list_of_format_lists = list(map(list, zip(*list_of_format_lists)))#transpose
print(list_of_format_lists)
return list_of_format_lists
highlight_style = highlight_case(df)
df.style.apply(highlight_style)
It doesn't work, and I get this output:
TypeError Traceback (most recent call last)
c:\python38\lib\site-packages\IPython\core\formatters.py in __call__(self, obj)
343 method = get_real_method(obj, self.print_method)
344 if method is not None:
--> 345 return method()
346 return None
347 else:
c:\python38\lib\site-packages\pandas\io\formats\style.py in _repr_html_(self)
191 Hooks into Jupyter notebook rich display system.
192 """
--> 193 return self.render()
194
195 #doc(NDFrame.to_excel, klass="Styler")
c:\python38\lib\site-packages\pandas\io\formats\style.py in render(self, **kwargs)
538 * table_attributes
539 """
--> 540 self._compute()
541 # TODO: namespace all the pandas keys
542 d = self._translate()
c:\python38\lib\site-packages\pandas\io\formats\style.py in _compute(self)
623 r = self
624 for func, args, kwargs in self._todo:
--> 625 r = func(self)(*args, **kwargs)
626 return r
627
c:\python38\lib\site-packages\pandas\io\formats\style.py in _apply(self, func, axis, subset, **kwargs)
637 data = self.data.loc[subset]
638 if axis is not None:
--> 639 result = data.apply(func, axis=axis, result_type="expand", **kwargs)
640 result.columns = data.columns
641 else:
c:\python38\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7543 kwds=kwds,
7544 )
-> 7545 return op.get_result()
7546
7547 def applymap(self, func) -> "DataFrame":
c:\python38\lib\site-packages\pandas\core\apply.py in get_result(self)
142 # dispatch to agg
143 if is_list_like(self.f) or is_dict_like(self.f):
--> 144 return self.obj.aggregate(self.f, axis=self.axis, *self.args, **self.kwds)
145
146 # all empty
c:\python38\lib\site-packages\pandas\core\frame.py in aggregate(self, func, axis, *args, **kwargs)
7353 axis = self._get_axis_number(axis)
7354
-> 7355 relabeling, func, columns, order = reconstruct_func(func, **kwargs)
7356
7357 result = None
c:\python38\lib\site-packages\pandas\core\aggregation.py in reconstruct_func(func, **kwargs)
74
75 if not relabeling:
---> 76 if isinstance(func, list) and len(func) > len(set(func)):
77
78 # GH 28426 will raise error if duplicated function names are used and
TypeError: unhashable type: 'list'
Since the formats are encoded for each row, it makes sense apply row-wise:
def format_row(r):
formats = r['format'].split(';')
return ['background-color: yellow' if y=='y' else '' for y in formats] + ['']
df.style.apply(format_row, axis=1)
Output:
Let's take the following contrived example where I create a DataFrame and then make a DatetimeIndex using a column with duplicate entries. I then place this DataFrame into a Panel and then attempt to iterate over the major axis.
import pandas as pd
import datetime as dt
a = [1371215933513120, 1371215933513121, 1371215933513122, 1371215933513122]
b = [1,2,3,4]
df = pd.DataFrame({'a':a, 'b':b, 'c':[dt.datetime.fromtimestamp(t/1000000.) for t in a]})
df.index=pd.DatetimeIndex(df['c'])
d = OrderedDict()
d['x'] = df
p = pd.Panel(d)
for y in p.major_axis:
print y
print p.major_xs(y)
This leads to the following output:
2013-06-14 15:18:53.513120
x
a 1371215933513120
b 1
c 2013-06-14 15:18:53.513120
2013-06-14 15:18:53.513121
x
a 1371215933513121
b 2
c 2013-06-14 15:18:53.513121
2013-06-14 15:18:53.513122
Followed by a rather cryptic (to me) error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-35-045aaae5a074> in <module>()
13 for y in p.major_axis:
14 print y
---> 15 print p.major_xs(y)
/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.py in __str__(self)
667 if py3compat.PY3:
668 return self.__unicode__()
--> 669 return self.__bytes__()
670
671 def __bytes__(self):
/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.py in __bytes__(self)
677 """
678 encoding = com.get_option("display.encoding")
--> 679 return self.__unicode__().encode(encoding, 'replace')
680
681 def __unicode__(self):
/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.py in __unicode__(self)
692 # This needs to compute the entire repr
693 # so don't do it unless rownum is bounded
--> 694 fits_horizontal = self._repr_fits_horizontal_()
695
696 if fits_vertical and fits_horizontal:
/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.py in _repr_fits_horizontal_(self)
652 d=d.iloc[:min(max_rows, height,len(d))]
653
--> 654 d.to_string(buf=buf)
655 value = buf.getvalue()
656 repr_width = max([len(l) for l in value.split('\n')])
/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.py in to_string(self, buf, columns, col_space, colSpace, header, index, na_rep, formatters, float_format, sparsify, nanRep, index_names, justify, force_unicode, line_width)
1489 header=header, index=index,
1490 line_width=line_width)
-> 1491 formatter.to_string()
1492
1493 if buf is None:
/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/format.py in to_string(self, force_unicode)
312 text = info_line
313 else:
--> 314 strcols = self._to_str_columns()
315 if self.line_width is None:
316 text = adjoin(1, *strcols)
/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/format.py in _to_str_columns(self)
265 for i, c in enumerate(self.columns):
266 if self.header:
--> 267 fmt_values = self._format_col(i)
268 cheader = str_columns[i]
269
/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/format.py in _format_col(self, i)
403 float_format=self.float_format,
404 na_rep=self.na_rep,
--> 405 space=self.col_space)
406
407 def to_html(self, classes=None):
/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/format.py in format_array(values, formatter, float_format, na_rep, digits, space, justify)
1319 justify=justify)
1320
-> 1321 return fmt_obj.get_result()
1322
1323
/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/format.py in get_result(self)
1335
1336 def get_result(self):
-> 1337 fmt_values = self._format_strings()
1338 return _make_fixed_width(fmt_values, self.justify)
1339
/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/format.py in _format_strings(self)
1362
1363 print "vals:", vals
-> 1364 is_float = lib.map_infer(vals, com.is_float) & notnull(vals)
1365 leading_space = is_float.any()
1366
ValueError: operands could not be broadcast together with shapes (2) (2,3)
Now, having explained that I'm creating an index with duplicate entries, the source of the error is clear. Without having known that, however, it would have been more difficult (again, for a novice like me) to figure out why this Exception was popping up.
This leads me to a few questions.
Is this really the expected behavior of pandas? Is it forbidden to create an index with duplicate entries, or is it just forbidden to iterate over them?
If it's forbidden to create such an index, then shouldn't an exception be raised when initially creating it?
If the iteration is somehow incorrect, shouldn't the error be more informative?
Am I doing something wrong?