Pandas: Group by combination of two columns in Pandas 0.23.4 - python

I am fairly new to Python. I came across Pandas: Group by combination of two columns on SO. Unfortunately, the accepted answer no longer works with pandas version 0.23.4 The objective of that post is to figure out combination of group variables, and create a dictionary for values. i.e. group_by should ignore the order of grouping.
Here's the accepted answer:
import pandas as pd
from collections import Counter
d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)
Here, ...apply(sorted) throws the following exception:
raise ValueError('Must have equal len keys and value ' ValueError:
Must have equal len keys and value when setting with an iterable
Here's my pandas version:
> pd.__version__
Out: '0.23.4'
Here's what I tried after reading https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html:
d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
d=d.sort_values(by=['x','y'],axis=1).reset_index(drop=True)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)
Unfortunately, this also throws error:
1382, in _get_label_or_level_values
raise KeyError(key) KeyError: 'x'
Expected output:
score count
x y
a b {1: 1, 3: 2} 2
c {2: 1} 1
Can someone please help me? On a side note, it will be great if you could also guide on how to compute the count of keys() in score column. I am looking for a vectorized solution.
I am using python 3.6.7
Many thanks.

Problem is sorted return lists, so is necessary convert ti to Series:
d[['x', 'y']] = d[['x', 'y']].apply(lambda x: pd.Series(sorted(x)), axis=1)
But faster is use numpy.sort with DataFrame constructor, because apply are loops under the hood:
d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
d[['x', 'y']] = pd.DataFrame(np.sort(d[['x', 'y']], axis=1), index=d.index)
Then seelct column for aggregation with list of aggregated functions - e.g. nunique for count of number of unique values:
x = d.groupby(['x', 'y'])['score'].agg([Counter, 'nunique'])
print(x)
Counter nunique
x y
a b {1: 1, 3: 2} 2
c {2: 1} 1
Or count by DataFrameGroupBy.size:
x = d.groupby(['x', 'y'])['score'].agg([Counter, 'size'])
print(x)
Counter size
x y
a b {1: 1, 3: 2} 3
c {2: 1} 1

Use -
a=d[['x','y']].values
a.sort(axis=1)
d[['x','y']] = a
x = d.groupby(['x', 'y']).agg(Counter)
print(x)
Output
score
x y
a b {1: 1, 3: 2}
c {2: 1}

Adding result_type = 'broadcast' as one of the args to .apply() worked.
>>> d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
>>> d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')
>>> x = d.groupby(['x', 'y']).agg(Counter)
>>> print(x)
score
x y
a b {1: 1, 3: 2}
c {2: 1}
Note the difference with and without result_type = 'broadcast'.
>>> d[['x', 'y']].apply(sorted, axis=1)
0 [a, b]
1 [a, c]
2 [a, b]
3 [a, b]
dtype: object
>>> d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')
x y
0 a b
1 a c
2 a b
3 a b
As you can see, result_type = 'broadcast' splits (broadcasts) the result of .apply() back from a list into the respective columns, allowing the assignment to d[['x', 'y']].

Related

If item within list in pandas column is a dictionary key, replace with value, if not in dictionary, delete

If a pandas column contains a list, you can use a dictionary to convert all the values using
df['listColumn'] = df['listColumn'].apply(lambda x: [columnDictionary[i] for i in x])
However, there are instances where not all the items in a list are keys to the dictionary. In that case, how do you replace those items with nothing.
For example
columnDictionary = {a:1, b:2, d:7, f:8 }
Specific Pandas row/column: [ a, b, c, d, e]
Specific Pandas row/column after conversion: [ 1, 2, 7]
With simple condition to check if a list value is in target dict keys list:
In [47]: df = pd.DataFrame({'listColumn': ['a', 123, list('abcde')]})
In [48]: repl_dict = {'a':1, 'b':2, 'd':7, 'f':8 }
In [49]: df['listColumn'].apply(lambda x: [repl_dict[v] for v in x if v in repl_dict] if isinstance(x, list) else x)
Out[49]:
0 a
1 123
2 [1, 2, 7]
Name: listColumn, dtype: object
Use "if else" inside the lamdba function :
Method 1: apply lambda on columns, below on one column only ( axis = 0 )
# apply lambda on 1 column (axis = 0)
d = {'col1':[ 'a', 'b', 'c', 'd', 'e']}
df = pd.DataFrame(data=d)
columnDictionary ={'a':1, 'b':2, 'd':7, 'f':8 }
df['col1'] = df['col1'].apply(lambda x: [columnDictionary[x] if x in columnDictionary else ''])
df
Method 2: apply lambda on rows (axis = 1), row by row (I think it is slower)
d = {'col1':[ 'a', 'b', 'c', 'd', 'e']}
df = pd.DataFrame(data=d)
columnDictionary ={'a':1, 'b':2, 'd':7, 'f':8 }
df['listColumn'] = df.apply(lambda x: [columnDictionary[i] if i in columnDictionary else '' for i in x],axis=1)
df
Result :
col1 listColumn
0 a [1]
1 b [2]
2 c []
3 d [7]
4 e []
There is a build-in function to check if something is list, it called isinstance(mydata, list) whitch will return True or False respectivelly.

Efficiently find index of DataFrame values in array

I have a DataFrame that resembles:
x y z
--------------
0 A 10
0 D 13
1 X 20
...
and I have two sorted arrays for every possible value for x and y:
x_values = [0, 1, ...]
y_values = ['a', ..., 'A', ..., 'D', ..., 'X', ...]
so I wrote a function:
def lookup(record, lookup_list, lookup_attr):
return np.searchsorted(lookup_list, getattr(record, lookup_attr))
and then call:
df_x_indicies = df.apply(lambda r: lookup(r, x_values, 'x')
df_y_indicies = df.apply(lambda r: lookup(r, y_values, 'y')
# df_x_indicies: [0, 0, 1, ...]
# df_y_indicies: [26, ...]
but is there are more performant way to do this? and possibly multiple columns at once to get a returned DataFrame rather than a series?
I tried:
np.where(np.in1d(x_values, df.x))[0]
but this removes duplicate values and that is not desired.
You can convert your index arrays to pd.Index objects to make lookup fast(er).
u, v = map(pd.Index, [x_values, y_values])
pd.DataFrame({'x': u.get_indexer(df.x), 'y': v.get_indexer(df.y)})
x y
0 0 1
1 0 2
2 1 3
Where,
x_values
# [0, 1]
y_values
# ['a', 'A', 'D', 'X']
As to your requirement of having this work for multiple columns, you will have to iterate over each one. Here's a version of the code above that should generalise to N columns and indices.
val_list = [x_values, y_values] # [x_values, y_values, z_values, ...]
idx_list = map(pd.Index, val_list)
pd.DataFrame({
f'{c}': idx.get_indexer(df[c]) for idx, c in zip(idx_list, df)})
x y
0 0 1
1 0 2
2 1 3
Update using Series with .loc , you may can also try with reindex
pd.Series(range(len(x_values)),index=x_values).loc[df.x].tolist()
Out[33]: [0, 0, 1]

Inconsistent behavior of jitted function

I have a very simple function like this one:
import numpy as np
from numba import jit
import pandas as pd
#jit
def f_(n, x, y, z):
for i in range(n):
z[i] = x[i] * y[i]
f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)
To which I pass
df = pd.DataFrame({"x": [1, 2, 3], "y": [3, 4, 5], "z": np.NaN})
I expected that function will modify data z column in place like this:
>>> f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)
>>> df
x y z
0 1 3 3.0
1 2 4 8.0
2 3 5 15.0
This works fine most of the time, but somehow fails to modify data in others.
I double checked things and:
I haven't determined any problems with data points which could cause this problem.
I see that data is modified as expected when I print the result.
If I return z array from the function it is modified as expected.
Unfortunately I couldn't reduce the problem to a minimal reproducible case. For example removing unrelated columns seems to "fix" the problem making reduction impossible.
Do I use jit in a way that is not intended to be used? Are there any border cases I should be aware of? Or is it likely to be a bug?
Edit:
I found the source of the problem. It occurs when data contains duplicated column names:
>>> df_ = pd.read_json('{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"v","type":"integer"},{"name":"y","type":"integer"},
... {"name":"v","type":"integer"},{"name":"x","type":"integer"},{"name":"z","type":"number"}],"primaryKey":["index"],"pandas_version":"0.20.
... 0"}, "data": [{"index":0,"v":0,"y":3,"v":0,"x":1,"z":null}]}', orient="table")
>>> f_(df_.shape[0], df_["x"].values, df_["y"].values, df_["z"].values)
>>> df_
v y v x z
0 0 3 0 1 NaN
If duplicate is removed the function works like expected:
>>> df_.drop("v", axis="columns", inplace=True)
>>> f_(df_.shape[0], df_["x"].values, df_["y"].values, df_["z"].values)
>>> df_
y x z
0 3 1 3.0
Ah, that's because in your "failing case" the df["z"].values returns a copy of what is stored in the 'z' column of df. It has nothing to do with the numba function:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])
>>> np.shares_memory(df['z'].values, df['z'])
False
While in the "working case" it's a view into the 'z' column:
>>> df = pd.DataFrame([[0, 3, 1, np.nan]], columns=['v', 'y', 'x', 'z'])
>>> np.shares_memory(df['z'].values, df['z'])
True
NB: It's actually quite funny that this works, because the copy is made when you do df['z'] not when you access the .values.
The take-away here is that you cannot expect that indexing a DataFrame or accessing the .values of a Series will always return a view. So updating the column in-place may not change the values of the original. Not only duplicate column names could be a problem. When the property values returns a copy and when it returns a view is not always clear (except for pd.Series then it's always a view). But these are just implementation details. So it's never a good idea to rely on a specific behavior here. The only guarantee that .values is making is that it returns a numpy.ndarray containing the same values.
However it's pretty easy to avoid that problem by simply returning the modified z column from the function:
import numba as nb
import numpy as np
import pandas as pd
#nb.njit
def f_(n, x, y, z):
for i in range(n):
z[i] = x[i] * y[i]
return z # this is new
Then assign the result of the function to the column:
>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])
>>> df['z'] = f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)
>>> df
v y v x z
0 0 3 0 1 3.0
>>> df = pd.DataFrame([[0, 3, 1, np.nan]], columns=['v', 'y', 'x', 'z'])
>>> df['z'] = f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)
>>> df
v y x z
0 0 3 1 3.0
In case you're interested what happened in your specific case currently (as I mentioned we're talking about implementation details here so don't take this as given. It's just the way it's implemented now). If you have a DataFrame it will store the columns that have the same dtype in a multidimensional NumPy array. This can be seen if you access the blocks attribute (deprecated because the internal storage may change in the near future):
>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])
>>> df.blocks
{'float64':
z
0 NaN
,
'int64':
v y v x
0 0 3 0 1}
Normally it's very easy to create a view into that block, by translating the column name to the column index of the corresponding block. However if you have a duplicate column name the accessing an arbitrary column cannot be guaranteed to be a view. For example if you want to access 'v' then it has to index the Int64 Block with index 0 and 2:
>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])
>>> df['v']
v v
0 0 0
Technically it could be possible to index the non-duplicated columns as views (and in this case even for the duplicated column, for example by using Int64Block[::2] but that's a very special case...). Pandas opts for the safe option to always return a copy if there are duplicate column names (makes sense if you think about it. Why should indexing one column return a view and another returns a copy). The indexing of the DataFrame has an explicit check for duplicate columns and treats them differently (resulting in copies):
def _getitem_column(self, key):
""" return the actual column """
# get column
if self.columns.is_unique:
return self._get_item_cache(key)
# duplicate columns & possible reduce dimensionality
result = self._constructor(self._data.get(key))
if result.columns.is_unique:
result = result[key]
return result
The columns.is_unique is the important line here. It's True for your "normal case" but "False" for the "failing case".

Align python arrays with missing data

I have some time series data, say:
# [ [time] [ data ] ]
a = [[0,1,2,3,4],['a','b','c','d','e']]
b = [[0,3,4]['f','g','h']]
and I would like an output with some filler value, lets say None for now:
a_new = [[0,1,2,3,4],['a','b','c','d','e']]
b_new = [[0,1,2,3,4],['f',None,None,'g','h']]
Is there a built in function in python/numpy to do this (or something like this)? Basically I would like to have all of my time vectors of equal size so I can calculate statistics (np.mean) and deal with the missing data accordingly.
How about this? (I'm assuming your definition of b was a typo, and I'm also assuming you know in advance how many entries you want.)
>>> b = [[0,3,4], ['f','g','h']]
>>> b_new = [list(range(5)), [None] * 5]
>>> for index, value in zip(*b): b_new[1][index] = value
>>> b_new
[[0, 1, 2, 3, 4], ['f', None, None, 'g', 'h']]
smarx has a fine answer, but pandas was made exactly for things like this.
# your data
a = [[0,1,2,3,4],['a','b','c','d','e']]
b = [[0,3,4],['f','g','h']]
# make an empty DataFrame (can do this faster but I'm going slow so you see how it works)
df_a = pd.DataFrame()
df_a['time'] = a[0]
df_a['A'] = a[1]
df_a.set_index('time',inplace=True)
# same for b (a faster way this time)
df_b = pd.DataFrame({'B':b[1]}, index=b[0])
# now merge the two Series together (the NaNs are in the right place)
df = pd.merge(df_a, df_b, left_index=True, right_index=True, how='outer')
In [28]: df
Out[28]:
A B
0 a f
1 b NaN
2 c NaN
3 d g
4 e h
Now the fun is just beginning. Within a DataFrame you can
compute all of your summary statistics (e.g. df.mean())
make plots (e.g. df.plot())
slice/dice your data basically however you want (e.g df.groupby())
Fill in or drop missing data using a specified method (e.g. df.fillna()),
take quarterly or monthly averages (e.g. df.resample()) and a lot more.
If you're just getting started (sorry for the infomercial it you aren't), I recommend reading 10 minutes to pandas for a quick overview.
Here's a vectorized NumPythonic approach -
def align_arrays(A):
time, data = A
time_new = np.arange(np.max(time)+1)
data_new = np.full(time_new.size, None, dtype=object)
data_new[np.in1d(time_new,time)] = data
return time_new, data_new
Sample runs -
In [113]: a = [[0,1,2,3,4],['a','b','c','d','e']]
In [114]: align_arrays(a)
Out[114]: (array([0, 1, 2, 3, 4]), array(['a', 'b', 'c', 'd', 'e'], dtype=object))
In [115]: b = [[0,3,4],['f','g','h']]
In [116]: align_arrays(b)
Out[116]: (array([0, 1, 2, 3, 4]),array(['f', None, None, 'g', 'h'],dtype=object))

Prevent Pandas from unpacking a tuple when creating a dataframe from dict

When creating a DataFrame in Pandas from a dictionary, a tuple is automatically expanded, i.e.
import pandas
d = {'a': 1, 'b': 2, 'c': (3,4)}
df = pandas.DataFrame.from_dict(d)
print(df)
returns
a b c
0 1 2 3
1 1 2 4
Apart from converting the tuple to string first, is there any way to prevent this from happening? I would want the result to be
a b c
0 1 2 (3, 4)
Try add [], so value in dictionary with key c is list of tuple:
import pandas
d = {'a': 1, 'b': 2, 'c': [(3,4)]}
df = pandas.DataFrame.from_dict(d)
print(df)
a b c
0 1 2 (3, 4)
Pass param orient='index' and transpose the result so it doesn't broadcast the scalar values:
In [13]:
d = {'a': 1, 'b': 2, 'c': (3,4)}
df = pd.DataFrame.from_dict(d, orient='index').T
df
Out[13]:
a c b
0 1 (3, 4) 2
To handle the situation where the first dict entry is a tuple, you'd need to enclose all the dict values into a list so it's iterable:
In [20]:
d = {'a': (5,6), 'b': 2, 'c': 1}
d1 = dict(zip(d.keys(), [[x] for x in d.values()]))
pd.DataFrame.from_dict(d1, orient='index').T
Out[23]:
a b c
0 (5, 6) 2 1

Categories