Pandas GroupBy Index - python

I have a dataframe with a column that I want to groupby. Within each group, I want to perform a check to see if the first values is less than the second value times some scalar, e.g. (x < y * .5). If it is, the first value is set to True and all other values False. Else, all values are False.
I have a sample data frame here:
d = pd.DataFrame(np.array([[0, 0, 1, 1, 2, 2, 2],
[3, 4, 5, 6, 7, 8, 9],
[1.25, 10.1, 2.3, 2.4, 1.2, 5.5, 5.7]]).T,
columns=['a', 'b', 'c'])
I can get a stacked groupby to get the data that I want out a a:
g = d.groupby('a')['c'].nsmallest(2).groupby(level='a')
This results in three groups, each with 2 entries. By adding an apply, I can call a function to return a boolean mask:
def func(group):
if group.iloc[0] < group.iloc[1] * .5:
return [True, False]
else:
return [False, False]
g = d.groupby('a')['c'].nsmallest(2).groupby(level='a').apply(func)
Unfortunately, this destroys the index into the original dataframe and removes the ability to handle cases where more than 2 elements are present.
Two questions:
Is it possible to maintain the index in the original dataframe and update a column with the results of a groupby? This is made slightly different because the .nsmallest call results in a Series on the 'c' column.
Does a more elegant method exist to compute a boolean array for groups in a dataframe based on some custom criteria, e.g. this ratio test.

Looks like transform is what you need:
>>> def func(group):
... res = [False] * len(group)
... if group.iloc[0] < group.iloc[1] * .5:
... res[0] = True
... return res
>>> d['res'] = d.groupby('a')['c'].transform(func).astype('bool')
>>> d
a b c res
0 0 3 1.25 True
1 0 4 10.10 False
2 1 5 2.30 False
3 1 6 2.40 False
4 2 7 1.20 True
5 2 8 5.50 False
6 2 9 5.70 False
From the documentation:
The transform method returns an object that is indexed the same (same
size) as the one being grouped. Thus, the passed transform function
should return a result that is the same size as the group chunk. For
example, suppose we wished to standardize the data within each group

Related

Why numpy .isin function gives incorrect output

My requirement is I have a large dataframe with millions of rows. I encoded all strings to numeric values in order to use numpys vectorization to increase processing speed.
So I was looking at a way to quickly check if a number exists in another list column. Previously, I was using list comprehension with string values, but with after converting to np.arrays was looking at similar function.
I stumbled across this link: check if values of a column are in values of another numpy array column in pandas
In order to the numpy.isin, I tried running below code:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,5,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 5 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
When I enter:
np.isin(dt['col_a'], dt['col_b'])
The output is:
array([False, True, False, False, True])
Which is incorrect as the 3rd row has 5 in both columns col_a and col_b.
Where as if I change the value to 4 as below:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,4,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 4 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
and execute same code:
np.isin(dt['col_a'], dt['col_b'])
I get correct result:
array([False, True, True, False, True])
Can someone please let me know why it's giving different results.
Since col_b not only has lists but also integers, you may need to use apply and treat them differently:
( dt.apply(lambda x: x['col_a'] in x['col_b'] if type(x['col_b']) is list
else x['col_a'] == x['col_b'], axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool
np.isin for each element from dt['col_a'] checks whether it is present in the whole dt['col_b'] column, i.e.:
[
1 in dt['col_b'],
2 in dt['col_b'],
5 in dt['col_b'],
...
]
There's no 5 in dt['col_b'] but there's 4
From the docs
isin is an element-wise function version of the python keyword in. isin(a, b) is roughly equivalent to np.array([item in b for item in a]) if a and b are 1-D sequences.
Also, your issue is that you have an inconsistent dt['col_b'] column (some values are numbers some are lists). I think the easiest approach is to use apply:
def isin(row):
if isinstance(row['col_b'], int):
return row['col_a'] == row['col_b']
else:
return row['col_a'] in row['col_b']
dt.apply(isin, axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool

function any is not consistent when applied on columns or the whole dataframe in python

I have a dataframe that might contain NaN values.
array = np.empty((4,5))
array[:] = 10
df = pd.DataFrame(array)
df.iloc[1,3] = np.NaN
df.isna().apply(lambda x: any(x), axis = 0)
Output:
0 False
1 False
2 False
3 True
4 False
dtype: bool
When I run:
any(df.isna())
It returns:
True
If there are no NaNs:
array = np.empty((4,5))
array[:] = 10
df = pd.DataFrame(array)
#df.iloc[1,3] = np.NaN
df.isna().apply(lambda x: any(x), axis = 0)
0 False
1 False
2 False
3 False
4 False
dtype: bool
However when I run:
any(df.isna())
It returns:
True
Why this is the case? Do I have any misunderstanding of the function any()?
Why this is the case? Do I have any misunderstanding of the function any()?
When you loop over a DataFrame you are actually iterating over its column labels, not its rows or values as you might think. More precisely, the for loop calls Dataframe.__iter__ which returns an iterator over the column labels of the DataFrame.
For instance, in the following
df = pd.DataFrame(columns=['a', 'b', 'c'])
for x in df:
print(x)
# Output:
#
# a
# b
# c
x holds the name of each df column. You can also see what is the output of list(df).
This means that when you do any(df.isna()), under the hood any is actually iterating over the column labels of df and checking their truthiness. If at least one is truthy it returns True.
In both of your examples the column labels are numbers list(df.isna()) = list(df.columns) = [0, 1, 2, 3], from which only 0 is a Falsy value. Therefore, in both cases any(df.isna()) = True.
Solution
The solution is to use DataFrame.any with axis=None instead of using the built-in any function.
df.isna().any(axis=None)

GroupBy aggregate function that computes two values at once

I have a datafame like the following:
import pandas as pd
df = pd.DataFrame({
'A': [1, 1, 1, 2, 2, 2],
'B': [1, 2, 3, 4, 5, 6],
'C': [4, 5, 6, 7, 8, 9],
})
Now I want to group and aggregate with two values being produced per group. The result should be similar to the following:
expected = df.groupby('A').agg([min, max])
# B C
# min max min max
# A
# 1 1 3 4 6
# 2 4 6 7 9
However, in my case, instead of two distinct functions min and max, I have one function that computes these two values at once:
def minmax(x):
"""This function promises to compute the min and max in one go."""
return min(x), max(x)
Now my question is, how can I use this one function to produce two aggregation values per group?
It's kind of related to this answer but I couldn't figure out how to do it. The best I could come up with is using a doubly-nested apply however this is not very elegant and also it produces the multi-index on the rows rather than on the columns:
result = df.groupby('A').apply(
lambda g: g.drop(columns='A').apply(
lambda h: pd.Series(dict(zip(['min', 'max'], minmax(h))))
)
)
# B C
# A
# 1 min 1 4
# max 3 6
# 2 min 4 7
# max 6 9
If you are stuck with a function that returns a tuple of values. I'd:
Define a new function that wraps the tuple values into a dict such that you predefine the dict.keys() to align with what you want the column names to be.
Use a careful for loop that doesn't waste time and space.
Wrap Function
# Given Function
def minmax(x):
"""This function promises to compute the min and max in one go."""
return min(x), max(x)
# wrapped function
def minmax_dict(x):
return dict(zip(['min', 'max'], minmax(x)))
Careful for loop
I'm aiming to pass this dictionary into the pd.DataFrame constructor. That means, I want tuples of the MultiIndex column elements in the keys. I want the values to be dictionaries with keys being the index elements.
dat = {}
for a, d in df.set_index('A').groupby('A'):
for cn, c in d.iteritems():
for k, v in minmax_dict(c).items():
dat.setdefault((cn, k), {})[a] = v
pd.DataFrame(dat).rename_axis('A')
B C
min max min max
A
1 1 3 4 6
2 4 6 7 9
Added Detail
Take a look at the crafted dictionary
data
{('B', 'min'): {1: 1, 2: 4},
('B', 'max'): {1: 3, 2: 6},
('C', 'min'): {1: 4, 2: 7},
('C', 'max'): {1: 6, 2: 9}}
One other solution:
pd.concat({k:d.agg(minmax).set_axis(['min','max'])
for k,d in df.drop('A',axis=1).groupby(df['A'])
})
Output:
B C
1 min 1 4
max 3 6
2 min 4 7
max 6 9

replace empty list with NaN in pandas dataframe

I'm trying to replace some empty list in my data with a NaN values. But how to represent an empty list in the expression?
import numpy as np
import pandas as pd
d = pd.DataFrame({'x' : [[1,2,3], [1,2], ["text"], []], 'y' : [1,2,3,4]})
d
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 [] 4
d.loc[d['x'] == [],['x']] = d.loc[d['x'] == [],'x'].apply(lambda x: np.nan)
d
ValueError: Arrays were different lengths: 4 vs 0
And, I want to select [text] by using d[d['x'] == ["text"]] with a ValueError: Arrays were different lengths: 4 vs 1 error, but select 3 by using d[d['y'] == 3] is correct. Why?
If you wish to replace empty lists in the column x with numpy nan's, you can do the following:
d.x = d.x.apply(lambda y: np.nan if len(y)==0 else y)
If you want to subset the dataframe on rows equal to ['text'], try the following:
d[[y==['text'] for y in d.x]]
I hope this helps.
You can use function "apply" to match the specified cell value no matter it is the instance of string, list and so on.
For example, in your case:
import pandas as pd
d = pd.DataFrame({'x' : [[1,2,3], [1,2], ["text"], []], 'y' : [1,2,3,4]})
d
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 [] 4
if you use d == 3 to select the cell whose value is 3, it's totally ok:
x y
0 False False
1 False False
2 False True
3 False False
However, if you use the equal sign to match a list, there may be out of your exception, like d == [text] or d == ['text'] or d == '[text]', such as the following:
There's some solutions:
Use function apply() on the specified Series in your Dataframe just like the answer on the top:
A more general method with the function applymap() on a Dataframe may be used for the preprocessing step:
d.applymap(lambda x: x == [])
x y
0 False False
1 False False
2 False False
3 True False
Wish it can help you and the following learners and it would be better if you add a type check in you applymap function which would otherwise cause some exceptions probably.
To answer your main question, just leave out the empty lists altogether. The NaN's will automatically get populated in if there's a value in one column and not the other if you use pandas.concat instead of building a dataframe from a dictionary.
>>> import pandas as pd
>>> ser1 = pd.Series([[1,2,3], [1,2], ["text"]], name='x')
>>> ser2 = pd.Series([1,2,3,4], name='y')
>>> result = pd.concat([ser1, ser2], axis=1)
>>> result
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 NaN 4
About your second question, it seems that you can't search inside of an element. Perhaps you should make that a separate question since it's not really related to your main question.

Greater/less than comparisons between Pandas DataFrames/Series

How can I perform comparisons between DataFrames and Series? I'd like to mask elements in a DataFrame/Series that are greater/less than elements in another DataFrame/Series.
For instance, the following doesn't replace elements greater than the mean
with nans although I was expecting it to:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x[x > x.mean(axis=1)] = np.nan
>>> x
a b
0 1 3
1 2 4
If we look at the boolean array created by the comparison, it is really weird:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x > x.mean(axis=1)
a b 0 1
0 False False False False
1 False False False False
I don't understand by what logic the resulting boolean array is like that. I'm able to work around this problem by using transpose:
>>> (x.T > x.mean(axis=1).T).T
a b
0 False True
1 False True
But I believe there is some "correct" way of doing this that I'm not aware of. And at least I'd like to understand what is going on.
The problem here is that it's interpreting the index as column values to perform the comparison, if you use .gt and pass axis=0 then you get the result you desire:
In [203]:
x.gt(x.mean(axis=1), axis=0)
Out[203]:
a b
0 False True
1 False True
You can see what I mean when you perform the comparison with the np array:
In [205]:
x > x.mean(axis=1).values
Out[205]:
a b
0 False False
1 False True
here you can see that the default axis for comparison is on the column, resulting in a different result

Categories