Pandas groupby multiple columns to compare values - python

My df looks like this: (There are dozens of other columns in the df but these are the three I am focused on)
Param Value Limit
A 1.50 1
B 2.50 1
C 2.00 2
D 2.00 2.5
E 1.50 2
I am trying to use pandas to calculate how many [Value] that are less than [Limit] per [Param], Hoping to get a list like this:
Param Count
A 1
B 1
C 1
D 0
E 0
I've tried with a few methods, the first being
value_count = df.loc[df['Value'] < df['Limit']].count()
but this just gives the full count per column in the df.
I've also tried groupby function which I think could be the correct idea, by creating a subset of the df with the chosen columns
df_below_limit = df[df['Value'] < df['Limit']]
df_below_limit.groupby('Param')['Value'].count()
This is nearly what I want but it excludes values below which I also need. Not sure how to go about getting the list as I need it.

Assuming you want the count per Param, you can use:
out = df['Value'].ge(df['Limit']).groupby(df['Param']).sum()
output:
Param
A 1
B 2
C 1
D 0
E 0
dtype: int64
used input (with a duplicated row "B" for the example):
Param Value Limit
0 A 1.5 1.0
1 B 2.5 1.0
2 B 2.5 1.0
3 C 2.0 2.0
4 D 2.0 2.5
5 E 1.5 2.0
as DataFrame
df['Value'].ge(df['Limit']).groupby(df['Param']).sum().reset_index(name='Count')
# or
df['Value'].ge(df['Limit']).groupby(df['Param']).agg(Count='sum').reset_index()
output:
Param Count
0 A 1
1 B 2
2 C 1
3 D 0
4 E 0

Related

I want to add sub-index in python with pandas [duplicate]

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

Python Dataframe filling up non existing

I was wondering if there is an efficient way to add rows to a Dataframe that e.g. include the average or a predifined value in case there are not enough rows for a specific value in another column. I guess the description of the Problem is not the best that is why you find an example below:
Say we have the Dataframe
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
And we want to have 2 Rows for each client A, B, C, D, no matter if these 2 rows are already existing or not. So for Client A and B we can just copy the rows, for C we want to add a row which says Client = C, NumberOfProducts = average of existing rows = 9 and ID is not of interest (so we could set it to ID = smallest existing one - 1 = 0 any other value, even NaN, would also be possible). For Client D there does not exist a single row so we want to add 2 rows where NumberOfProducts is equal to the constant 2.5. The output should then look like this:
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
C 9 0
D 2.5 NaN
D 2.5 NaN
What I have done so far is to loop through the dataframe and add rows where necessary. Since this is pretty inefficient any better solution would be highly appreciated.
Use:
clients = ['A','B','C','D']
N = 2
#test only values from list and also filter only 2 rows for each client if necessary
df = df[df['Client'].isin(clients)].groupby('Client').head(N)
#create helper counter and reshape by unstack
df1 = df.set_index(['Client',df.groupby('Client').cumcount()]).unstack()
#set first if only 1 row per client - replace second NumberOfProducts by first
df1[('NumberOfProducts',1)] = df1[('NumberOfProducts',1)].fillna(df1[('NumberOfProducts',0)])
# ... replace second ID by first subtracted by 1
df1[('ID',1)] = df1[('ID',1)].fillna(df1[('ID',0)] - 1)
#add missing clients by reindex
df1 = df1.reindex(clients)
#replace NumberOfProducts by constant 2.5
df1['NumberOfProducts'] = df1['NumberOfProducts'].fillna(2.5)
print (df1)
NumberOfProducts ID
0 1 0 1
Client
A 1.0 5.0 2.0 1.0
B 1.0 6.0 2.0 1.0
C 9.0 9.0 1.0 0.0
D 2.5 2.5 NaN NaN
#last reshape to original
df2 = df1.stack().reset_index(level=1, drop=True).reset_index()
print (df2)
Client NumberOfProducts ID
0 A 1.0 2.0
1 A 5.0 1.0
2 B 1.0 2.0
3 B 6.0 1.0
4 C 9.0 1.0
5 C 9.0 0.0
6 D 2.5 NaN
7 D 2.5 NaN

Is there a faster way to count values across multiple columns, excluding duplicated values on the same row?

Given the following df
id val1 val2 val3
0 1 A A B
1 1 A B B
2 1 B C NaN
3 1 NaN B D
4 2 A D NaN
I would like to sum the value counts within each id group for all columns; however, I need to only count values that appear on the same row once, so the expected output is:
id
1 B 4
A 2
C 1
D 1
2 A 1
D 1
I can accomplish this with
import pandas as pd
df.set_index('id').apply(lambda x: list(set(x)), axis=1).apply(pd.Series).stack().groupby(level=0).value_counts()
but the apply(...axis=1) (and perhaps apply(pd.Series)) really kills the performance on large DataFrames. Since I have a small number of columns, I guess I could just check for all pairwise duplicates, replace one with np.NaN and then just use df.set_index('id').stack().groupby(level=0).value_counts() but that doesn't seem like the right approach when the number of columns get large.
Any ideas on a faster way around this?
Here's the missing steps that remove row duplicates from your dataframe:
nodups = df.stack().reset_index(level=0).drop_duplicates()
nodups = nodups.set_index(['level_0', nodups.index]).unstack()
nodups.columns = nodups.columns.levels[1]
# id val1 val2 val3
#level_0
#0 1 A None B
#1 1 A B None
#2 1 B C None
#3 1 None B D
#4 2 A D None
Now you can follow with:
nodups.set_index('id').stack().groupby(level=0).value_counts()
Perhaps you can further optimize the code.
I am using get_dummies
s=df.set_index('id',append=True).stack().str.get_dummies().sum(level=[0,1]).gt(0).sum(level=1).stack().astype(int)
s[s.gt(0)]
Out[234]:
id
1 A 2
B 4
C 1
D 1
2 A 1
D 1
dtype: int32

Pandas: assign an index to each group identified by groupby

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

Pandas data frame mean by variables

I have a data frame
a = pd.DataFrame({'a':[1,2,3,4], 'b':[1,1,2,2], 'c':[1,1,1,2]})
>>> a
a b c
0 1 1 1
1 2 1 1
2 3 2 1
3 4 2 2
I would like to compute the mean of a once that it has been grouped according to the value of b an c.
So i should split the data in 3 groups:
b=1,c=1
b=1,c=2
b=2,c=2
and then compute the mean of a in each group.
How can I do that?
I suspect that I have to use groupby but I do not understand how.
You can groupby multiple columns by passing a list of the column names, then it's just a simple case of calling mean on the gorupby object:
In [4]:
a.groupby(['b','c']).mean()
Out[4]:
a
b c
1 1 1.5
2 1 3.0
2 4.0
If you want to restore the columns that were grouped by back as columns, just call reset_index():
In [5]:
a.groupby(['b','c']).mean().reset_index()
Out[5]:
b c a
0 1 1 1.5
1 2 1 3.0
2 2 2 4.0

Categories