Count instances based on criteria with groupby() - python

I have a dataframe that looks like this:
In [60]: df1
Out[60]:
DIFF UID
0 NaN 1
1 13.0 1
2 4.0 1
3 NaN 2
4 3.0 2
5 23.0 2
6 NaN 3
7 4.0 3
8 29.0 3
9 42.0 3
10 NaN 4
11 3.0 4
and for each UID I want to calculate how many instances are found to have a value for DIFF over a given param.
I have tried something like this:
In [61]: threshold = 5
In [62]: df1[df1.DIFF > threshold].groupby('UID')['DIFF'].count().reset_index().rename(columns={'DIFF':'ATTR_NAME'})
Out[63]:
UID ATTR_NAME
0 1 1
1 2 1
2 3 2
That works fine, in regards to finding the returning the right count of instances per user etc. However, I would like to be able to also include the users that have 0 instances, which are now excluded in the df1[df1.DIFF > threshold] part.
The desired output would be:
UID ATTR_NAME
0 1 1
1 2 1
2 3 2
3 4 0
Any ideas?
Thanks

Simple, use .reindex:
req = df1[df1.DIFF > threshold].groupby('UID')['DIFF'].count()
req = req.reindex(df1.UID.unique()).reset_index().rename(columns={'DIFF':'ATTR_NAME'})
In one line:
df1[df1.DIFF > threshold].groupby('UID')['DIFF'].count().reindex(df1.UID.unique()).reset_index().rename(columns={'DIFF':'ATTR_NAME'})

Another way would be to use a function with apply() to do this:
In [101]: def count_instances(x, threshold):
counter = 0
for i in x:
if i > threshold: counter += 1
return counter
.....:
In [102]: df1.groupby('UID')['DIFF'].apply(lambda x: count_instances(x, 5)).reset_index()
Out[102]:
UID DIFF
0 1 1
1 2 1
2 3 2
3 4 0
It appears that this way is a little faster as well:
In [103]: %timeit df1.groupby('UID')['DIFF'].apply(lambda x: count_instances(x, 5)).reset_index()
100 loops, best of 3: 2.38 ms per loop
In [104]: %timeit df1[df1.DIFF > 5].groupby('UID')['DIFF'].count().reset_index()
100 loops, best of 3: 2.39 ms per loop

Would something like this work well for you?

Searching to count the numbers of values matching criteria without filtering out the keys that have no matches is equivalent to counting the numbers of True matches per group, which can be done with a sum of boolean values:
(df1.DIFF > 5).groupby(df1.UID).sum().reset_index()
UID DIFF
0 1 1.0
1 2 1.0
2 3 2.0
3 4 0.0

Related

I want to add sub-index in python with pandas [duplicate]

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

Sum of count where values are less than row

I'm using Pandas to come up with new column that will search through the entire column with values [1-100] and will count the values where it's less than the current row.
See [df] example below:
[A][NewCol]
1 0
3 2
2 1
5 4
8 5
3 2
Essentially, for each row I need to look at the entire Column A, and count how many values are less than the current row. So for Value 5, there are 4 values that are less (<) than 5 (1,2,3,3).
What would be the easiest way of doing this?
Thanks!
One way to do it like this, use rank with method='min':
df['NewCol'] = (df['A'].rank(method='min') - 1).astype(int)
Output:
A NewCol
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
I am using numpy broadcast
s=df.A.values
(s[:,None]>s).sum(1)
Out[649]: array([0, 2, 1, 4, 5, 2])
#df['NewCol']=(s[:,None]>s).sum(1)
timing
df=pd.concat([df]*1000)
%%timeit
s=df.A.values
(s[:,None]>s).sum(1)
10 loops, best of 3: 83.7 ms per loop
%timeit (df['A'].rank(method='min') - 1).astype(int)
1000 loops, best of 3: 479 µs per loop
Try this code
A = [Your numbers]
less_than = []
for element in A:
counter = 0
for number in A:
if number < element:
counter += 1
less_than.append(counter)
You can do it this way:
import pandas as pd
df = pd.DataFrame({'A': [1,3,2,5,8,3]})
df['NewCol'] = 0
for idx, row in df.iterrows():
df.loc[idx, 'NewCol'] = (df.loc[:, 'A'] < row.A).sum()
print(df)
A NewCol
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
Another way is sort and reset index:
m=df.A.sort_values().reset_index(drop=True).reset_index()
m.columns=['new','A']
print(m)
new A
0 0 1
1 1 2
2 2 3
3 3 3
4 4 5
5 5 8
You didn't specify if speed or memory usage was important (or if you had a very large dataset). The "easiest" way to do it is straightfoward: calculate how many are less then i for each entry in the column and collect those into a new column:
df=pd.DataFrame({'A': [1,3,2,5,8,3]})
col=df['A']
df['new_col']=[ sum(col<i) for i in col ]
print(df)
Result:
A new_col
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
There might be more efficient ways to do this on large datasets, such as sorting your column first.

Cubic Root of Pandas DataFrame

I understand how to take cubic root of both positive and negative numbers. But when trying to use apply-lambda method to efficiently process all elements of a dataframe, I run into an ambiguity issue. Interestingly, this error does not arise with equalities, so I am wondering what could be wrong with the code:
sample[columns]=sample[columns].apply(lambda x: (-1)*np.power(-x,1./3) if x<0 else np.power(x,1./3))
It looks like you are passing a list or array of column names. I assume this because your variable name is plural with an s at the end. If this is the case, then sample[columns] is a dataframe. This is an issue because apply iterates through each column, passing that column the lambda you passed to apply. So you get
(-1) * np.power(-series_object, -1./3) if series_object < 0 else...
And it's the series_object < 0 that is messing things up because you are asking for the truthiness of a whole series being less than zero.
applymap
f = lambda x: -np.power(-x, 1./3) if x < 0 else np.power(x, 1./3)
sample[columns] = sample[columns].applymap(f)
That said, I'd use a lambda defined as follows
f = lambda x: np.sign(x) * np.power(abs(x), 1./3)
Then you could perform this on the entire dataframe
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(-10, 10, (5, 5)))
df
0 1 2 3 4
0 6 1 -8 0 5
1 3 1 3 9 -2
2 -10 2 -10 -8 -10
3 -3 9 3 8 2
4 -6 -7 9 3 -3
f = lambda x: np.sign(x) * np.power(abs(x), 1./3)
f(df)
0 1 2 3 4
0 1.817121 1.000000 -2.000000 0.000000 1.709976
1 1.442250 1.000000 1.442250 2.080084 -1.259921
2 -2.154435 1.259921 -2.154435 -2.000000 -2.154435
3 -1.442250 2.080084 1.442250 2.000000 1.259921
4 -1.817121 -1.912931 2.080084 1.442250 -1.442250
Same as
df.applymap(f)
0 1 2 3 4
0 1.817121 1.000000 -2.000000 0.000000 1.709976
1 1.442250 1.000000 1.442250 2.080084 -1.259921
2 -2.154435 1.259921 -2.154435 -2.000000 -2.154435
3 -1.442250 2.080084 1.442250 2.000000 1.259921
4 -1.817121 -1.912931 2.080084 1.442250 -1.442250
Check for equality
df.applymap(f).equals(f(df))
True
And its faster
%timeit df.applymap(f)
%timeit f(df)
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 473 µs per loop
It doesn't have to be complicated, simply use NumPys cube-root function: np.cbrt:
df[columns] = np.cbrt(df[columns])
It requires NumPy >= 1.10 though.
For older versions you could use np.absolute and np.sign instead of using conditionals:
df[columns] = df[columns].apply(lambda x: np.power(np.absolute(x), 1./3) * np.sign(x))
This calculates the cube root of the absolute and then changes the sign appropriatly.
Try :
sample[columns]=sample[columns]**(1/3)

Getting average of rows in dataframe greater than or equal to zero

I would like to get the average value of a row in a dataframe where I only use values greater than or equal to zero.
For example:
if my dataframe looked like:
df = pd.DataFrame([[3,4,5], [4,5,6],[4,-10,6]])
3 4 5
4 5 6
4 -10 6
currently if I get the average of the row I write :
df['mean'] = df.mean(axis = 1)
and get:
3 4 5 4
4 5 6 5
4 -10 6 0
I would like to get a dataframe that only used values greater than zero to computer the average. I would like a dataframe that looked like:
3 4 5 4
4 5 6 5
4 -10 6 5
In the above example -10 is excluded in the average. Is there a command that excludes the -10?
You can use df[df > 0] to query the data frame before calculating the average; df[df > 0] returns a data frame where cells smaller or equal to zero will be replaced with NaN and get ignored when calculating the mean:
df[df > 0].mean(1)
#0 4.0
#1 5.0
#2 5.0
#dtype: float64
Not nearly as succinct as #Psidom. But if you wanted to use numpy and get some added quickness.
v0 = df.values
v1 = np.where(v0 > 0, v0, np.nan)
v2 = np.nanmean(v1, axis=1)
df.assign(Mean=v2)
0 1 2 Mean
0 3 4 5 4.0
1 4 5 6 5.0
2 4 -10 6 5.0
Timing
small data
%timeit df.assign(Mean=df[df > 0].mean(1))
1000 loops, best of 3: 1.71 ms per loop
%%timeit
v0 = df.values
v1 = np.where(v0 > 0, v0, np.nan)
v2 = np.nanmean(v1, axis=1)
df.assign(Mean=v2)
1000 loops, best of 3: 407 µs per loop

Pandas: assign an index to each group identified by groupby

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

Categories