drop group by number of occurrence - python

Hi I want to delete the rows with the entries whose number of occurrence is smaller than a number, for example:
df = pd.DataFrame({'a': [1,2,3,2], 'b':[4,5,6,7], 'c':[0,1,3,2]})
df
a b c
0 1 4 0
1 2 5 1
2 3 6 3
3 2 7 2
Here I want to delete all the rows if the number of occurrence in column 'a' is less than twice.
Wanted output:
a b c
1 2 5 1
3 2 7 2
What I know:
we can find the number of occurrence by condition = df['a'].value_counts() < 2, and it will give me something like:
2 False
3 True
1 True
Name: a, dtype: int64
But I don't know how I should approach from here to delete the rows.
Thanks in advance!

groupby + size
res = df[df.groupby('a')['b'].transform('size') >= 2]
The transform method maps df.groupby('a')['b'].size() to df aligned with df['a'].
value_counts + map
s = df['a'].value_counts()
res = df[df['a'].map(s) >= 2]
print(res)
a b c
1 2 5 1
3 2 7 2

You Can use df.where and the dropna
df.where(df['a'].value_counts() <2).dropna()
a b c
1 2.0 5.0 1.0
3 2.0 7.0 2.0

You could try something like this to get the length of each group, transform back to original index and index the df by it
df[df.groupby("a").transform(len)["b"] >= 2]
a b c
1 2 5 1
3 2 7 2
Breaking it into individual steps you get:
df.groupby("a").transform(len)["b"]
0 1
1 2
2 1
3 2
Name: b, dtype: int64
These are the group sizes transformed back onto your original index
df.groupby("a").transform(len)["b"] >=2
0 False
1 True
2 False
3 True
Name: b, dtype: bool
We then turn this into the boolean index and index our original dataframe by it

Related

I want to add sub-index in python with pandas [duplicate]

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

How to filter groupby object in pandas

I have df like this
A B
1 1
1 2
1 3
2 2
2 1
3 2
3 3
3 4
I would like to extract rows whose col B is not ascending like
A B
2 2
2 1
I tried
df.groupby("A").filter()...
But I stacked to extract.
If you have any solution,please let me know.
One way is to use pandas.Series.is_monotonic:
df[df.groupby('A')['B'].transform(lambda x:not x.is_monotonic)]
Output:
A B
3 2 2
4 2 1
Use GroupBy.transform with Series.diff and compare by Series.lt for at least one negative value with Series.any and filter by boolean indexing:
df1 = df[df.groupby('A')['B'].transform(lambda x: x.diff().lt(0).any())]
print (df1)
A B
3 2 2
4 2 1

Pandas enumerate groups in descending order

I've the following column:
column
0 10
1 10
2 8
3 8
4 6
5 6
My goal is to find the today unique values (3 in this case) and create a new column which would create the following
new_column
0 3
1 3
2 2
3 2
4 1
5 1
The numbering starts from length of unique values (3) and same number is repeated if current row is same as previous row based on original column. Number gets decreased as row value changes. All unique values in original column have same number of rows (2 rows for each unique value in this case).
My solution was to groupby the original column and create a new list like below:
i=1
new_time=[]
for j, v in df.groupby('column'):
new_time.append([i]*2)
i=i+1
Then I'd flatten the list sort in decreasing order. Any other simpler solution?
Thanks.
pd.factorize
i, u = pd.factorize(df.column)
df.assign(new=len(u) - i)
column new
0 10 3
1 10 3
2 8 2
3 8 2
4 6 1
5 6 1
dict.setdefault
d = {}
for k in df.column:
d.setdefault(k, len(d))
df.assign(new=len(d) - df.column.map(d))
Use GroupBy.ngroup with ascending=False:
df.groupby('column', sort=False).ngroup(ascending=False)+1
0 3
1 3
2 2
3 2
4 1
5 1
dtype: int64
For DataFrame that looks like this,
df = pd.DataFrame({'column': [10, 10, 8, 8, 10, 10]})
. . .where only consecutive values are to be grouped, you'll need to modify your grouper:
(df.groupby(df['column'].ne(df['column'].shift()).cumsum(), sort=False)
.ngroup(ascending=False)
.add(1))
0 3
1 3
2 2
3 2
4 1
5 1
dtype: int64
Acutally, we can use rank with method being dense i.e
dense: like ‘min’, but rank always increases by 1 between groups
df['column'].rank(method='dense')
0 3.0
1 3.0
2 2.0
3 2.0
4 1.0
5 1.0
rank version of #cs95's solution would be
df['column'].ne(df['column'].shift()).cumsum().rank(method='dense',ascending=False)
Try with unique and map
df.column.map(dict(zip(df.column.unique(),reversed(range(df.column.nunique())))))+1
Out[350]:
0 3
1 3
2 2
3 2
4 1
5 1
Name: column, dtype: int64
IIUC, you want groupID of same-values consecutive groups in reversed order. If so, I think this should work too:
df.column.nunique() - df.column.ne(df.column.shift()).cumsum().sub(1)
Out[691]:
0 3
1 3
2 2
3 2
4 1
5 1
Name: column, dtype: int32

How to grouping and transform in pandas

Now I have below dataframe
A B C
1 1 1
1 2 1
1 3 2
2 4 2
2 5 2
2 6 3
I would like to grouping by df.A, and sum up in df.B
But, I would like to transform C as first of each group elements.
So I would like to get results below.
A B C
1 6 1
2 15 2
How I can remain df.C and transform the first element of each group?
I tried df.groupby(A)[B].sum() but I couldnt figure out next step...
You can use agg and pass a dict of funcs to perform on the cols of interest:
In [115]:
df.groupby('A').agg({'B':'sum','C':'first'}).reset_index()
Out[115]:
A C B
0 1 1 6
1 2 2 15
The dict has the col name and the func to perform on each col, here we can pass the string name of the func for sum and first.
To reorder the cols you can use fancy indexing:
In [116]:
df.groupby('A').agg({'B':'sum','C':'first'}).reset_index().ix[:,df.columns]
Out[116]:
A B C
0 1 6 1
1 2 15 2

Pandas: assign an index to each group identified by groupby

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

Categories