Excel like groupping in pandas DataFrame [duplicate]

Excel like groupping in pandas DataFrame [duplicate] - python

I have a Pandas dataframe like the one below, where column A is a series of strings, and values in column B are true/false depending on whether the value of column A is the same as the value of column A in the previous row.
A B
1 False
1 True
1b False
1b True
1b True
1 False
I want to add a new column, C, that assigns the same value (it can be any value) to all consecutive duplicate entries, but this value must be unique from the values assigned to other groups of consecutive duplicate entries. For example:
A B C
1 False 1
1 True 1
1b False 2
1b True 2
1b True 2
1 False 3
Any thoughts about how to go about this in an efficient way?

Try with groupby ngroup + 1 be sure to sort=False to make sure groups are created in the order they appear in the DataFrame:
df['C'] = df.groupby(['A', (~df['B']).cumsum()], sort=False).ngroup() + 1
A B C
0 1 False 1
1 1 True 1
2 1b False 2
3 1b True 2
4 1b True 2
5 1 False 3
Or either of the following can be used directly assuming values appear in sorted order with cumsum:
df['C'] = (~df['B']).cumsum()
A B C
0 1 False 1
1 1 True 1
2 1b False 2
3 1b True 2
4 1b True 2
5 1 False 3
This would be equivalent to:
df['A'].ne(df['A'].shift()).cumsum()
0 1
1 1
2 2
3 2
4 2
5 3
Name: A, dtype: int32
Which would be the standard way to solve this problem if the B column was not already calculated.

Try with shift combine with cumsum
df['C'] = df.A.ne(df.A.shift()).cumsum()
Out[191]:
0 1
1 1
2 2
3 2
4 2
5 3
Name: A, dtype: int64

I think that's what you are looking for.
df['C'] = df.groupby('A').ngroup() + 1

Related

Pandas creating new column based on consecutive duplicates

I have a Pandas dataframe like the one below, where column A is a series of strings, and values in column B are true/false depending on whether the value of column A is the same as the value of column A in the previous row.
A B
1 False
1 True
1b False
1b True
1b True
1 False
I want to add a new column, C, that assigns the same value (it can be any value) to all consecutive duplicate entries, but this value must be unique from the values assigned to other groups of consecutive duplicate entries. For example:
A B C
1 False 1
1 True 1
1b False 2
1b True 2
1b True 2
1 False 3
Any thoughts about how to go about this in an efficient way?

Try with groupby ngroup + 1 be sure to sort=False to make sure groups are created in the order they appear in the DataFrame:
df['C'] = df.groupby(['A', (~df['B']).cumsum()], sort=False).ngroup() + 1
A B C
0 1 False 1
1 1 True 1
2 1b False 2
3 1b True 2
4 1b True 2
5 1 False 3
Or either of the following can be used directly assuming values appear in sorted order with cumsum:
df['C'] = (~df['B']).cumsum()
A B C
0 1 False 1
1 1 True 1
2 1b False 2
3 1b True 2
4 1b True 2
5 1 False 3
This would be equivalent to:
df['A'].ne(df['A'].shift()).cumsum()
0 1
1 1
2 2
3 2
4 2
5 3
Name: A, dtype: int32
Which would be the standard way to solve this problem if the B column was not already calculated.

Try with shift combine with cumsum
df['C'] = df.A.ne(df.A.shift()).cumsum()
Out[191]:
0 1
1 1
2 2
3 2
4 2
5 3
Name: A, dtype: int64

I think that's what you are looking for.
df['C'] = df.groupby('A').ngroup() + 1

drop group by number of occurrence

Hi I want to delete the rows with the entries whose number of occurrence is smaller than a number, for example:
df = pd.DataFrame({'a': [1,2,3,2], 'b':[4,5,6,7], 'c':[0,1,3,2]})
df
a b c
0 1 4 0
1 2 5 1
2 3 6 3
3 2 7 2
Here I want to delete all the rows if the number of occurrence in column 'a' is less than twice.
Wanted output:
a b c
1 2 5 1
3 2 7 2
What I know:
we can find the number of occurrence by condition = df['a'].value_counts() < 2, and it will give me something like:
2 False
3 True
1 True
Name: a, dtype: int64
But I don't know how I should approach from here to delete the rows.
Thanks in advance!

groupby + size
res = df[df.groupby('a')['b'].transform('size') >= 2]
The transform method maps df.groupby('a')['b'].size() to df aligned with df['a'].
value_counts + map
s = df['a'].value_counts()
res = df[df['a'].map(s) >= 2]
print(res)
a b c
1 2 5 1
3 2 7 2

You Can use df.where and the dropna
df.where(df['a'].value_counts() <2).dropna()
a b c
1 2.0 5.0 1.0
3 2.0 7.0 2.0

You could try something like this to get the length of each group, transform back to original index and index the df by it
df[df.groupby("a").transform(len)["b"] >= 2]
a b c
1 2 5 1
3 2 7 2
Breaking it into individual steps you get:
df.groupby("a").transform(len)["b"]
0 1
1 2
2 1
3 2
Name: b, dtype: int64
These are the group sizes transformed back onto your original index
df.groupby("a").transform(len)["b"] >=2
0 False
1 True
2 False
3 True
Name: b, dtype: bool
We then turn this into the boolean index and index our original dataframe by it

Python - Pandas - DataFrame - Explode single column into multiple boolean columns based on conditions

Good morning chaps,
Any pythonic way to explode a dataframe column into multiple columns with boolean flags, based on some condition (str.contains in this case)?
Let's say I have this:
Position Letter
1 a
2 b
3 c
4 b
5 b
And I'd like to achieve this:
Position Letter is_a is_b is_C
1 a TRUE FALSE FALSE
2 b FALSE TRUE FALSE
3 c FALSE FALSE TRUE
4 b FALSE TRUE FALSE
5 b FALSE TRUE FALSE
Can do with a loop through 'abc' and explicitly creating new df columns, but wondering if some built-in method already exists in pandas. Number of possible values, and hence number of new columns is variable.
Thanks and regards.

use Series.str.get_dummies():
In [31]: df.join(df.Letter.str.get_dummies())
Out[31]:
Position Letter a b c
0 1 a 1 0 0
1 2 b 0 1 0
2 3 c 0 0 1
3 4 b 0 1 0
4 5 b 0 1 0
or
In [32]: df.join(df.Letter.str.get_dummies().astype(bool))
Out[32]:
Position Letter a b c
0 1 a True False False
1 2 b False True False
2 3 c False False True
3 4 b False True False
4 5 b False True False

Pandas conditional subset for dataframe with bool values and ints

I have a dataframe with three series. Column A contains a group_id. Column B contains True or False. Column C contains a 1-n ranking (where n is the number of rows per group_id).
I'd like to store a subset of this dataframe for each row that:
1) Column C == 1
OR
2) Column B == True
The following logic copies my old dataframe row for row into the new dataframe:
new_df = df[df.column_b | df.column_c == 1]

IIUC, starting from a sample dataframe like:
A,B,C
01,True,1
01,False,2
02,False,1
02,True,2
03,True,1
you can:
df = df[(df['C']==1) | (df['B']==True)]
which returns:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1

You've couple of methods for filtering, and performance varies based on size of your data
In [722]: df[(df['C']==1) | df['B']]
Out[722]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
In [723]: df.query('C==1 or B==True')
Out[723]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
In [724]: df[df.eval('C==1 or B==True')]
Out[724]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1

Python - Pandas: select first observation per group

I want to adapt my former SAS code to Python using the dataframe framework.
In SAS I often use this type of code (assume the columns are sorted by group_id where group_id takes values 1 to 10 where there are multiple observations for each group_id):
data want;set have;
by group_id;
if first.group_id then c=1; else c=0;
run;
so what goes on here is that I select the first observations for each id and I create a new variable c that takes value 1 and 0 for the others. The dataset looks like this:
group_id c
1 1
1 0
1 0
2 1
2 0
2 0
3 1
3 0
3 0
How can I do this in Python using dataframe? Assume that I start with the group_id vector only.

If you're using 0.13+ you can use cumcount groupby method:
In [11]: df
Out[11]:
group_id
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
In [12]: df.groupby('group_id').cumcount() == 0
Out[12]:
0 True
1 False
2 False
3 True
4 False
5 False
6 True
7 False
8 False
dtype: bool
You can force the dtype to be int rather than bool:
In [13]: df['c'] = (df.groupby('group_id').cumcount() == 0).astype(int)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Excel like groupping in pandas DataFrame [duplicate] - python

Try with shift combine with cumsum df['C'] = df.A.ne(df.A.shift()).cumsum() Out[191]: 0 1 1 1 2 2 3 2 4 2 5 3 Name: A, dtype: int64

I think that's what you are looking for. df['C'] = df.groupby('A').ngroup() + 1

Related

Pandas creating new column based on consecutive duplicates

drop group by number of occurrence

Python - Pandas - DataFrame - Explode single column into multiple boolean columns based on conditions

Pandas conditional subset for dataframe with bool values and ints

Python - Pandas: select first observation per group

Categories

Resources