Pandas conditional subset for dataframe with bool values and ints - python

I have a dataframe with three series. Column A contains a group_id. Column B contains True or False. Column C contains a 1-n ranking (where n is the number of rows per group_id).
I'd like to store a subset of this dataframe for each row that:
1) Column C == 1
OR
2) Column B == True
The following logic copies my old dataframe row for row into the new dataframe:
new_df = df[df.column_b | df.column_c == 1]

IIUC, starting from a sample dataframe like:
A,B,C
01,True,1
01,False,2
02,False,1
02,True,2
03,True,1
you can:
df = df[(df['C']==1) | (df['B']==True)]
which returns:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1

You've couple of methods for filtering, and performance varies based on size of your data
In [722]: df[(df['C']==1) | df['B']]
Out[722]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
In [723]: df.query('C==1 or B==True')
Out[723]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
In [724]: df[df.eval('C==1 or B==True')]
Out[724]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1

Related

Excel like groupping in pandas DataFrame [duplicate]

I have a Pandas dataframe like the one below, where column A is a series of strings, and values in column B are true/false depending on whether the value of column A is the same as the value of column A in the previous row.
A B
1 False
1 True
1b False
1b True
1b True
1 False
I want to add a new column, C, that assigns the same value (it can be any value) to all consecutive duplicate entries, but this value must be unique from the values assigned to other groups of consecutive duplicate entries. For example:
A B C
1 False 1
1 True 1
1b False 2
1b True 2
1b True 2
1 False 3
Any thoughts about how to go about this in an efficient way?
Try with groupby ngroup + 1 be sure to sort=False to make sure groups are created in the order they appear in the DataFrame:
df['C'] = df.groupby(['A', (~df['B']).cumsum()], sort=False).ngroup() + 1
A B C
0 1 False 1
1 1 True 1
2 1b False 2
3 1b True 2
4 1b True 2
5 1 False 3
Or either of the following can be used directly assuming values appear in sorted order with cumsum:
df['C'] = (~df['B']).cumsum()
A B C
0 1 False 1
1 1 True 1
2 1b False 2
3 1b True 2
4 1b True 2
5 1 False 3
This would be equivalent to:
df['A'].ne(df['A'].shift()).cumsum()
0 1
1 1
2 2
3 2
4 2
5 3
Name: A, dtype: int32
Which would be the standard way to solve this problem if the B column was not already calculated.
Try with shift combine with cumsum
df['C'] = df.A.ne(df.A.shift()).cumsum()
Out[191]:
0 1
1 1
2 2
3 2
4 2
5 3
Name: A, dtype: int64
I think that's what you are looking for.
df['C'] = df.groupby('A').ngroup() + 1

Pandas creating new column based on consecutive duplicates

I have a Pandas dataframe like the one below, where column A is a series of strings, and values in column B are true/false depending on whether the value of column A is the same as the value of column A in the previous row.
A B
1 False
1 True
1b False
1b True
1b True
1 False
I want to add a new column, C, that assigns the same value (it can be any value) to all consecutive duplicate entries, but this value must be unique from the values assigned to other groups of consecutive duplicate entries. For example:
A B C
1 False 1
1 True 1
1b False 2
1b True 2
1b True 2
1 False 3
Any thoughts about how to go about this in an efficient way?
Try with groupby ngroup + 1 be sure to sort=False to make sure groups are created in the order they appear in the DataFrame:
df['C'] = df.groupby(['A', (~df['B']).cumsum()], sort=False).ngroup() + 1
A B C
0 1 False 1
1 1 True 1
2 1b False 2
3 1b True 2
4 1b True 2
5 1 False 3
Or either of the following can be used directly assuming values appear in sorted order with cumsum:
df['C'] = (~df['B']).cumsum()
A B C
0 1 False 1
1 1 True 1
2 1b False 2
3 1b True 2
4 1b True 2
5 1 False 3
This would be equivalent to:
df['A'].ne(df['A'].shift()).cumsum()
0 1
1 1
2 2
3 2
4 2
5 3
Name: A, dtype: int32
Which would be the standard way to solve this problem if the B column was not already calculated.
Try with shift combine with cumsum
df['C'] = df.A.ne(df.A.shift()).cumsum()
Out[191]:
0 1
1 1
2 2
3 2
4 2
5 3
Name: A, dtype: int64
I think that's what you are looking for.
df['C'] = df.groupby('A').ngroup() + 1

How can I select and index the highest value in each group of a Pandas dataframe?

I have a dataframe with multiple columns, each combination of columns describing one experiment (e.g. multiple super-labels, for each super-label multiple episodes with different number of timesteps). I want to set the last timestep in each episode for all experiments to True, but I can't figure out how to do this. I have tried three different approaches, all using .loc and 1) using .max().index, 2) .idxmax() and 3) .tail(1).index, but they all fail (the first two with for me ununderstandable exceptions and the last one being wrong.
This is my minimal example:
import numpy as np
import pandas as pd
np.random.seed(4)
def gen(t):
results = []
for episode_id, episode in enumerate(range(np.random.randint(2, 4))):
for i in range(np.random.randint(2, 6)):
results.append(
{
"episode": episode_id,
"timestep": i,
"t": t,
}
)
return pd.DataFrame(results)
df = pd.concat([gen("a"), gen("b")])
base_groups = ["t", "episode"]
df["last_timestep"] = False
print("Expected:")
print(df.groupby(base_groups).timestep.max())
#df.loc[df.groupby(base_groups).timestep.max().index, "last_timestep"] = True
#df.loc[df.groupby(base_groups).timestep.idxmax(), "last_timestep"] = True
df.loc[df.groupby(base_groups).tail(1).index, "last_timestep"] = True
print("Is:")
print(df[df.last_timestep])
The output of df.groupby(base_groups).timestep.max() is exactly what I expect, the correct rows are selected:
Expected:
t episode
a 0 3
1 4
b 0 2
1 1
2 4
But when filtering the dataframe, this is what I get:
Is:
episode timestep t last_timestep
2 0 2 a True
3 0 3 a True
4 1 0 a True
8 1 4 a True
2 0 2 b True
3 1 0 b True
4 1 1 b True
8 2 3 b True
9 2 4 b True
The rows 0, 2, 5 and 7 should not be selected.
Use GroupBy.transform for repeat max aggregated values and compare by column timestep:
df["last_timestep"] = df.groupby(base_groups)['timestep'].transform(max).eq(df['timestep'])
print (df)
episode timestep t last_timestep
0 0 0 a False
1 0 1 a False
2 0 2 a False
3 0 3 a True
4 1 0 a False
5 1 1 a False
6 1 2 a False
7 1 3 a False
8 1 4 a True
0 0 0 b False
1 0 1 b False
2 0 2 b True
3 1 0 b False
4 1 1 b True
5 2 0 b False
6 2 1 b False
7 2 2 b False
8 2 3 b False
9 2 4 b True

Python - Pandas - DataFrame - Explode single column into multiple boolean columns based on conditions

Good morning chaps,
Any pythonic way to explode a dataframe column into multiple columns with boolean flags, based on some condition (str.contains in this case)?
Let's say I have this:
Position Letter
1 a
2 b
3 c
4 b
5 b
And I'd like to achieve this:
Position Letter is_a is_b is_C
1 a TRUE FALSE FALSE
2 b FALSE TRUE FALSE
3 c FALSE FALSE TRUE
4 b FALSE TRUE FALSE
5 b FALSE TRUE FALSE
Can do with a loop through 'abc' and explicitly creating new df columns, but wondering if some built-in method already exists in pandas. Number of possible values, and hence number of new columns is variable.
Thanks and regards.
use Series.str.get_dummies():
In [31]: df.join(df.Letter.str.get_dummies())
Out[31]:
Position Letter a b c
0 1 a 1 0 0
1 2 b 0 1 0
2 3 c 0 0 1
3 4 b 0 1 0
4 5 b 0 1 0
or
In [32]: df.join(df.Letter.str.get_dummies().astype(bool))
Out[32]:
Position Letter a b c
0 1 a True False False
1 2 b False True False
2 3 c False False True
3 4 b False True False
4 5 b False True False

Python - Pandas: select first observation per group

I want to adapt my former SAS code to Python using the dataframe framework.
In SAS I often use this type of code (assume the columns are sorted by group_id where group_id takes values 1 to 10 where there are multiple observations for each group_id):
data want;set have;
by group_id;
if first.group_id then c=1; else c=0;
run;
so what goes on here is that I select the first observations for each id and I create a new variable c that takes value 1 and 0 for the others. The dataset looks like this:
group_id c
1 1
1 0
1 0
2 1
2 0
2 0
3 1
3 0
3 0
How can I do this in Python using dataframe? Assume that I start with the group_id vector only.
If you're using 0.13+ you can use cumcount groupby method:
In [11]: df
Out[11]:
group_id
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
In [12]: df.groupby('group_id').cumcount() == 0
Out[12]:
0 True
1 False
2 False
3 True
4 False
5 False
6 True
7 False
8 False
dtype: bool
You can force the dtype to be int rather than bool:
In [13]: df['c'] = (df.groupby('group_id').cumcount() == 0).astype(int)

Categories