Select all rows from where a condition is true in pandas - python

I have a dataframe
Id Seqno. Event
1 2 A
1 3 B
1 5 A
1 6 A
1 7 D
2 0 E
2 1 A
2 2 B
2 4 A
2 6 B
I want to get all the events happened since the count of recent occurrence of Pattern A = 2 for each ID. Seqno. is a sequence number for each ID.
The output will be
Id Seqno. Event
1 5 A
1 6 A
1 7 D
2 1 A
2 2 B
2 4 A
2 6 B
so far i tried,
y=x.groupby('Id').apply( lambda
x:x.eventtype.eq('A').cumsum().tail(2)).reset_index()
p=y.groupby('Id').apply(lambda x:
x.iloc[0]).reset_index(drop=True)
q= x.reset_index()
s= pd.merge(q,p,on='Id')
dd= s[s['index']>=s['level_1']]
I was wondering if there is a good way of doing it.

Use groupby with cumsum, subtract it from the count of A's per group, and filter:
g = df['Event'].eq('A').groupby(df['Id'])
df[(g.transform('sum') - g.cumsum()).le(1)]
Id Seqno. Event
2 1 5 A
3 1 6 A
4 1 7 D
6 2 1 A
7 2 2 B
8 2 4 A
9 2 6 B

Thanks to cold ,ALollz and Vaishali, via the explanation (from the comment) using groupby with cumcount get the count , then we using reindex and ffill
s=df.loc[df.Event=='A'].groupby('Id').cumcount(ascending=False).add(1).reindex(df.index)
s.groupby(df['Id']).ffill()
Out[57]:
0 3.0
1 3.0
2 2.0
3 1.0
4 1.0
5 NaN
6 2.0
7 2.0
8 1.0
9 1.0
dtype: float64
yourdf=df[s.groupby(df['Id']).ffill()<=2]
yourdf
Out[58]:
Id Seqno. Event
2 1 5 A
3 1 6 A
4 1 7 D
6 2 1 A
7 2 2 B
8 2 4 A
9 2 6 B

Related

How to calculate count within the same group based on ID

My DataFrame looks like:
df = pd.DataFrame({"ID":['A','B','A','A','B','B','C','D','D','C'],
'count':[1,1,2,2,2,2,1,1,1,2]})
print(df)
ID count
0 A 1
1 B 1
2 A 2
3 A 2
4 B 2
5 B 2
6 C 1
7 D 1
8 D 1
9 C 2
I will be having only ID column and I want to calculate count column. The logic is I want to cumulatively count the occurrence of an ID. If its repeated immediately like index 2 & 3 they both should get same count. How can I achieve this?
My attempt which is not giving the accurate results:
df['x'] = df['ID'].eq(df['ID'].shift(-1)).astype(int)
df.groupby('ID')['x'].transform('cumsum')+1
0 1
1 1
2 2
3 2
4 2
5 2
6 1
7 2
8 2
9 1
Name: x, dtype: int32
The question is not directly related to groupby cumulative count, but it is different.
We can do filter then reindex back
(df[df.ID.ne(df.ID.shift())].groupby('ID').cumcount().add(1)
.reindex(df.index,method='ffill'))
Out[10]:
0 1
1 1
2 2
3 2
4 2
5 2
6 1
7 1
8 1
9 2
dtype: int64
You could also use groupby() with sort=False:
df['count2'] = df[(df.ID.ne(df.ID.shift()))].groupby('ID', sort=False).cumcount().add(1)
df['count2'] = df['count2'].ffill()
Output:
ID count count2
0 A 1 1
1 B 1 1
2 A 2 2
3 A 2 2
4 B 2 2
5 B 2 2
6 C 1 1
7 D 1 1
8 D 1 1
9 C 2 2

Add Missing Values To Pandas Groups

Let's say I have a DataFrame like:
import pandas as pd
df = pd.DataFrame({"Quarter": [1,2,3,4,1,2,3,4,4],
"Type": ["a","a","a","a","b","b","c","c","d"],
"Value": [4,1,3,4,7,2,9,4,1]})
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 c 9
7 4 c 4
8 4 d 1
For each Type, there needs to be a total of 4 rows that represent one of four quarters as indicated by the Quarter column. So, it would look like:
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b NaN
7 4 b NaN
8 1 c NaN
9 2 c NaN
10 3 c 9
11 4 c 4
12 1 d NaN
13 2 d NaN
14 3 d NaN
15 4 d 1
Then, where there are missing values in the Value column, fill the missing values using the next closest available value with the same Type (if it's the last quarter that is missing then fill with the third quarter):
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b 2
7 4 b 2
8 1 c 9
9 2 c 9
10 3 c 9
11 4 c 4
12 1 d 1
13 2 d 1
14 3 d 1
15 4 d 1
What's the best way to accomplish this?
Use reindex:
idx = pd.MultiIndex.from_product([
df['Type'].unique(),
range(1,5)
], names=['Type', 'Quarter'])
df.set_index(['Type', 'Quarter']).reindex(idx) \
.groupby('Type') \
.transform(lambda v: v.ffill().bfill()) \
.reset_index()
you can use set_index and unstack to create the missing rows you want (assuming each quarter is available in at least one type), then ffill and bfill over the columns and finally stack and reset_index to go back to the original shape
df = df.set_index(['Type', 'Quarter']).unstack()\
.ffill(axis=1).bfill(axis=1)\
.stack().reset_index()
print (df)
Type Quarter Value
0 a 1 4.0
1 a 2 1.0
2 a 3 3.0
3 a 4 4.0
4 b 1 7.0
5 b 2 2.0
6 b 3 2.0
7 b 4 2.0
8 c 1 9.0
9 c 2 9.0
10 c 3 9.0
11 c 4 4.0
12 d 1 1.0
13 d 2 1.0
14 d 3 1.0
15 d 4 1.0

need to filter rows present in one dataframe on another

I have two data frames in pandas from which i need to get the rows with all the corresponding column values in second which are not in first .
ex
df A
A B C D
6 4 1 6
7 6 6 3
1 6 2 9
8 0 4 9
1 0 2 3
8 4 7 5
4 7 1 1
3 7 3 4
5 2 8 8
3 2 8 8
5 2 8 8
df B
A B C D
1 0 2 3
8 4 7 5
4 7 1 1
1 0 2 3
8 4 7 5
4 7 1 1
3 7 3 4
5 2 8 8
1 1 1 1
2 2 2 2
1 1 1 1
req
A B C D
1 1 1 1
2 2 2 2
1 1 1 1
i tried using pd.merge and inner/left on all columns but it is taking a lot more computational time and resource if the rows and columns are more. is there any other way to work it around like iterating through each row of dfA with dfB on all columns and then pick the ones which are there only in dfB?
You can use merge with ind parameter.
df_b.merge(df_a, on=['A','B','C','D'],
how='left', indicator='ind')\
.query('ind == "left_only"')\
.drop('ind', axis=1)
Output:
A B C D
9 1 1 1 1
10 2 2 2 2
11 1 1 1 1

Find data from row in Pandas DataFrame based upon calculated value?

As an extension of my previous question, I would like take a DataFrame like the one below and find the correct row from which to pull data from column C and place it into column D based upon the following criteria:
B_new = 2*A_old -B_old, ie. the new row needs to have a B equal to the following result from the old row: 2*A - B.
Where A is the same, ie. A in the new row should have the same value as the old row.
Any values not found should use a NaN result
Code:
import pandas as pd
a = [2,2,2,3,3,3,3]
b = [1,2,3,1,3,4,5]
c = [0,1,2,3,4,5,6]
df = pd.DataFrame({'A': a , 'B': b, 'C':c})
print(df)
A B C
0 2 1 0
1 2 2 1
2 2 3 2
3 3 1 3
4 3 3 4
5 3 4 5
6 3 5 6
Desired output:
A B C D
0 2 1 0 2.0
1 2 2 1 1.0
2 2 3 2 0.0
3 3 1 3 6.0
4 3 3 4 4.0
5 3 4 5 NaN
6 3 5 6 3.0
Based upon the solutions in my previous question, I've come up with a method that uses a for loop to move thru each unique value of A:
for i in df.A.unique():
mapping = dict(df[df.A==i][['B', 'C']].values)
df.loc[df.A==i,'D'] = (2 * df[df.A==i]['A'] - df[df.A==i]['B']).map(mapping)
However, this seem clunky and I suspect there is a better way that doesn't make use of for loops, which from my prior experience tend to be slow.
Question:
What's the fastest way to accomplish this transfer of data within the DataFrame?
You could
In [370]: (df[['A', 'C']].assign(B=2*df.A - df.B)
.merge(df, how='left', on=['A', 'B'])
.assign(B=df.B)
.rename(columns={'C_x': 'C', 'C_y': 'D'}) )
Out[370]:
A C B D
0 2 0 1 2.0
1 2 1 2 1.0
2 2 2 3 0.0
3 3 3 1 6.0
4 3 4 3 4.0
5 3 5 4 NaN
6 3 6 5 3.0
Details:
In [372]: df[['A', 'C']].assign(B=2*df.A - df.B)
Out[372]:
A C B
0 2 0 3
1 2 1 2
2 2 2 1
3 3 3 5
4 3 4 3
5 3 5 2
6 3 6 1
In [373]: df[['A', 'C']].assign(B=2*df.A - df.B).merge(df, how='left', on=['A', 'B'])
Out[373]:
A C_x B C_y
0 2 0 3 2.0
1 2 1 2 1.0
2 2 2 1 0.0
3 3 3 5 6.0
4 3 4 3 4.0
5 3 5 2 NaN
6 3 6 1 3.0

Creating a new column in panda dataframe using logical indexing and group by

I have a data frame like below
df=pd.DataFrame({'a':['a','a','b','a','b','a','a','a'], 'b' : [1,0,0,1,0,1,1,1], 'c' : [1,2,3,4,5,6,7,8],'d':['1','2','1','2','1','2','1','2']})
df
Out[94]:
a b c d
0 a 1 1 1
1 a 0 2 2
2 b 0 3 1
3 a 1 4 2
4 b 0 5 1
5 a 1 6 2
6 a 1 7 1
7 a 1 8 2
I want something like below
df[(df['a']=='a') & (df['b']==1)]
In [97]:
df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].rank()
df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].rank()
Out[97]:
0 1
3 1
5 2
6 2
7 3
dtype: float64
I want this rank as a new column in dataframe df and wherever there is no rank I want NaN. SO final output will be something like below
a b c d rank
0 a 1 1 1 1
1 a 0 2 2 NaN
2 b 0 3 1 NaN
3 a 1 4 2 1
4 b 0 5 1 NaN
5 a 1 6 2 2
6 a 1 7 1 2
7 a 1 8 2 3
I will appreciate all the help and guidance. Thanks a lot.
Almost there, you just need to call transform to return a series with an index aligned to your orig df:
In [459]:
df['rank'] = df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].transform(pd.Series.rank)
df
Out[459]:
a b c d rank
0 a 1 1 1 1
1 a 0 2 2 NaN
2 b 0 3 1 NaN
3 a 1 4 2 1
4 b 0 5 1 NaN
5 a 1 6 2 2
6 a 1 7 1 2
7 a 1 8 2 3

Categories