Improve setting values in pandas - python

I want to add some columns for group features(std, mean...), the code below works but the dataset is really big and the performance is bad. Is there any good idea to improve the code? Thanks
import pandas as pd
df = pd.DataFrame([[1,2,1], [1,2,2], [1,3,3], [1,3,4],[2,8,9], [2,11,11]], columns=['A', 'B', 'C'])
df['mean'] = 0
df2 = df.groupby('A')
for a, group in df2:
mean = group['C'].mean()
df.loc[df['A'] == a, 'mean'] = mean
df
'''
A B C mean
0 1 2 1 2.5
1 1 2 2 2.5
2 1 3 3 2.5
3 1 3 4 2.5
4 2 8 9 10.0
5 2 11 11 10.0
'''

Pandas' groupby.transform does the job of broadcasting aggregate statistics across the original index. This makes it perfect for your purposes and should be considered the idiomatic way to perform this task.
pipelined solution that produces a copy of df with new column
df.assign(Mean=df.groupby('A').C.transform('mean'))
A B C Mean
0 1 2 1 2.5
1 1 2 2 2.5
2 1 3 3 2.5
3 1 3 4 2.5
4 2 8 9 10.0
5 2 11 11 10.0
In place assignment
df['Mean'] = df.groupby('A').C.transform('mean')
df
A B C Mean
0 1 2 1 2.5
1 1 2 2 2.5
2 1 3 3 2.5
3 1 3 4 2.5
4 2 8 9 10.0
5 2 11 11 10.0
Alternatively, you can use pd.factorize and np.bincount
f, u = pd.factorize(df.A.values)
totals = np.bincount(f, df.C.values)
counts = np.bincount(f)
df.assign(Mean=(totals / counts)[f])
A B C Mean
0 1 2 1 2.5
1 1 2 2 2.5
2 1 3 3 2.5
3 1 3 4 2.5
4 2 8 9 10.0
5 2 11 11 10.0

Here is one way:
s = df.groupby('A')['C'].mean()
df['mean'] = df['A'].map(s)
# A B C mean
# 0 1 2 1 2.5
# 1 1 2 2 2.5
# 2 1 3 3 2.5
# 3 1 3 4 2.5
# 4 2 8 9 10.0
# 5 2 11 11 10.0
Explanation
First, groupby 'A' and calculate mean of 'C'. This creates a series with index unique entries in 'A' and values as required.
Second, map this series onto your dataframe. This is possible because pd.Series.map can take a series as an input.

You can call mean with index
df.assign(mean=df.A.map(df.set_index('A').C.mean(level=0)))
Out[28]:
A B C mean
0 1 2 1 2.5
1 1 2 2 2.5
2 1 3 3 2.5
3 1 3 4 2.5
4 2 8 9 10.0
5 2 11 11 10.0
Or using get
df['mean']=df.set_index('A').C.mean(level=0).get(df.A).values
df
Out[35]:
A B C mean
0 1 2 1 2.5
1 1 2 2 2.5
2 1 3 3 2.5
3 1 3 4 2.5
4 2 8 9 10.0
5 2 11 11 10.0

Related

Mean of the summations of group by

I have a data that looks like this:
A B C
1 1 1
1 1 5
1 2 7
1 2 3
2 1 8
2 1 10
2 2 1
2 2 4
I need to group by A and B and sum C then get the mean of (sum C) for each unique value in A
Output1:
A B SumC
1 1 6
2 10
2 1 18
2 5
Output2:
A Mean C
1 8
2 11.5
My attempt:
DailyCount_ps = (df_new.groupby(["A","B"])["C"].sum()).rename(“Sum C”)
Any help?
Well you can do it in 2 steps:
df = df.groupby(['A', 'B'], as_index=False)['C'].sum().rename({'C': 'Sum C'}, axis=1)
df['Mean C'] = df.groupby('A')['Sum C'].transform('mean')
df
A B Sum C Mean C
0 1 1 6 8.0
1 1 2 10 8.0
2 2 1 18 11.5
3 2 2 5 11.5

Add Missing Values To Pandas Groups

Let's say I have a DataFrame like:
import pandas as pd
df = pd.DataFrame({"Quarter": [1,2,3,4,1,2,3,4,4],
"Type": ["a","a","a","a","b","b","c","c","d"],
"Value": [4,1,3,4,7,2,9,4,1]})
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 c 9
7 4 c 4
8 4 d 1
For each Type, there needs to be a total of 4 rows that represent one of four quarters as indicated by the Quarter column. So, it would look like:
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b NaN
7 4 b NaN
8 1 c NaN
9 2 c NaN
10 3 c 9
11 4 c 4
12 1 d NaN
13 2 d NaN
14 3 d NaN
15 4 d 1
Then, where there are missing values in the Value column, fill the missing values using the next closest available value with the same Type (if it's the last quarter that is missing then fill with the third quarter):
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b 2
7 4 b 2
8 1 c 9
9 2 c 9
10 3 c 9
11 4 c 4
12 1 d 1
13 2 d 1
14 3 d 1
15 4 d 1
What's the best way to accomplish this?
Use reindex:
idx = pd.MultiIndex.from_product([
df['Type'].unique(),
range(1,5)
], names=['Type', 'Quarter'])
df.set_index(['Type', 'Quarter']).reindex(idx) \
.groupby('Type') \
.transform(lambda v: v.ffill().bfill()) \
.reset_index()
you can use set_index and unstack to create the missing rows you want (assuming each quarter is available in at least one type), then ffill and bfill over the columns and finally stack and reset_index to go back to the original shape
df = df.set_index(['Type', 'Quarter']).unstack()\
.ffill(axis=1).bfill(axis=1)\
.stack().reset_index()
print (df)
Type Quarter Value
0 a 1 4.0
1 a 2 1.0
2 a 3 3.0
3 a 4 4.0
4 b 1 7.0
5 b 2 2.0
6 b 3 2.0
7 b 4 2.0
8 c 1 9.0
9 c 2 9.0
10 c 3 9.0
11 c 4 4.0
12 d 1 1.0
13 d 2 1.0
14 d 3 1.0
15 d 4 1.0

Select all rows from where a condition is true in pandas

I have a dataframe
Id Seqno. Event
1 2 A
1 3 B
1 5 A
1 6 A
1 7 D
2 0 E
2 1 A
2 2 B
2 4 A
2 6 B
I want to get all the events happened since the count of recent occurrence of Pattern A = 2 for each ID. Seqno. is a sequence number for each ID.
The output will be
Id Seqno. Event
1 5 A
1 6 A
1 7 D
2 1 A
2 2 B
2 4 A
2 6 B
so far i tried,
y=x.groupby('Id').apply( lambda
x:x.eventtype.eq('A').cumsum().tail(2)).reset_index()
p=y.groupby('Id').apply(lambda x:
x.iloc[0]).reset_index(drop=True)
q= x.reset_index()
s= pd.merge(q,p,on='Id')
dd= s[s['index']>=s['level_1']]
I was wondering if there is a good way of doing it.
Use groupby with cumsum, subtract it from the count of A's per group, and filter:
g = df['Event'].eq('A').groupby(df['Id'])
df[(g.transform('sum') - g.cumsum()).le(1)]
Id Seqno. Event
2 1 5 A
3 1 6 A
4 1 7 D
6 2 1 A
7 2 2 B
8 2 4 A
9 2 6 B
Thanks to cold ,ALollz and Vaishali, via the explanation (from the comment) using groupby with cumcount get the count , then we using reindex and ffill
s=df.loc[df.Event=='A'].groupby('Id').cumcount(ascending=False).add(1).reindex(df.index)
s.groupby(df['Id']).ffill()
Out[57]:
0 3.0
1 3.0
2 2.0
3 1.0
4 1.0
5 NaN
6 2.0
7 2.0
8 1.0
9 1.0
dtype: float64
yourdf=df[s.groupby(df['Id']).ffill()<=2]
yourdf
Out[58]:
Id Seqno. Event
2 1 5 A
3 1 6 A
4 1 7 D
6 2 1 A
7 2 2 B
8 2 4 A
9 2 6 B

Mapping data from one dataframe to another based on grouby

Probably a similar question has been asked before, but I could not find anyone to solve my problem. Maybe I am not using the proper search words!.
I have two pandas Dataframes as below:
import pandas as pd
import numpy as np
df1
a = np.array([1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3])
b = np.array([1,1,2,2,3,3,1,1,2,2,3,3,1,1,2,2,3,3])
df1 = pd.DataFrame({'a':a, 'b':b})
print(df1)
a b
0 1 1
1 1 1
2 1 2
3 1 2
4 1 3
5 1 3
6 2 1
7 2 1
8 2 2
9 2 2
10 2 3
11 2 3
12 3 1
13 3 1
14 3 2
15 3 2
16 3 3
17 3 3
df2 is as below:
a2 = np.array([1,1,1,2,2,2,3,3,3])
b2 = np.array([1,2,3,1,2,3,1,2,3])
c = np.array([4,8,3,np.nan, 2, 5,6, np.nan, 1])
df2 = pd.DataFrame({'a':a2, 'b':b2, 'c': c})
a b c
0 1 1 4.0
1 1 2 8.0
2 1 3 3.0
3 2 1 NaN
4 2 2 2.0
5 2 3 5.0
6 3 1 6.0
7 3 2 NaN
8 3 3 1.0
Now I want to map column c from df2 to df1 but keeping the grouping of columns a=a1 and b=b2. Therefore, df1 is modified as shown below
a b c
0 1 1 4
1 1 1 4
2 1 2 8
3 1 2 8
4 1 3 3
5 1 3 3
6 2 1 NaN
7 2 1 NaN
8 2 2 2.0
9 2 2 2.0
10 2 3 5.0
11 2 3 5.0
12 3 1 6.0
13 3 1 6.0
14 3 2 NaN
15 3 2 NaN
16 3 3 1.0
17 3 3 1.0
How can I achieve this with simple and intuitive way using pandas?
Quite simple using merge:
df1.merge(df2)
a b c
0 1 1 4.0
1 1 1 4.0
2 1 2 8.0
3 1 2 8.0
4 1 3 3.0
5 1 3 3.0
6 2 1 NaN
7 2 1 NaN
8 2 2 2.0
9 2 2 2.0
10 2 3 5.0
11 2 3 5.0
12 3 1 6.0
13 3 1 6.0
14 3 2 NaN
15 3 2 NaN
16 3 3 1.0
17 3 3 1.0
If you have more columns and you want to specifically only merge on a and b, use:
df1.merge(df2, on=['a','b'])

Find data from row in Pandas DataFrame based upon calculated value?

As an extension of my previous question, I would like take a DataFrame like the one below and find the correct row from which to pull data from column C and place it into column D based upon the following criteria:
B_new = 2*A_old -B_old, ie. the new row needs to have a B equal to the following result from the old row: 2*A - B.
Where A is the same, ie. A in the new row should have the same value as the old row.
Any values not found should use a NaN result
Code:
import pandas as pd
a = [2,2,2,3,3,3,3]
b = [1,2,3,1,3,4,5]
c = [0,1,2,3,4,5,6]
df = pd.DataFrame({'A': a , 'B': b, 'C':c})
print(df)
A B C
0 2 1 0
1 2 2 1
2 2 3 2
3 3 1 3
4 3 3 4
5 3 4 5
6 3 5 6
Desired output:
A B C D
0 2 1 0 2.0
1 2 2 1 1.0
2 2 3 2 0.0
3 3 1 3 6.0
4 3 3 4 4.0
5 3 4 5 NaN
6 3 5 6 3.0
Based upon the solutions in my previous question, I've come up with a method that uses a for loop to move thru each unique value of A:
for i in df.A.unique():
mapping = dict(df[df.A==i][['B', 'C']].values)
df.loc[df.A==i,'D'] = (2 * df[df.A==i]['A'] - df[df.A==i]['B']).map(mapping)
However, this seem clunky and I suspect there is a better way that doesn't make use of for loops, which from my prior experience tend to be slow.
Question:
What's the fastest way to accomplish this transfer of data within the DataFrame?
You could
In [370]: (df[['A', 'C']].assign(B=2*df.A - df.B)
.merge(df, how='left', on=['A', 'B'])
.assign(B=df.B)
.rename(columns={'C_x': 'C', 'C_y': 'D'}) )
Out[370]:
A C B D
0 2 0 1 2.0
1 2 1 2 1.0
2 2 2 3 0.0
3 3 3 1 6.0
4 3 4 3 4.0
5 3 5 4 NaN
6 3 6 5 3.0
Details:
In [372]: df[['A', 'C']].assign(B=2*df.A - df.B)
Out[372]:
A C B
0 2 0 3
1 2 1 2
2 2 2 1
3 3 3 5
4 3 4 3
5 3 5 2
6 3 6 1
In [373]: df[['A', 'C']].assign(B=2*df.A - df.B).merge(df, how='left', on=['A', 'B'])
Out[373]:
A C_x B C_y
0 2 0 3 2.0
1 2 1 2 1.0
2 2 2 1 0.0
3 3 3 5 6.0
4 3 4 3 4.0
5 3 5 2 NaN
6 3 6 1 3.0

Categories