I tried to find the Desired_Output column, which is defined as follow: For each Name and Subj group, find the Min of all previous Score.
Name Date Subj Score Desired_Output
A 2022-05-11 1200 70.88 69.60
A 2022-03-20 1200 69.96 69.60
A 2022-02-23 1200 69.60 69.63
A 2022-01-26 1200 69.63 70.22
A 2022-01-05 1200 70.35 70.22
A 2021-12-08 1200 70.22 70.69
A 2021-11-17 1000 56.73 null
A 2021-11-10 1200 70.69 null
B 2022-05-07 1600 96.16 96.53
B 2022-04-24 1600 94.53 null
B 2022-03-20 2000 124.60 null
B 2022-02-27 1800 109.16 null
B 2022-02-03 1400 82.54 null
Here is the dataset:
pd.DataFrame({
'Name': ['A','A','A','A','A','A','A','A','B','B','B','B','B'],
'Date': ['2022-05-11','2022-03-20','2022-02-23','2022-01-26','2022-01-05','2021-12-08','2021-11-17','2021-11-10','2022-05-07','2022-04-24','2022-03-20','2022-02-27','2022-02-03'],
'Subj': [1200,1200,1200,1200,1200,1200,1000,1200,1600,1600,2000,1800,1400],
'Score': [70.88,69.96,69.6,69.63,70.35,70.22,56.73,70.69,96.16,94.53,124.6,109.16,82.54]})
I don't know how to achieve that in Pandas, especially without looping the DataFrame.
Assuming the dates are sorted in reverse order, you can use a reversed cummin+shift per group:
df['Desired'] = (df[::-1]
.groupby(['Name', 'Subj'])['Score']
.apply(lambda s: s.cummin().shift())
)
Output:
Name Date Subj Score Desired
0 A 2022-05-11 1200 70.88 69.60
1 A 2022-03-20 1200 69.96 69.60
2 A 2022-02-23 1200 69.60 69.63
3 A 2022-01-26 1200 69.63 70.22
4 A 2022-01-05 1200 70.35 70.22
5 A 2021-12-08 1200 70.22 70.69
6 A 2021-11-17 1000 56.73 NaN
7 A 2021-11-10 1200 70.69 NaN
8 B 2022-05-07 1600 96.16 94.53
9 B 2022-04-24 1600 94.53 NaN
10 B 2022-03-20 2000 124.60 NaN
11 B 2022-02-27 1800 109.16 NaN
12 B 2022-02-03 1400 82.54 NaN
Related
** I have Input data frame **
ID
Date
Amount
A
2021-08-03
100
A
2021-08-04
100
A
2021-08-06
20
A
2021-08-07
100
A
2021-08-09
300
A
2021-08-11
100
A
2021-08-12
100
A
2021-08-13
10
A
2021-08-23
10
A
2021-08-24
10
A
2021-08-26
10
A
2021-08-28
10
desired Output data frame
ID
Date
Amount
TwoWeekSum
A
2021-08-03
100
320
A
2021-08-04
100
320
A
2021-08-06
20
320
A
2021-08-07
100
320
A
2021-08-09
300
830
A
2021-08-11
100
830
A
2021-08-12
100
830
A
2021-08-13
10
830
A
2021-08-23
10
40
A
2021-08-24
10
40
A
2021-08-26
10
40
A
2021-08-28
10
40
I want to calculate the last two week total sum like
twoweekSum= current week total sum + Previous Week total sum i.e. current week is 34 then twoweekSum is 34 week total sum + 33 week total sum.
Please help me in to get in this in like output data frame so I can use that for further analysis.
Thank You folks !
Use:
#convert values to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#convert values to weeks
df['week'] = df['Date'].dt.isocalendar().week
#aggregate sum per ID and weeks, then add missing weeks and sum in rolling
f = lambda x: x.reindex(range(x.index.min(), x.index.max() + 1))
.rolling(2, min_periods=1).sum()
df1 = df.groupby(['ID', 'week'])['Amount'].sum().reset_index(level=0).groupby('ID').apply(f)
print (df1)
Amount
ID week
A 31 320.0
32 830.0
33 510.0
34 40.0
#last add to original DataFrame per ID and weeks
df=df.join(df1.rename(columns={'Amount':'TwoWeekSum'}),on=['ID','week']).drop('week',axis=1)
print (df)
ID Date Amount TwoWeekSum
0 A 2021-08-03 100 320.0
1 A 2021-08-04 100 320.0
2 A 2021-08-06 20 320.0
3 A 2021-08-07 100 320.0
4 A 2021-08-09 300 830.0
5 A 2021-08-11 100 830.0
6 A 2021-08-12 100 830.0
7 A 2021-08-13 10 830.0
8 A 2021-08-23 10 40.0
9 A 2021-08-24 10 40.0
10 A 2021-08-26 10 40.0
11 A 2021-08-28 10 40.0
per = pd.period_range(df['Date'].min(), df['Date'].max(), freq='w')
mapper = df.groupby(df['Date'].astype('Period[W]')).sum().reindex(per, fill_value=0).rolling(2, 1).sum()['Amount']
out = df['Date'].astype('Period[W]').map(mapper)
out
0 320.0
1 320.0
2 320.0
3 320.0
4 830.0
5 830.0
6 830.0
7 830.0
8 40.0
9 40.0
10 40.0
11 40.0
Name: Date, dtype: float64
make out to TwoWeekSum column
df.assign(TwoWeekSum=out)
ID Date Amount TwoWeekSum
0 A 2021-08-03 100 320.0
1 A 2021-08-04 100 320.0
2 A 2021-08-06 20 320.0
3 A 2021-08-07 100 320.0
4 A 2021-08-09 300 830.0
5 A 2021-08-11 100 830.0
6 A 2021-08-12 100 830.0
7 A 2021-08-13 10 830.0
8 A 2021-08-23 10 40.0
9 A 2021-08-24 10 40.0
10 A 2021-08-26 10 40.0
11 A 2021-08-28 10 40.0
Update
if each ID , groupby and merge
per = pd.period_range(df['Date'].min(), df['Date'].max(), freq='w')
s = df['Date'].astype('Period[W]')
idx = pd.MultiIndex.from_product([df['ID'].unique(), per])
df1 = df.groupby(['ID', s]).sum().reindex(idx, fill_value=0).rolling(2, 1).agg(sum).reset_index().set_axis(['ID', 'period', 'TwoWeekSum'], axis=1)
df.assign(period=s).merge(df1, how='left').drop('period', axis=1)
Try using groupby to group the dataframe by dt.week (each week), then use transform sum to add up the values weekly and repeat the values:
df['TwoWeekSum'] = df.groupby(df['Date'].dt.week)['Amount'].transform('sum')
And then:
print(df)
Gives:
ID Date Amount TwoWeekSum
0 A 2021-08-03 100 320
1 A 2021-08-04 100 320
2 A 2021-08-06 20 320
3 A 2021-08-07 100 320
4 A 2021-08-09 300 830
5 A 2021-08-11 100 830
6 A 2021-08-12 100 830
7 A 2021-08-13 10 830
8 A 2021-08-23 10 40
9 A 2021-08-24 10 40
10 A 2021-08-26 10 40
11 A 2021-08-28 10 40
How can I overwrite a pandas dataframe with another one in the most pythonic and fastest way?
Let me elaborate on my use case and what all methods I have tried so far and the results which I got with each one of them.
I have 2 pandas dataframes, let me call them bigdf and smalldf. I want to replace the rows, columns of bigdf with the values from smalldf.
bigdf
A B C D E F G H I J K L M N
0 2021-02-12 80 1 3000 100 3100 2021-02-12 05:00:00 0 2021-02-12 05:01:00 1.0 60.0 1.0 2021-02-12 05:03:00 -6.912197
1 2021-02-12 80 1 4000 100 4100 2021-02-12 05:05:00 1000 NaT 0.0 NaN NaN 2021-02-12 05:05:40 -6.658210
2 2021-02-12 80 1 5000 150 5100 2021-02-12 05:10:00 1200 NaT NaN NaN NaN NaT NaN
3 2021-02-12 80 1 6000 150 6100 2021-02-12 05:15:00 1500 NaT NaN NaN NaN NaT NaN
4 2021-02-12 40 1 7000 300 7100 2021-02-12 05:10:00 700 NaT NaN NaN NaN NaT NaN
5 2021-02-12 40 1 8000 300 8100 2021-02-12 05:05:00 980 NaT NaN NaN NaN NaT NaN
6 2021-02-12 60 1 9000 400 9100 2021-02-12 05:15:00 1300 NaT NaN NaN NaN NaT NaN
smalldf
A B C E F I J M N
0 2021-02-12 80 1 100 3100 NaT NaN NaT -6.912197
1 2021-02-12 80 1 100 4100 2021-02-12 05:04:30 1.0 2021-02-12 05:05:59 -6.658210
2 2021-02-12 80 1 150 5100 2021-02-12 05:11:30 1.0 2021-02-12 05:11:00 53.308885
The matching columns in both dataframes are A, B, Eand F.
The result which I am expecting after overwritting smalldf values into bigdf is:
Resultant bigdf
A B C D E F G H I J K L M N
0 2021-02-12 80 1 3000 100 3100 2021-02-12 05:00:00 0 NaT NaN 60.0 1.0 NaT -6.912197
1 2021-02-12 80 1 4000 100 4100 2021-02-12 05:05:00 1000 2021-02-12 05:04:30 1.0 NaN NaN 2021-02-12 05:05:59 -6.658210
2 2021-02-12 80 1 5000 150 5100 2021-02-12 05:10:00 1200 2021-02-12 05:11:30 1.0 NaN NaN 2021-02-12 05:11:00 53.308885
3 2021-02-12 80 1 6000 150 6100 2021-02-12 05:15:00 1500 NaT NaN NaN NaN NaT NaN
4 2021-02-12 40 1 7000 300 7100 2021-02-12 05:10:00 700 NaT NaN NaN NaN NaT NaN
5 2021-02-12 40 1 8000 300 8100 2021-02-12 05:05:00 980 NaT NaN NaN NaN NaT NaN
6 2021-02-12 60 1 9000 400 9100 2021-02-12 05:15:00 1300 NaT NaN NaN NaN NaT NaN
As is evident from the resultant bigdf, the values where the rows and columns matched with smalldf are all overwritten by corresponding values from smalldf covering all of the below scenarios:
Value overwritten by another value.
NaN/NaT overwritten by value.
Value overwritten by NaN/NaT.
To get this result I have used the following code:
smalldf = smalldf.set_index(['A', 'B', 'E', 'F'])
bigdf = bigdf.set_index(['A', 'B', 'E', 'F'])
var1 = bigdf['I']
var2 = bigdf['J']
var3 = bigdf['M']
var4 = bigdf['N']
for index, row in smalldf.iterrows():
if index in bigdf.index:
var1[index] = row['I']
var2[index] = row['J']
var3[index] = row['M']
var4[index] = row['N']
bigdf['I'] = var1
bigdf['J'] = var2
bigdf['M'] = var3
bigdf['N'] = var4
bigdf = bigdf.reset_index()
But the problem is that this code as much as it gives correct results is very slow. For a bigdf with about 8K rows and 50 columns it takes about 30 seconds.
For faster output, I have tried combine_first but this function only overwrites Null values.
Next I tried bigdf.update(smalldf) but this one does not overwrites a value with a Null (if the case be).
Can anyone suggest a faster and accurate method to achieve the result I am looking at? I am looking at a solution which would take not more than a couple of seconds to overwrite the dataframe of the order of approx (10K, 50). The smalldf will be approx (2K, 30).
My emphasis here is the execution speed. Any help would be appreciated.
Given the following df,
Account contract_date type item_id quantity price tax net_amount
ABC123 2020-06-17 P 1409 1000 0.355 10 400
ABC123 2020-06-17 S 1409 2000 0.053 15 150
ABC123 2020-06-17 C 1409 500 0.25 5 180
ABC123 2020-06-17 S 1370 5000 0.17 30 900
DEF456 2020-06-18 P 7214 3000 0.1793 20 600
I would like to turn df, grouped by Account, contract_date and item_id. Then split the values of different types into different column. Intended results are as follows. I can do this with for loop/apply, but would like to seek for suggestion for groupby or pivot or any vectorized/pythonic solution to this. Intended results are as follows:
Account contract_date item_id quantity_P quantity_S quantity_C price_P price_S price_C tax_P tax_S tax_C net_amount_P net_amount_S net_amount_C
ABC123 2020-06-17 1409 1000 2000 500 0.355 0.053 0.25 10 15 5 400 150 180
ABC123 2020-06-17 1370 0 5000 0 0 0.17 0 0 30 0 0 900 0
DEF456 2020-06-18 7214 3000 0 0 0.1793 0 0 20 0 0 600 0 0
*Although it looks a bit off for the alignment, you may copy the df and use df = pd.read_clipboard() to read the table. Appreciate your help. Thank you.
Edit: The error I am getting using df.pivot(index=['Account', 'contract_date', 'item_id'], columns=['type'])
Use df.pivot:
In [1660]: df.pivot(index=['Account', 'contract_date', 'item_id'], columns=['type'])
Out[1660]:
quantity price tax net_amount
type C P S C P S C P S C P S
Account contract_date item_id
ABC123 2020-06-17 1370 NaN NaN 5000.0 NaN NaN 0.170 NaN NaN 30.0 NaN NaN 900.0
1409 500.0 1000.0 2000.0 0.25 0.3550 0.053 5.0 10.0 15.0 180.0 400.0 150.0
DEF456 2020-06-18 7214 NaN 3000.0 NaN NaN 0.1793 NaN NaN 20.0 NaN NaN 600.0 NaN
I have a list of monthly sales numbers for events. I have a column Event_Ind that indicates whether that month had an event. I need to get the 3 values (inclusive) prior to each event. Values are allowed to overlap.
import pandas as pd
dates = pd.date_range(start='2019-01-01', end='2020-01-01', freq='M')
values = [1000,1067,1099,1100,2000,1000,1057,1082,1200,1300,1453,1500]
event_ind = ["*","","","","*","","","","*","","*",""]
df = pd.DataFrame({'Dates':dates, 'Values':values, 'Event_Ind':event_ind})
Dates Values Event_Ind
0 2019-01-31 1000 *
1 2019-02-28 1067
2 2019-03-31 1099
3 2019-04-30 1100
4 2019-05-31 2000 *
5 2019-06-30 1000
6 2019-07-31 1057
7 2019-08-31 1082
8 2019-09-30 1200 *
9 2019-10-31 1300
10 2019-11-30 1453 *
11 2019-12-31 1500
Goal would be for this sample data:
Dates Values Event_Ind
0 1/31/2019 1000 *
1 3/31/2019 1099
2 4/30/2019 1100
3 5/31/2019 2000 *
4 7/31/2019 1057
5 8/31/2019 1082
6 9/30/2019 1200 *
7 9/30/2019 1200 *
8 10/31/2019 1300
9 11/30/2019 1453 *
I'm thinking I can do something with shift() or groupby.tail(). But I can't seem to use them to get my desired output
You could something along these lines:
s = df.Event_Ind.eq('*')
i = np.concatenate([np.arange(a,b+1) for b,a in zip(s[s].index, s[s].index - 2)])
df.loc[i[i>=0]]
Dates Values Event_Ind
0 2019-01-31 1000 *
1 2019-02-28 1067
2 2019-03-31 1099
3 2019-04-30 1100
4 2019-05-31 2000 *
5 2019-06-30 1000
6 2019-07-31 1057
7 2019-08-31 1082
8 2019-09-30 1200 *
7 2019-08-31 1082
8 2019-09-30 1200 *
9 2019-10-31 1300
10 2019-11-30 1453 *
Explanation
[np.arange(a,b+1) for b,a in zip(s[s].index, s[s].index - 2)]
The above code zips the indexes values with * with indexes two rows above. Thus, np.arange(a,b+1) yields the indexes of rows you want to show at the final df.
Since the above generates a list of arrays, you want to np.concatenate all that to have a single array of indexes to keep.
df.loc[i[i>=0]]
Finally, the above first filters all values in i that are negative (because negative indexes in python have a meaning) and df.loc[] that to retrieve the final df.
Try:
x=df["Event_Ind"]=="*"
ind=list(map(lambda i: any(x[i:i+3]), range(len(x))))
print(df.loc[ind])
Output:
Dates Values Event_Ind
0 2019-01-31 1000 *
2 2019-03-31 1099
3 2019-04-30 1100
4 2019-05-31 2000 *
6 2019-07-31 1057
7 2019-08-31 1082
8 2019-09-30 1200 *
9 2019-10-31 1300
10 2019-11-30 1453 *
[Program finished]
when I use this syntax it creates a series rather than adding a column to my new dataframe sum.
My code:
sum = data['variance'] = data.budget + data.actual
My dataframe data currently has everything except the budget - actual column. How do I create a variance column?
cluster date budget actual budget - actual
0 a 2014-01-01 00:00:00 11000 10000 1000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
I think you've misunderstood some python syntax, the following does two assignments:
In [11]: a = b = 1
In [12]: a
Out[12]: 1
In [13]: b
Out[13]: 1
So in your code it was as if you were doing:
sum = df['budget'] + df['actual'] # a Series
# and
df['variance'] = df['budget'] + df['actual'] # assigned to a column
The latter creates a new column for df:
In [21]: df
Out[21]:
cluster date budget actual
0 a 2014-01-01 00:00:00 11000 10000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
In [22]: df['variance'] = df['budget'] + df['actual']
In [23]: df
Out[23]:
cluster date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
As an aside, you shouldn't use sum as a variable name as the overrides the built-in sum function.
df['variance'] = df.loc[:,['budget','actual']].sum(axis=1)
You could also use the .add() function:
df.loc[:,'variance'] = df.loc[:,'budget'].add(df.loc[:,'actual'])
Same thing can be done using lambda function.
Here I am reading the data from a xlsx file.
import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name = 4)
print df
Output:
cluster Unnamed: 1 date budget actual
0 a 2014-01-01 00:00:00 11000 10000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
Sum two columns into 3rd new one.
df['variance'] = df.apply(lambda x: x['budget'] + x['actual'], axis=1)
print df
Output:
cluster Unnamed: 1 date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
If "budget" has any NaN values but you don't want it to sum to NaN then try:
def fun (b, a):
if math.isnan(b):
return a
else:
return b + a
f = np.vectorize(fun, otypes=[float])
df['variance'] = f(df['budget'], df_Lp['actual'])
This is the most elegant solution which follows DRY and work absolutely great.
dataframe_name['col1', 'col2', 'col3'].sum(axis = 1, skipna = True)
Thank you.
eval lets you sum and create columns right away:
In [12]: data.eval('variance = budget + actual', inplace=True)
In [13]: data
Out[13]:
cluster date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
Since inplace=True you don't need to assign it back to data.