Pandas: sum column values against specific value in another column - python

I am using this data frame:
InvoiceNo Amount Year-Month
1 100 2019-01
2 125 2019-02
3 200 2019-02
4 300 2019-03
5 120 2019-03
6 350 2019-03
7 500 2019-04
8 230 2019-04
9 100 2019-04
10 200 2019-05
I want to sum up value against all the same months and display new column with those monthly values like this:
InvoiceNo Amount Year-Month MonthlyValue
1 100 2019-01 100
2 125 2019-02 325
3 200 2019-02 325
4 300 2019-03 770
5 120 2019-03 770
6 350 2019-03 770
7 500 2019-04 830
8 230 2019-04 830
9 100 2019-04 830
10 200 2019-05 200
I tried df['MonthlyValue'] = df.groupby(['Year-Month'])['Year-Month'].transform(sum) and it doesn't seem to work.

You are close, need specify column Amount after groupby:
df['MonthlyValue'] = df.groupby('Year-Month')['Amount'].transform('sum')

Related

Special kind of dataframes merging — inserting into a dataframe according to date values

df_1 is as follows -
date id score
2019-05 5 78.9
2019-06 5 77.5
2019-07 5 80.2
2019-08 5 82.0
2019-05 2 79.9
2019-06 2 69.3
2019-07 2 75.2
2019-08 2 80.0
2019-05 70 68.8
2019-06 70 67.5
2019-07 70 70.2
2019-08 70 86.0
df_2 is as follows -
date id score
2019-01 2 79.1
2019-02 2 79.2
2019-03 2 75.2
2019-04 2 80.0
2019-01 5 78.9
2019-02 5 78.5
2019-03 5 80.8
2019-04 5 82.8
2019-01 70 68.4
2019-02 70 72.2
2019-03 70 70.5
2019-04 70 81.0
How can I merge them into one dataframe according to date and id, resulting in -
date id score
2019-01 2 79.1
2019-02 2 79.2
2019-03 2 75.2
2019-04 2 80.0
2019-05 2 79.9
2019-06 2 69.3
2019-07 2 75.2
2019-08 2 80.0
2019-01 5 78.9
2019-02 5 78.5
2019-03 5 80.8
2019-04 5 82.8
2019-05 5 78.9
2019-06 5 77.5
2019-07 5 80.2
2019-08 5 82.0
2019-01 70 68.4
2019-02 70 72.2
2019-03 70 70.5
2019-04 70 81.0
2019-05 70 68.8
2019-06 70 67.5
2019-07 70 70.2
2019-08 70 86.0
Use pd.concat:
pd.concat([df_1, df_2]).sort_values(["date", "id"]).reset_index(drop=True)
Concat and sort values
pd.concat([df1, df2]).sort_values(['id', 'date'])
date id score
0 2019-01 2 79.1
1 2019-02 2 79.2
2 2019-03 2 75.2
3 2019-04 2 80.0
4 2019-05 2 79.9
5 2019-06 2 69.3
6 2019-07 2 75.2
7 2019-08 2 80.0
4 2019-01 5 78.9
5 2019-02 5 78.5
6 2019-03 5 80.8
7 2019-04 5 82.8
0 2019-05 5 78.9
1 2019-06 5 77.5
2 2019-07 5 80.2
3 2019-08 5 82.0
8 2019-01 70 68.4
9 2019-02 70 72.2
10 2019-03 70 70.5
11 2019-04 70 81.0
8 2019-05 70 68.8
9 2019-06 70 67.5
10 2019-07 70 70.2
11 2019-08 70 86.0

How to calculate last two week sum for each group ID

** I have Input data frame **
ID
Date
Amount
A
2021-08-03
100
A
2021-08-04
100
A
2021-08-06
20
A
2021-08-07
100
A
2021-08-09
300
A
2021-08-11
100
A
2021-08-12
100
A
2021-08-13
10
A
2021-08-23
10
A
2021-08-24
10
A
2021-08-26
10
A
2021-08-28
10
desired Output data frame
ID
Date
Amount
TwoWeekSum
A
2021-08-03
100
320
A
2021-08-04
100
320
A
2021-08-06
20
320
A
2021-08-07
100
320
A
2021-08-09
300
830
A
2021-08-11
100
830
A
2021-08-12
100
830
A
2021-08-13
10
830
A
2021-08-23
10
40
A
2021-08-24
10
40
A
2021-08-26
10
40
A
2021-08-28
10
40
I want to calculate the last two week total sum like
twoweekSum= current week total sum + Previous Week total sum i.e. current week is 34 then twoweekSum is 34 week total sum + 33 week total sum.
Please help me in to get in this in like output data frame so I can use that for further analysis.
Thank You folks !
Use:
#convert values to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#convert values to weeks
df['week'] = df['Date'].dt.isocalendar().week
#aggregate sum per ID and weeks, then add missing weeks and sum in rolling
f = lambda x: x.reindex(range(x.index.min(), x.index.max() + 1))
.rolling(2, min_periods=1).sum()
df1 = df.groupby(['ID', 'week'])['Amount'].sum().reset_index(level=0).groupby('ID').apply(f)
print (df1)
Amount
ID week
A 31 320.0
32 830.0
33 510.0
34 40.0
#last add to original DataFrame per ID and weeks
df=df.join(df1.rename(columns={'Amount':'TwoWeekSum'}),on=['ID','week']).drop('week',axis=1)
print (df)
ID Date Amount TwoWeekSum
0 A 2021-08-03 100 320.0
1 A 2021-08-04 100 320.0
2 A 2021-08-06 20 320.0
3 A 2021-08-07 100 320.0
4 A 2021-08-09 300 830.0
5 A 2021-08-11 100 830.0
6 A 2021-08-12 100 830.0
7 A 2021-08-13 10 830.0
8 A 2021-08-23 10 40.0
9 A 2021-08-24 10 40.0
10 A 2021-08-26 10 40.0
11 A 2021-08-28 10 40.0
per = pd.period_range(df['Date'].min(), df['Date'].max(), freq='w')
mapper = df.groupby(df['Date'].astype('Period[W]')).sum().reindex(per, fill_value=0).rolling(2, 1).sum()['Amount']
out = df['Date'].astype('Period[W]').map(mapper)
out
0 320.0
1 320.0
2 320.0
3 320.0
4 830.0
5 830.0
6 830.0
7 830.0
8 40.0
9 40.0
10 40.0
11 40.0
Name: Date, dtype: float64
make out to TwoWeekSum column
df.assign(TwoWeekSum=out)
ID Date Amount TwoWeekSum
0 A 2021-08-03 100 320.0
1 A 2021-08-04 100 320.0
2 A 2021-08-06 20 320.0
3 A 2021-08-07 100 320.0
4 A 2021-08-09 300 830.0
5 A 2021-08-11 100 830.0
6 A 2021-08-12 100 830.0
7 A 2021-08-13 10 830.0
8 A 2021-08-23 10 40.0
9 A 2021-08-24 10 40.0
10 A 2021-08-26 10 40.0
11 A 2021-08-28 10 40.0
Update
if each ID , groupby and merge
per = pd.period_range(df['Date'].min(), df['Date'].max(), freq='w')
s = df['Date'].astype('Period[W]')
idx = pd.MultiIndex.from_product([df['ID'].unique(), per])
df1 = df.groupby(['ID', s]).sum().reindex(idx, fill_value=0).rolling(2, 1).agg(sum).reset_index().set_axis(['ID', 'period', 'TwoWeekSum'], axis=1)
df.assign(period=s).merge(df1, how='left').drop('period', axis=1)
Try using groupby to group the dataframe by dt.week (each week), then use transform sum to add up the values weekly and repeat the values:
df['TwoWeekSum'] = df.groupby(df['Date'].dt.week)['Amount'].transform('sum')
And then:
print(df)
Gives:
ID Date Amount TwoWeekSum
0 A 2021-08-03 100 320
1 A 2021-08-04 100 320
2 A 2021-08-06 20 320
3 A 2021-08-07 100 320
4 A 2021-08-09 300 830
5 A 2021-08-11 100 830
6 A 2021-08-12 100 830
7 A 2021-08-13 10 830
8 A 2021-08-23 10 40
9 A 2021-08-24 10 40
10 A 2021-08-26 10 40
11 A 2021-08-28 10 40

Joining 2 dataframe based on a column [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 11 months ago.
Following is one of my dataframe structure:
strike coi chgcoi
120 200 20
125 210 15
130 230 12
135 240 9
and the other one is:
strike poi chgpoi
125 210 15
130 230 12
135 240 9
140 225 12
What I want is:
strike coi chgcoi strike poi chgpoi
120 200 20 120 0 0
125 210 15 125 210 15
130 230 12 130 230 12
135 240 9 135 240 9
140 0 0 140 225 12
First, you need to create two dataframes using pandas
df1 = pd.Dataframe({'column_1': [val_1, val_2, ..., val_n], 'column_2':[val_1, val_2, ..., val_n]})
df2 = pd.Dataframe({'column_1': [val_1, val_2, ..., val_n], 'column_2':[val_1, val_2, ..., val_n]})
Then you can use outer join
df1.merge(df2, on='common_column_name', how='outer')
db1
strike coi chgcoi
0 120 200 20
1 125 210 15
2 130 230 12
3 135 240 9
db2
strike poi chgpoi
0 125 210 15
1 130 230 12
2 135 240 9
3 140 225 12
merge = db1.merge(db2,how="outer",on='strike')
merge
strike coi chgcoi poi chgpoi
0 120 200.0 20.0 NaN NaN
1 125 210.0 15.0 210.0 15.0
2 130 230.0 12.0 230.0 12.0
3 135 240.0 9.0 240.0 9.0
4 140 NaN NaN 225.0 12.0
merge.fillna(0)
strike coi chgcoi poi chgpoi
0 120 200.0 20.0 0.0 0.0
1 125 210.0 15.0 210.0 15.0
2 130 230.0 12.0 230.0 12.0
3 135 240.0 9.0 240.0 9.0
4 140 0.0 0.0 225.0 12.0
This is your expected result with the only difference that 'strike' is not repeated

Remove multiple rows with same index values in python dataframe

I have a dataset with number of startstation IDS, endstation IDS and the duration of travel for bikes in a city.
The data dates back to 2017 and hence now certain stations do not exist.
I have the list of those station IDs. How can I remove rows from the dataframe which either starts or ends at those stations?
For example, if I want to remove StartStation ID = 135 which is in index 4 and 5, what should I do? This entends for a million rows where 135 can be present anywhere.
Bike Id StartStation Id EndStation Id Duration
0 395 573 137.0 660.0
1 12931 399 507.0 420.0
2 7120 399 507.0 420.0
3 1198 599 616.0 300.0
4 10739 135 486.0 1260.0
5 10949 135 486.0 1260.0
6 8831 193 411.0 540.0
7 8778 266 770.0 600.0
8 700 137 294.0 540.0
9 5017 456 39.0 3000.0
10 4359 444 445.0 240.0
11 2801 288 288.0 5340.0
12 9525 265 592.0 300.0
I'm calling your list of ids to remove removed_ids.
df=df.loc[
(~df['StartStation ID'].isin(removed_ids)) &\
(~df['EndStation ID'].isin(removed_ids))
]

How can I match one individual's survey responses across time to form a panel dataset?

I am working with survey data in which respondents were interviewed twice: once initially and once six to eight months later. Each month, new interviewees are contacted, resulting in a rotating panel structure. How can I match an individual to his/her previous interview in Python using the following information:
CASEID YYYYMM ID IDPREV DATEPR INCOME
1 2 198706 2 382 198612 12500
2 3 198706 3 4 198612 2500
3 4 198706 4 67 198612 27500
4 5 198706 5 134 198612 12500
5 6 198706 6 193 198612 22500
So, the first line states that the individual's previous answers to the survey are contained on the line where the previous date is 198612 (Dec. 1986) and the ID is 382. How can I match these responses using the information that I have to create a panel dataset of the following form:
CASEID YYYYMM ID IDPREV DATEPR INCOME
1 463 198612 382 - - 12000
1856 198706 2 382 198612 12500
2 97 198612 4 - - 3500
1857 198706 3 4 198612 2500
3 164 198612 67 - - 25000
1858 198706 4 67 198612 27500
4 289 198612 134 - - 12500
1859 198706 5 134 198612 12500
5 323 198612 193 - - 22500
1860 198706 6 193 198612 22500
I have looked into the "merge" documentation for pandas and have tried a couple of different ways of matching the dates and IDs by indexing them, but cannot seem to get the panel data structure.
Starting with:
CASEID YYYYMM ID IDPREV DATEPR INCOME
0 463 198612 382 NaN NaN 12000
1 1856 198706 2 382.0 198612.0 12500
2 97 198612 4 NaN NaN 3500
3 1857 198706 3 4.0 198612.0 2500
4 164 198612 67 NaN NaN 25000
5 1858 198706 4 67.0 198612.0 27500
6 289 198612 134 NaN NaN 12500
7 1859 198706 5 134.0 198612.0 12500
8 323 198612 193 NaN NaN 22500
9 1860 198706 6 193.0 198612.0 22500
You could combine the two observations by merging:
combined = pd.merge(df, df, left_on=['YYYYMM', 'ID'], right_on=['DATEPR', 'IDPREV'], suffixes=['_1', '_2'])
CASEID_1 YYYYMM_1 ID_1 IDPREV_1 DATEPR_1 INCOME_1 CASEID_2 YYYYMM_2 \
0 463 198612 382 NaN NaN 12000 1856 198706
1 97 198612 4 NaN NaN 3500 1857 198706
2 164 198612 67 NaN NaN 25000 1858 198706
3 289 198612 134 NaN NaN 12500 1859 198706
4 323 198612 193 NaN NaN 22500 1860 198706
ID_2 IDPREV_2 DATEPR_2 INCOME_2
0 2 382.0 198612.0 12500
1 3 4.0 198612.0 2500
2 4 67.0 198612.0 27500
3 5 134.0 198612.0 12500
4 6 193.0 198612.0 22500
from where you could select the columns you need, or while merging:
combined = pd.merge(df.loc[:, ['CASEID', 'YYYYMM', 'ID', 'INCOME']], df,
left_on=['YYYYMM', 'ID'], right_on=['DATEPR', 'IDPREV'], suffixes=['_1', '_2'])
CASEID_1 YYYYMM_1 ID_1 INCOME_1 CASEID_2 YYYYMM_2 ID_2 IDPREV \
0 463 198612 382 12000 1856 198706 2 382.0
1 97 198612 4 3500 1857 198706 3 4.0
2 164 198612 67 25000 1858 198706 4 67.0
3 289 198612 134 12500 1859 198706 5 134.0
4 323 198612 193 22500 1860 198706 6 193.0
DATEPR INCOME_2
0 198612.0 12500
1 198612.0 2500
2 198612.0 27500
3 198612.0 12500
4 198612.0 22500
You could form a panel from here:
combined = combined.reset_index().set_index('index')
df1 = combined.loc[:, ['CASEID_1', 'YYYYMM_1', 'ID_1', 'INCOME_1']]
df1.rename(columns={col: col[:-2] for col in df1.columns}, inplace=True)
df2 = combined.loc[:, ['CASEID_2', 'YYYYMM_2', 'ID_2', 'INCOME_2']]
df2.rename(columns={col: col[:-2] for col in df2.columns}, inplace=True)
panel = pd.concat([df1, df2]).sort_index()
CASEID YYYYMM ID INCOME
index
0 463 198612 382 12000
0 1856 198706 2 12500
1 97 198612 4 3500
1 1857 198706 3 2500
2 164 198612 67 25000
2 1858 198706 4 27500
3 289 198612 134 12500
3 1859 198706 5 12500
4 323 198612 193 22500
4 1860 198706 6 22500

Categories