Trying to combine a list of dataframes to one dataframe. Data looks like:
Date station_id Hour Temp
0 2004-01-01 1 1 46.0
1 2004-01-01 1 2 46.0
2 2004-01-01 1 3 45.0
3 2004-01-01 1 4 41.0
...
433730 2008-06-30 11 3 64.0
433731 2008-06-30 11 4 64.0
433732 2008-06-30 11 5 64.0
433733 2008-06-30 11 6 64.0
This gives me a list of dataframes:
stations = [x for _,x in df.groupby('station_id')]
When I reset the indices for "stations", and concat, I can get a dataframe, but it doesn't look like I'd like:
for i in range(0,11):
stations[i].reset_index(drop=True,inplace=True)
pd.concat(stations,axis=1)
Date station_id Hour Temp Date station_id Hour Temp
0 2004-01-01 1 1 46.0 2004-01-01 2 1 38.0
1 2004-01-01 1 2 46.0 2004-01-01 2 2 36.0
2 2004-01-01 1 3 45.0 2004-01-01 2 3 35.0
3 2004-01-01 1 4 41.0 2004-01-01 2 4 30.0
I'm much rather get towards a df like this:
Date Hour Stn1 Stn2
0 2004-01-01 1 46.0 38.0
1 2004-01-01 2 46.0 6.0
2 2004-01-01 3 45.0 35.0
3 2004-01-01 4 41.0 30.0
How do I do this?
Based on your expected output, you are looking for a pivot table with index=['Date', 'Hour'], columns='station_id', values=Temp. Demo:
# A bunch of example data
df
Date station_id Hour Temp
0 2004-01-01 1 1 10.0
1 2004-01-01 1 2 20.0
2 2004-01-01 1 3 30.0
3 2004-01-01 1 4 40.0
4 2004-01-01 2 1 50.0
5 2004-01-01 2 2 60.0
6 2004-01-01 2 3 70.0
7 2004-01-01 2 4 80.0
8 2004-01-01 3 1 90.0
9 2004-01-01 3 2 100.0
10 2004-01-02 3 1 110.0
11 2004-01-02 3 2 120.0
12 2004-01-01 4 4 130.0
13 2004-01-02 4 5 140.0
# Create pivot table, with ['Date', 'Hour'] in a MultiIndex
res = df.pivot_table(columns='station_id', index=['Date', 'Hour'], values='Temp')
# Add 'Stn' prefix to each column name
res = res.add_prefix('Stn')
# Delete the name of the columns' index, which is 'station_id'
del res.columns.name
# Reset MultiIndex into columns
res.reset_index(inplace=True)
res
Date Hour Stn1 Stn2 Stn3 Stn4
0 2004-01-01 1 10.0 50.0 90.0 NaN
1 2004-01-01 2 20.0 60.0 100.0 NaN
2 2004-01-01 3 30.0 70.0 NaN NaN
3 2004-01-01 4 40.0 80.0 NaN 130.0
4 2004-01-02 1 NaN NaN 110.0 NaN
5 2004-01-02 2 NaN NaN 120.0 NaN
6 2004-01-02 5 NaN NaN NaN 140.0
For what it's worth, this gets where I want to go.
stations = [x for _,x in df.groupby('station_id')] #,as_index=True)]
for i in range(0,11):
stations[i].reset_index(drop=True,inplace=True)
stations[i].rename(columns={'Temp':'Stn'+str(i+1)},inplace=True)
stations[i].drop(columns='station_id',inplace=True)
if i>0:
stations[i].drop(columns=['Date','Hour'],inplace=True)
stations = pd.concat(stations,axis=1)
Feels a bit brute force to me, though. Additional pythonic suggestions welcome.
Related
I have a dataframe which has two columns. date and value.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['date'] = ['2020-03-01 00:00:00','2020-03-01 00:00:15', '2020-03-01 00:00:30', '2020-03-02 00:00:00','2020-03-02 00:00:15', '2020-03-02 00:00:30' , '2020-03-03 00:00:15', '2020-03-03 00:00:30', '2020-03-05 00:00:00', '2020-03-05 00:00:30']
df['value'] = [1, 2, 3, 4, 5, 6, 1, 2, 3, 4]
df
date value
0 2020-03-01 00:00:00 1
1 2020-03-01 00:00:15 2
2 2020-03-01 00:00:30 3
3 2020-03-02 00:00:00 4
4 2020-03-02 00:00:15 5
5 2020-03-02 00:00:30 6
6 2020-03-03 00:00:15 1
7 2020-03-03 00:00:30 2
8 2020-03-05 00:00:00 3
9 2020-03-05 00:00:30 4
in the date column, I have some missing values (I want all the days, like 1-2-3-4-... but in this example I dont have day 2020-03-4, so I put nan for that), so I want to build this df at first which shows me the days which I dont have their data:
day 00:00:00 00:00:15 00:00:30
0 2020-03-01 1.0 2.0 3.0
1 2020-03-02 4.0 5.0 6.0
2 2020-03-03 NaN 1.0 2.0
3 2020-03-04 NaN NaN NaN
4 2020-03-05 3.0 NaN 4.0
Then replace the Nan values with mean of columns, like:
day 00:00:00 00:00:15 00:00:30
0 2020-03-01 1.000000 2.000000 3.000000
1 2020-03-02 4.000000 5.000000 6.000000
2 2020-03-03 2.666667 1.000000 2.000000
3 2020-03-04 2.666667 2.666667 2.666667
4 2020-03-05 3.000000 5.000000 4.000000
And then build one df with one row as(the name of columns is not important)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2 4 5 6 2.67 1 2 2.67 2.67 2.67 3 5 4
I am working with pivot and groupby, but I could not solve it. Especially the missing date. Could you please help me with that?
you can use resample():
df['date']=pd.to_datetime(df['date'])
dfx=df.set_index('date').resample('15S').first()
We got the distribution of all hours of the day. But we only need values between 00:00:00 and 00:00:30.
dfx = dfx.between_time("00:00:00", "00:00:30").reset_index()
print(dfx)
'''
date value
0 2020-03-01 00:00:00 1.0
1 2020-03-01 00:00:15 2.0
2 2020-03-01 00:00:30 3.0
3 2020-03-02 00:00:00 4.0
4 2020-03-02 00:00:15 5.0
5 2020-03-02 00:00:30 6.0
6 2020-03-03 00:00:00
7 2020-03-03 00:00:15 1.0
8 2020-03-03 00:00:30 2.0
9 2020-03-04 00:00:00
10 2020-03-04 00:00:15
11 2020-03-04 00:00:30
12 2020-03-05 00:00:00 3.0
13 2020-03-05 00:00:15
14 2020-03-05 00:00:30 4.0
'''
Then i convert times into columns using crosstab:
dfx=pd.crosstab(dfx['date'].dt.date, dfx['date'].dt.time,values=dfx['value'],aggfunc='sum',dropna=False)
print(dfx)
'''
date 00:00:00 00:00:15 00:00:30
date
2020-03-01 1.0 2.0 3.0
2020-03-02 4.0 5.0 6.0
2020-03-03 0.0 1.0 2.0
2020-03-04 0.0 0.0 0.0
2020-03-05 3.0 0.0 4.0
'''
Values with 0 are times that are not in the data set. I replace them with nan and populate them with the column averages:
dfx=dfx.replace(0,np.nan)
for i in dfx.columns:
dfx[i]=dfx[i].fillna(dfx[i].mean())
print(dfx)
'''
date 00:00:00 00:00:15 00:00:30
date
2020-03-01 1.000000 2.000000 3.00
2020-03-02 4.000000 5.000000 6.00
2020-03-03 2.666667 1.000000 2.00
2020-03-04 2.666667 2.666667 3.75
2020-03-05 3.000000 2.666667 4.00
'''
I did not fully understand what you want at the last stage, if you write it in detail, I will edit my answer.
I have a dataframe like this:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2023 1 2
.......
I would like to split the rows where end - start > 1 year (see last row where end=2023 and start = 2020), keeping the same value for column A, while splitting proportionally the value in column B:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2020 1 2/4
01.01.2021 31.12.2021 1 2/4
01.01.2022 31.12.2022 1 2/4
01.01.2023 31.12.2023 1 2/4
.......
Any idea?
Here is my solution. See the comments below:
import io
# TEST DATA:
text=""" start end A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
31.12.2020 20.01.2021 12 12
31.12.2020 01.01.2021 22 22
30.12.2020 01.01.2021 32 32
10.05.2020 28.09.2023 44 44
27.11.2020 31.12.2023 88 88
31.12.2020 31.12.2023 100 100
01.01.2020 31.12.2021 200 200
"""
df= pd.read_csv(io.StringIO(text), sep=r"\s+", engine="python", parse_dates=[0,1])
#print("\n----\n df:",df)
#----------------------------------------
# SOLUTION:
def split_years(r):
"""
Split row 'r' where "end"-"start" greater than 0.
The new rows have repeated values of 'A', and 'B' divided by the number of years.
Return: a DataFrame with rows per year.
"""
t1,t2 = r["start"], r["end"]
ys= t2.year - t1.year
kk= 0 if t1.is_year_end else 1
if ys>0:
l1=[t1] + [ t1+pd.offsets.YearBegin(i) for i in range(1,ys+1) ]
l2=[ t1+pd.offsets.YearEnd(i) for i in range(kk,ys+kk) ] + [t2]
return pd.DataFrame({"start":l1, "end":l2, "A":r.A,"B": r.B/len(l1)})
print("year difference <= 0!")
return None
# Create two groups, one for rows where the 'start' and 'end' is in the same year, and one for the others:
grps= df.groupby(lambda idx: (df.loc[idx,"start"].year-df.loc[idx,"end"].year)!=0 ).groups
print("\n---- grps:\n",grps)
# Extract the "one year" rows in a data frame:
df1= df.loc[grps[False]]
#print("\n---- df1:\n",df1)
# Extract the rows to be splitted:
df2= df.loc[grps[True]]
print("\n---- df2:\n",df2)
# Split the rows and put the resulting data frames into a list:
ldfs=[ split_years(df2.loc[row]) for row in df2.index ]
print("\n---- ldfs:")
for fr in ldfs:
print(fr,"\n")
# Insert the "one year" data frame to the list, and concatenate them:
ldfs.insert(0,df1)
df_rslt= pd.concat(ldfs,sort=False)
#print("\n---- df_rslt:\n",df_rslt)
# Housekeeping:
df_rslt= df_rslt.sort_values("start").reset_index(drop=True)
print("\n---- df_rslt:\n",df_rslt)
Outputs:
---- grps:
{False: Int64Index([0, 1, 2, 3, 4], dtype='int64'), True: Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')}
---- df2:
start end A B
5 2020-12-31 2021-01-20 12 12
6 2020-12-31 2021-01-01 22 22
7 2020-12-30 2021-01-01 32 32
8 2020-10-05 2023-09-28 44 44
9 2020-11-27 2023-12-31 88 88
10 2020-12-31 2023-12-31 100 100
11 2020-01-01 2021-12-31 200 200
---- ldfs:
start end A B
0 2020-12-31 2020-12-31 12 6.0
1 2021-01-01 2021-01-20 12 6.0
start end A B
0 2020-12-31 2020-12-31 22 11.0
1 2021-01-01 2021-01-01 22 11.0
start end A B
0 2020-12-30 2020-12-31 32 16.0
1 2021-01-01 2021-01-01 32 16.0
start end A B
0 2020-10-05 2020-12-31 44 11.0
1 2021-01-01 2021-12-31 44 11.0
2 2022-01-01 2022-12-31 44 11.0
3 2023-01-01 2023-09-28 44 11.0
start end A B
0 2020-11-27 2020-12-31 88 22.0
1 2021-01-01 2021-12-31 88 22.0
2 2022-01-01 2022-12-31 88 22.0
3 2023-01-01 2023-12-31 88 22.0
start end A B
0 2020-12-31 2020-12-31 100 25.0
1 2021-01-01 2021-12-31 100 25.0
2 2022-01-01 2022-12-31 100 25.0
3 2023-01-01 2023-12-31 100 25.0
start end A B
0 2020-01-01 2020-12-31 200 100.0
1 2021-01-01 2021-12-31 200 100.0
---- df_rslt:
start end A B
0 2020-01-01 2020-06-30 2 3.0
1 2020-01-01 2020-12-31 3 1.0
2 2020-01-01 2020-12-31 200 100.0
3 2020-01-04 2020-04-30 6 2.0
4 2020-01-07 2020-12-31 8 2.0
5 2020-10-05 2020-12-31 44 11.0
6 2020-11-27 2020-12-31 88 22.0
7 2020-12-30 2020-12-31 32 16.0
8 2020-12-31 2020-12-31 12 6.0
9 2020-12-31 2020-12-31 100 25.0
10 2020-12-31 2020-12-31 22 11.0
11 2021-01-01 2021-12-31 100 25.0
12 2021-01-01 2021-12-31 88 22.0
13 2021-01-01 2021-12-31 44 11.0
14 2021-01-01 2021-01-01 32 16.0
15 2021-01-01 2021-01-01 22 11.0
16 2021-01-01 2021-01-20 12 6.0
17 2021-01-01 2021-12-31 2 3.0
18 2021-01-01 2021-12-31 200 100.0
19 2022-01-01 2022-12-31 88 22.0
20 2022-01-01 2022-12-31 100 25.0
21 2022-01-01 2022-12-31 44 11.0
22 2023-01-01 2023-09-28 44 11.0
23 2023-01-01 2023-12-31 88 22.0
24 2023-01-01 2023-12-31 100 25.0
Bit of a different approach, adding new columns instead of new rows. But I think this accomplishes what you want to do.
df["years_apart"] = (
(df["end_date"] - df["start_date"]).dt.days / 365
).astype(int)
for years in range(1, df["years_apart"].max().astype(int)):
df[f"{years}_end_date"] = pd.NaT
df.loc[
df["years_apart"] == years, f"{years}_end_date"
] = df.loc[
df["years_apart"] == years, "start_date"
] + dt.timedelta(days=365*years)
df["B_bis"] = df["B"] / df["years_apart"]
Output
start_date end_date years_apart 1_end_date 2_end_date ...
2018-01-01 2018-01-02 0 NaT NaT
2018-01-02 2019-01-02 1 2019-01-02 NaT
2018-01-03 2020-01-03 2 NaT 2020-01-03
I have solved it creating a date difference and a counter that adds years to the repeated rows:
#calculate difference between start and end year
table['diff'] = (table['end'] - table['start'])//timedelta(days=365)
table['diff'] = table['diff']+1
#replicate rows depending on number of years
table = table.reindex(table.index.repeat(table['diff']))
#counter that increase for diff>1, assign increasing years to the replicated rows
table['count'] = table['diff'].groupby(table['diff']).cumsum()//table['diff']
table['start'] = np.where(table['diff']>1, table['start']+table['count']-1, table['start'])
table['end'] = table['start']
#split B among years
table['B'] = table['B']//table['diff']
I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05
12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06
13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09
22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
As you can see from the dataframe above that there are few missing dates in between. I would like to create new records for those dates and fill in values from the immediate previous row
def dt(df):
r = pd.date_range(start=df.date.min(), end=df.date.max())
df.set_index('date').reindex(r)
new_df = df.groupby(['subject_id','month']).apply(dt)
This generates all the dates. I only want to find the missing date within the input date interval for each subject for each month
I did try the code from this related post. Though it helped me but doesn't get me the expected output for this updated/new requirement. As we do left join, it copies all records. I can't do inner join either because it will drop non-match column. I want a mix of left join and inner join
Currently it creates new records for all 365 days in a year which I don't want. something like below. This is not expected
I only wish to add missing dates between input date interval as shown below. For example subject = 1, in the 4th month has records from 3rd and 5th. but 4th is missing. So we add record for 4th day alone. We don't need 6th,7th etc unlike current output. Similarly in 7th month, record for 7th day missing. so we just add a new record for that
I expect my output to be like as shown below
Here is problem you need resample for append new days, so it is necessary.
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
print (df1)
subject_id date
0 1 2173-04-03
1 1 2173-04-04
2 1 2173-04-05
3 1 2173-04-06
4 1 2173-04-07
.. ... ...
99 2 2173-04-10
100 2 2173-04-11
101 2 2173-04-12
102 2 2173-04-13
103 2 2173-04-14
[104 rows x 2 columns]
Idea is remove unnecessary missing rows - you can create threshold for minimum consecutive mising values (here 5) and remove rows (created new column fro easy test):
df2 = df1.merge(df, how='left')
thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())
df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5.0 3.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5.0 3.0 NaN
2 1 2173-04-04 NaT NaN NaN 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5.0 5.0 NaN
32 1 2173-05-04 2173-05-04 13:14:00 5.0 4.0 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1.0 5.0 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6.0 6.0 NaN
96 1 2173-07-07 NaT NaN NaN 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5.0 8.0 NaN
98 2 2173-04-08 2173-04-08 16:00:00 5.0 8.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8.0 9.0 NaN
100 2 2173-04-10 NaT NaN NaN 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3.0 11.0 NaN
102 2 2173-04-12 NaT NaN NaN 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4.0 13.0 NaN
104 2 2173-04-14 2173-04-14 08:00:00 6.0 14.0 NaN
Last use previous solution:
df2 = df2.groupby(df['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1 5 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6 6 NaN
96 1 2173-07-07 2173-07-07 13:39:00 6 7 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 1.0
99 2 2173-04-09 2173-04-09 22:00:00 8 9 1.0
100 2 2173-04-10 2173-04-10 22:00:00 8 10 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 1.0
EDIT: Solution with reindex for each month:
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df['month'] = df['time_1'].dt.month
df1 = (df.drop_duplicates(['date','subject_id'])
.set_index('date')
.groupby(['subject_id', 'month'])
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max())))
.rename_axis(('subject_id','month','date'))
.index
.to_frame(index=False)
)
print (df1)
subject_id month date
0 1 4 2173-04-03
1 1 4 2173-04-04
2 1 4 2173-04-05
3 1 5 2173-05-04
4 1 5 2173-05-05
5 1 7 2173-07-06
6 1 7 2173-07-07
7 1 7 2173-07-08
8 2 4 2173-04-08
9 2 4 2173-04-09
10 2 4 2173-04-10
11 2 4 2173-04-11
12 2 4 2173-04-12
13 2 4 2173-04-13
14 2 4 2173-04-14
df2 = df1.merge(df, how='left')
df2 = df2.groupby(df2['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id month date time_1 val day
0 1 4 2173-04-03 2173-04-03 12:35:00 5 3
1 1 4 2173-04-03 2173-04-03 12:50:00 5 3
2 1 4 2173-04-04 2173-04-04 12:50:00 5 4
3 1 4 2173-04-05 2173-04-05 12:59:00 5 5
4 1 5 2173-05-04 2173-05-04 13:14:00 5 4
5 1 5 2173-05-05 2173-05-05 13:37:00 1 5
6 1 7 2173-07-06 2173-07-06 13:39:00 6 6
7 1 7 2173-07-07 2173-07-07 13:39:00 6 7
8 1 7 2173-07-08 2173-07-08 11:30:00 5 8
9 2 4 2173-04-08 2173-04-08 16:00:00 5 8
10 2 4 2173-04-09 2173-04-09 22:00:00 8 9
11 2 4 2173-04-10 2173-04-10 22:00:00 8 10
12 2 4 2173-04-11 2173-04-11 04:00:00 3 11
13 2 4 2173-04-12 2173-04-12 04:00:00 3 12
14 2 4 2173-04-13 2173-04-13 04:30:00 4 13
15 2 4 2173-04-14 2173-04-14 08:00:00 6 14
Does this help?
def fill_dates(df):
result = pd.DataFrame()
for i,row in df.iterrows():
if i == 0:
result = result.append(row)
else:
start_date = result.iloc[-1]['time_1']
end_date = row['time_1']
# print(start_date, end_date)
delta = (end_date - start_date).days
# print(delta)
if delta > 0 and start_date.month == end_date.month:
for j in range(delta):
day = start_date + timedelta(days=j+1)
new_row = result.iloc[-1].copy()
new_row['time_1'] = day
new_row['remarks'] = 'added'
if new_row['time_1'].date() != row['time_1'].date():
result = result.append(new_row)
result = result.append(row)
else:
result = result.append(row)
result.reset_index(inplace = True)
return result
I would like to groupby by the variable of my df "cod_id" and then apply this function:
[df.loc[df['dt_op'].between(d, d + pd.Timedelta(days = 7)), 'quantity'].sum() \
for d in df['dt_op']]
Moving from this df:
print(df)
dt_op quantity cod_id
20/01/18 1 613
21/01/18 8 611
21/01/18 1 613
...
To this one:
print(final_df)
n = 7
dt_op quantity product_code Final_Quantity
20/01/18 1 613 2
21/01/18 8 611 8
25/01/18 1 613 1
...
I tried with:
def lookforward(x):
L = [x.loc[x['dt_op'].between(row.dt_op, row.dt_op + pd.Timedelta(days=7)), \
'quantity'].sum() for row in x.itertuples(index=False)]
return pd.Series(L, index=x.index)
s = df.groupby('cod_id').apply(lookforward)
s.index = s.index.droplevel(0)
df['Final_Quantity'] = s
print(df)
dt_op quantity cod_id Final_Quantity
0 2018-01-20 1 613 2
1 2018-01-21 8 611 8
2 2018-01-21 1 613 1
But it is not an efficient solution, since it is computationally slow;
How can I improve its performance?
I would achieve it even with a new code/new function that leads to the same result.
EDIT:
Subset of the original dataset, with just one product (cod_id == 2), I tried to run on the code provided by "w-m":
print(df)
cod_id dt_op quantita final_sum
0 2 2017-01-03 1 54.0
1 2 2017-01-04 1 53.0
2 2 2017-01-13 1 52.0
3 2 2017-01-23 2 51.0
4 2 2017-01-26 1 49.0
5 2 2017-02-03 1 48.0
6 2 2017-02-27 1 47.0
7 2 2017-03-05 1 46.0
8 2 2017-03-15 1 45.0
9 2 2017-03-23 1 44.0
10 2 2017-03-27 2 43.0
11 2 2017-03-31 3 41.0
12 2 2017-04-04 1 38.0
13 2 2017-04-05 1 37.0
14 2 2017-04-15 2 36.0
15 2 2017-04-27 2 34.0
16 2 2017-04-30 1 32.0
17 2 2017-05-16 1 31.0
18 2 2017-05-18 1 30.0
19 2 2017-05-19 1 29.0
20 2 2017-06-03 1 28.0
21 2 2017-06-04 1 27.0
22 2 2017-06-07 1 26.0
23 2 2017-06-13 2 25.0
24 2 2017-06-14 1 23.0
25 2 2017-06-20 1 22.0
26 2 2017-06-22 2 21.0
27 2 2017-06-28 1 19.0
28 2 2017-06-30 1 18.0
29 2 2017-07-03 1 17.0
30 2 2017-07-06 2 16.0
31 2 2017-07-07 1 14.0
32 2 2017-07-13 1 13.0
33 2 2017-07-20 1 12.0
34 2 2017-07-28 1 11.0
35 2 2017-08-06 1 10.0
36 2 2017-08-07 1 9.0
37 2 2017-08-24 1 8.0
38 2 2017-09-06 1 7.0
39 2 2017-09-16 2 6.0
40 2 2017-09-20 1 4.0
41 2 2017-10-07 1 3.0
42 2 2017-11-04 1 2.0
43 2 2017-12-07 1 1.0
Edit 181017: this approach doesn't work due to forward rolling functions on sparse time series not currently being supported by pandas, see the comments.
Using for loops can be a performance killer when doing pandas operations.
The for loop around the rows plus their timedelta of 7 days can be replaced with a .rolling("7D"). To get a forward-rolling time delta (current date + 7 days), we reverse the df by date, as shown here.
Then no custom function is required anymore, and you can just take .quantity.sum() from the groupby.
quant_sum = df.sort_values("dt_op", ascending=False).groupby("cod_id") \
.rolling("7D", on="dt_op").quantity.sum()
cod_id dt_op
611 2018-01-21 8.0
613 2018-01-21 1.0
2018-01-20 2.0
Name: quantity, dtype: float64
result = df.set_index(["cod_id", "dt_op"])
result["final_sum"] = quant_sum
result.reset_index()
cod_id dt_op quantity final_sum
0 613 2018-01-20 1 2.0
1 611 2018-01-21 8 8.0
2 613 2018-01-21 1 1.0
Implementing the exact behavior from the question is difficult due to two shortcoming in pandas: neither groupby/rolling/transform nor forward looking rolling sparse dates being implemented (see other answer for more details).
This answer attempts to work around both by resampling the data, filling in all days, and then joining the quant_sums back with the original data.
# Create a temporary df with all in between days filled in with zeros
filled = df.set_index("dt_op").groupby("cod_id") \
.resample("D").asfreq().fillna(0) \
.quantity.to_frame()
# Reverse and sum
filled["quant_sum"] = filled.reset_index().set_index("dt_op") \
.iloc[::-1] \
.groupby("cod_id") \
.rolling(7, min_periods=1) \
.quantity.sum().astype(int)
# Join with original `df`, dropping the filled days
result = df.set_index(["cod_id", "dt_op"]).join(filled.quant_sum).reset_index()
This question already has answers here:
Adding a column thats result of difference in consecutive rows in pandas
(4 answers)
Closed 5 years ago.
Code:
import pandas as pd
df = pd.read_csv('xyz.csv', usecols=['transaction_date', 'amount'])
df=pd.concat(g for _, g in df.groupby("amount") if len(g) > 3)
df=df.reset_index(drop=True)
print(df)
Output:
transaction_date amount
0 2016-06-02 50.0
1 2016-06-02 50.0
2 2016-06-02 50.0
3 2016-06-02 50.0
4 2016-06-02 50.0
5 2016-06-02 50.0
6 2016-07-04 50.0
7 2016-07-04 50.0
8 2016-09-29 225.0
9 2016-10-29 225.0
10 2016-11-29 225.0
11 2016-12-30 225.0
12 2017-01-30 225.0
13 2016-05-16 1000.0
14 2016-05-20 1000.0
I need to add another column next to the amount column which gives the difference between corresponding rows of transaction_date
e.g.
transaction_date amount delta(days)
0 2016-06-02 50.0 -
1 2016-06-02 50.0 0
2 2016-06-02 50.0 0
3 2016-06-02 50.0 0
4 2016-06-02 50.0 0
5 2016-06-02 50.0 0
6 2016-07-04 50.0 32
7 2016-07-04 50.0 .
8 2016-09-29 225.0 .
9 2016-10-29 225.0 .
10 2016-11-29 225.0
there're probably some better methods, but you can use pandas.Series.shift:
>>> df.transaction_date.shift(-1) - df.transaction_date
0 0 days
1 0 days
2 0 days
3 0 days
4 0 days
5 32 days
6 0 days
7 87 days
8 30 days
9 31 days
10 31 days
11 31 days
12 -259 days
13 4 days
14 NaT
I think you need diff + dt.days:
df['delta(days)'] = df['transaction_date'].diff().dt.days
print (df)
transaction_date amount delta(days)
0 2016-06-02 50.0 NaN
1 2016-06-02 50.0 0.0
2 2016-06-02 50.0 0.0
3 2016-06-02 50.0 0.0
4 2016-06-02 50.0 0.0
5 2016-06-02 50.0 0.0
6 2016-07-04 50.0 32.0
7 2016-07-04 50.0 0.0
8 2016-09-29 225.0 87.0
9 2016-10-29 225.0 30.0
10 2016-11-29 225.0 31.0
11 2016-12-30 225.0 31.0
12 2017-01-30 225.0 31.0
13 2016-05-16 1000.0 -259.0
14 2016-05-20 1000.0 4.0
But if need count it by groups add groupby:
df['delta(days)'] = df.groupby('amount')['transaction_date'].diff().dt.days
print (df)
transaction_date amount delta(days)
0 2016-06-02 50.0 NaN
1 2016-06-02 50.0 0.0
2 2016-06-02 50.0 0.0
3 2016-06-02 50.0 0.0
4 2016-06-02 50.0 0.0
5 2016-06-02 50.0 0.0
6 2016-07-04 50.0 32.0
7 2016-07-04 50.0 0.0
8 2016-09-29 225.0 NaN
9 2016-10-29 225.0 30.0
10 2016-11-29 225.0 31.0
11 2016-12-30 225.0 31.0
12 2017-01-30 225.0 31.0
13 2016-05-16 1000.0 NaN
14 2016-05-20 1000.0 4.0
To get exact output you've requested (sorting optional), use shift to solve for timedelta, use dt.days to find int:
df.transaction_date = pd.to_datetime(df.transaction_date)
df.sort_values('transaction_date', inplace=True)
df['delta(days)'] = (df['transaction_date'] - df['transaction_date'].shift(1)).dt.days
Output:
transaction_date amount delta(days)
13 2016-05-16 1000.0 NaN
14 2016-05-20 1000.0 4.0
0 2016-06-02 50.0 13.0
1 2016-06-02 50.0 0.0
2 2016-06-02 50.0 0.0
3 2016-06-02 50.0 0.0
4 2016-06-02 50.0 0.0
5 2016-06-02 50.0 0.0
6 2016-07-04 50.0 32.0
7 2016-07-04 50.0 0.0
8 2016-09-29 225.0 87.0
9 2016-10-29 225.0 30.0
10 2016-11-29 225.0 31.0
11 2016-12-30 225.0 31.0
12 2017-01-30 225.0 31.0