Problem in the selection of a part of data in pandas

Problem in the selection of a part of data in pandas - python

I have the following database that are extracted with pandas from csv files :
df1=pd.read_csv(path,parse_dates=True)
The print of df1 gives :
control Avg_return
2019-09-07 True 0
2019-06-06 True 0
2019-02-19 True 0
2019-01-17 True 0
2018-12-20 True 0
2018-11-27 True 0
2018-10-12 True 0
... ... ...
After I load the 2 csv file
df2=pd.read_csv(path,parse_dates=True)
The print of df2 gives :
return
2010-01-01 NaN
2010-04-01 0.010920
2010-05-01 -0.004404
2010-06-01 -0.025209
2010-07-01 -0.023280
... ...
The aim of my code is :
Take a date from df1
Subtract 6 days from the date taken in point 1.
Subtract 244 days from the date taken in point 1.
Take all the return from this two date in df2
Compute the mean of these return and stock it in Avg_return
I did this :
for i in range(0,df1_row):
#I go through my data df1
if (control.iloc[i]==True):
#I check if control_1 is true
date_1=df1.index[i]-pd.to_timedelta(6, unit='d')
# I remove 6 days from my date
date_2=df1.index[i]-pd.to_timedelta(244, unit='d')
# I remove 244 days from my date
df1.loc[i,"Average_return"] = df2[[date_1:date_2],["return"]].mean()
# I want to make the mean of the return between my date-6 days and my date-244 days
Unfortunately it gives me this error :
df1.loc[i,"Average_return"] = df2[[date1:date2],["return"]].mean()
^
SyntaxError: invalid syntax
Is someone able to help me? :)

The following looks a bit ugly, but I think it works :)
Dummy df's:
import numpy as np
import pandas as pd
cols = ['date', 'control', 'Avg_return']
data = [
[pd.to_datetime('2019-09-07'), True, 0],
[pd.to_datetime('2019-06-06'), True, 0]
]
df1 = pd.DataFrame(data, columns=cols)
cols2 = ['date', 'return']
data2 = [
[pd.to_datetime('2010-01-01'), np.nan],
[pd.to_datetime('2010-04-01'), 0.010920],
[pd.to_datetime('2019-09-01'), 1]
]
df2 = pd.DataFrame(data2, columns=cols2)
Drafted solution:
df1['date_minus_6'] = df1['date'] - dt.timedelta(days=6)
df1['date_minus_244'] = df1['date'] - dt.timedelta(days=244)
for i in range(0, df1.shape[0]):
for j in range(0, df2.shape[0]):
if df2['date'].iloc[j] == df1['date_minus_6'].iloc[i]:
df1['Avg_return'].iloc[i] = (
df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
).mean()
elif df2['date'].iloc[j] == df1['date_minus_244'].iloc[i]:
df1['Avg_return'].iloc[i] = (
df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
).mean()
Output:
date control Avg_return date_minus_6 date_minus_244
0 2019-09-07 True 1.0 2019-09-01 2019-01-06
1 2019-06-06 True 0.0 2019-05-31 2018-10-05

import csv
import pandas as pd
df1=pd.read_csv('dsf1.csv',parse_dates=True)
df2=pd.read_csv('dsf2.csv',parse_dates=True)
df1.columns = ['date', 'control', 'return']
df2.columns = ['date', 'return']
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
for i in range(0, df1.shape[0]):
if df1['control'][i] == True:
date_1 = df1['date'][0] - pd.to_timedelta(6, unit='d')
date_2 = df2['date'][0] - pd.to_timedelta(244, unit='d')
#I'm not sure if average_return has the correct condition, but adjust as you see fit
df1.loc[i, 'average_return'] = (df1[df1['date'] > date_1]['return'] - df2[df2['date'] > date_2]['return']).mean()
print df1

This is a different approach without looping over all rows:
# make sure your index is a datetime index
df1.index = pd.to_datetime(df1.index)
df1['date_1'] = df1.index - pd.to_timedelta(6, unit='d')
df1['date_2'] = df1.index - pd.to_timedelta(244, unit='d')
df1['Average_return'] = df1.apply(lambda r: df2.loc[r['date_1']: r['date_2'], 'return'].mean(), axis=1)

Related

use groupby() and for loop to count column values with conditions

The logic of what I am trying to do I think is best explained with code:
import pandas as pd
import numpy as np
from datetime import timedelta
random.seed(365)
#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
{"start_date":start_date,
"end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")
I first create a pd.Series with the 1st day of every month in the entire history of the data:
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time
What I then want to do is count the number of df["start_date"] values which are less than the 1st day of each month in the series and where the df["end_date"] values are null (recorded as NaT)
I would think I would use a for loop to do this and somehow groupby the dates series so that the resulting output looks something like this:
month_start
count
2015-01-01
5
2015-02-01
10
2015-03-01
35
The count column in the resulting output is a count of the number of df rows where the df["start_date"] values are less than the 1st of each month in the series and where the df["end_date"] values are null - this occurs for every value in the series
Here is the logic of what I am trying to do:
df.groupby(by = dates)[["start_date", "end_date"]].apply(
lambda x: [x["start_date"] < date for date in dates] & x["end_date"].isnull == True
)

Is this what you want:
df2 = df[df['end_date'].isnull()]
dates_count = dates.apply(lambda x: df2[df2['start_date'] < x]['start_date'].count())
print(pd.concat([dates, dates_count], axis=1))

IIUC, group by period (shifted by 1 month) and count the NaT, then cumsum to accumulate the counts:
(df['end_date'].isna()
.groupby(df['start_date'].dt.to_period('M').add(1).dt.start_time)
.sum()
.cumsum()
)
Output:
start_date
2015-02-01 0
2015-03-01 0
2015-04-01 0
2015-05-01 0
2015-06-01 0
...
2022-06-01 122
2022-07-01 127
2022-08-01 133
2022-09-01 138
2022-10-01 140
Name: end_date, Length: 93, dtype: int64

How to assign hours to data that is exceeding the hours?

considering the following:
timeline = pd.date_range(start="2027-01-01",
end="2061-01-01",
freq="H")
timeline = timeline[:-1]
df1 = pd.DataFrame()
for i in range(0, 34):
df2 = pd.DataFrame()
df2['value'] = np.random.randint(1, 6, 8900)
df2['year'] = 2027 + i
df1 = pd.concat([df1, df2])
Note that, 8900 is always larger than 366 * 24. The objective is to combine the timeline and the df1, such that the first n-rows will be used to fill the timeline. We omit subsequent rows in that year and continue with the next year.
Problem what I am encountering is, is that not all years have the same number of hours, because some are leap years, which are quite troublesome. I was wondering whether there was an effective way to deal with that.
Is there a way to perform the merge, taking into account the intricacies of different hours per year?

Code
df1 = df1.reset_index(drop=True)
timeline = timeline.to_frame()
timeline = timeline.rename(columns={(0):'date'})
timeline['tlyear'] = timeline.date.dt.year
timeline = timeline.reset_index(drop=True)
pd.concat([timeline,df1], join='inner', axis=1).drop('tlyear',1)
Complete code
timeline = pd.date_range(start="2027-01-01",
end="2061-01-01",
freq="H")
timeline = timeline[:-1]
df1 = pd.DataFrame()
for i in range(0, 34):
df2 = pd.DataFrame()
df2['value'] = np.random.randint(1, 6, 8900)
df2['year'] = 2027 + i
df1 = pd.concat([df1, df2])
df1 = df1.reset_index(drop=True)
timeline = timeline.to_frame()
timeline = timeline.rename(columns={(0):'date'})
timeline['tlyear'] = timeline.date.dt.year
timeline = timeline.reset_index(drop=True)
pd.concat([timeline,df1], join='inner', axis=1).drop('tlyear',1)
Edit
for i in range(0, 34):
df2 = pd.DataFrame()
df2['value'] = np.random.randint(1, 6, 8900)
df2['year'] = 2027 + i
df1 = pd.concat([df1, df2])
df1 = df1.reset_index(drop=True)
timeline = timeline.to_frame()
timeline = timeline.rename(columns={(0):'date'})
timeline['year'] = timeline.date.dt.year
timeline = timeline.reset_index(drop=True)
pd.merge_asof(df1, timeline, on='year', direction='nearest')
Output Sample
date value year
0 2027-01-01 00:00:00 5 2027
1 2027-01-01 01:00:00 2 2027
2 2027-01-01 02:00:00 3 2027
3 2027-01-01 03:00:00 4 2027
4 2027-01-01 04:00:00 1 2027
... ... ... ...
298051 2060-12-31 19:00:00 1 2060
298052 2060-12-31 20:00:00 3 2060
298053 2060-12-31 21:00:00 2 2060
298054 2060-12-31 22:00:00 1 2060
298055 2060-12-31 23:00:00 3 2060

There is a slightly different approach that come to mind, we can just do the following:
df1['Row'] = df1.groupby(['year']).cumcount()
timeline = timeline.to_frame()
timeline = timeline.rename(columns={(0):'date'})
timeline['year'] = timeline.date.dt.year
timeline['Row'] = timeline.groupby(['year']).cumcount()
and then merge on them both:
result = timeline.merge(df1, on=['year', 'Row'])
This will force the row order, I believe.

Get the overlap duration between date intervals based on condition

I have two dataframes, they have a start/end datetime and a value. Not the same number of rows. The intervals which overlap may not be in the same row/index.
df1
start_datetime end_datetime value
08:50 09:50 5
09:52 10:10 6
10:50 11:30 2
df2
start_datetime end_datetime value
08:51 08:59 3
09:52 10:02 9
10:03 10:30 1
11:03 11:39 1
13:10 13:15 0
I would like to calculate the sum of duration time when df1 and df2 overlap only if df1.value > df2.value.
During one df2 time interval, df1 can overlaps multiple times and sometimes the condition is True.
I tried something like that:
time = timedelta()
for i, row1 in df1.iterrows():
t1 = pd.Interval(row1.start, row1.end)
for j, row2 in df2.iterrows():
t2 = pd.Interval(row2.start, row2.end)
if t1.overlaps(t2) and row1.value > row2.value:
latest_start = np.maximum(row1.start, row1.start)
earliest_end = np.minimum(row2.end, row2.end)
delta = earliest_end - latest_start
time += delta
I can loop on every df1 rows and test with the whole df2 data but it's not optimized.
expected output (example):
Timedelta('0 days 00:99:99')

Here is my solution:
Create DataFrames:
df1 = pd.DataFrame(
{"start_datetime1": ['08:50' ,'09:52' ,'10:50 ' ],
'end_datetime1' : ['09:50','10:10','11:30'] ,
'value1': [5,6,2]})
df2 = pd.DataFrame(
{"start_datetime2": ['08:51' ,'09:52' ,'10:03 ','11:03 ','13:10 ' ],
'end_datetime2' : ['08:59','10:02','10:30','11:39', '13:15'] ,
'value2': [3,9,1,1,0]})
df2["start_datetime2"]= pd.to_datetime(df2["start_datetime2"])
df2["end_datetime2"]= pd.to_datetime(df2["end_datetime2"])
df1["start_datetime1"]= pd.to_datetime(df1["start_datetime1"])
df1["end_datetime1"]= pd.to_datetime(df1["end_datetime1"])
Combine dataframes to make comparison easier. Combined dataframe has all possible matches :
df1['temp'] = 1 #temporary keys to create all combinations
df2['temp'] = 1
df_combined = pd.merge(df1,df2,on='temp').drop('temp',axis=1)
Compare values with lambda function:
df_combined['Result'] = df_combined.apply(lambda row: max(row["start_datetime1"],row["start_datetime2"]) -
min(row["start_datetime1"],row["start_datetime2"])
if pd.Interval(row['start_datetime1'], row['end_datetime1']).overlaps(
pd.Interval(row['start_datetime2'], row['end_datetime2'])) and
row["value1"] > row["value2"]
else 0, axis = 1 )
df_combined
Result :
total_timedelta = df_combined['Result'].loc[df_combined['Result'] != 0].sum()
0 days 00:25:00
Dataframe:

Cumulative sum over days in python

I have the following dataframe:
date money
0 2018-01-01 20
1 2018-01-05 30
2 2018-02-15 7
3 2019-03-17 150
4 2018-01-05 15
...
2530 2019-03-17 350
And I need:
[(2018-01-01,20),(2018-01-05,65),(2018-02-15,72),...,(2019-03-17,572)]
So i need to do a cumulative sum of money over all days:
So far I have tried many things and the closest Ithink I've got is:
graph_df.date = pd.to_datetime(graph_df.date)
temporary = graph_df.groupby('date').money.sum()
temporary = temporary.groupby(temporary.index.to_period('date')).cumsum().reset_index()
But this gives me ValueError: Invalid frequency: date
Could anyone help please?
Thanks

I don't think you need the second groupby. You can simply add a column with the cumulative sum.
This does the trick for me:
import pandas as pd
df = pd.DataFrame({'date': ['01-01-2019','04-06-2019', '07-06-2019'], 'money': [12,15,19]})
df['date'] = pd.to_datetime(df['date']) # this is not strictly needed
tmp = df.groupby('date')['money'].sum().reset_index()
tmp['money_sum'] = tmp['money'].cumsum()
Converting the date column to an actual date is not needed for this to work.

list(map(tuple, df.groupby('date', as_index=False)['money'].sum().values))
Edit:
df = pd.DataFrame({'date': ['2018-01-01', '2018-01-05', '2018-02-15', '2019-03-17', '2018-01-05'],
'money': [20, 30, 7, 150, 15]})
#df['date'] = pd.to_datetime(df['date'])
#df = df.sort_values(by='date')
temporary = df.groupby('date', as_index=False)['money'].sum()
temporary['money_cum'] = temporary['money'].cumsum()
Result:
>>> list(map(tuple, temporary[['date', 'money_cum']].values))
[('2018-01-01', 20),
('2018-01-05', 65),
('2018-02-15', 72),
('2019-03-17', 222)]

you can try using df.groupby('date').sum():
example data frame:
df
date money
0 01/01/2018 20
1 05/01/2018 30
2 15/02/2018 7
3 17/03/2019 150
4 05/01/2018 15
5 17/03/2019 550
6 15/02/2018 13
df['cumsum'] = df.money.cumsum()
list(zip(df.groupby('date').tail(1)['date'], df.groupby('date').tail(1)['cumsum']))
[('01/01/2018', 20),
('05/01/2018', 222),
('17/03/2019', 772),
('15/02/2018', 785)]

"Iterative" Window function on subset of dataframe

I am looking for a way to create the column 'min_value' from the dataframe df below. For each row i, we subset from the entire dataframe all the records that correspond to the grouping ['Date_A', 'Date_B'] of the row i and having the condition 'Advance' less than 'Advance' of row i, and finally we pick the minimum of the column 'Amount' from this subset to set 'min_value' for the row i:
Initial dataframe:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240]})
df = df [['Date_A', 'Date_B', 'Advance', 'Amount']]
df
Desired output:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240],
'min_value': [180,180,180,230,230,220] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
I wrote the following loop that I think would do the job but it is much too long to run, I guess there must be much more efficient ways to accomplish this.
for i in range(len(df)):
date1=df['Date_A'][i] #select the date A of the row i
date2=df['Date_B'][i] #select the date B of the row i
advance= df['Advance'][i] #select the advance of the row i
df.loc[i,'min_value'] = df[df['Date_A']==date1][df['Date_B']==date2][df['Advance']<advance]['Amount'].min() # subset the entire dataframe to meet dates and advance conditions
df.loc[df['min_value'].isnull(),'min_value']=df['Amount'] # for the smallest advance value, ste min=to its own amount
df
I hope it is clear enough, thanks for your help.
Improvement question
Thanks a lot for the answer. For the last part, the NA rows, I'd like to replace the amount of the row by the overall amount of the Date_A,Date_B,advance grouping so that I have the overall minimum of the last day before date_A
Improvement desired output (two recodrs for the smallest advance value)
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2017-12-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-1-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [5,8,150,5],
'Amount' : [230,220,240,225],
'min_value': [225,230,220,225] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
Thanks

You can use groupby on 'Date_A' and 'Date_B' after sorting the value by 'Advance' and apply the function cummin and shift to the column 'Amount'. Then use fillna with the value from the column 'Amount', such as:
df['min_value'] = (df.sort_values('Advance').groupby(['Date_A','Date_B'])['Amount']
.apply(lambda ser_g: ser_g.cummin().shift()).fillna(df['Amount']))
and you get:
Date_A Date_B Advance Amount min_value
0 2017-12-25 2018-01-01 10 180 180.0
1 2017-12-25 2018-01-01 103 220 180.0
2 2017-12-25 2018-01-01 200 200 180.0
3 2018-01-25 2018-02-01 5 230 230.0
4 2018-01-25 2018-02-01 8 220 230.0
5 2018-01-25 2018-02-01 150 240 220.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem in the selection of a part of data in pandas - python

Related

use groupby() and for loop to count column values with conditions

How to assign hours to data that is exceeding the hours?

Get the overlap duration between date intervals based on condition

Cumulative sum over days in python

"Iterative" Window function on subset of dataframe

Categories

Resources