How to assign hours to data that is exceeding the hours? - python

considering the following:
timeline = pd.date_range(start="2027-01-01",
end="2061-01-01",
freq="H")
timeline = timeline[:-1]
df1 = pd.DataFrame()
for i in range(0, 34):
df2 = pd.DataFrame()
df2['value'] = np.random.randint(1, 6, 8900)
df2['year'] = 2027 + i
df1 = pd.concat([df1, df2])
Note that, 8900 is always larger than 366 * 24. The objective is to combine the timeline and the df1, such that the first n-rows will be used to fill the timeline. We omit subsequent rows in that year and continue with the next year.
Problem what I am encountering is, is that not all years have the same number of hours, because some are leap years, which are quite troublesome. I was wondering whether there was an effective way to deal with that.
Is there a way to perform the merge, taking into account the intricacies of different hours per year?

Code
df1 = df1.reset_index(drop=True)
timeline = timeline.to_frame()
timeline = timeline.rename(columns={(0):'date'})
timeline['tlyear'] = timeline.date.dt.year
timeline = timeline.reset_index(drop=True)
pd.concat([timeline,df1], join='inner', axis=1).drop('tlyear',1)
Complete code
timeline = pd.date_range(start="2027-01-01",
end="2061-01-01",
freq="H")
timeline = timeline[:-1]
df1 = pd.DataFrame()
for i in range(0, 34):
df2 = pd.DataFrame()
df2['value'] = np.random.randint(1, 6, 8900)
df2['year'] = 2027 + i
df1 = pd.concat([df1, df2])
df1 = df1.reset_index(drop=True)
timeline = timeline.to_frame()
timeline = timeline.rename(columns={(0):'date'})
timeline['tlyear'] = timeline.date.dt.year
timeline = timeline.reset_index(drop=True)
pd.concat([timeline,df1], join='inner', axis=1).drop('tlyear',1)
Edit
for i in range(0, 34):
df2 = pd.DataFrame()
df2['value'] = np.random.randint(1, 6, 8900)
df2['year'] = 2027 + i
df1 = pd.concat([df1, df2])
df1 = df1.reset_index(drop=True)
timeline = timeline.to_frame()
timeline = timeline.rename(columns={(0):'date'})
timeline['year'] = timeline.date.dt.year
timeline = timeline.reset_index(drop=True)
pd.merge_asof(df1, timeline, on='year', direction='nearest')
Output Sample
date value year
0 2027-01-01 00:00:00 5 2027
1 2027-01-01 01:00:00 2 2027
2 2027-01-01 02:00:00 3 2027
3 2027-01-01 03:00:00 4 2027
4 2027-01-01 04:00:00 1 2027
... ... ... ...
298051 2060-12-31 19:00:00 1 2060
298052 2060-12-31 20:00:00 3 2060
298053 2060-12-31 21:00:00 2 2060
298054 2060-12-31 22:00:00 1 2060
298055 2060-12-31 23:00:00 3 2060

There is a slightly different approach that come to mind, we can just do the following:
df1['Row'] = df1.groupby(['year']).cumcount()
timeline = timeline.to_frame()
timeline = timeline.rename(columns={(0):'date'})
timeline['year'] = timeline.date.dt.year
timeline['Row'] = timeline.groupby(['year']).cumcount()
and then merge on them both:
result = timeline.merge(df1, on=['year', 'Row'])
This will force the row order, I believe.

Related

Inserting rows in specific location using pandas

I have a CSV-file containing the following data structure:
2015-01-02,09:30:00,64.815
2015-01-02,09:35:00,64.8741
2015-01-02,09:55:00,65.0255
2015-01-02,10:00:00,64.9269
By using Pandas in Python, I would like to quadruple the 2nd row and insert the new rows after the 2nd row (filling up the missing intervals with the 2nd row). Eventually, it should look like:
2015-01-02,09:30:00,64.815
2015-01-02,09:35:00,64.8741
2015-01-02,09:40:00,64.8741
2015-01-02,09:45:00,64.8741
2015-01-02,09:50:00,64.8741
2015-01-02,09:55:00,65.0255
2015-01-02,10:00:00,64.9269
2015-01-02,10:05:00,64.815
I have the following code:
df = pd.read_csv("csv.file", header=0, names=['date', 'minute', 'price'])
for i in range(len(df)):
if i != len(df)-1:
next_i = i+1
if df.loc[next_i, 'date'] == df.loc[i, 'date'] and df.loc[i, 'minute'] != "16:00:00":
now = int(df.loc[i, "minute"][:2]+df.loc[i, "minute"][3:5])
future = int(df.loc[next_i, "minute"][:2]+df.loc[next_i, "minute"][3:5])
while now + 5 != future and df.loc[next_i, "minute"][3:5] != "00" and df.loc[next_i, "minute"][3:5] != "60":
newminutes = str(int(df.loc[i, "minute"][3:5])+5*a)
newtime = df.loc[next_i, "minute"][:2] +":"+newminutes+":00"
df.loc[next_i-0.5] = [df.loc[next_i, 'date'], newtime , df.loc[i, 'price']]
df = df.sort_index().reset_index(drop=True)
now = int(newtime[:2]+newtime[3:5])
future = int(df.loc[next_i+1, "minute"][:2]+df.loc[next_i+1, "minute"][3:5])
However, it's not working.
I see there is an extra row in the expected output 2015-01-02,10:05:00,64.815.
To accomodate that as well you can reindex using pd.DateRange.
Creating data
data = {
'date' : ['2015-01-02', '2015-01-02', '2015-01-02', '2015-01-02'],
'time' : ['09:30:00', '09:35:00', '09:55:00', '10:00:00'],
'val' : [64.815, 64.8741, 65.0255, 64.9269]
}
df = pd.DataFrame(data)
Creating datetime column for reindexing
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])
df.set_index('datetime', inplace=True)
Generating output
df = df.resample('5min').asfreq().reindex(pd.date_range('2015-01-02 09:30:00', '2015-01-02 10:05:00', freq='5 min')).ffill()
df[['date', 'time']] = df.index.astype(str).to_series().str.split(' ', expand=True).values
df.reset_index(drop=True)
Output
This gives us the expected output
date time val
0 2015-01-02 09:30:00 64.8150
1 2015-01-02 09:35:00 64.8741
2 2015-01-02 09:40:00 64.8741
3 2015-01-02 09:45:00 64.8741
4 2015-01-02 09:50:00 64.8741
5 2015-01-02 09:55:00 65.0255
6 2015-01-02 10:00:00 64.9269
7 2015-01-02 10:05:00 64.9269
However if that was a typo and you don't want the last row you can do this :
df = df.resample('5min').asfreq().reindex(pd.date_range(df.index[0], df.index[len(df)-1], freq='5 min')).ffill()
df[['date', 'time']] = df.index.astype(str).to_series().str.split(' ', expand=True).values
df.reset_index(drop=True)
which gives is
date time val
0 2015-01-02 09:30:00 64.8150
1 2015-01-02 09:35:00 64.8741
2 2015-01-02 09:40:00 64.8741
3 2015-01-02 09:45:00 64.8741
4 2015-01-02 09:50:00 64.8741
5 2015-01-02 09:55:00 65.0255
6 2015-01-02 10:00:00 64.9269
Try pandas merge_ordered function.
Create the original data frame:
data = {
'date' : ['2015-01-02', '2015-01-02', '2015-01-02', '2015-01-02'],
'time' : ['09:30:00', '09:35:00', '09:55:00', '10:00:00'],
'val' : [64.815, 64.8741, 65.0255, 64.9269]
}
df = pd.DataFrame(data)
df['datetime']=pd.to_datetime(df['date']+' '+df['time'])
Create a second data frame df2 with 5 minute time intervals from min to max of df1
df2=pd.DataFrame(pd.date_range(df['datetime'].min(), df['datetime'].max(), freq='5 min').rename('datetime'))
Using panda's merge_ordered function:
result=pd.merge_ordered(df2,df, on='datetime',how='left')
result['date']=result['datetime'].dt.date
result['time']=result['datetime'].dt.time
result['val']=result['val'].ffill()
result=result.drop('datetime', axis=1)

Create a list of years with pandas

I have a dataframe with a column of dates of the form
2004-01-01
2005-01-01
2006-01-01
2007-01-01
2008-01-01
2009-01-01
2010-01-01
2011-01-01
2012-01-01
2013-01-01
2014-01-01
2015-01-01
2016-01-01
2017-01-01
2018-01-01
2019-01-01
Given an integer number k, let's say k=5, I would like to generate an array of the next k years after the maximum date of the column. The output should look like:
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01
Let's use pd.to_datetime + max to compute the largest date in the column date then use pd.date_range to generate the dates based on the offset frequency one year and having the number of periods equals to k=5:
strt, offs = pd.to_datetime(df['date']).max(), pd.DateOffset(years=1)
dates = pd.date_range(strt + offs, freq=offs, periods=k).strftime('%Y-%m-%d').tolist()
print(dates)
['2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01', '2024-01-01']
Here you go:
import pandas as pd
# this is your k
k = 5
# Creating a test DF
array = {'dt': ['2018-01-01', '2019-01-01']}
df = pd.DataFrame(array)
# Extracting column of year
df['year'] = pd.DatetimeIndex(df['dt']).year
year1 = df['year'].max()
# creating a new DF and populating it with k years
years_df = pd.DataFrame()
for i in range (1,k+1):
row = {'dates':[str(year1 + i) + '-01-01']}
years_df = years_df.append(pd.DataFrame(row))
years_df
The output:
dates
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01

Problem in the selection of a part of data in pandas

I have the following database that are extracted with pandas from csv files :
df1=pd.read_csv(path,parse_dates=True)
The print of df1 gives :
control Avg_return
2019-09-07 True 0
2019-06-06 True 0
2019-02-19 True 0
2019-01-17 True 0
2018-12-20 True 0
2018-11-27 True 0
2018-10-12 True 0
... ... ...
After I load the 2 csv file
df2=pd.read_csv(path,parse_dates=True)
The print of df2 gives :
return
2010-01-01 NaN
2010-04-01 0.010920
2010-05-01 -0.004404
2010-06-01 -0.025209
2010-07-01 -0.023280
... ...
The aim of my code is :
Take a date from df1
Subtract 6 days from the date taken in point 1.
Subtract 244 days from the date taken in point 1.
Take all the return from this two date in df2
Compute the mean of these return and stock it in Avg_return
I did this :
for i in range(0,df1_row):
#I go through my data df1
if (control.iloc[i]==True):
#I check if control_1 is true
date_1=df1.index[i]-pd.to_timedelta(6, unit='d')
# I remove 6 days from my date
date_2=df1.index[i]-pd.to_timedelta(244, unit='d')
# I remove 244 days from my date
df1.loc[i,"Average_return"] = df2[[date_1:date_2],["return"]].mean()
# I want to make the mean of the return between my date-6 days and my date-244 days
Unfortunately it gives me this error :
df1.loc[i,"Average_return"] = df2[[date1:date2],["return"]].mean()
^
SyntaxError: invalid syntax
Is someone able to help me? :)
The following looks a bit ugly, but I think it works :)
Dummy df's:
import numpy as np
import pandas as pd
cols = ['date', 'control', 'Avg_return']
data = [
[pd.to_datetime('2019-09-07'), True, 0],
[pd.to_datetime('2019-06-06'), True, 0]
]
df1 = pd.DataFrame(data, columns=cols)
cols2 = ['date', 'return']
data2 = [
[pd.to_datetime('2010-01-01'), np.nan],
[pd.to_datetime('2010-04-01'), 0.010920],
[pd.to_datetime('2019-09-01'), 1]
]
df2 = pd.DataFrame(data2, columns=cols2)
Drafted solution:
df1['date_minus_6'] = df1['date'] - dt.timedelta(days=6)
df1['date_minus_244'] = df1['date'] - dt.timedelta(days=244)
for i in range(0, df1.shape[0]):
for j in range(0, df2.shape[0]):
if df2['date'].iloc[j] == df1['date_minus_6'].iloc[i]:
df1['Avg_return'].iloc[i] = (
df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
).mean()
elif df2['date'].iloc[j] == df1['date_minus_244'].iloc[i]:
df1['Avg_return'].iloc[i] = (
df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
).mean()
Output:
date control Avg_return date_minus_6 date_minus_244
0 2019-09-07 True 1.0 2019-09-01 2019-01-06
1 2019-06-06 True 0.0 2019-05-31 2018-10-05
import csv
import pandas as pd
df1=pd.read_csv('dsf1.csv',parse_dates=True)
df2=pd.read_csv('dsf2.csv',parse_dates=True)
df1.columns = ['date', 'control', 'return']
df2.columns = ['date', 'return']
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
for i in range(0, df1.shape[0]):
if df1['control'][i] == True:
date_1 = df1['date'][0] - pd.to_timedelta(6, unit='d')
date_2 = df2['date'][0] - pd.to_timedelta(244, unit='d')
#I'm not sure if average_return has the correct condition, but adjust as you see fit
df1.loc[i, 'average_return'] = (df1[df1['date'] > date_1]['return'] - df2[df2['date'] > date_2]['return']).mean()
print df1
This is a different approach without looping over all rows:
# make sure your index is a datetime index
df1.index = pd.to_datetime(df1.index)
df1['date_1'] = df1.index - pd.to_timedelta(6, unit='d')
df1['date_2'] = df1.index - pd.to_timedelta(244, unit='d')
df1['Average_return'] = df1.apply(lambda r: df2.loc[r['date_1']: r['date_2'], 'return'].mean(), axis=1)

Cumulative sum over days in python

I have the following dataframe:
date money
0 2018-01-01 20
1 2018-01-05 30
2 2018-02-15 7
3 2019-03-17 150
4 2018-01-05 15
...
2530 2019-03-17 350
And I need:
[(2018-01-01,20),(2018-01-05,65),(2018-02-15,72),...,(2019-03-17,572)]
So i need to do a cumulative sum of money over all days:
So far I have tried many things and the closest Ithink I've got is:
graph_df.date = pd.to_datetime(graph_df.date)
temporary = graph_df.groupby('date').money.sum()
temporary = temporary.groupby(temporary.index.to_period('date')).cumsum().reset_index()
But this gives me ValueError: Invalid frequency: date
Could anyone help please?
Thanks
I don't think you need the second groupby. You can simply add a column with the cumulative sum.
This does the trick for me:
import pandas as pd
df = pd.DataFrame({'date': ['01-01-2019','04-06-2019', '07-06-2019'], 'money': [12,15,19]})
df['date'] = pd.to_datetime(df['date']) # this is not strictly needed
tmp = df.groupby('date')['money'].sum().reset_index()
tmp['money_sum'] = tmp['money'].cumsum()
Converting the date column to an actual date is not needed for this to work.
list(map(tuple, df.groupby('date', as_index=False)['money'].sum().values))
Edit:
df = pd.DataFrame({'date': ['2018-01-01', '2018-01-05', '2018-02-15', '2019-03-17', '2018-01-05'],
'money': [20, 30, 7, 150, 15]})
#df['date'] = pd.to_datetime(df['date'])
#df = df.sort_values(by='date')
temporary = df.groupby('date', as_index=False)['money'].sum()
temporary['money_cum'] = temporary['money'].cumsum()
Result:
>>> list(map(tuple, temporary[['date', 'money_cum']].values))
[('2018-01-01', 20),
('2018-01-05', 65),
('2018-02-15', 72),
('2019-03-17', 222)]
you can try using df.groupby('date').sum():
example data frame:
df
date money
0 01/01/2018 20
1 05/01/2018 30
2 15/02/2018 7
3 17/03/2019 150
4 05/01/2018 15
5 17/03/2019 550
6 15/02/2018 13
df['cumsum'] = df.money.cumsum()
list(zip(df.groupby('date').tail(1)['date'], df.groupby('date').tail(1)['cumsum']))
[('01/01/2018', 20),
('05/01/2018', 222),
('17/03/2019', 772),
('15/02/2018', 785)]

"Iterative" Window function on subset of dataframe

I am looking for a way to create the column 'min_value' from the dataframe df below. For each row i, we subset from the entire dataframe all the records that correspond to the grouping ['Date_A', 'Date_B'] of the row i and having the condition 'Advance' less than 'Advance' of row i, and finally we pick the minimum of the column 'Amount' from this subset to set 'min_value' for the row i:
Initial dataframe:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240]})
df = df [['Date_A', 'Date_B', 'Advance', 'Amount']]
df
Desired output:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240],
'min_value': [180,180,180,230,230,220] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
I wrote the following loop that I think would do the job but it is much too long to run, I guess there must be much more efficient ways to accomplish this.
for i in range(len(df)):
date1=df['Date_A'][i] #select the date A of the row i
date2=df['Date_B'][i] #select the date B of the row i
advance= df['Advance'][i] #select the advance of the row i
df.loc[i,'min_value'] = df[df['Date_A']==date1][df['Date_B']==date2][df['Advance']<advance]['Amount'].min() # subset the entire dataframe to meet dates and advance conditions
df.loc[df['min_value'].isnull(),'min_value']=df['Amount'] # for the smallest advance value, ste min=to its own amount
df
I hope it is clear enough, thanks for your help.
Improvement question
Thanks a lot for the answer. For the last part, the NA rows, I'd like to replace the amount of the row by the overall amount of the Date_A,Date_B,advance grouping so that I have the overall minimum of the last day before date_A
Improvement desired output (two recodrs for the smallest advance value)
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2017-12-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-1-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [5,8,150,5],
'Amount' : [230,220,240,225],
'min_value': [225,230,220,225] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
Thanks
You can use groupby on 'Date_A' and 'Date_B' after sorting the value by 'Advance' and apply the function cummin and shift to the column 'Amount'. Then use fillna with the value from the column 'Amount', such as:
df['min_value'] = (df.sort_values('Advance').groupby(['Date_A','Date_B'])['Amount']
.apply(lambda ser_g: ser_g.cummin().shift()).fillna(df['Amount']))
and you get:
Date_A Date_B Advance Amount min_value
0 2017-12-25 2018-01-01 10 180 180.0
1 2017-12-25 2018-01-01 103 220 180.0
2 2017-12-25 2018-01-01 200 200 180.0
3 2018-01-25 2018-02-01 5 230 230.0
4 2018-01-25 2018-02-01 8 220 230.0
5 2018-01-25 2018-02-01 150 240 220.0

Categories