I have a file, df, that I wish to take the delta of every 7 day period and reflect the timestamp for that particular period
df:
Date Value
10/15/2020 75
10/14/2020 70
10/13/2020 65
10/12/2020 60
10/11/2020 55
10/10/2020 50
10/9/2020 45
10/8/2020 40
10/7/2020 35
10/6/2020 30
10/5/2020 25
10/4/2020 20
10/3/2020 15
10/2/2020 10
10/1/2020 5
Desired Output:
10/15/2020 to 10/9/2020 is 7 days with the delta being: 75 - 45 = 30
10/9/2020 timestamp would be: 30 and so on
Date Value
10/9/2020 30
10/2/2020 30
This is what I am doing:
df= df['Delta']=df.iloc[:,6].sub(df.iloc[:,0]),Date=pd.Series
(pd.date_range(pd.Timestamp('2020-10-
15'),
periods=7, freq='7d')))[['Delta','Date']]
I am also thinking I may be able to do this:
Edit I updated callDate to Date
for row in df.itertuples():
Date = datetime.strptime(row.Date, "%m/%d/%y %I:%M %p")
previousRecord = df['Date'].shift(-6).strptime(row.Date, "%m/%d/%y %I:%M
%p")
Delta = Date - previousRecord
Any suggestion is appreciated
Don't iterate through the dataframe. You can use a merge:
(df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('6D')),
on='Date')
.assign(Value = lambda x: x['Value_y']-x['Value_x'])
[['Date','Value']]
)
Output:
Date Value
0 2020-10-09 30
1 2020-10-08 30
2 2020-10-07 30
3 2020-10-06 30
4 2020-10-05 30
5 2020-10-04 30
6 2020-10-03 30
7 2020-10-02 30
8 2020-10-01 30
The last block of code you wrote is the way I would do it. Only problem is in Delta = Date - previousRecord, there is nothing called Date here. You should instead be accessing the value associated with callDate.
Related
I want to convert all rows of my DataFrame that contains hours and minutes into minutes only.
I have a dataframe that looks like this:
df=
time
0 8h30
1 14h07
2 08h30
3 7h50
4 8h0
5 8h15
6 6h15
I'm using the following method to convert:
df['time'] = pd.eval(
df['time'].replace(['h'], ['*60+'], regex=True))
Output
SyntaxError: invalid syntax
I think the error comes from the format of the hour, maybe pd.evalcant accept 08h30 or 8h0, how to solve this probleme ?
Pandas can already handle such strings if the units are included in the string. While 14h07 can't be parse (why assume 07 is minutes?), 14h07 can be converted to a Timedelta :
>>> pd.to_timedelta("14h07m")
Timedelta('0 days 14:07:00')
Given this dataframe :
d1 = pd.DataFrame(['8h30m', '14h07m', '08h30m', '8h0m'],
columns=['time'])
You can convert the time series into a Timedelta series with pd.to_timedelta :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d1
time tm
0 8h30m 0 days 08:30:00
1 14h07m 0 days 14:07:00
2 08h30m 0 days 08:30:00
3 8h0m 0 days 08:00:00
To handle the missing minutes unit in the original data, just append m:
d1['tm'] = pd.to_timedelta(d1['time'] + 'm')
Once you have a Timedelta you can calculate hours and minutes.
The components of the values can be retrieved with Timedelta.components
>>> d1.tm.dt.components.hours
0 8
1 14
2 8
3 8
Name: hours, dtype: int64
To get the total minutes, seconds or hours, change the frequency to minutes:
>>> d1.tm.astype('timedelta64[m]')
0 510.0
1 847.0
2 510.0
3 480.0
Name: tm, dtype: float64
Bringing all the operations together :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d2 = (d1.assign(h=d1.tm.dt.components.hours,
... m=d1.tm.dt.components.minutes,
... total_minutes=d1.tm.astype('timedelta64[m]')))
>>>
>>> d2
time tm h m total_minutes
0 8h30m 0 days 08:30:00 8 30 510.0
1 14h07m 0 days 14:07:00 14 7 847.0
2 08h30m 0 days 08:30:00 8 30 510.0
3 8h0m 0 days 08:00:00 8 0 480.0
To avoid having to trim leading zeros, an alternative approach:
df[['h', 'm']] = df['time'].str.split('h', expand=True).astype(int)
df['total_min'] = df['h']*60 + df['m']
Result:
time h m total_min
0 8h30 8 30 510
1 14h07 14 7 847
2 08h30 8 30 510
3 7h50 7 50 470
4 8h0 8 0 480
5 8h15 8 15 495
6 6h15 6 15 375
Just to give an alternative approach with kind of the same elements as above you could do:
df = pd.DataFrame(data=["8h30", "14h07", "08h30", "7h50", "8h0 ", "8h15", "6h15"],
columns=["time"])
First split you column on the "h"
hm = df["time"].str.split("h", expand=True)
Then combine the columns again, but zeropad time hours and minutes in order to make valid time strings:
df2 = hm[0].str.strip().str.zfill(2) + hm[1].str.strip().str.zfill(2)
Then convert the string column with proper values to a date time column:
df3 = pd.to_datetime(df2, format="%H%M")
Finally, calculate the number of minutes by subtrackting a zero time (to make deltatimes) and divide by the minutes deltatime:
zerotime= pd.to_datetime("0000", format="%H%M")
df['minutes'] = (df3 - zerotime) / pd.Timedelta(minutes=1)
The results look like:
time minutes
0 8h30 510.0
1 14h07 847.0
2 08h30 510.0
3 7h50 470.0
4 8h0 480.0
5 8h15 495.0
6 6h15 375.0
how to convert time to week number
year_start = '2019-05-21'
year_end = '2020-02-22'
How do I get the week number based on the date that I set as first week?
For example 2019-05-21 should be Week 1 instead of 2019-01-01
If you do not have dates outside of year_start/year_end, use isocalendar().week and perform a simple subtraction with modulo:
year_start = pd.to_datetime('2019-05-21')
#year_end = pd.to_datetime('2020-02-22')
df = pd.DataFrame({'date': pd.date_range('2019-05-21', '2020-02-22', freq='30D')})
df['week'] = (df['date'].dt.isocalendar().week.astype(int)-year_start.isocalendar()[1])%52+1
Output:
date week
0 2019-05-21 1
1 2019-06-20 5
2 2019-07-20 9
3 2019-08-19 14
4 2019-09-18 18
5 2019-10-18 22
6 2019-11-17 26
7 2019-12-17 31
8 2020-01-16 35
9 2020-02-15 39
Try the following code.
import numpy as np
import pandas as pd
year_start = '2019-05-21'
year_end = '2020-02-22'
# Create a sample dataframe
df = pd.DataFrame(pd.date_range(year_start, year_end, freq='D'), columns=['date'])
# Add the week number
df['week_number'] = (((df.date.view(np.int64) - pd.to_datetime([year_start]).view(np.int64)) / (1e9 * 60 * 60 * 24) - df.date.dt.day_of_week + 7) // 7 + 1).astype(np.int64)
date
week_number
2019-05-21
1
2019-05-22
1
2019-05-23
1
2019-05-24
1
2019-05-25
1
2019-05-26
1
2019-05-27
2
2019-05-28
2
2020-02-18
40
2020-02-19
40
2020-02-20
40
2020-02-21
40
2020-02-22
40
If you just need a function to calculate week no, based on given start and end date:
import pandas as pd
import numpy as np
start_date = "2019-05-21"
end_date = "2020-02-22"
start_datetime = pd.to_datetime(start_date)
end_datetime = pd.to_datetime(end_date)
def get_week_no(date):
given_datetime = pd.to_datetime(date)
# if date in range
if start_datetime <= given_datetime <= end_datetime:
x = given_datetime - start_datetime
# adding 1 as it will return 0 for 1st week
return int(x / np.timedelta64(1, 'W')) + 1
raise ValueError(f"Date is not in range {start_date} - {end_date}")
print(get_week_no("2019-05-21"))
In the function, we are calculating week no by finding difference between given date and start date in weeks.
I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0
I have a pandas dataframe with id and date as the 2 columns - the date column has all the way to seconds.
data = {'id':[17,17,17,17,17,18,18,18,18],'date':['2018-01-16','2018-01-26','2018-01-27','2018-02-11',
'2018-03-14','2018-01-28','2018-02-12','2018-02-25','2018-03-04'],
}
df1 = pd.DataFrame(data)
I would like to have a new column - (tslt) - 'time_since_last_transaction'. The first transaction for each unique user_id could be a number say 1. Each subsequent transaction for that user should measure the difference between the 1st time stamp for that user and its current time stamp to generate a time difference in seconds.
I used the datetime and timedelta etc. but did not have too much of luck. Any help would be appreciated.
You can try groupby().transform():
df1['date'] = pd.to_datetime(df1['date'])
df1['diff'] = df1['date'].sub(df1.groupby('id').date.transform('min')).dt.total_seconds()
Output:
id date diff
0 17 2018-01-16 0.0
1 17 2018-01-26 864000.0
2 17 2018-01-27 950400.0
3 17 2018-02-11 2246400.0
4 17 2018-03-14 4924800.0
5 18 2018-01-28 0.0
6 18 2018-02-12 1296000.0
7 18 2018-02-25 2419200.0
8 18 2018-03-04 3024000.0
I have the following code:
import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
import fxcmpy
import numpy as np
symbols = con.get_instruments()
ticker = 'NGAS'
start = datetime.datetime(2015,1,1)
end = datetime.datetime.today()
data = con.get_candles(ticker, period='m1', number=10000)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d %hh:%mm %s')
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d %hh:%mm %s')
data['hour'] = data.index.hour
data['minute'] = data.index.minute
data produces the following :
bidopen bidclose bidhigh bidlow askopen askclose askhigh asklow tickqty hour minute
date
2019-12-05 07:00:00 2.4230 2.4280 2.4300 2.422 2.4305 2.4360 2.439 2.4295 47 7 0
2019-12-05 07:01:00 2.4280 2.4265 2.4270 2.426 2.4360 2.4340 2.436 2.4340 10 7 1
2019-12-05 07:02:00 2.4265 2.4295 2.4300 2.426 2.4340 2.4370 2.438 2.4340 35 7 2
2019-12-05 07:03:00 2.4295 2.4285 2.4300 2.428 2.4370 2.4360 2.438 2.4360 20 7 3
2019-12-05 07:04:00 2.4285 2.4350 2.4360 2.428 2.4360 2.4425 2.444 2.4360 50 7 4
... ... ... ... ... ... ... ... ... ... ... ...
2019-12-17 15:07:00 2.3335 2.3340 2.3345 2.332 2.3410 2.3415 2.342 2.3395 94 15 7
2019-12-17 15:08:00 2.3340 2.3345 2.3355 2.334 2.3415 2.3420 2.344 2.3415 22 15 8
2019-12-17 15:09:00 2.3345 2.3335 2.3345 2.332 2.3420 2.3410 2.342 2.3410 15 15 9
2019-12-17 15:10:00 2.3335 2.3325 2.3345 2.331 2.3410 2.3400 2.342 2.3390 72 15 10
2019-12-17 15:11:00 2.3325 2.3270 2.3325 2.326 2.3400 2.3345 2.340 2.3335 99 15 11
In the table above hours start from 7 end end in 15. However when i run the following code, hour starts from 0 and ends at 59. Why is that?
df = data.groupby(['hour', 'minute']).mean()
bidopen bidclose bidhigh bidlow askopen askclose askhigh asklow tickqty
hour minute
0 0 2.302786 2.303500 2.304286 2.302071 2.310571 2.311214 2.312000 2.310143 16.285714
1 2.294917 2.294333 2.295250 2.293583 2.302667 2.302000 2.303333 2.301333 14.500000
2 2.283000 2.283333 2.283833 2.282333 2.290667 2.290833 2.292000 2.290167 18.666667
3 2.298417 2.298833 2.299167 2.297833 2.305917 2.306333 2.307000 2.305917 14.833333
4 2.283583 2.284000 2.284250 2.283000 2.291083 2.291750 2.292167 2.291083 14.166667
... ... ... ... ... ... ... ... ... ... ...
23 55 2.285500 2.285800 2.286600 2.284700 2.293100 2.293400 2.294300 2.292600 10.400000
56 2.303800 2.304000 2.304600 2.303300 2.311400 2.311700 2.312500 2.311000 11.200000
57 2.268700 2.268400 2.268900 2.268100 2.276200 2.276100 2.276700 2.275900 5.800000
58 2.302857 2.303000 2.303286 2.302357 2.310571 2.310571 2.311214 2.310286 8.000000
59 2.321300 2.321000 2.321700 2.320400 2.328900 2.328900 2.329500 2.328700 8.400000
What i am trying to do is group data by hour which starts from 7 and ends at 15, then i want the mean() of that. So mean() of all the hour 7 to hour 15.
--
Edit 1:
How can i set hour and day as index?
data.set_index('minute', inplace = True)
data.set_index('hour', inplace = True)
gives me an error
Perhaps data.index = pd.to_datetime(data.index, format ='%Y-%m-%d %hh:%mm %s') should be changed to data.index = pd.to_datetime(data.index, format ='%Y-%m-%d %H:%M %S') For hour, min and seconds!
The results you are seeing are correct:
The date of the first line is the 5th of December, the date of the last line is the 17th December, and so there are many lines in between where the hour of the day is after 3pm or before 7am.
Try df[df['hour']>15].head() to see some of the lines which are later in the day than 3pm
updated:
to get the mean for the hours 7 - 15 first see the below example code
df = pd.DataFrame()
df['hour']=np.array([15,12,10,6,4,19,15,12,10])
df['price']=np.array([1,2,3,4,5,6,7,8,9])
df[(df['hour']>=7)&(df['hour']<=15)].mean().price
which returns
5.0
or for mean by hour
df[(df['hour']>=7)&(df['hour']<=15)].groupby('hour').mean()
which returns
price
hour
10 6
12 5
15 4
First of all, what you're seeing is a multi-index. You're seeing hours ranging from 0 to 23 and minutes ranging from 0 to 59.
If you'd like the mean for each hour, you simply need:
data.groupby(['hour']).mean().
If you do choose to group by an additional quantity such as in data.groupby(['hour','minute']).mean() it may be helpful to call a .reset_index() to avoid the confusion of the multi-index.
(e.g. df = data.groupby(['hour','minute']).mean().reset_index())
%hh:%mm %s isn't supported in python datetime, instead of:
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d %hh:%mm %s')
Use:
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d %H:%M %S')