I'm trying to add the row-wise result from a function into my dataframe using
df.set_Value.
df in the format :
Count DTW
DateTime
2015-01-16 10 0
2015-01-17 28 0
Using df.setValue
dw.set_Value(idx, 'col', dtw) # idx and dtw are int values
TypeError: cannot insert DatetimeIndex with incompatible label
How do I solve this error or what alternative method with comparable efficiency is there?
I think you have Series, not DataFrame, so use Series.set_value with index converted to datetime
dw = pd.Series([-2374], index = [pd.to_datetime('2015-01-18')])
dw.index.name = 'DateTime'
print (dw)
DateTime
2015-01-18 -2374
dtype: int64
print (dw.set_value(pd.to_datetime('2015-01-19'), 1))
DateTime
2015-01-18 -2374
2015-01-19 1
dtype: int64
print (dw.set_value(pd.datetime(2015, 1, 19), 1))
DateTime
2015-01-18 -2374
2015-01-19 1
dtype: int64
More standard way is use ix or iloc:
print (dw)
Count DTW
DateTime
2015-01-16 10 0
2015-01-17 28 0
dw.ix[1, 'DTW'] = 10
#dw.DTW.iloc[1] = 10
print (dw)
Count DTW
DateTime
2015-01-16 10 0
2015-01-17 28 10
Related
I created a pandas df with columns named start_date and current_date. Both columns have a dtype of datetime64[ns]. What's the best way to find the quantity of business days between the current_date and start_date column?
I've tried:
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
projects_df['start_date'] = pd.to_datetime(projects_df['start_date'])
projects_df['current_date'] = pd.to_datetime(projects_df['current_date'])
projects_df['days_count'] = len(pd.date_range(start=projects_df['start_date'], end=projects_df['current_date'], freq=us_bd))
I get the following error message:
Cannot convert input....start_date, dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp
I'm using Python version 3.10.4.
pd.date_range's parameters need to be datetimes, not series.
For this reason, we can use df.apply to apply the function to each row.
In addition, pandas has bdate_range which is just date_range with freq defaulting to business days, which is exactly what you need.
Using apply and a lambda function, we can create a new Series calculating business days between each start and current date for each row.
projects_df['start_date'] = pd.to_datetime(projects_df['start_date'])
projects_df['current_date'] = pd.to_datetime(projects_df['current_date'])
projects_df['days_count'] = projects_df.apply(lambda row: len(pd.bdate_range(row['start_date'], row['current_date'])), axis=1)
Using a random sample of 10 date pairs, my output is the following:
start_date current_date bdays
0 2022-01-03 17:08:04 2022-05-20 00:53:46 100
1 2022-04-18 09:43:02 2022-06-10 16:56:16 40
2 2022-09-01 12:02:34 2022-09-25 14:59:29 17
3 2022-04-02 14:24:12 2022-04-24 21:05:55 15
4 2022-01-31 02:15:46 2022-07-02 16:16:02 110
5 2022-08-02 22:05:15 2022-08-17 17:25:10 12
6 2022-03-06 05:30:20 2022-07-04 08:43:00 86
7 2022-01-15 17:01:33 2022-08-09 21:48:41 147
8 2022-06-04 14:47:53 2022-12-12 18:05:58 136
9 2022-02-16 11:52:03 2022-10-18 01:30:58 175
I have a pandas timeline table containing dates objects and scores:
datetime score
2018-11-23 08:33:02 4
2018-11-24 09:43:30 2
2018-11-25 08:21:34 5
2018-11-26 19:33:01 4
2018-11-23 08:50:40 1
2018-11-23 09:03:10 3
I want to aggregate the score by hour without taking into consideration the date, the result desired is :
08:00:00 10
09:00:00 5
19:00:00 4
So basically I have to remove the date-month-year, and then group score by hour,
I tried this command
monthagg = df['score'].resample('H').sum().to_frame()
Which does work but takes into consideration the date-month-year, How to remove DD-MM-YYYY and aggregate by Hour?
One possible solution is use DatetimeIndex.floor for set minutes and seconds to 0 and then convert DatetimeIndex to strings by DatetimeIndex.strftime, then aggregate sum:
a = df['score'].groupby(df.index.floor('H').strftime('%H:%M:%S')).sum()
#if column datetime
#a = df['score'].groupby(df['datetime'].dt.floor('H').dt.strftime('%H:%M:%S')).sum()
print (a)
08:00:00 10
09:00:00 5
19:00:00 4
Name: score, dtype: int64
Or use DatetimeIndex.hour and aggregate sum:
a = df.groupby(df.index.hour)['score'].sum()
#if column datetime
#a = df.groupby(df['datetime'].dt.hour)['score'].sum()
print (a)
datetime
8 10
9 5
19 4
Name: score, dtype: int64
Setup to generate a frame with datetime objects:
import datetime
import pandas as pd
rows = [datetime.datetime.now() + datetime.timedelta(hours=i) for i in range(100)]
df = pd.DataFrame(rows,columns = ["date"])
You can now add a hour-column like this, and then group by it:
df["hour"] = df["date"].dt.hour
df.groupby("hour").sum()
import pandas as pd
df = pd.DataFrame({'datetime':['2018-11-23 08:33:02 ','2018-11-24 09:43:30',
'2018-11-25 08:21:34',
'2018-11-26 19:33:01','2018-11-23 08:50:40',
'2018-11-23 09:03:10'],'score':[4,2,5,4,1,3]})
df['datetime']=pd.to_datetime(df['datetime'], errors='coerce')
df["hour"] = df["datetime"].dt.hour
df.groupby("hour").sum()
Output:
8 10
9 5
19 4
I have tried to calculate the number of business days between two date (stored in separate columns in a dataframe ).
MonthBegin MonthEnd
0 2014-06-09 2014-06-30
1 2014-07-01 2014-07-31
2 2014-08-01 2014-08-31
3 2014-09-01 2014-09-30
4 2014-10-01 2014-10-31
I have tried to apply numpy.busday_count but I get the following error:
Iterator operand 0 dtype could not be cast from dtype('<M8[ns]') to dtype('<M8[D]') according to the rule 'safe'
I have tried to change the type into Timestamp as the following :
Timestamp('2014-08-31 00:00:00')
or datetime :
datetime.date(2014, 8, 31)
or to numpy.datetime64:
numpy.datetime64('2014-06-30T00:00:00.000000000')
Anyone knows how to fix it?
Note 1: I have passed tried np.busday_count in two way :
1. Passing dataframe columns, t['Days']=np.busday_count(t.MonthBegin,t.MonthEnd)
Passing arrays np.busday_count(dt1,dt2)
Note2: My dataframe has over 150K rows so I need to use an efficient algorithm
You can using bdate_range, also I corrected your input , since the most of MonthEnd is early than the MonthBegin
[len(pd.bdate_range(x,y))for x,y in zip(df['MonthBegin'],df['MonthEnd'])]
Out[519]: [16, 21, 22, 23, 20]
I think the best way to do is
df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
For my dataframe df as below:
MBegin MEnd
0 2011-01-01 2011-02-01
1 2011-01-10 2011-02-10
2 2011-01-02 2011-02-02
doing :
df['MBegin'] = df['MBegin'].values.astype('datetime64[D]')
df['MEnd'] = df['MEnd'].values.astype('datetime64[D]')
df['busday'] = df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
>>df
MBegin MEnd busday
0 2011-01-01 2011-02-01 21
1 2011-01-10 2011-02-10 23
2 2011-01-02 2011-02-02 22
You need to provide the template in which your dates are written.
a = datetime.strptime('2014-06-9', '%Y-%m-%d')
Calculate this for your
b = datetime.strptime('2014-06-30', '%Y-%m-%d')
Now their difference
c = b-a
c.days
which gives you the difference 21 days, You can now use list comprehension to get the difference between two dates as days.
will give you datetime.timedelta(21), to convert it into days, just use
You can modify your code to get the desired result as below:
df = pd.DataFrame({'MonthBegin': ['2014-06-09', '2014-08-01', '2014-09-01', '2014-10-01', '2014-11-01'],
'MonthEnd': ['2014-06-30', '2014-08-31', '2014-09-30', '2014-10-31', '2014-11-30']})
df['MonthBegin'] = df['MonthBegin'].astype('datetime64[ns]')
df['MonthEnd'] = df['MonthEnd'].astype('datetime64[ns]')
df['BDays'] = np.busday_count(df['MonthBegin'].tolist(), df['MonthEnd'].tolist())
print(df)
MonthBegin MonthEnd BDays
0 2014-06-09 2014-06-30 15
1 2014-08-01 2014-08-31 21
2 2014-09-01 2014-09-30 21
3 2014-10-01 2014-10-31 22
4 2014-11-01 2014-11-30 20
Additionally numpy.busday_count has few other optional arguments like weekmask, holidays ... which you can use according to your need.
I have a column I_DATE of type string(object) in a dataframe called train as show below.
I_DATE
28-03-2012 2:15:00 PM
28-03-2012 2:17:28 PM
28-03-2012 2:50:50 PM
How to convert I_DATE from string to datetime format & specify the format of input string.
Also, how to filter rows based on a range of dates in pandas?
Use to_datetime. There is no need for a format string since the parser is able to handle it:
In [51]:
pd.to_datetime(df['I_DATE'])
Out[51]:
0 2012-03-28 14:15:00
1 2012-03-28 14:17:28
2 2012-03-28 14:50:50
Name: I_DATE, dtype: datetime64[ns]
To access the date/day/time component use the dt accessor:
In [54]:
df['I_DATE'].dt.date
Out[54]:
0 2012-03-28
1 2012-03-28
2 2012-03-28
dtype: object
In [56]:
df['I_DATE'].dt.time
Out[56]:
0 14:15:00
1 14:17:28
2 14:50:50
dtype: object
You can use strings to filter as an example:
In [59]:
df = pd.DataFrame({'date':pd.date_range(start = dt.datetime(2015,1,1), end = dt.datetime.now())})
df[(df['date'] > '2015-02-04') & (df['date'] < '2015-02-10')]
Out[59]:
date
35 2015-02-05
36 2015-02-06
37 2015-02-07
38 2015-02-08
39 2015-02-09
Approach: 1
Given original string format: 2019/03/04 00:08:48
you can use
updated_df = df['timestamp'].astype('datetime64[ns]')
The result will be in this datetime format: 2019-03-04 00:08:48
Approach: 2
updated_df = df.astype({'timestamp':'datetime64[ns]'})
For a datetime in AM/PM format, the time format is '%I:%M:%S %p'. See all possible format combinations at https://strftime.org/. N.B. If you have time component as in the OP, the conversion will be done much, much faster if you pass the format= (see here for more info).
df['I_DATE'] = pd.to_datetime(df['I_DATE'], format='%d-%m-%Y %I:%M:%S %p')
To filter a datetime using a range, you can use query:
df = pd.DataFrame({'date': pd.date_range('2015-01-01', '2015-04-01')})
df.query("'2015-02-04' < date < '2015-02-10'")
or use between to create a mask and filter.
df[df['date'].between('2015-02-04', '2015-02-10')]
I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494