Is possible to set a date when converting with pandas.to_datetime? - python

I have an array that looks like this:
array([(b'03:35:05.397191'),
(b'03:35:06.184700'),
(b'03:35:08.642503'), ...,
(b'05:47:15.285806'),
(b'05:47:20.189460'),
(b'05:47:30.598514')],
dtype=[('Date', 'S15')])
I want to convert it into a dataframe, using to_datetime. I could do that by simply doing this:
df = pd.DataFrame( array )
df['Date'] = pd.to_datetime( df['Date'].str.decode("utf-8") )
>>> df.Date
0 2018-03-07 03:35:05.397191
1 2018-03-07 03:35:06.184700
2 2018-03-07 03:35:08.642503
3 2018-03-07 03:35:09.155030
4 2018-03-07 03:35:09.300029
5 2018-03-07 03:35:09.303031
The problem is that it automatically sets the date as today. Is it possible to set the date as a different day, for example, 2015-01-25?

Instead of using pd.to_datetime, use pd.to_timedelta and add a date.
pd.to_timedelta(df.Date.str.decode("utf-8")) + pd.to_datetime('2017-03-15')
0 2017-03-15 03:35:05.397191
1 2017-03-15 03:35:06.184700
2 2017-03-15 03:35:08.642503
3 2017-03-15 05:47:15.285806
4 2017-03-15 05:47:20.189460
5 2017-03-15 05:47:30.598514
Name: Date, dtype: datetime64[ns]

Try this:
df['Date'] = pd.to_datetime( df['Date'].str.decode("utf-8") ).apply(lambda x: x.replace(year=2015, month=1, day=25))
Incorporating #Wen's solution for correctness :)

you could create a string with complete date-time and parse, like:
df = pd.DataFrame( array )
df['Date'] = pd.to_datetime( '20150125 ' + df['Date'].str.decode("utf-8") )

Ummm, seems like it work :-)
pd.to_datetime(df['Date'].str.decode("utf-8"))-(pd.to_datetime('today')-pd.to_datetime('2015-01-25'))
Out[376]:
0 2015-01-25 03:35:05.397191
1 2015-01-25 03:35:06.184700
2 2015-01-25 03:35:08.642503
3 2015-01-25 05:47:15.285806
4 2015-01-25 05:47:20.189460
5 2015-01-25 05:47:30.598514
Name: Date, dtype: datetime64[ns]

Related

How to homogenize date type in a pandas dataframe column?

I have a Date column in my dataframe having dates with 2 different types (YYYY-DD-MM 00:00:00 and YYYY-DD-MM) :
Date
0 2023-01-10 00:00:00
1 2024-27-06
2 2022-07-04 00:00:00
3 NaN
4 2020-30-06
(you can use pd.read_clipboard(sep='\s\s+') after copying the previous dataframe to get it in your notebook)
I would like to have only a YYYY-MM-DD type. Consequently, I would like to have :
Date
0 2023-10-01
1 2024-06-27
2 2022-04-07
3 NaN
4 2020-06-30
How please could I do ?
Use Series.str.replace with to_datetime and format parameter:
df['Date'] = pd.to_datetime(df['Date'].str.replace(' 00:00:00',''), format='%Y-%d-%m')
print (df)
Date
0 2023-10-01
1 2024-06-27
2 2022-04-07
3 NaT
4 2020-06-30
Another idea with match both formats:
d1 = pd.to_datetime(df['Date'], format='%Y-%d-%m', errors='coerce')
d2 = pd.to_datetime(df['Date'], format='%Y-%d-%m 00:00:00', errors='coerce')
df['Date'] = d1.fillna(d2)

Substracting timedelta in pandas

I have a dataframe with two columns (date and days).
df = pd.DataFrame({'date':[2020-01-31, 2020-01-21, 2020-01-11], 'days':[1, 2, 3]})
I want to have a third column (date_2) for which to substract the number of days from the date. Therefore, date_2 would be [2020-01-30, 2020-01-19, 2020-01-8].
I know timedelta(days = i) but I cannot give it the content of df['days'] as i in pandas.
Use to_timedelta with unit=d and subtract
>>pd.to_datetime(df['date'])-pd.to_timedelta(df['days'],unit='d')
0 2020-01-30
1 2020-01-19
2 2020-01-08
dtype: datetime64[ns]
Use to_datetime for datetimes and subtract by Series.sub with timedeltas created by to_timedelta:
df['new'] = pd.to_datetime(df['date']).sub(pd.to_timedelta(df['days'], unit='d'))
print (df)
date days new
0 2020-01-31 1 2020-01-30
1 2020-01-21 2 2020-01-19
2 2020-01-11 3 2020-01-08

Pandas datetimes with different formats in the same column

I have a pandas data frame which has datetimes with 2 different formats e.g.:
3/14/2019 5:15:32 AM
2019-08-03 05:15:35
2019-01-03 05:15:33
2019-01-03 05:15:33
2/28/2019 5:15:31 AM
2/27/2019 11:18:39 AM
...
I have tried various formats but get errors like ValueError: unconverted data remains: AM
I would like to get the format as 2019-02-28 and have the time removed
You can use pd.to_datetime().dt.strftime() to efficienty convert the entire column to a datetime object and then to a string with Pandas intelligently guessing the date formatting:
df = pd.Series('''3/14/2019 5:15:32 AM
2019-08-03 05:15:35
2019-01-03 05:15:33
2019-01-03 05:15:33
2/28/2019 5:15:31 AM
2/27/2019 11:18:39 AM'''.split('\n'), name='date', dtype=str).to_frame()
print(pd.to_datetime(df.date).dt.strftime('%Y-%m-%d'))
0 2019-03-14
1 2019-08-03
2 2019-01-03
3 2019-01-03
4 2019-02-28
5 2019-02-27
Name: date, dtype: object
If that doesn't give you what you want, you will need to identify the different kinds of formats and apply different settings when you convert them to datetime objects:
# Classify date column by format type
df['format'] = 1
df.loc[df.date.str.contains('/'), 'format'] = 2
df['new_date'] = pd.to_datetime(df.date)
# Convert to datetime with two different format settings
df.loc[df.format == 1, 'new_date'] = pd.to_datetime(df.loc[df.format == 1, 'date'], format = '%Y-%d-%m %H:%M:%S').dt.strftime('%Y-%m-%d')
df.loc[df.format == 2, 'new_date'] = pd.to_datetime(df.loc[df.format == 2, 'date'], format = '%m/%d/%Y %H:%M:%S %p').dt.strftime('%Y-%m-%d')
print(df)
date format new_date
0 3/14/2019 5:15:32 AM 2 2019-03-14
1 2019-08-03 05:15:35 1 2019-03-08
2 2019-01-03 05:15:33 1 2019-03-01
3 2019-01-03 05:15:33 1 2019-03-01
4 2/28/2019 5:15:31 AM 2 2019-02-28
5 2/27/2019 11:18:39 AM 2 2019-02-27
Assume that the column name in your DataFrame is DatStr.
The key to success is a proper conversion function, to be
applied to each date string:
def datCnv(src):
return pd.to_datetime(src)
Then all you should do to create a true date column is to call:
df['Dat'] = df.DatStr.apply(datCnv)
When you print the DataFrame, the result is:
DatStr Dat
0 3/14/2019 5:15:32 AM 2019-03-14 05:15:32
1 2019-08-03 05:15:35 2019-08-03 05:15:35
2 2019-01-03 05:15:33 2019-01-03 05:15:33
3 2019-01-03 05:15:33 2019-01-03 05:15:33
4 2/28/2019 5:15:31 AM 2019-02-28 05:15:31
5 2/27/2019 11:18:39 AM 2019-02-27 11:18:39
Note that to_datetime function is clever enough to recognize the
actual date format used in each case.
I had a similar issue. Luckily for me the different format occurred every other row. Therefore I could easily do a slice with .iloc. Howevery you could also slice the Series with .loc and a filter (detecting each format).
Then you can combine the rows with pd.concat. The order will be the same as for the rest of the DataFrame (if you assign it). If there are missing indices they will become NaN, if there are duplicated indices pandas will raise an error.
df["datetime"] = pd.concat([
pd.to_datetime(df["Time"].str.slice(1).iloc[1::2], format="%y-%m-%d %H:%M:%S.%f"),
pd.to_datetime(df["Time"].str.slice(1).iloc[::2], format="%y-%m-%d %H:%M:%S"),
])
I think is a little bit late for the answer but I discover a simplier way to do the same
df["date"] = pd.to_datetime(df["date"], format='%Y-%d-%m %H:%M:%S', errors='ignore').astype('datetime64[D]')
df["date"] = pd.to_datetime(df["date"], format='%m/%d/%Y %H:%M:%S %p', errors='ignore').astype('datetime64[D]')

Count business day between using pandas columns

I have tried to calculate the number of business days between two date (stored in separate columns in a dataframe ).
MonthBegin MonthEnd
0 2014-06-09 2014-06-30
1 2014-07-01 2014-07-31
2 2014-08-01 2014-08-31
3 2014-09-01 2014-09-30
4 2014-10-01 2014-10-31
I have tried to apply numpy.busday_count but I get the following error:
Iterator operand 0 dtype could not be cast from dtype('<M8[ns]') to dtype('<M8[D]') according to the rule 'safe'
I have tried to change the type into Timestamp as the following :
Timestamp('2014-08-31 00:00:00')
or datetime :
datetime.date(2014, 8, 31)
or to numpy.datetime64:
numpy.datetime64('2014-06-30T00:00:00.000000000')
Anyone knows how to fix it?
Note 1: I have passed tried np.busday_count in two way :
1. Passing dataframe columns, t['Days']=np.busday_count(t.MonthBegin,t.MonthEnd)
Passing arrays np.busday_count(dt1,dt2)
Note2: My dataframe has over 150K rows so I need to use an efficient algorithm
You can using bdate_range, also I corrected your input , since the most of MonthEnd is early than the MonthBegin
[len(pd.bdate_range(x,y))for x,y in zip(df['MonthBegin'],df['MonthEnd'])]
Out[519]: [16, 21, 22, 23, 20]
I think the best way to do is
df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
For my dataframe df as below:
MBegin MEnd
0 2011-01-01 2011-02-01
1 2011-01-10 2011-02-10
2 2011-01-02 2011-02-02
doing :
df['MBegin'] = df['MBegin'].values.astype('datetime64[D]')
df['MEnd'] = df['MEnd'].values.astype('datetime64[D]')
df['busday'] = df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
>>df
MBegin MEnd busday
0 2011-01-01 2011-02-01 21
1 2011-01-10 2011-02-10 23
2 2011-01-02 2011-02-02 22
You need to provide the template in which your dates are written.
a = datetime.strptime('2014-06-9', '%Y-%m-%d')
Calculate this for your
b = datetime.strptime('2014-06-30', '%Y-%m-%d')
Now their difference
c = b-a
c.days
which gives you the difference 21 days, You can now use list comprehension to get the difference between two dates as days.
will give you datetime.timedelta(21), to convert it into days, just use
You can modify your code to get the desired result as below:
df = pd.DataFrame({'MonthBegin': ['2014-06-09', '2014-08-01', '2014-09-01', '2014-10-01', '2014-11-01'],
'MonthEnd': ['2014-06-30', '2014-08-31', '2014-09-30', '2014-10-31', '2014-11-30']})
df['MonthBegin'] = df['MonthBegin'].astype('datetime64[ns]')
df['MonthEnd'] = df['MonthEnd'].astype('datetime64[ns]')
df['BDays'] = np.busday_count(df['MonthBegin'].tolist(), df['MonthEnd'].tolist())
print(df)
MonthBegin MonthEnd BDays
0 2014-06-09 2014-06-30 15
1 2014-08-01 2014-08-31 21
2 2014-09-01 2014-09-30 21
3 2014-10-01 2014-10-31 22
4 2014-11-01 2014-11-30 20
Additionally numpy.busday_count has few other optional arguments like weekmask, holidays ... which you can use according to your need.

Extracting YYYY-MM from datetime column

I've a dataframe of this format -
var1 date
A 2017/01/01
A 2017/01/02
...
I want the date to be converted into YYYY-MM format but the df['date'].dtype is object.
How can I remove the day part from date while keeping the data type as datetime?
Expected Output -
A - 2017/01
Thanks
You can't have custom representation for the datetime dtype. But you have the following options:
use strings - you might have any representation (as you wish), but all datetime methods and attributes get lost
use datetime, but set the day part to 1 (as #Kopytok) has already shown.
use period dtype, which still allows you to use some date arithmetic
Demo:
In [207]: df
Out[207]:
var1 date
0 A 2018-12-31
1 A 2017-09-07
2 B 2016-02-29
In [208]: df['new'] = df['date'].dt.to_period('M')
In [209]: df
Out[209]:
var1 date new
0 A 2018-12-31 2018-12
1 A 2017-09-07 2017-09
2 B 2016-02-29 2016-02
In [210]: df.dtypes
Out[210]:
var1 object
date datetime64[ns]
new object
dtype: object
In [211]: df['new'] + 8
Out[211]:
0 2019-08
1 2018-05
2 2016-10
Name: new, dtype: object
It is possible replace every date with the first day of month:
pd.to_datetime(d["date"], format="%Y/%m/%d").apply(lambda x: x.replace(day=1))
Result:
0 2017-01-01
1 2017-01-01

Categories