Extracting YYYY-MM from datetime column - python

I've a dataframe of this format -
var1 date
A 2017/01/01
A 2017/01/02
...
I want the date to be converted into YYYY-MM format but the df['date'].dtype is object.
How can I remove the day part from date while keeping the data type as datetime?
Expected Output -
A - 2017/01
Thanks

You can't have custom representation for the datetime dtype. But you have the following options:
use strings - you might have any representation (as you wish), but all datetime methods and attributes get lost
use datetime, but set the day part to 1 (as #Kopytok) has already shown.
use period dtype, which still allows you to use some date arithmetic
Demo:
In [207]: df
Out[207]:
var1 date
0 A 2018-12-31
1 A 2017-09-07
2 B 2016-02-29
In [208]: df['new'] = df['date'].dt.to_period('M')
In [209]: df
Out[209]:
var1 date new
0 A 2018-12-31 2018-12
1 A 2017-09-07 2017-09
2 B 2016-02-29 2016-02
In [210]: df.dtypes
Out[210]:
var1 object
date datetime64[ns]
new object
dtype: object
In [211]: df['new'] + 8
Out[211]:
0 2019-08
1 2018-05
2 2016-10
Name: new, dtype: object

It is possible replace every date with the first day of month:
pd.to_datetime(d["date"], format="%Y/%m/%d").apply(lambda x: x.replace(day=1))
Result:
0 2017-01-01
1 2017-01-01

Related

Convert (back and forth) UNIX timestamp to pandas.tslib.Timestamp and datetime for series

I am working with python 3.5.2, pandas 0.18.1 and sqlite3.
In my data base, I have a column unix_time with INT for seconds since 1970. Ideally I want to read my dataframe from sqlite, and then create a time column which would correspond to the datetime or pandas.tslib.Timestamp conversion of the unix_time column that I woul only use for some processing and then drop before saving the dataframe back.
The issue is that when parsing the unix_time column using :
df = pd.read_from_sql_query("SELECT * FROM test", con, parse_dates=['unix_time'])
I obtain pandas.tslib.Timestamp types which is fine for my processing, but then I have to recreate my original unix_time column using :
df['unix_time'][i] = (df['unix_time'][i] - datetime(1970,1,1)).total_seconds()
which is really 'dirty'
First question : Do you have a better way?
I thought about giving up the unix time format and only use datetime format but the to_datetime method from pandas returns in fact pandas.tslib.Timestamp ... And anyway, doing so would force me to iterate over all rows which is a bad solution. (It is impossible to apply to_datetime on something else than a view over a single cell of the dataframe
Second question : Is it possible to apply it on a series?
My last try was with directly using df['time'] = datetime.datetime.fromtimestamp(df['unix_time']) but surprisingly, it also returns pandas.tslib.Timestamp.
In the end, knowing that I can only save unix timestamps or datetimes, my only choices for the moment are :
parsing but then having to convert them back to unix timestamp one by
one.
Or not parse it but have to convert them to pandas.tslib.Timestamp
one by one.
It would be great if I could convert a whole series.
Last question : Is there a way to convert a unix timestamps series to datetime (or at least pandas.tslib.Timestamp), or a pandas.tslib.Timestamp (or datetime) series to unix timestamps?
Thanks
EDIT:
During my processing, I extract a row that I want to append to my dataset. Apparently, the coversion to pandas.tslib.Timestamp appends implicitly when passing from dataframe to serie :
df = pd.DataFrame({'UNX':pd.date_range('2016-01-01', freq='9999S', periods=10).astype(np.int64)//10**9})
df['Date'] = pd.to_datetime(df.UNX, unit='s')
print(df.Date.dtypes)
print(type(df['Date'][0]))
test = df.iloc[0]
print(type(test.Date))
new_df = test.to_frame().transpose() #from here, impossible to do : new_df.to_sql("test", con) because the type for 'Date' is not supported
print(new_df.Date.dtypes)
returns
datetime64[ns]
<class 'pandas.tslib.Timestamp'>
<class 'pandas.tslib.Timestamp'>
object
Is there a way to convert the 'Date' in new_df from pandas.tslib.Timestamp to datetime64[ns] or datetime.datetime (or simply str) ?
IIUC you can do it this way:
In [96]: df = pd.DataFrame({'UNX':pd.date_range('2016-01-01', freq='9999S', periods=10).astype(np.int64)//10**9})
In [97]: df
Out[97]:
UNX
0 1451606400
1 1451616399
2 1451626398
3 1451636397
4 1451646396
5 1451656395
6 1451666394
7 1451676393
8 1451686392
9 1451696391
Convert UNIX epoch to Python datetime:
In [98]: df['Date'] = pd.to_datetime(df.UNX, unit='s')
In [99]: df
Out[99]:
UNX Date
0 1451606400 2016-01-01 00:00:00
1 1451616399 2016-01-01 02:46:39
2 1451626398 2016-01-01 05:33:18
3 1451636397 2016-01-01 08:19:57
4 1451646396 2016-01-01 11:06:36
5 1451656395 2016-01-01 13:53:15
6 1451666394 2016-01-01 16:39:54
7 1451676393 2016-01-01 19:26:33
8 1451686392 2016-01-01 22:13:12
9 1451696391 2016-01-02 00:59:51
Convert datetime to UNIX epoch:
In [100]: df['UNX2'] = df.Date.astype('int64')//10**9
In [101]: df
Out[101]:
UNX Date UNX2
0 1451606400 2016-01-01 00:00:00 1451606400
1 1451616399 2016-01-01 02:46:39 1451616399
2 1451626398 2016-01-01 05:33:18 1451626398
3 1451636397 2016-01-01 08:19:57 1451636397
4 1451646396 2016-01-01 11:06:36 1451646396
5 1451656395 2016-01-01 13:53:15 1451656395
6 1451666394 2016-01-01 16:39:54 1451666394
7 1451676393 2016-01-01 19:26:33 1451676393
8 1451686392 2016-01-01 22:13:12 1451686392
9 1451696391 2016-01-02 00:59:51 1451696391
Check:
In [102]: df.UNX.eq(df.UNX2).all()
Out[102]: True
Round trip between Pandas Timestamp and Unix Seconds (since 1970-01-01):
date_in = pd.to_datetime("2022-04-07")
# type(date_in) is: pandas._libs.tslibs.timestamps.Timestamp
unix_seconds = date_in.value//10**9
date_out = pd.to_datetime(unix_seconds, unit="s")
Output:
date_in
Out[1]: Timestamp('2021-04-07 00:00:00')
unix_seconds
Out[2]: 1617753600
date_out
Out[3]: Timestamp('2021-04-07 00:00:00')

Convert strings to date format

I have a dataframe with a column of strings indicating month and year (MM-YY) but i need it to be like YYYY,MM,DD e.g 2015,10,01
for i in df['End Date (MM-YY)']:
print i
Mar-16
Nov-16
Jan-16
Jan-16
print type(i)
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
I think you can use to_datetime with parameter format:
df = pd.DataFrame({'End Date (MM-YY)': {0: 'Mar-16',
1: 'Nov-16',
2: 'Jan-16',
3: 'Jan-16'}})
print df
End Date (MM-YY)
0 Mar-16
1 Nov-16
2 Jan-16
3 Jan-16
print pd.to_datetime(df['End Date (MM-YY)'], format='%b-%y')
0 2016-03-01
1 2016-11-01
2 2016-01-01
3 2016-01-01
Name: End Date (MM-YY), dtype: datetime64[ns]
df['date'] = pd.to_datetime(df['End Date (MM-YY)'], format='%b-%y')
If you need convert date column to the last day of month, use MonthEnd:
df['date-end-month'] = df['date'] + pd.offsets.MonthEnd()
print df
End Date (MM-YY) date date-end-month
0 Mar-16 2016-03-01 2016-03-31
1 Nov-16 2016-11-01 2016-11-30
2 Jan-16 2016-01-01 2016-01-31
3 Jan-16 2016-01-01 2016-01-31
You can use Lambda and Map functions, the references for which are here 1 and 2 combined with to_datetime with parameter format.
Can you provide more information on the data that you are using. I can refine my answer further based on that part of information. Thanks!
If you are trying to do what I think you are...
Use the datetime.datetime.strptime method! It's a wonderful way to specify the format you expect dates to show up in a string, and it returns a nice datetime obj for you to do with what you will.
You can even turn it back into a differently formatted string with datetime.datetime.strftime!

is the year-month part of a datetime variable still a time object in Pandas?

consider this
df=pd.DataFrame({'A':['20150202','20150503','20150503'],'B':[3, 3, 1],'C':[1, 3, 1]})
df.A=pd.to_datetime(df.A)
df['month']=df.A.dt.to_period('M')
df
Out[59]:
A B C month
0 2015-02-02 3 1 2015-02
1 2015-05-03 3 3 2015-05
2 2015-05-03 1 1 2015-05
and my month variable is:
df.month
Out[82]:
0 2015-02
1 2015-05
2 2015-05
Name: month, dtype: object
Now if I index my dataset by df.month, it seems that Pandas understands this is a date. In other words, I can draw a plot without having to sort my index first.
But is this actually correct? The dtype object (instead of some datetime format) worries me. Is there a proper date object type for this kind of monthly date?
It is a pandas period object
In [5]: df.month.map(type)
Out[5]:
0 <class 'pandas._period.Period'>
1 <class 'pandas._period.Period'>
2 <class 'pandas._period.Period'>
Name: month, dtype: object

Pandas column date transformation

I have a pandas dataframe with a date column the data type is datetime64[ns]. there are over 1000 observations in the dataframe. I want to transform the following column:
date
2013-05-01
2013-05-01
to
date
05/2013
05/2013
or
date
05-2013
05-2013
EDIT//
this is my sample code as of now
test = pd.DataFrame({'a':['07/2017','07/2017',pd.NaT]})
a
0 2017-07-13
1 2017-07-13
2 NaT
test['a'].apply(lambda x: x if pd.isnull(x) == True else x.strftime('%Y-%m'))
0 2017-07-01
1 2017-07-01
2 NaT
Name: a, dtype: datetime64[ns]
why did only the date change and not the format?
You can convert datetime64 into whatever string format you like using the strftime method. In your case you would apply it like this:
df.date = df.date[df.date.notnull()].map(lambda x: x.strftime('%m/%Y'))
df.date
Out[111]:
0 05/2013
1 05/2013

Pandas Timedelta in Days

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

Categories