Python-String parsing for extraction of date and time - python

The datetime is given in the format YY-MM-DD HH:MM:SS in a dataframe.I want new Series of year,month and hour for which I am trying the below code.
But the problem is that Month and Hour are getting the same value,Year is fine.
Can anyone help me with this ? I am using Ipthon notebook and Pandas and numpy.
Here is the code :
def extract_hour(X):
cnv=datetime.strptime(X, '%Y-%m-%d %H:%M:%S')
return cnv.hour
def extract_month(X):
cnv=datetime.strptime(X, '%Y-%m-%d %H:%M:%S')
return cnv.month
def extract_year(X):
cnv=datetime.strptime(X, '%Y-%m-%d %H:%M:%S')
return cnv.year
#month column
train['Month']=train['datetime'].apply((lambda x: extract_month(x)))
test['Month']=test['datetime'].apply((lambda x: extract_month(x)))
#year column
train['Year']=train['datetime'].apply((lambda x: extract_year(x)))
test['Year']=test['datetime'].apply((lambda x: extract_year(x)))
#Hour column
train['Hour']=train['datetime'].apply((lambda x: extract_hour(x)))
test['Hour']=test['datetime'].apply((lambda x: extract_hour(x)))

you can use .dt accessors instead: train['datetime'].dt.month, train['datetime'].dt.year, train['datetime'].dt.hour (see the full list below)
Demo:
In [81]: train = pd.DataFrame(pd.date_range('2016-01-01', freq='1999H', periods=10), columns=['datetime'])
In [82]: train
Out[82]:
datetime
0 2016-01-01 00:00:00
1 2016-03-24 07:00:00
2 2016-06-15 14:00:00
3 2016-09-06 21:00:00
4 2016-11-29 04:00:00
5 2017-02-20 11:00:00
6 2017-05-14 18:00:00
7 2017-08-06 01:00:00
8 2017-10-28 08:00:00
9 2018-01-19 15:00:00
In [83]: train.datetime.dt.year
Out[83]:
0 2016
1 2016
2 2016
3 2016
4 2016
5 2017
6 2017
7 2017
8 2017
9 2018
Name: datetime, dtype: int64
In [84]: train.datetime.dt.month
Out[84]:
0 1
1 3
2 6
3 9
4 11
5 2
6 5
7 8
8 10
9 1
Name: datetime, dtype: int64
In [85]: train.datetime.dt.hour
Out[85]:
0 0
1 7
2 14
3 21
4 4
5 11
6 18
7 1
8 8
9 15
Name: datetime, dtype: int64
In [86]: train.datetime.dt.day
Out[86]:
0 1
1 24
2 15
3 6
4 29
5 20
6 14
7 6
8 28
9 19
Name: datetime, dtype: int64
List of all .dt accessors:
In [77]: train.datetime.dt.
train.datetime.dt.ceil train.datetime.dt.hour train.datetime.dt.month train.datetime.dt.to_pydatetime
train.datetime.dt.date train.datetime.dt.is_month_end train.datetime.dt.nanosecond train.datetime.dt.tz
train.datetime.dt.day train.datetime.dt.is_month_start train.datetime.dt.normalize train.datetime.dt.tz_convert
train.datetime.dt.dayofweek train.datetime.dt.is_quarter_end train.datetime.dt.quarter train.datetime.dt.tz_localize
train.datetime.dt.dayofyear train.datetime.dt.is_quarter_start train.datetime.dt.round train.datetime.dt.week
train.datetime.dt.days_in_month train.datetime.dt.is_year_end train.datetime.dt.second train.datetime.dt.weekday
train.datetime.dt.daysinmonth train.datetime.dt.is_year_start train.datetime.dt.strftime train.datetime.dt.weekday_name
train.datetime.dt.floor train.datetime.dt.microsecond train.datetime.dt.time train.datetime.dt.weekofyear
train.datetime.dt.freq train.datetime.dt.minute train.datetime.dt.to_period train.datetime.dt.year

Related

convert month of dates into sequence

i want to combine months from years into sequence, for example, i have dataframe like this:
stuff_id date
1 2015-02-03
2 2015-03-03
3 2015-05-19
4 2015-10-13
5 2016-01-07
6 2016-03-20
i want to sequence the months of the date. the desired output is:
stuff_id date month
1 2015-02-03 1
2 2015-03-03 2
3 2015-05-19 4
4 2015-10-13 9
5 2016-01-07 12
6 2016-03-20 14
which means feb'15 is the first month in the date list and jan'2016 is the 12th month after feb'2015
If your date column is a datetime (if it's not, cast it to one), you can use the .dt.month and .dt.year properties for this!
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.month.html
recast
(text copy from Answer to Pasting data into a pandas dataframe)
>>> df = pd.read_table(io.StringIO(s), delim_whitespace=True) # text from SO
>>> df["date"] = pd.to_datetime(df["date"])
>>> df
stuff_id date
0 1 2015-02-03
1 2 2015-03-03
2 3 2015-05-19
3 4 2015-10-13
4 5 2016-01-07
5 6 2016-03-20
>>> df.dtypes
stuff_id int64
date datetime64[ns]
dtype: object
extract years and months to decimal months and reduce to relative
>>> months = df["date"].dt.year * 12 + df["date"].dt.month # series
>>> df["months"] = months - min(months) + 1
>>> df
stuff_id date months
0 1 2015-02-03 1
1 2 2015-03-03 2
2 3 2015-05-19 4
3 4 2015-10-13 9
4 5 2016-01-07 12
5 6 2016-03-20 14

Merging year and week column to create datetime and sorting in python

Sample Data
Year Week_No Value
2015 52 3
2016 2 7
2015 51 5
2016 1 6
2015 50 4
Below is the code that I have tried
import datetime
d = "2015-50"
r = datetime.datetime.strptime(d + '-1', "%Y-%W-%w")
print(r)
2015-12-14 00:00:00
How to move ahead to create a datetime column?
I think in pandas is best use to_datetime with add last value -1 for day of week:
df['datetime'] = pd.to_datetime(df.Year.astype(str) + '-' +
df.Week_No.astype(str) + '-1', format="%Y-%W-%w")
print (df)
Year Week_No Value datetime
0 2015 52 3 2015-12-28
1 2016 2 7 2016-01-11
2 2015 51 5 2015-12-21
3 2016 1 6 2016-01-04
4 2015 50 4 2015-12-14
You can try this:
df['datetime'] = df.apply(lambda x: datetime.datetime.strptime(str(x.Year) + '-' + str(x.Week_No) + '-1', "%Y-%W-%w"), axis=1)
output:
Year Week_No Value datetime
0 2015 52 3 2015-12-28
1 2016 2 7 2016-01-11
2 2015 51 5 2015-12-21
3 2016 1 6 2016-01-04
4 2015 50 4 2015-12-14

Create Datetime Column from Integer Column - Hours and Minutes

I have a dataframe with a 4 digit int column:
df['time'].head(10)
0 1844
1 2151
2 1341
3 2252
4 2252
5 1216
6 2334
7 2247
8 2237
9 1651
Name: DepTime, dtype: int64
I have verified that max is 2400 and min is 1. I would like to convert this to a date time column with hours and minutes. How would I do that?
If these are 4 digits, timedelta is more appropriate than datetime:
pd.to_timedelta(df['time']//100 * 60 + df['time'] % 100, unit='m')
Output:
0 18:44:00
1 21:51:00
2 13:41:00
3 22:52:00
4 22:52:00
5 12:16:00
6 23:34:00
7 22:47:00
8 22:37:00
9 16:51:00
Name: time, dtype: timedelta64[ns]
If you have another column date, you may want to merge date and time to create a datetime column.
IIUC
pd.to_datetime(df.time.astype(str),format='%H%M').dt.strftime('%H:%M')
Out[324]:
0 21:51
1 13:41
2 22:52
3 22:52
4 12:16
5 23:34
6 22:47
7 22:37
8 16:51
Name: col2, dtype: object
Try this!
df['conversion'] = (df['time'].apply(lambda x: pd.to_datetime(x, format = '%H%M')).dt.strftime('%H:%M'))
If you want output in string format of HH:MM, you just need to convert column to string and use str.slice_replace with : (Note: I change your sample to include case of 3-digit integer)
sample df:
time
0 1844
1 2151
2 1341
3 2252
4 2252
5 216
6 2334
7 2247
8 2237
9 1651
s = df['time'].map('{0:04}'.format)
out = s.str.slice_replace(2,2,':')
Out[666]:
0 18:44
1 21:51
2 13:41
3 22:52
4 22:52
5 02:16
6 23:34
7 22:47
8 22:37
9 16:51
Name: time, dtype: object
Or split and concat with :
s = df['time'].map('{0:04}'.format)
out = s.str[:2] + ':' + s.str[2:]
Out[665]:
0 18:44
1 21:51
2 13:41
3 22:52
4 22:52
5 02:16
6 23:34
7 22:47
8 22:37
9 16:51
Name: time, dtype: object

Python pandas: how to create a column which is a fixed date + the # days in another column

I need to add a column to a dataframe, so that row 0 is 15-Feb-2019. row 1 is 16th, etc. I have tried using the index:
import numpy as np
import pandas as pd
df=pd.DataFrame()
df['a']=np.arange(10,20)
df['date from index']=df.apply( lambda x: pd.to_datetime('15-2-2019') + pd.DateOffset(days=x.index), axis=1 )
but I get:
TypeError: ('must be str, not int', 'occurred at index 0')
which I admit I do not understand.
I tried creating an explicit column to use instead of the index:
df=pd.DataFrame()
df['a']=np.arange(10,20)
df['counter']=np.arange(0,df.shape[0])
df['date from counter']=df.apply( lambda x: pd.to_datetime('15-2-2019') + pd.DateOffset(days=x['counter']), axis=1 )
but this gives me:
TypeError: ('unsupported type for timedelta days component:
numpy.int32', 'occurred at index 0')
What am I doing wrong?
Use to_timedelta for convert values to day timedeltas or use
parameter origin with specify start day with parameter unit in to_datetime:
df['date from index']= pd.to_datetime('15-2-2019') + pd.to_timedelta(df.index, 'd')
df['date from counter']= pd.to_datetime('15-2-2019') + pd.to_timedelta(df['counter'], 'd')
df['date from index1']= pd.to_datetime(df.index, origin='15-02-2019', unit='d')
df['date from counter1']= pd.to_datetime(df['counter'], origin='15-02-2019', unit='d')
print(df.head())
a counter date from index date from counter date from index1 \
0 10 0 2019-02-15 2019-02-15 2019-02-15
1 11 1 2019-02-16 2019-02-16 2019-02-16
2 12 2 2019-02-17 2019-02-17 2019-02-17
3 13 3 2019-02-18 2019-02-18 2019-02-18
4 14 4 2019-02-19 2019-02-19 2019-02-19
date from counter1
0 2019-02-15
1 2019-02-16
2 2019-02-17
3 2019-02-18
4 2019-02-19
You can vectorise this with pd.to_timedelta:
# pd.to_timedelta(df.index, unit='d') + pd.to_datetime('15-2-2019') # whichever
pd.to_timedelta(df.a, unit='d') + pd.to_datetime('15-2-2019')
0 2019-02-25
1 2019-02-26
2 2019-02-27
3 2019-02-28
4 2019-03-01
5 2019-03-02
6 2019-03-03
7 2019-03-04
8 2019-03-05
9 2019-03-06
Name: a, dtype: datetime64[ns]
df['date_from_counter'] = (
pd.to_timedelta(df.a, unit='d') + pd.to_datetime('15-2-2019'))
df
a counter date_from_counter
0 10 0 2019-02-25
1 11 1 2019-02-26
2 12 2 2019-02-27
3 13 3 2019-02-28
4 14 4 2019-03-01
5 15 5 2019-03-02
6 16 6 2019-03-03
7 17 7 2019-03-04
8 18 8 2019-03-05
9 19 9 2019-03-06
As expected, you can call pd.to_timedelta on whatever column of integers with the right unit, and then use the resultant Timedelta column for date time arithmetic.
For your code to work, it seems like you needed to pass int, not np.int (not sure why). This works.
dt = pd.to_datetime('15-2-2019')
df['date from counter'] = df.apply(
lambda x: dt + pd.DateOffset(days=x['counter'].item()), axis=1)
df
a counter date from counter
0 10 0 2019-02-15
1 11 1 2019-02-16
2 12 2 2019-02-17
3 13 3 2019-02-18
4 14 4 2019-02-19
5 15 5 2019-02-20
6 16 6 2019-02-21
7 17 7 2019-02-22
8 18 8 2019-02-23
9 19 9 2019-02-24

Python Pandas Series of Datetimes to Seconds Since the Epoch

Following in the spirit of this answer, I attempted the following to convert a DataFrame column of datetimes to a column of seconds since the epoch.
df['date'] = (df['date']+datetime.timedelta(hours=2)-datetime.datetime(1970,1,1))
df['date'].map(lambda td:td.total_seconds())
The second command causes the following error which I do not understand. Any thoughts on what might be going on here? I replaced map with apply and that didn't help matters.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-99-7123e823f995> in <module>()
----> 1 df['date'].map(lambda td:td.total_seconds())
/Users/cpd/.virtualenvs/py27-ipython+pandas/lib/python2.7/site-packages/pandas-0.12.0_937_gb55c790-py2.7-macosx-10.8-x86_64.egg/pandas/core/series.pyc in map(self, arg, na_action)
1932 return self._constructor(new_values, index=self.index).__finalize__(self)
1933 else:
-> 1934 mapped = map_f(values, arg)
1935 return self._constructor(mapped, index=self.index).__finalize__(self)
1936
/Users/cpd/.virtualenvs/py27-ipython+pandas/lib/python2.7/site-packages/pandas-0.12.0_937_gb55c790-py2.7-macosx-10.8-x86_64.egg/pandas/lib.so in pandas.lib.map_infer (pandas/lib.c:43628)()
<ipython-input-99-7123e823f995> in <lambda>(td)
----> 1 df['date'].map(lambda td:td.total_seconds())
AttributeError: 'float' object has no attribute 'total_seconds'
Update:
In 0.15.0 Timedeltas became a full-fledged dtype.
So this becomes possible (as well as the methods below)
In [45]: s = Series(pd.timedelta_range('1 day',freq='1S',periods=5))
In [46]: s.dt.components
Out[46]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 1 0 0 0 0 0 0
1 1 0 0 1 0 0 0
2 1 0 0 2 0 0 0
3 1 0 0 3 0 0 0
4 1 0 0 4 0 0 0
In [47]: s.astype('timedelta64[s]')
Out[47]:
0 86400
1 86401
2 86402
3 86403
4 86404
dtype: float64
Original Answer:
I see that you are on master (and 0.13 is coming out very shortly),
so assuming you have numpy >= 1.7. Do this. See here for the docs (this is frequency conversion)
In [5]: df = DataFrame(dict(date = date_range('20130101',periods=10)))
In [6]: df
Out[6]:
date
0 2013-01-01 00:00:00
1 2013-01-02 00:00:00
2 2013-01-03 00:00:00
3 2013-01-04 00:00:00
4 2013-01-05 00:00:00
5 2013-01-06 00:00:00
6 2013-01-07 00:00:00
7 2013-01-08 00:00:00
8 2013-01-09 00:00:00
9 2013-01-10 00:00:00
In [7]: df['date']+timedelta(hours=2)-datetime.datetime(1970,1,1)
Out[7]:
0 15706 days, 02:00:00
1 15707 days, 02:00:00
2 15708 days, 02:00:00
3 15709 days, 02:00:00
4 15710 days, 02:00:00
5 15711 days, 02:00:00
6 15712 days, 02:00:00
7 15713 days, 02:00:00
8 15714 days, 02:00:00
9 15715 days, 02:00:00
Name: date, dtype: timedelta64[ns]
In [9]: (df['date']+timedelta(hours=2)-datetime.datetime(1970,1,1)) / np.timedelta64(1,'s')
Out[9]:
0 1357005600
1 1357092000
2 1357178400
3 1357264800
4 1357351200
5 1357437600
6 1357524000
7 1357610400
8 1357696800
9 1357783200
Name: date, dtype: float64
The contained values are np.timedelta64[ns] objects, they don't have the same methods as timedelta objects, so no total_seconds().
In [10]: s = (df['date']+timedelta(hours=2)-datetime.datetime(1970,1,1))
In [11]: s[0]
Out[11]: numpy.timedelta64(1357005600000000000,'ns')
You can astype them to int, and you get back a ns unit.
In [12]: s[0].astype(int)
Out[12]: 1357005600000000000
You can do this as well (but only on an individual unit element).
In [18]: s[0].astype('timedelta64[s]')
Out[18]: numpy.timedelta64(1357005600,'s')

Categories