Extract day and month from a datetime object - python

I have a column with dates in string format '2017-01-01'. Is there a way to extract day and month from it using pandas?
I have converted the column to datetime dtype but haven't figured out the later part:
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df.dtypes:
Date datetime64[ns]
print(df)
Date
0 2017-05-11
1 2017-05-12
2 2017-05-13

With dt.day and dt.month --- Series.dt
df = pd.DataFrame({'date':pd.date_range(start='2017-01-01',periods=5)})
df.date.dt.month
Out[164]:
0 1
1 1
2 1
3 1
4 1
Name: date, dtype: int64
df.date.dt.day
Out[165]:
0 1
1 2
2 3
3 4
4 5
Name: date, dtype: int64
Also can do with dt.strftime
df.date.dt.strftime('%m')
Out[166]:
0 01
1 01
2 01
3 01
4 01
Name: date, dtype: object

A simple form:
df['MM-DD'] = df['date'].dt.strftime('%m-%d')

Use dt to get the datetime attributes of the column.
In [60]: df = pd.DataFrame({'date': [datetime.datetime(2018,1,1),datetime.datetime(2018,1,2),datetime.datetime(2018,1,3),]})
In [61]: df
Out[61]:
date
0 2018-01-01
1 2018-01-02
2 2018-01-03
In [63]: df['day'] = df.date.dt.day
In [64]: df['month'] = df.date.dt.month
In [65]: df
Out[65]:
date day month
0 2018-01-01 1 1
1 2018-01-02 2 1
2 2018-01-03 3 1
Timing the methods provided:
Using apply:
In [217]: %timeit(df['date'].apply(lambda d: d.day))
The slowest run took 33.66 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 210 µs per loop
Using dt.date:
In [218]: %timeit(df.date.dt.day)
10000 loops, best of 3: 127 µs per loop
Using dt.strftime:
In [219]: %timeit(df.date.dt.strftime('%d'))
The slowest run took 40.92 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 284 µs per loop
We can see that dt.day is the fastest

This should do it:
df['day'] = df['Date'].apply(lambda r:r.day)
df['month'] = df['Date'].apply(lambda r:r.month)

Related

Applying Date Operation to Entire Data Frame

import pandas as pd
import numpy as np
df = pd.DataFrame({'year': np.repeat(2018,12), 'month': range(1,13)})
In this data frame, I am interested in creating a field called 'year_month' such that each value looks like so:
datetime.date(df['year'][0], df['month'][0], 1).strftime("%Y%m")
I'm stuck on how to apply this operation to the entire data frame and would appreciate any help.
Join both columns converted to strings and for months add zfill:
df['new'] = df['year'].astype(str) + df['month'].astype(str).str.zfill(2)
Or add new column day by assign, convert columns to_datetime and last strftime:
df['new'] = pd.to_datetime(df.assign(day=1)).dt.strftime("%Y%m")
If multiple columns in DataFrame:
df['new'] = pd.to_datetime(df.assign(day=1)[['day','month','year']]).dt.strftime("%Y%m")
print (df)
month year new
0 1 2018 201801
1 2 2018 201802
2 3 2018 201803
3 4 2018 201804
4 5 2018 201805
5 6 2018 201806
6 7 2018 201807
7 8 2018 201808
8 9 2018 201809
9 10 2018 201810
10 11 2018 201811
11 12 2018 201812
Timings:
df = pd.DataFrame({'year': np.repeat(2018,12), 'month': range(1,13)})
df = pd.concat([df] * 1000, ignore_index=True)
In [212]: %timeit pd.to_datetime(df.assign(day=1)).dt.strftime("%Y%m")
10 loops, best of 3: 74.1 ms per loop
In [213]: %timeit df['year'].astype(str) + df['month'].astype(str).str.zfill(2)
10 loops, best of 3: 41.3 ms per loop
One way would be to create the datetime objects directly from the source data:
import pandas as pd
import numpy as np
from datetime import date
df = pd.DataFrame({'date': [date(i, j, 1) for i, j \
in zip(np.repeat(2018,12), range(1,13))]})
# date
# 0 2018-01-01
# 1 2018-02-01
# 2 2018-03-01
# 3 2018-04-01
# 4 2018-05-01
# 5 2018-06-01
# 6 2018-07-01
# 7 2018-08-01
# 8 2018-09-01
# 9 2018-10-01
# 10 2018-11-01
# 11 2018-12-01
You could use an apply function such as:
df['year_month'] = df.apply(lambda row: datetime.date(row[1], row[0], 1).strftime("%Y%m"), axis = 1)

How can I group by month from a date field using Python and Pandas?

I have a dataframe, df, which is as follows:
| date | Revenue |
|-----------|---------|
| 6/2/2017 | 100 |
| 5/23/2017 | 200 |
| 5/20/2017 | 300 |
| 6/22/2017 | 400 |
| 6/21/2017 | 500 |
I need to group the above data by month to get output as:
| date | SUM(Revenue) |
|------|--------------|
| May | 500 |
| June | 1000 |
I tried this code, but it did not work:
df.groupby(month('date')).agg({'Revenue': 'sum'})
I want to only use Pandas or NumPy and no additional libraries.
Try this:
In [6]: df['date'] = pd.to_datetime(df['date'])
In [7]: df
Out[7]:
date Revenue
0 2017-06-02 100
1 2017-05-23 200
2 2017-05-20 300
3 2017-06-22 400
4 2017-06-21 500
In [59]: df.groupby(df['date'].dt.strftime('%B'))['Revenue'].sum().sort_values()
Out[59]:
date
May 500
June 1000
Try a groupby using a pandas Grouper:
df = pd.DataFrame({'date':['6/2/2017','5/23/2017','5/20/2017','6/22/2017','6/21/2017'],'Revenue':[100,200,300,400,500]})
df.date = pd.to_datetime(df.date)
dg = df.groupby(pd.Grouper(key='date', freq='1M')).sum() # groupby each 1 month
dg.index = dg.index.strftime('%B')
Output:
Revenue
May 500
June 1000
For DataFrame with many rows, using strftime takes up more time. If the date column already has dtype of datetime64[ns] (can use pd.to_datetime() to convert, or specify parse_dates during csv import, etc.), one can directly access datetime property for groupby labels (Method 3). The speedup is substantial.
import numpy as np
import pandas as pd
T = pd.date_range(pd.Timestamp(0), pd.Timestamp.now()).to_frame(index=False)
T = pd.concat([T for i in range(1,10)])
T['revenue'] = pd.Series(np.random.randint(1000, size=T.shape[0]))
T.columns.values[0] = 'date'
print(T.shape) #(159336, 2)
print(T.dtypes) #date: datetime64[ns], revenue: int32
Method 1: strftime
%timeit -n 10 -r 7 T.groupby(T['date'].dt.strftime('%B'))['revenue'].sum()
1.47 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Method 2: Grouper
%timeit -n 10 -r 7 T.groupby(pd.Grouper(key='date', freq='1M')).sum()
#NOTE Manually map months as integer {01..12} to strings
56.9 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Method 3: datetime properties
%timeit -n 10 -r 7 T.groupby(T['date'].dt.month)['revenue'].sum()
#NOTE Manually map months as integer {01..12} to strings
34 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This will work better.
Try this:
# Explicitly convert to date
df['Date'] = pd.to_datetime(df['Date'])
# Set your date column as index
df.set_index('Date',inplace=True)
# For monthly use 'M', If needed for other freq you can change.
df[revenue].resample('M').sum()
This code gives the same result as shivsn's answer on the first post.
But the thing is we can do a lot more operations in this mentioned code.
It is recommended to use this:
>>> df['Date'] = pd.to_datetime(df['Date'])
>>> df.set_index('Date',inplace=True)
>>> df['withdrawal'].resample('M').sum().sort_values()
Date
2019-10-31 28710.00
2019-04-30 31437.00
2019-07-31 39728.00
2019-11-30 40121.00
2019-05-31 46495.00
2020-02-29 57751.10
2019-12-31 72469.13
2020-01-31 76115.78
2019-06-30 76947.00
2019-09-30 79847.04
2020-03-31 97920.18
2019-08-31 205279.45
Name: withdrawal, dtype: float64
where shivsn's code does the same.
>>> df.groupby(df['Date'].dt.strftime('%B'))['withdrawal'].sum().sort_values()
Date
October 28710.00
April 31437.00
July 39728.00
November 40121.00
May 46495.00
February 57751.10
December 72469.13
January 76115.78
June 76947.00
September 79847.04
March 97920.18
August 205279.45
Name: withdrawal, dtype: float64
df['Month'] = pd.DatetimeIndex(df['date']).month_name()
Using this you should get
date
Revenue
Month
6/2/2017
100
June
5/23/2017
200
May
5/20/2017
300
May
6/22/2017
400
June
6/21/2017
500
June
Try this:
Change the date column into datetime format.
---> df['Date'] = pd.to_datetime(df['Date'])
Insert a new row in the data frame which has month like [May, 'June']
---> df['months'] = df['date'].apply(lambda x:x.strftime('%B'))
---> here x is date which take from date column in data frame.
Now aggregate the aggregate data in the month column and sum the revenue.
--->response_data_frame = df.groupby('months')['Revenue'].sum()
---->print(response_data_frame)
Output:
month
Revenue
May
500
June
1000

how to compare unicode date u'2006-07-23' format and 25-06-15 08:42:43.830000000 PM using python pandas?

Basically the unicode format will get from the datepicker and 25-06-15 08:42:43.830000000 PM this format from one column
my dataframe is:
query,status,received_date
a,closed,25-06-15 08:42:43.830000000 PM
b,pending,27-06-15 08:42:43.830000000 PM
ab,closed,28-06-15 08:42:43.830000000 PM
bb,pending,29-06-15 08:42:43.830000000 PM
and I will get two dates from datepicker like following format (u'2015-06-23',u'2015-06-29'). How to compare this unicode dates and recieved_date column.
I have to display datas between those two dates (that will get from datepicker)
I think you need first convert dates to_datetime, then column received_date too and extract date. Last use boolean indexing with mask for filtering:
#datetimes changed for better testing
print df
query status received_date
0 a closed 20-06-15 08:42:43.830000000 PM
1 b pending 27-06-15 08:42:43.830000000 PM
2 ab closed 28-06-15 08:42:43.830000000 PM
3 bb pending 30-06-15 08:42:43.830000000 PM
dates = (u'2015-06-23',u'2015-06-29')
dates = pd.to_datetime(dates).date
print dates
[datetime.date(2015, 6, 23) datetime.date(2015, 6, 29)]
df['received_date'] = pd.to_datetime(df['received_date']).dt.date
print df
query status received_date
0 a closed 2015-06-20
1 b pending 2015-06-27
2 ab closed 2015-06-28
3 bb pending 2015-06-30
print (df['received_date'] > dates[0]) & (df['received_date'] < dates[1])
0 False
1 True
2 True
3 False
Name: received_date, dtype: bool
df = df[(df['received_date'] > dates[0]) & (df['received_date'] < dates[1])]
print df
query status received_date
1 b pending 2015-06-27
2 ab closed 2015-06-28
But faster is modified PhilChang solution:
dates = (u'2015-06-23',u'2015-06-29')
df['received_date'] = pd.to_datetime(df['received_date'])
df = df.set_index('received_date')
return df[dates[0]:dates[1]]
TESTING (len(df) == 40k):
In [569]: %timeit a(df)
1 loops, best of 3: 12.2 s per loop
In [570]: %timeit b(df1)
10 loops, best of 3: 92.3 ms per loop
In [571]: %timeit c(df2)
100 loops, best of 3: 6.57 ms per loop
Code for testing:
#length is 40k
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
def a(df):
dates = (u'2015-06-23',u'2015-06-29')
df = df.set_index('received_date')
df.index = pd.DatetimeIndex(df.index)
return df[dates[0]:dates[1]]
def b(df):
dates = (u'2015-06-23',u'2015-06-29')
dates = pd.to_datetime(dates).date
df['received_date'] = pd.to_datetime(df['received_date']).dt.date
df = df[(df['received_date'] > dates[0]) & (df['received_date'] < dates[1])]
return df
def c(df):
dates = (u'2015-06-23',u'2015-06-29')
df['received_date'] = pd.to_datetime(df['received_date'])
df = df.set_index('received_date')
return df[dates[0]:dates[1]]
print a(df)
print b(df1)
print c(df2)
convert them to datetime.
dates = (u'2015-06-23',u'2015-06-29')
df = df.set_index('received_date')
df.index = pd.DatetimeIndex(df.index)
df[dates[0]:dates[1]]

Modify hour in datetimeindex in pandas dataframe

I have a dataframe that looks like this:
master.head(5)
Out[73]:
hour price
day
2014-01-01 0 1066.24
2014-01-01 1 1032.11
2014-01-01 2 1028.53
2014-01-01 3 963.57
2014-01-01 4 890.65
In [74]: master.index.dtype
Out[74]: dtype('<M8[ns]')
What I need to do is update the hour in the index with the hour in the column but the following approaches don't work:
In [82]: master.index.hour = master.index.hour(master['hour'])
TypeError: 'numpy.ndarray' object is not callable
In [83]: master.index.hour = [master.index.hour(master.iloc[i,0]) for i in len(master.index.hour)]
TypeError: 'int' object is not iterable
How to proceed?
IIUC I think you want to construct a TimedeltaIndex:
In [89]:
df.index += pd.TimedeltaIndex(df['hour'], unit='h')
df
Out[89]:
hour price
2014-01-01 00:00:00 0 1066.24
2014-01-01 01:00:00 1 1032.11
2014-01-01 02:00:00 2 1028.53
2014-01-01 03:00:00 3 963.57
2014-01-01 04:00:00 4 890.65
Just to compare against using apply:
In [87]:
%timeit df.index + pd.TimedeltaIndex(df['hour'], unit='h')
%timeit df.index + df['hour'].apply(lambda x: pd.Timedelta(x, 'h'))
1000 loops, best of 3: 291 µs per loop
1000 loops, best of 3: 1.18 ms per loop
You can see that using a TimedeltaIndex is significantly faster
master.index =
pd.to_datetime(master.index.map(lambda x : x.strftime('%Y-%m-%d')) + '-' + master.hour.map(str) , format='%Y-%m-%d-%H.0')

Optimizing Pandas groupby/apply

I am writing a process which takes a semi-large file as input (~4 million rows, 5 columns)
and performs a few operations on it.
Columns:
- CARD_NO
- ID
- CREATED_DATE
- STATUS
- FLAG2
I need to create a file which contains 1 copy of each CARD_NO where STATUS = '1' and CREATED_DATE is the maximum of all CREATED_DATEs for that CARD_NO.
I succeeded but my solution is very slow (3h and counting as of right now.)
Here is my code:
file = 'input.csv'
input = pd.read_csv(file)
input = input.drop_duplicates()
card_groups = input.groupby('CARD_NO', as_index=False, sort=False).filter(lambda x: x['STATUS'] == 1)
def important(x):
latest_date = x['CREATED_DATE'].values[x['CREATED_DATE'].values.argmax()]
return x[x.CREATED_DATE == latest_date]
#where the major slowdown occurs
group_2 = card_groups.groupby('CARD_NO', as_index=False, sort=False).apply(important)
path = 'result.csv'
group_2.to_csv(path, sep=',', index=False)
# ~4 minutes for the 154k rows file
# 3+ hours for ~4m rows
I was wondering if you had any advice on how to improve the running time of this little process.
Thank you and have a good day.
Setup (FYI make sure that your use parse_dates=True when reading your csv)
In [6]: n_groups = 10000
In [7]: N = 4000000
In [8]: dates = date_range('20130101',periods=100)
In [9]: df = DataFrame(dict(id = np.random.randint(0,n_groups,size=N), status = np.random.randint(0,10,size=N), date=np.random.choice(dates,size=N,replace=True)))
In [10]: pd.set_option('max_rows',10)
In [13]: df = DataFrame(dict(card_no = np.random.randint(0,n_groups,size=N), status = np.random.randint(0,10,size=N), date=np.random.choice(dates,size=N,replace=True)))
In [14]: df
Out[14]:
card_no date status
0 5790 2013-02-11 6
1 6572 2013-03-17 6
2 7764 2013-02-06 3
3 4905 2013-04-01 3
4 3871 2013-04-08 1
... ... ... ...
3999995 1891 2013-02-16 5
3999996 9048 2013-01-11 9
3999997 1443 2013-02-23 1
3999998 2845 2013-01-28 0
3999999 5645 2013-02-05 8
[4000000 rows x 3 columns]
In [15]: df.dtypes
Out[15]:
card_no int64
date datetime64[ns]
status int64
dtype: object
Only status == 1, groupby card_no, then return the max date for that group
In [18]: df[df.status==1].groupby('card_no')['date'].max()
Out[18]:
card_no
0 2013-04-06
1 2013-03-30
2 2013-04-09
...
9997 2013-04-07
9998 2013-04-07
9999 2013-04-09
Name: date, Length: 10000, dtype: datetime64[ns]
In [19]: %timeit df[df.status==1].groupby('card_no')['date'].max()
1 loops, best of 3: 934 ms per loop
If you need a transform of this (e.g. the same values for each group. Note that with < 0.14.1 (releasing this week) you will need to use this soln here, otherwise this will be pretty slow)
In [20]: df[df.status==1].groupby('card_no')['date'].transform('max')
Out[20]:
4 2013-04-10
13 2013-04-10
25 2013-04-10
...
3999973 2013-04-10
3999979 2013-04-10
3999997 2013-04-09
Name: date, Length: 399724, dtype: datetime64[ns]
In [21]: %timeit df[df.status==1].groupby('card_no')['date'].transform('max')
1 loops, best of 3: 1.8 s per loop
I suspect you prob want to merge the final transform back into the original frame
In [24]: df.join(res.to_frame('max_date'))
Out[24]:
card_no date status max_date
0 5790 2013-02-11 6 NaT
1 6572 2013-03-17 6 NaT
2 7764 2013-02-06 3 NaT
3 4905 2013-04-01 3 NaT
4 3871 2013-04-08 1 2013-04-10
... ... ... ... ...
3999995 1891 2013-02-16 5 NaT
3999996 9048 2013-01-11 9 NaT
3999997 1443 2013-02-23 1 2013-04-09
3999998 2845 2013-01-28 0 NaT
3999999 5645 2013-02-05 8 NaT
[4000000 rows x 4 columns]
In [25]: %timeit df.join(res.to_frame('max_date'))
10 loops, best of 3: 58.8 ms per loop
The csv writing will actually take a fair amount of time relative to this. I used HDF5 for things like this, MUCH faster.

Categories