Extract multi-year three month series (winter) from pandas dataframe - python

I've got a pandas DataFrame containing 70 years with hourly data, looking like this:
pressure
2015-06-01 18:00:00 945.6
2015-06-01 19:00:00 945.6
2015-06-01 20:00:00 945.4
2015-06-01 21:00:00 945.4
2015-06-01 22:00:00 945.3
I want to extract the winter months (D-J-F) from every year and generate a new DataFrame with a series of winters.
I found a lot of complicated stuff (e.g. extracting the df.index.month as a new column and then adress this one afterwards), but is there a way to get the winter months straightforward?

You can use map():
import pandas as pd
df = pd.DataFrame({'date' : [datetime.date(2015, 11, 1), datetime.date(2015, 12, 1), datetime.date(2015, 1, 1), datetime.date(2015, 2, 1)],
'pressure': [1,2,3,4]})
winter_months = [12, 1, 2]
print df
# date pressure
# 0 2015-11-01 1
# 1 2015-12-01 2
# 2 2015-01-01 3
# 3 2015-02-01 4
df = df[df["date"].map(lambda t: t.month in winter_months)]
print df
# date pressure
# 1 2015-12-01 2
# 2 2015-01-01 3
# 3 2015-02-01 4
EDIT: I noticed that in your example the dates are the dataframe's index. This still works:
df = df[df.index.map(lambda t: t.month in winter_months)]

I just found that
df[(df.index.month==12) | (df.index.month==1) | (df.index.month==2)]
works fine.

Related

Pandas change time values based on condition

I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data

Pandas DataFrame Time index using .loc function error

I have created DataFrame with DateTime index, then I split the index into the Date index column and Time index column. Now, when I call for a row of a specific time by using pd.loc(), the system shows an error.
Here're an example of steps of how I made the DataFrame from beginning till reaching my consideration.
import pandas as pd
import numpy as np
df= pd.DataFrame({'A':[1, 2, 3, 4], 'B':[5, 6, 7, 8], 'C':[9, 10, 11, 12],
'DateTime':pd.to_datetime(['2021-09-01 10:00:00', '2021-09-01 11:00:00', '2021-09-01 12:00:00', '2021-09-01 13:00:00'])})
df=df.set_index(df['DateTime'])
df.drop('DateTime', axis=1, inplace=True)
df
OUT >>
A B C
DateTime
2021-09-01 10:00:00 1 5 9
2021-09-01 11:00:00 2 6 10
2021-09-01 12:00:00 3 7 11
2021-09-01 13:00:00 4 8 12
In this step, I'm gonna splitting DateTime index into multi-index Date & Time
df.index = pd.MultiIndex.from_arrays([df.index.date, df.index.time], names=['Date','Time'])
df
OUT >>
A B C
Date Time
2021-09-01 10:00:00 1 5 9
11:00:00 2 6 10
12:00:00 3 7 11
13:00:00 4 8 12
##Here is the issue##
when I call this statement, The system shows an error
df.loc["11:00:00"]
How to fix that?
1. If you want to use .loc, you can just specify the time by:
import datetime
df.loc[(slice(None), datetime.time(11, 0)), :]
or use pd.IndexSlice similar to the solution by BENY, as follows:
import datetime
idx = pd.IndexSlice
df.loc[idx[:,datetime.time(11, 0)], :]
(defining a variable idx to use pd.IndexSlice gives us cleaner code and less typing if you are going to use pd.IndexSlice multiple times).
Result:
A B C
Date Time
2021-09-01 11:00:00 2 6 10
2. If you want to select just for one day, you can use:
import datetime
df.loc[(datetime.date(2021, 9, 1), datetime.time(11, 0))]
Result:
A 2
B 6
C 10
Name: (2021-09-01, 11:00:00), dtype: int64
3. You can also use .xs to access the MultiIndex row index, as follows:
import datetime
df.xs(datetime.time(11,0), axis=0, level='Time')
Result:
A B C
Date
2021-09-01 2 6 10
4. Alterative way if you haven't split DateTime index into multi-index Date & Time
Actually, if you haven't split the DatetimeIndex into separate date and time index, you can also use the .between_time() function to filter the time, as follows:
df.between_time("11:00:00", "11:00:00")
You can specify a range of time to filter, instead of just a point of time, if you specify different values for the start_time and end_time.
Result:
A B C
DateTime
2021-09-01 11:00:00 2 6 10
As you can see, .between_time() allows you to enter the time in simple string to filter, instead of requiring the use of datetime objects. This should be nearest to your tried ideal (but invalid) syntax of using df.loc["11:00:00"] to filter.
As a suggestion, if you split the DatetimeIndex into separate date and time index simply for the sake of filtering by time, you can consider using the .between_time() function instead.
We can just do the correct value slice with IndexSlice
import datetime
out = df.loc[pd.IndexSlice[:,datetime.time(11, 0)],:]
Out[76]:
A B C DateTime
Date Time
2021-09-01 11:00:00 2 6 10 2021-09-01 11:00:00
Why do you need to split your datetime into two parts?
You can use indexer_at_time
>>> df
A B C
DateTime
2021-09-01 10:00:00 1 5 9
2021-09-01 11:00:00 2 6 10
2021-09-01 12:00:00 3 7 11
2021-09-01 13:00:00 4 8 12
# Extract 11:00:00 from any day
>>> df.iloc[df.index.indexer_at_time('11:00:00')]
A B C
DateTime
2021-09-01 11:00:00 2 6 10
You can also create a proxy to save time typing:
T = df.index.indexer_at_time
df.iloc[T('11:00:00')]

Create a list of years with pandas

I have a dataframe with a column of dates of the form
2004-01-01
2005-01-01
2006-01-01
2007-01-01
2008-01-01
2009-01-01
2010-01-01
2011-01-01
2012-01-01
2013-01-01
2014-01-01
2015-01-01
2016-01-01
2017-01-01
2018-01-01
2019-01-01
Given an integer number k, let's say k=5, I would like to generate an array of the next k years after the maximum date of the column. The output should look like:
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01
Let's use pd.to_datetime + max to compute the largest date in the column date then use pd.date_range to generate the dates based on the offset frequency one year and having the number of periods equals to k=5:
strt, offs = pd.to_datetime(df['date']).max(), pd.DateOffset(years=1)
dates = pd.date_range(strt + offs, freq=offs, periods=k).strftime('%Y-%m-%d').tolist()
print(dates)
['2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01', '2024-01-01']
Here you go:
import pandas as pd
# this is your k
k = 5
# Creating a test DF
array = {'dt': ['2018-01-01', '2019-01-01']}
df = pd.DataFrame(array)
# Extracting column of year
df['year'] = pd.DatetimeIndex(df['dt']).year
year1 = df['year'].max()
# creating a new DF and populating it with k years
years_df = pd.DataFrame()
for i in range (1,k+1):
row = {'dates':[str(year1 + i) + '-01-01']}
years_df = years_df.append(pd.DataFrame(row))
years_df
The output:
dates
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01

Count business day between using pandas columns

I have tried to calculate the number of business days between two date (stored in separate columns in a dataframe ).
MonthBegin MonthEnd
0 2014-06-09 2014-06-30
1 2014-07-01 2014-07-31
2 2014-08-01 2014-08-31
3 2014-09-01 2014-09-30
4 2014-10-01 2014-10-31
I have tried to apply numpy.busday_count but I get the following error:
Iterator operand 0 dtype could not be cast from dtype('<M8[ns]') to dtype('<M8[D]') according to the rule 'safe'
I have tried to change the type into Timestamp as the following :
Timestamp('2014-08-31 00:00:00')
or datetime :
datetime.date(2014, 8, 31)
or to numpy.datetime64:
numpy.datetime64('2014-06-30T00:00:00.000000000')
Anyone knows how to fix it?
Note 1: I have passed tried np.busday_count in two way :
1. Passing dataframe columns, t['Days']=np.busday_count(t.MonthBegin,t.MonthEnd)
Passing arrays np.busday_count(dt1,dt2)
Note2: My dataframe has over 150K rows so I need to use an efficient algorithm
You can using bdate_range, also I corrected your input , since the most of MonthEnd is early than the MonthBegin
[len(pd.bdate_range(x,y))for x,y in zip(df['MonthBegin'],df['MonthEnd'])]
Out[519]: [16, 21, 22, 23, 20]
I think the best way to do is
df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
For my dataframe df as below:
MBegin MEnd
0 2011-01-01 2011-02-01
1 2011-01-10 2011-02-10
2 2011-01-02 2011-02-02
doing :
df['MBegin'] = df['MBegin'].values.astype('datetime64[D]')
df['MEnd'] = df['MEnd'].values.astype('datetime64[D]')
df['busday'] = df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
>>df
MBegin MEnd busday
0 2011-01-01 2011-02-01 21
1 2011-01-10 2011-02-10 23
2 2011-01-02 2011-02-02 22
You need to provide the template in which your dates are written.
a = datetime.strptime('2014-06-9', '%Y-%m-%d')
Calculate this for your
b = datetime.strptime('2014-06-30', '%Y-%m-%d')
Now their difference
c = b-a
c.days
which gives you the difference 21 days, You can now use list comprehension to get the difference between two dates as days.
will give you datetime.timedelta(21), to convert it into days, just use
You can modify your code to get the desired result as below:
df = pd.DataFrame({'MonthBegin': ['2014-06-09', '2014-08-01', '2014-09-01', '2014-10-01', '2014-11-01'],
'MonthEnd': ['2014-06-30', '2014-08-31', '2014-09-30', '2014-10-31', '2014-11-30']})
df['MonthBegin'] = df['MonthBegin'].astype('datetime64[ns]')
df['MonthEnd'] = df['MonthEnd'].astype('datetime64[ns]')
df['BDays'] = np.busday_count(df['MonthBegin'].tolist(), df['MonthEnd'].tolist())
print(df)
MonthBegin MonthEnd BDays
0 2014-06-09 2014-06-30 15
1 2014-08-01 2014-08-31 21
2 2014-09-01 2014-09-30 21
3 2014-10-01 2014-10-31 22
4 2014-11-01 2014-11-30 20
Additionally numpy.busday_count has few other optional arguments like weekmask, holidays ... which you can use according to your need.

how to get the datetimes before and after some specific dates in Pandas?

I have a Pandas DataFrame that looks like
col1
2015-02-02
2015-04-05
2016-07-02
I would like to add, for each date in col 1, the x days before and x days after that date.
That means that the resulting DataFrame will contain more rows (specifically, n(1+ 2*x), where n is the orignal number of dates in col1)
How can I do that in a proper Pandonic way?
Output would be (for x=1)
col1
2015-01-01
2015-01-02
2015-01-03
2015-04-04
etc
Thanks!
you can do it this way, but I'm not sure that it's the best / fastest way to do it:
In [143]: df
Out[143]:
col1
0 2015-02-02
1 2015-04-05
2 2016-07-02
In [144]: %paste
N = 2
(df.col1.apply(lambda x: pd.Series(pd.date_range(x - pd.Timedelta(days=N),
x + pd.Timedelta(days=N))
)
)
.stack()
.drop_duplicates()
.reset_index(level=[0,1], drop=True)
.to_frame(name='col1')
)
## -- End pasted text --
Out[144]:
col1
0 2015-01-31
1 2015-02-01
2 2015-02-02
3 2015-02-03
4 2015-02-04
5 2015-04-03
6 2015-04-04
7 2015-04-05
8 2015-04-06
9 2015-04-07
10 2016-06-30
11 2016-07-01
12 2016-07-02
13 2016-07-03
14 2016-07-04
Something like this takes a dataframe with a datetime.date column and then stacks another Series underneath with timedelta shifts to the original data.
import datetime
import pandas as pd
df = pd.DataFrame([{'date': datetime.date(2016, 1, 2)}, {'date': datetime.date(2016, 1, 1)}], columns=['date'])
df = pd.concat([df.date, df.date + datetime.timedelta(days=1)], ignore_index=True).to_frame()

Categories