If date is leap day (29-Feb) change to 28-Feb - python

In a Pandas dataframe i need to change all leap days cells in a specific column (they should be changed to 28 Feb). So, for example, 2020/02/29 should become 2020/02/28.
I tried the following, but didn't work:
df.loc[((df['Date'].dt.month == 2) & (df['Date'].dt.day == 29)), 'Date'] = df['Date'] + timedelta-(1)
Any ideas?
Thanks

You can use np.where:
df["Date"] = np.where((df["Date"].dt.month == 2) & \
(df["Date"].dt.day == 29),
df["Date"] - pd.DateOffset(days=1),
df["Date"])
If you look closely, you have written + timedelta-(1) which returns an error. Using your method works fine if you instead write + timedelta(-1).

Related

How to populate 2 columns in DataFrame with same "def" function?

I have a data frame that has a date column, what I need is to create another 2 columns with the "start of week date" and "end of week date". The reason for this is that I will then need to group by an "isoweek" column... but also keep this two-column "start_of_week_date" and "end_of_week_date"
I've created the below function:
def myfunc(dt, option):
wkday = dt.isoweekday()
if option == 'start':
delta = datetime.timedelta(1 - wkday)
elif option == 'end':
delta = datetime.timedelta(7 - wkday)
else:
raise TypeError
return date + delta
Now I don't know how I would use the above function to populate the columns.
Probably don't even need my function to get what I need... which is... I have a DF that has the below columns
\>>> date, isoweek, qty
I will need to change it to:
\>>> isoweek, start_of_week_date, end_of_week_date, qty
this would then make my data go from 1.8 million rows to 300 thousand rows :D
can someone help me?
thank you
There might be builtin functions that one can use and i can see one of the answers proposes such.
However, if you wish to apply your own function (which is perfectly acceptable) then could use the apply with lambda.
Here is an example:
import pandas as pd
from datetime import datetime
# an example dataframe
d = {'some date':[1,2,3,4],
'other data':[2,4,6,8]}
df = pd.DataFrame(d)
# user defined function from the question
def myfunc(dt, option):
wkday = dt.isoweekday()
if option == 'start':
delta = datetime.timedelta(1 - wkday)
elif option == 'end':
delta = datetime.timedelta(7 - wkday)
else:
raise TypeError
return date + delta
df['new_col'] = df.apply(lambda x: myfunc(df['some data'], df['other data']), axis=1)
Hope I understand correctly, Refer this dt.weekday for caculating week start & week end, here I've used 6 for 'Sunday' if you need any other day as weekend then give the appropriate number.
The day of the week with Monday=0, Sunday=6
df['start_of_week_date'] = df['Date'] - df['Date'].dt.weekday.astype('timedelta64[D]')
df['end_of_week_date'] = df['Date'] + (6 - df['Date'].dt.weekday).astype('timedelta64[D]')

Filtering dataframe for previous week dates in Python

I have a dataframe like this--
Now, I have to filter this data only for the previous week dates (excluding Sat, Sun)
I wrote this Python code-
today = date.today() - timedelta(weeks = 1)
weekday = today.weekday()
prev_week_start = today - timedelta(days = weekday)
for i in range(0,5):
prev_weekdate = today - timedelta(days = i)
prev_weekend = prev__weekdate + timedelta(days = 4)
next_weekstart = prev_weekdate + timedelta(days = 14)
next_weekdateend = next_weekstart + timedelta(days = 4)
week_dates = pd.DataFrame({"LastWeekStartDate":pd.to_datetime(['prev_weekdate']),
"LastWeekEndDate":pd.to_datetime(['prev_weekend']),
"NextWeekStartDate":pd.to_datetime(['next_weekstart']),
"NextWeekEndDate":pd.to_datetime(['next_weekdateend'])})
week_dates.head()
I am getting the correct dates for all the previous & next week start & end dates with these formulae.
Then I wrote the following code to get my data and filter it for previous week using the above dataframe - week_dates
df = pd.read_csv()
df['Date'] = pd.to_datetime(df.Date)
Now to filter rows of my dataframe - df , I wrote-
df = df[(df['Date'] >= week_dates.LastWeekStartDate & df['Date'] <= week_dates.LastWeekEndDate)][["Date","Actual Call Volume","Forecasted Call Volume"]]
I am getting an error -
TypeError: unsupported operand type(s) for &:'DatetimeArray' and
'DateTimeArray'
Please can somebody tell me where I am going wrong or is there any other way to write this code. Thanks in advance!
Use DataFrame.loc with add )( and replaced ][ to , and also use loc for select first value in prev_week DataFrame:
df = df.loc[(df['Date'] >= prev_week.loc[0, 'Prev_week_start']) &
(df['Date'] <= prev_week.loc[0, 'Prev_week_start']),
["Date","Actual Call Volume","Forecasted Call Volume"]]
For simplier solution use:
Prev_week_start = date.today() - timedelta(weeks = 1)
Prev_week_end = Prev_week_start + timedelta(days = 4)
df = df.loc[(df['Date'] >= Prev_week_start) & (df['Date'] <= Prev_week_end),
["Date","Actual Call Volume","Forecasted Call Volume"]]
I found the solution to my problem. The values of column 'Date' in my dataframe were being compared with the entire columns of my function week_range(start) table which is not possible. I needed a scalar value to filter my dataframe.
The simplest way to write would be as follows-
df = df[(df['Date'] >= prev_week.Prev_week_start[0]) & (df['Date'] >= prev_week.Prev_week_end[0])][["Date","Actual Call Volume","Forecasted Call Volume"]]
I simply specified the index for Prev_week_start & Prev_week_end by adding index [0].

Select just month and date in datetime.date

I'm working with the NYC MVA dataset. I've combined the CRASH DATE and CRASH TIME columns into a single column with format 2017-06-26 22:00:00. I'd now like to add a categorical column based on seasons. In order to do this, I'm looking to apply a mask to each season name and fill the column based on that, using the following basic structure:
df[df['CRASH TIME'].dt.date < dt.date(:,1,2)]
The problem is that the datetime date Timestamp requires a year input; the dataset spans a number of years. I would like to select all years, not any given year. In other words, I'd like to just select the month and date, and not the year. Is there a way to do that using datetime Timestamps?
assuming your using pandas for manipulating the data you could do something like this
df['day'] = df['CRASH TIME'].apply(lambda r:r.day)
df['month'] = df['CRASH TIME'].apply(lambda r:r.month)
Then you can combine them or work with them just as they are.
I'm not sure there's a way how to directly compare only a part of date, but you can extract the month and day into a tuple and compare them that way:
month_day_left = (df['CRASH TIME'].dt.date.month, df['CRASH TIME'].dt.date.day)
month_day_right = (dt.date.month, dt.date.day)
(2, 1) < (2, 2) # True
(1, 10) < (2, 1) # True
(2, 1) < (1, 30) # False
you can eventually wrap this comparison into a custom function and use it that way:
df[ is_earlier(df['CRASH TIME'].dt, dt)]

Creating a company week number & year in pandas

assume that we have the following df
import pandas as pd
data = {'Dates' : ['2018-10-15', '2018-02-01', '2018-04-01']}
data['Dates'] = pd.to_datetime(data.Dates)
print(df)
Dates
0 2018-10-15
1 2018-02-01
2 2018-04-01
in my current company, we have a financial week structure which I normally work out using an excel and I'd like to do this in Python
I use the DateTime module to work around my conditions which are as follows
if the month is >= 4 (April) the Week number is 1 (so I take the ISO week number and subtract 13)
if the month is < 4 I add 39.
I use the same logic for the YEAR if >= 4 then year + 1 else YEAR
I thought I could use a simple for loop that I could use over my dataframe
for x in data.Dates:
if x.dt.month >= 4:
df['Week'] = x.dt.week - 13
else:
df['Week'] = x.dt.week + 39
and for the year
for x in data.Dates:
if x.dt.month >= 4:
df['Year'] = FY & x.dt.year + 1
else:
df['Year'] = FY & x.dt.year
however, the >= 4 on both throws formula error.
File "<ipython-input-38-eadb99fdd9db>", line 4
df.Dates.dt.month > 4:
^
SyntaxError: invalid syntax
however, if I do
data['Week'] = data.Dates.dt.week
this gives all the week numbers, am I missing something basic or essential here?
I hope this is clear and concise, any advice (even how to ask better questions) is appreciated.
Don't use an explicit loop
Pandas specialises in vectorised operations. There's no need for a for loop. You can use, for example, numpy.where to create a series conditionally:
import numpy as np
data['Week'] = np.where(data['Dates'].dt.month >= 4, data['Dates'].dt.week - 13,
data['Dates'].dt.week + 39)
The reason your code doesn't work is because you are updating an entire series in each loop rather than elements in a series. In other words, you are applying elementwise logic to a series.
The issue arises because you are iterating through the values in df['Dates'], which are TimeStamp objects. This is equivalent to going through df['Dates'][0], df['Dates'][1]...to extract the feature of interest. To extract a particular "date-related feature" like month, or day, or week you can simply extract the attribute as follows:
df['Dates'][0].month
On the other hand, df['Dates'] in itself is a pandas timestamp Series object. To extract these date-related features from the entire Series, you would have to use something like:
df['Dates'].dt.month
This is similar to the functioning of a "string" Series object, where you have to call pd.Series.str.<method>, to perform the requisite string operation (such as extract, contains, get, etc) on the entire Series object.
The syntax error does not come from here but try to remove the 'dt' in your for loops:
import pandas as pd
df = pd.DataFrame()
df['Dates'] = pd.to_datetime({'Dates' : ['2018-10-15', '2018-02-01', '2018-04-01']})
for x in df.Dates:
if x.month >= 4:
df['Week'] = x.week - 13
else:
df['Week'] = x.week + 39
for x in df.Dates:
if x.month >= 4:
df['Year'] = FY & x.year + 1
else:
df['Year'] = FY & x.year
The question is a bit confusing due to the use of 'data' and 'df'. I hope I didn't miss-interpreted it.
If it does not work can you post the whole code so I can try it?
You are almost there, just drop dt like so:
for x in data.Dates:
if x.month >= 4:
df['Year'] = FY & x.year + 1
else:
df['Year'] = FY & x.year
however, if I do
data['Week'] = data.Dates.dt.week
this gives all the week numbers, am I missing something basic or essential here?
Try this
def my_f(x):
if x.month >= 4:
return x.week - 13
else:
return x.week + 39
df['Week'] = df.Dates.apply(lambda x: my_f(x))

Pandas dataframe get next (trading) day in dataframe

I have a date given that may or may not be a trading day, and I have a pandas dataframe indexed by trading days that has returns of each trading day.
This is my date
dt_query = datetime.datetime(2006, 12, 31, 16)
And I want to do something like this (returns is a pandas dataframe)
returns.ix[pd.Timestamp(dt_query + datetime.timedelta(days = 1))]
However, that may not work as one day ahead may or may not be a trading day. I could make a try block that loops and tries until we find something, but I'm wondering if there's a more elegant way that just uses pandas.
This might not be the most the elegant solution but it works.
Here's the idea: from any date dt_query, within a number of calender days (say 10), there must be trading days, and your next trading day is just the first among them. So you can find all days in returns within dt_query and dt_query + timedelta(days = 10), and then get the first one.
Using your example, it should look like
next_trading_date = returns.index[(returns.index > dt_query) & (returns.index <= dt_query + timedelta(days = 10))][0]
You can check the timedelta of the whole column doing:
delta = returns.column - dt_query
then use np.timedelta64() to define a tolerance used to check which rows you want to select:
tol = np.timedelta64(days=2)
and:
returns[delta < tol]
will return the rows within the desired range...
Thank you! That's been plaguing me for hours.
I altered it a bit:
try:
date_check = dja[start_day]
except KeyError:
print("Start date not a trading day, fetching next trading day...")
test = dja.index.searchsorted(start_day)
next_date = dja.index[(dja.index > start_day)]
start_date = next_date[0]
print("New date:", start_date)

Categories