How to populate 2 columns in DataFrame with same "def" function? - python

I have a data frame that has a date column, what I need is to create another 2 columns with the "start of week date" and "end of week date". The reason for this is that I will then need to group by an "isoweek" column... but also keep this two-column "start_of_week_date" and "end_of_week_date"
I've created the below function:
def myfunc(dt, option):
wkday = dt.isoweekday()
if option == 'start':
delta = datetime.timedelta(1 - wkday)
elif option == 'end':
delta = datetime.timedelta(7 - wkday)
else:
raise TypeError
return date + delta
Now I don't know how I would use the above function to populate the columns.
Probably don't even need my function to get what I need... which is... I have a DF that has the below columns
\>>> date, isoweek, qty
I will need to change it to:
\>>> isoweek, start_of_week_date, end_of_week_date, qty
this would then make my data go from 1.8 million rows to 300 thousand rows :D
can someone help me?
thank you

There might be builtin functions that one can use and i can see one of the answers proposes such.
However, if you wish to apply your own function (which is perfectly acceptable) then could use the apply with lambda.
Here is an example:
import pandas as pd
from datetime import datetime
# an example dataframe
d = {'some date':[1,2,3,4],
'other data':[2,4,6,8]}
df = pd.DataFrame(d)
# user defined function from the question
def myfunc(dt, option):
wkday = dt.isoweekday()
if option == 'start':
delta = datetime.timedelta(1 - wkday)
elif option == 'end':
delta = datetime.timedelta(7 - wkday)
else:
raise TypeError
return date + delta
df['new_col'] = df.apply(lambda x: myfunc(df['some data'], df['other data']), axis=1)

Hope I understand correctly, Refer this dt.weekday for caculating week start & week end, here I've used 6 for 'Sunday' if you need any other day as weekend then give the appropriate number.
The day of the week with Monday=0, Sunday=6
df['start_of_week_date'] = df['Date'] - df['Date'].dt.weekday.astype('timedelta64[D]')
df['end_of_week_date'] = df['Date'] + (6 - df['Date'].dt.weekday).astype('timedelta64[D]')

Related

Choose specific calendar dates in python

I am trying to convert a column of dates from MonthYear form to mm/dd/yyyy and I can do it as a string replace but it requires 157 lines of code to get all the data changed. I want to be able to take the month and year and push out the second wednesday of the month in mm/dd/yyyy form. is that possible?
I am currently using this code
df['Column']=df['Column'].str.replace("December2009", "12/11/2009")
I don't know of a standard library tool for this, but it's easy to make your own, something like this:
from datetime import datetime, timedelta
import pandas as pd
test_arr = ['December2009', 'August2012', 'March2015']
def replacer(d):
# take a datestring of format %B%Y and find the second wednesday
dt = datetime.strptime(d, '%B%Y')
x = 0
# start at day 1 and increment through until conditions satisfied
while True:
s = dt.strftime('%A')
if s == 'Wednesday':
x += 1 # if a wednesday found, increment the counter
if x == 2:
break # when two wednesdays found then break
dt += timedelta(days = 1)
return dt.strftime('%m/%d/%Y')
df = pd.DataFrame(test_arr, columns = ['a'])
df['a'].apply(replacer) # .apply() applies the given python function to each element in the df column
Maybe the calendar module as recommended in the other comments could make the code look nicer but I'm unfamiliar with it so it might be something you want to look into the improve the solution

Pandas select last friday of each month [duplicate]

I've written this function to get the last Thursday of the month
def last_thurs_date(date):
month=date.dt.month
year=date.dt.year
cal = calendar.monthcalendar(year, month)
last_thurs_date = cal[4][4]
if month < 10:
thurday_date = str(year)+'-0'+ str(month)+'-' + str(last_thurs_date)
else:
thurday_date = str(year) + '-' + str(month) + '-' + str(last_thurs_date)
return thurday_date
But its not working with the lambda function.
datelist['Date'].map(lambda x: last_thurs_date(x))
Where datelist is
datelist = pd.DataFrame(pd.date_range(start = pd.to_datetime('01-01-2014',format='%d-%m-%Y')
, end = pd.to_datetime('06-03-2019',format='%d-%m-%Y'),freq='D').tolist()).rename(columns={0:'Date'})
datelist['Date']=pd.to_datetime(datelist['Date'])
Jpp already added the solution, but just to add a slightly more readable formatted string - see this awesome website.
import calendar
def last_thurs_date(date):
year, month = date.year, date.month
cal = calendar.monthcalendar(year, month)
# the last (4th week -> row) thursday (4th day -> column) of the calendar
# except when 0, then take the 3rd week (February exception)
last_thurs_date = cal[4][4] if cal[4][4] > 0 else cal[3][4]
return f'{year}-{month:02d}-{last_thurs_date}'
Also added a bit of logic - e.g. you got 2019-02-0 as February doesn't have 4 full weeks.
Scalar datetime objects don't have a dt accessor, series do: see pd.Series.dt. If you remove this, your function works fine. The key is understanding that pd.Series.apply passes scalars to your custom function via a loop, not an entire series.
def last_thurs_date(date):
month = date.month
year = date.year
cal = calendar.monthcalendar(year, month)
last_thurs_date = cal[4][4]
if month < 10:
thurday_date = str(year)+'-0'+ str(month)+'-' + str(last_thurs_date)
else:
thurday_date = str(year) + '-' + str(month) + '-' + str(last_thurs_date)
return thurday_date
You can rewrite your logic more succinctly via f-strings (Python 3.6+) and a ternary statement:
def last_thurs_date(date):
month = date.month
year = date.year
last_thurs_date = calendar.monthcalendar(year, month)[4][4]
return f'{year}{"-0" if month < 10 else "-"}{month}-{last_thurs_date}'
I know that a lot of time has passed since the date of this post, but I think it would be worth adding another option if someone came across this thread
Even though I use pandas every day at work, in that case my suggestion would be to just use the datetutil library. The solution is a simple one-liner, without unnecessary combinations.
from dateutil.rrule import rrule, MONTHLY, FR, SA
from datetime import datetime as dt
import pandas as pd
# monthly options expiration dates calculated for 2022
monthly_options = list(rrule(MONTHLY, count=12, byweekday=FR, bysetpos=3, dtstart=dt(2022,1,1)))
# last satruday of the month
last_saturday = list(rrule(MONTHLY, count=12, byweekday=SA, bysetpos=-1, dtstart=dt(2022,1,1)))
and then of course:
pd.DataFrame({'LAST_ST':last_saturdays}) #or whatever you need
This question answer Calculate Last Friday of Month in Pandas
This can be modified by selecting the appropriate day of the week, here freq='W-FRI'
I think the easiest way is to create a pandas.DataFrame using pandas.date_range and specifying freq='W-FRI.
W-FRI is Weekly Fridays
pd.date_range(df.Date.min(), df.Date.max(), freq='W-FRI')
Creates all the Fridays in the date range between the min and max of the dates in df
Use a .groupby on year and month, and select .last(), to get the last Friday of every month for every year in the date range.
Because this method finds all the Fridays for every month in the range and then chooses .last() for each month, there's not an issue with trying to figure out which week of the month has the last Friday.
With this, use pandas: Boolean Indexing to find values in the Date column of the dataframe that are in last_fridays_in_daterange.
Use the .isin method to determine containment.
pandas: DateOffset objects
import pandas as pd
# test data: given a dataframe with a datetime column
df = pd.DataFrame({'Date': pd.date_range(start=pd.to_datetime('2014-01-01'), end=pd.to_datetime('2020-08-31'), freq='D')})
# create a dateframe with all Fridays in the daterange for min and max of df.Date
fridays = pd.DataFrame({'datetime': pd.date_range(df.Date.min(), df.Date.max(), freq='W-FRI')})
# use groubpy and last, to get the last Friday of each month into a list
last_fridays_in_daterange = fridays.groupby([fridays.datetime.dt.year, fridays.datetime.dt.month]).last()['datetime'].tolist()
# find the data for the last Friday of the month
df[df.Date.isin(last_fridays_in_daterange)]

Creating a company week number & year in pandas

assume that we have the following df
import pandas as pd
data = {'Dates' : ['2018-10-15', '2018-02-01', '2018-04-01']}
data['Dates'] = pd.to_datetime(data.Dates)
print(df)
Dates
0 2018-10-15
1 2018-02-01
2 2018-04-01
in my current company, we have a financial week structure which I normally work out using an excel and I'd like to do this in Python
I use the DateTime module to work around my conditions which are as follows
if the month is >= 4 (April) the Week number is 1 (so I take the ISO week number and subtract 13)
if the month is < 4 I add 39.
I use the same logic for the YEAR if >= 4 then year + 1 else YEAR
I thought I could use a simple for loop that I could use over my dataframe
for x in data.Dates:
if x.dt.month >= 4:
df['Week'] = x.dt.week - 13
else:
df['Week'] = x.dt.week + 39
and for the year
for x in data.Dates:
if x.dt.month >= 4:
df['Year'] = FY & x.dt.year + 1
else:
df['Year'] = FY & x.dt.year
however, the >= 4 on both throws formula error.
File "<ipython-input-38-eadb99fdd9db>", line 4
df.Dates.dt.month > 4:
^
SyntaxError: invalid syntax
however, if I do
data['Week'] = data.Dates.dt.week
this gives all the week numbers, am I missing something basic or essential here?
I hope this is clear and concise, any advice (even how to ask better questions) is appreciated.
Don't use an explicit loop
Pandas specialises in vectorised operations. There's no need for a for loop. You can use, for example, numpy.where to create a series conditionally:
import numpy as np
data['Week'] = np.where(data['Dates'].dt.month >= 4, data['Dates'].dt.week - 13,
data['Dates'].dt.week + 39)
The reason your code doesn't work is because you are updating an entire series in each loop rather than elements in a series. In other words, you are applying elementwise logic to a series.
The issue arises because you are iterating through the values in df['Dates'], which are TimeStamp objects. This is equivalent to going through df['Dates'][0], df['Dates'][1]...to extract the feature of interest. To extract a particular "date-related feature" like month, or day, or week you can simply extract the attribute as follows:
df['Dates'][0].month
On the other hand, df['Dates'] in itself is a pandas timestamp Series object. To extract these date-related features from the entire Series, you would have to use something like:
df['Dates'].dt.month
This is similar to the functioning of a "string" Series object, where you have to call pd.Series.str.<method>, to perform the requisite string operation (such as extract, contains, get, etc) on the entire Series object.
The syntax error does not come from here but try to remove the 'dt' in your for loops:
import pandas as pd
df = pd.DataFrame()
df['Dates'] = pd.to_datetime({'Dates' : ['2018-10-15', '2018-02-01', '2018-04-01']})
for x in df.Dates:
if x.month >= 4:
df['Week'] = x.week - 13
else:
df['Week'] = x.week + 39
for x in df.Dates:
if x.month >= 4:
df['Year'] = FY & x.year + 1
else:
df['Year'] = FY & x.year
The question is a bit confusing due to the use of 'data' and 'df'. I hope I didn't miss-interpreted it.
If it does not work can you post the whole code so I can try it?
You are almost there, just drop dt like so:
for x in data.Dates:
if x.month >= 4:
df['Year'] = FY & x.year + 1
else:
df['Year'] = FY & x.year
however, if I do
data['Week'] = data.Dates.dt.week
this gives all the week numbers, am I missing something basic or essential here?
Try this
def my_f(x):
if x.month >= 4:
return x.week - 13
else:
return x.week + 39
df['Week'] = df.Dates.apply(lambda x: my_f(x))

Mapping Values in a pandas Dataframe column?

I am trying to filter out some data and seem to be running into some errors.
Below this statement is a replica of the following code I have:
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
source = requests.get(url).text
s = StringIO(source)
election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
election_data.head(n=3)
last_day = max(election_data["Start Date"])
filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]
As you can see last_day is the max within the column election_data
I would like to filter out the data in which the difference between
the max and x is less than or equal to 5 days
I have tried using for - loops, and various combinations of list comprehension.
filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]
This line would normally work however, python3 gives me the following error:
<map object at 0x10798a2b0>
Your first attempt has it almost right. The issue is
(last_day - election_date['Start Date']).days
which should instead be
(last_day - election_date['Start Date']).dt.days
Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.
data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]
Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).
If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.
If you want an entirely new DataFrame object with the filtered data:
election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]
Or if you just want the Start Date column post-filtering:
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]
Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.

Pandas - Python, deleting rows based on Date column

I'm trying to delete rows of a dataframe based on one date column; [Delivery Date]
I need to delete rows which are older than 6 months old but not equal to the year '1970'.
I've created 2 variables:
from datetime import date, timedelta
sixmonthago = date.today() - timedelta(188)
import time
nineteen_seventy = time.strptime('01-01-70', '%d-%m-%y')
but I don't know how to delete rows based on these two variables, using the [Delivery Date] column.
Could anyone provide the correct solution?
You can just filter them out:
df[(df['Delivery Date'].dt.year == 1970) | (df['Delivery Date'] >= sixmonthago)]
This returns all rows where the year is 1970 or the date is less than 6 months.
You can use boolean indexing and pass multiple conditions to filter the df, for multiple conditions you need to use the array operators so | instead of or, and parentheses around the conditions due to operator precedence.
Check the docs for an explanation of boolean indexing
Be sure the calculation itself is accurate for "6 months" prior. You may not want to be hardcoding in 188 days. Not all months are made equally.
from datetime import date
from dateutil.relativedelta import relativedelta
#http://stackoverflow.com/questions/546321/how-do-i-calculate-the-date-six-months-from-the-current-date-using-the-datetime
six_months = date.today() - relativedelta( months = +6 )
Then you can apply the following logic.
import time
nineteen_seventy = time.strptime('01-01-70', '%d-%m-%y')
df = df[(df['Delivery Date'].dt.year == nineteen_seventy.tm_year) | (df['Delivery Date'] >= six_months)]
If you truly want to drop sections of the dataframe, you can do the following:
df = df[(df['Delivery Date'].dt.year != nineteen_seventy.tm_year) | (df['Delivery Date'] < six_months)].drop(df.columns)

Categories