assume that we have the following df
import pandas as pd
data = {'Dates' : ['2018-10-15', '2018-02-01', '2018-04-01']}
data['Dates'] = pd.to_datetime(data.Dates)
print(df)
Dates
0 2018-10-15
1 2018-02-01
2 2018-04-01
in my current company, we have a financial week structure which I normally work out using an excel and I'd like to do this in Python
I use the DateTime module to work around my conditions which are as follows
if the month is >= 4 (April) the Week number is 1 (so I take the ISO week number and subtract 13)
if the month is < 4 I add 39.
I use the same logic for the YEAR if >= 4 then year + 1 else YEAR
I thought I could use a simple for loop that I could use over my dataframe
for x in data.Dates:
if x.dt.month >= 4:
df['Week'] = x.dt.week - 13
else:
df['Week'] = x.dt.week + 39
and for the year
for x in data.Dates:
if x.dt.month >= 4:
df['Year'] = FY & x.dt.year + 1
else:
df['Year'] = FY & x.dt.year
however, the >= 4 on both throws formula error.
File "<ipython-input-38-eadb99fdd9db>", line 4
df.Dates.dt.month > 4:
^
SyntaxError: invalid syntax
however, if I do
data['Week'] = data.Dates.dt.week
this gives all the week numbers, am I missing something basic or essential here?
I hope this is clear and concise, any advice (even how to ask better questions) is appreciated.
Don't use an explicit loop
Pandas specialises in vectorised operations. There's no need for a for loop. You can use, for example, numpy.where to create a series conditionally:
import numpy as np
data['Week'] = np.where(data['Dates'].dt.month >= 4, data['Dates'].dt.week - 13,
data['Dates'].dt.week + 39)
The reason your code doesn't work is because you are updating an entire series in each loop rather than elements in a series. In other words, you are applying elementwise logic to a series.
The issue arises because you are iterating through the values in df['Dates'], which are TimeStamp objects. This is equivalent to going through df['Dates'][0], df['Dates'][1]...to extract the feature of interest. To extract a particular "date-related feature" like month, or day, or week you can simply extract the attribute as follows:
df['Dates'][0].month
On the other hand, df['Dates'] in itself is a pandas timestamp Series object. To extract these date-related features from the entire Series, you would have to use something like:
df['Dates'].dt.month
This is similar to the functioning of a "string" Series object, where you have to call pd.Series.str.<method>, to perform the requisite string operation (such as extract, contains, get, etc) on the entire Series object.
The syntax error does not come from here but try to remove the 'dt' in your for loops:
import pandas as pd
df = pd.DataFrame()
df['Dates'] = pd.to_datetime({'Dates' : ['2018-10-15', '2018-02-01', '2018-04-01']})
for x in df.Dates:
if x.month >= 4:
df['Week'] = x.week - 13
else:
df['Week'] = x.week + 39
for x in df.Dates:
if x.month >= 4:
df['Year'] = FY & x.year + 1
else:
df['Year'] = FY & x.year
The question is a bit confusing due to the use of 'data' and 'df'. I hope I didn't miss-interpreted it.
If it does not work can you post the whole code so I can try it?
You are almost there, just drop dt like so:
for x in data.Dates:
if x.month >= 4:
df['Year'] = FY & x.year + 1
else:
df['Year'] = FY & x.year
however, if I do
data['Week'] = data.Dates.dt.week
this gives all the week numbers, am I missing something basic or essential here?
Try this
def my_f(x):
if x.month >= 4:
return x.week - 13
else:
return x.week + 39
df['Week'] = df.Dates.apply(lambda x: my_f(x))
Related
In a Pandas dataframe i need to change all leap days cells in a specific column (they should be changed to 28 Feb). So, for example, 2020/02/29 should become 2020/02/28.
I tried the following, but didn't work:
df.loc[((df['Date'].dt.month == 2) & (df['Date'].dt.day == 29)), 'Date'] = df['Date'] + timedelta-(1)
Any ideas?
Thanks
You can use np.where:
df["Date"] = np.where((df["Date"].dt.month == 2) & \
(df["Date"].dt.day == 29),
df["Date"] - pd.DateOffset(days=1),
df["Date"])
If you look closely, you have written + timedelta-(1) which returns an error. Using your method works fine if you instead write + timedelta(-1).
This must be really simple but can't find the answer - have a dataframe with a date, and i just need to add a financial year column (1st April - 31st March) - e.g. 15/5/19 would return 2019 or 2019-20.
I could probably do it with a for loop, but i'm guessing there is a much better way within the dataframe?
if base_date.month > 3:
fin_year = base_date.year
else:
fin_year = base_date.year-1
Actually, there's a fuction for that:
s.dt.to_period('Q-OCT').dt.qyear
This will work, assuming your date column is base_date and is a datetime object:
df['financial_year'] = df['base_date'].map(lambda x: x.year if x.month > 3 else x.year-1)
For Australian Financial Year:
df['financial_year'] = df['base_date'].map(lambda d: d.year + 1 if d.month > 6 else d.year)
PeriodIndex objects have the as_freq method for recasting to annual series, which allows you to specify the month at which years are deemed to start.
df['financial_year'] = df['base_date'].asfreq('A-APR') - 1
Subtract 1 to get it to regard the previous April as the year start.
I have a data frame that has a date column, what I need is to create another 2 columns with the "start of week date" and "end of week date". The reason for this is that I will then need to group by an "isoweek" column... but also keep this two-column "start_of_week_date" and "end_of_week_date"
I've created the below function:
def myfunc(dt, option):
wkday = dt.isoweekday()
if option == 'start':
delta = datetime.timedelta(1 - wkday)
elif option == 'end':
delta = datetime.timedelta(7 - wkday)
else:
raise TypeError
return date + delta
Now I don't know how I would use the above function to populate the columns.
Probably don't even need my function to get what I need... which is... I have a DF that has the below columns
\>>> date, isoweek, qty
I will need to change it to:
\>>> isoweek, start_of_week_date, end_of_week_date, qty
this would then make my data go from 1.8 million rows to 300 thousand rows :D
can someone help me?
thank you
There might be builtin functions that one can use and i can see one of the answers proposes such.
However, if you wish to apply your own function (which is perfectly acceptable) then could use the apply with lambda.
Here is an example:
import pandas as pd
from datetime import datetime
# an example dataframe
d = {'some date':[1,2,3,4],
'other data':[2,4,6,8]}
df = pd.DataFrame(d)
# user defined function from the question
def myfunc(dt, option):
wkday = dt.isoweekday()
if option == 'start':
delta = datetime.timedelta(1 - wkday)
elif option == 'end':
delta = datetime.timedelta(7 - wkday)
else:
raise TypeError
return date + delta
df['new_col'] = df.apply(lambda x: myfunc(df['some data'], df['other data']), axis=1)
Hope I understand correctly, Refer this dt.weekday for caculating week start & week end, here I've used 6 for 'Sunday' if you need any other day as weekend then give the appropriate number.
The day of the week with Monday=0, Sunday=6
df['start_of_week_date'] = df['Date'] - df['Date'].dt.weekday.astype('timedelta64[D]')
df['end_of_week_date'] = df['Date'] + (6 - df['Date'].dt.weekday).astype('timedelta64[D]')
I have the following dataframe:
data = [
("10/10/2016","A"),
("10/10/2016","B"),
("09/12/2016","B"),
("09/12/2016","A"),
("08/11/2016","A"),
("08/11/2016","C")]
#Create DataFrame base
df = pd.DataFrame(data, columns=("Time","User"))
# Convert time column to correct format for time calculations
df["Time"] = pd.to_datetime(df["Time"], '%m/%d/%Y')
Each row represents when a user makes a specific action. I want to compute how frequently (in terms of days) each user makes that specific action.
Let's say user A transacted first time on 08/11/2016, and then he transacted again on 09/12/2016, i.e. around 30 days after. Then, he transacted again on 10/10/2016, around 29 days after his second transaction. So, his average frequency in days would be (29+30)/2.
What is the most efficient way to do that?
Thanks in advance!
Update
I wrote the following function that computes my desired output.
from datetime import timedelta
def averagetime(a):
numdeltas = len(a) - 1
sumdeltas = 0
i = 1
while i < len(a):
delta = abs((a[i] - a[i-1]).days)
sumdeltas += delta
i += 1
if numdeltas > 1:
avg = sumdeltas / numdeltas
else:
avg = 'NaN'
return avg
It works correctly, for example, when I pass the whole "Time" column:
averagetime(df["Time"])
But it gives me an error when I try to apply it after group by.
df.groupby('User')['Time'].apply(averagetime)
Any suggestions how I can fix the above?
You can use diff, convert to float by np.timedelta64(1,'D') and with abs count sum:
print (averagetime(df["Time"]))
12.0
su = ((df["Time"].diff() / np.timedelta64(1,'D')).abs().sum())
print (su / (len(df) - 1))
12.0
Then I apply it to groupby, but there is necessary condition, because:
ZeroDivisionError: float division by zero
print (df.groupby('User')['Time']
.apply(lambda x: np.nan if len(x) == 1
else (x.diff()/np.timedelta64(1,'D')).abs().sum()/(len(x)-1)))
User
A 30.0
B 28.0
C NaN
Name: Time, dtype: float64
Building on from #Jezrael's answer:
If by "how frequently" you mean - how much time passes between each user performing the action then here's an approach:
import pandas as pd
import numpy as np
data = [
("10/10/2016","A"),
("10/10/2016","B"),
("09/12/2016","B"),
("09/12/2016","A"),
("08/11/2016","A"),
("08/11/2016","C"),
]
# Create DataFrame base
df = pd.DataFrame(data, columns=("Time","User"))
# Convert time column to correct format for time calculations
df["Time"] = pd.to_datetime(df["Time"], dayfirst=True)
# Group the DF by min, max and count the number of instances
grouped = (df.groupby("User").agg([np.max, np.min, np.count_nonzero])
# This step is a bit messy and could be improved,
# but we need the count as an int
.assign(counter=lambda x: x["Time"]["count_nonzero"].astype(int))
# Use apply to calculate the time between first and last, then divide by frequency
.apply(lambda x: (x["Time"]["amax"] - x["Time"]["amin"]) / x["counter"].astype(int), axis=1)
)
# Output the DF if using an interactive prompt
grouped
Output:
User
A 20 days
B 30 days
C 0 days
I have a dataframe with a DATE row and I need to convert it to a row of value 1 if the date is a weekend day and 0 if it is not.
So far, I converted the data to weekdays
df['WEEKDAY'] = pandas.to_datetime(df['DATE']).dt.dayofweek
It's there a way to create this "WEEKEND" row without functions?
Thanks!
Here's the solution I've come up with:
df['WEEKDAY'] = ((pd.DatetimeIndex(df.index).dayofweek) // 5 == 1).astype(float)
Essentially all it does is use integer division (//) to test whether the dayofweek attribute of the DatetimeIndex is less than 5. Normally this would return just a True or False, but tacking on the astype(float) at the end returns a 1 or 0 rather than a boolean.
One more way of getting weekend indicator is by where function:
df['WEEKDAY'] = np.where((df['DATE']).dt.dayofweek) < 5,0,1)
One more way of getting weekend indicator is by first converting the date column to day of the week and then using those values to get weekend/not weekend. This can be implemented as follows:
df['WEEKDAY'] = pandas.to_datetime(df['DATE']).dt.dayofweek # monday = 0, sunday = 6
df['weekend_indi'] = 0 # Initialize the column with default value of 0
df.loc[df['WEEKDAY'].isin([5, 6]), 'weekend_indi'] = 1 # 5 and 6 correspond to Sat and Sun
Found the top voted answer in some codebase. Please don't do it that way, it's completely unreadable. Instead do:
df['weekend'] = df['date'].dt.day_name().isin(['Saturday', 'Sunday'])
Note: df['date'] needs to be in datetime64 or similar format.
The simplest solution I found is:
df['WEEKEND'] = df['DATE'].dt.weekday > 4
The question asks for 0 or 1 instead of True and False, just multiply by 1:
df['WEEKEND'] = (df['DATE'].dt.weekday > 4) * 1