I have a dataframe with a DATE row and I need to convert it to a row of value 1 if the date is a weekend day and 0 if it is not.
So far, I converted the data to weekdays
df['WEEKDAY'] = pandas.to_datetime(df['DATE']).dt.dayofweek
It's there a way to create this "WEEKEND" row without functions?
Thanks!
Here's the solution I've come up with:
df['WEEKDAY'] = ((pd.DatetimeIndex(df.index).dayofweek) // 5 == 1).astype(float)
Essentially all it does is use integer division (//) to test whether the dayofweek attribute of the DatetimeIndex is less than 5. Normally this would return just a True or False, but tacking on the astype(float) at the end returns a 1 or 0 rather than a boolean.
One more way of getting weekend indicator is by where function:
df['WEEKDAY'] = np.where((df['DATE']).dt.dayofweek) < 5,0,1)
One more way of getting weekend indicator is by first converting the date column to day of the week and then using those values to get weekend/not weekend. This can be implemented as follows:
df['WEEKDAY'] = pandas.to_datetime(df['DATE']).dt.dayofweek # monday = 0, sunday = 6
df['weekend_indi'] = 0 # Initialize the column with default value of 0
df.loc[df['WEEKDAY'].isin([5, 6]), 'weekend_indi'] = 1 # 5 and 6 correspond to Sat and Sun
Found the top voted answer in some codebase. Please don't do it that way, it's completely unreadable. Instead do:
df['weekend'] = df['date'].dt.day_name().isin(['Saturday', 'Sunday'])
Note: df['date'] needs to be in datetime64 or similar format.
The simplest solution I found is:
df['WEEKEND'] = df['DATE'].dt.weekday > 4
The question asks for 0 or 1 instead of True and False, just multiply by 1:
df['WEEKEND'] = (df['DATE'].dt.weekday > 4) * 1
Related
This must be really simple but can't find the answer - have a dataframe with a date, and i just need to add a financial year column (1st April - 31st March) - e.g. 15/5/19 would return 2019 or 2019-20.
I could probably do it with a for loop, but i'm guessing there is a much better way within the dataframe?
if base_date.month > 3:
fin_year = base_date.year
else:
fin_year = base_date.year-1
Actually, there's a fuction for that:
s.dt.to_period('Q-OCT').dt.qyear
This will work, assuming your date column is base_date and is a datetime object:
df['financial_year'] = df['base_date'].map(lambda x: x.year if x.month > 3 else x.year-1)
For Australian Financial Year:
df['financial_year'] = df['base_date'].map(lambda d: d.year + 1 if d.month > 6 else d.year)
PeriodIndex objects have the as_freq method for recasting to annual series, which allows you to specify the month at which years are deemed to start.
df['financial_year'] = df['base_date'].asfreq('A-APR') - 1
Subtract 1 to get it to regard the previous April as the year start.
I have a DataFrame with a DatetimeIndex and I need to get a new column with values 0 (for a Saturday or Sunday) or 1 (if it was business day) based on the datetime index. How can I do it in a way like:
keytable['var']= if 'Saturday' or 'Sunday' == 0 else return 1
Thanks in advance for the amazing support this community gives to coders worldwide!
You can do:
keytable['var'] = (keytable.index.weekday < 5).astype(int)
I'm struggling with this rather complex calculated column.
The cumulative sum is watts of light.
It changes to 0 when the system resets it to a new day. So the 24hour day is sunrise to sunrise.
I want to use this fact to calculate a 'Date 2' that I can them summarize in the future to report average 24hr day temp, light etc.
For the First 0 Cumulative Sum of every Date, Date + 1 Day, Else the last row of Date 2.
I have been playing around with the following, assuming Advanced Date is a copy of Cumulative Sum:
for i in range(1, len(ClimateDF)):
j = ClimateDF.columns.get_loc('AdvancedDate')
if ClimateDF.iat[i, j] == 0 and ClimateDF.iat[i - 1, j] != 0:
print(ClimateDF.iat[i, j])
# ClimateDF.iat[i, 'AdvancedDate'] = 'New Day' #this doesn't work
ClimateDF['AdvancedDate'].values[i] = 1
else:
print(ClimateDF.iat[i, j])
#ClimateDF.iat[i, 'AdvancedDate'] = 'Not New Day' #this doesn't work
ClimateDF['AdvancedDate'].values[i] = 2
This doesn't quite do what I want, but I thought I was close. However when I change:
ClimateDF['AdvancedDate'].values[i] = 1
to
ClimateDF['AdvancedDate'].values[i] = ClimateDF['Date'].values[i]
I get a:
TypeError: float() argument must be a string or a number, not 'datetime.date'
Am I on the right track? How do I get past this error? Is there a more efficient way I could be doing this?
IIUC, you can first create a cumsum reflecting day change, and then calculate Date_2 by adding it to the first date:
s = (df["sum"].eq(0)&df["sum"].shift().ne(0)).cumsum()
df["Date_2"] = df["Datetime"][0]+pd.to_timedelta(s,unit="D") #base on first day to calculate offset for all days
assume that we have the following df
import pandas as pd
data = {'Dates' : ['2018-10-15', '2018-02-01', '2018-04-01']}
data['Dates'] = pd.to_datetime(data.Dates)
print(df)
Dates
0 2018-10-15
1 2018-02-01
2 2018-04-01
in my current company, we have a financial week structure which I normally work out using an excel and I'd like to do this in Python
I use the DateTime module to work around my conditions which are as follows
if the month is >= 4 (April) the Week number is 1 (so I take the ISO week number and subtract 13)
if the month is < 4 I add 39.
I use the same logic for the YEAR if >= 4 then year + 1 else YEAR
I thought I could use a simple for loop that I could use over my dataframe
for x in data.Dates:
if x.dt.month >= 4:
df['Week'] = x.dt.week - 13
else:
df['Week'] = x.dt.week + 39
and for the year
for x in data.Dates:
if x.dt.month >= 4:
df['Year'] = FY & x.dt.year + 1
else:
df['Year'] = FY & x.dt.year
however, the >= 4 on both throws formula error.
File "<ipython-input-38-eadb99fdd9db>", line 4
df.Dates.dt.month > 4:
^
SyntaxError: invalid syntax
however, if I do
data['Week'] = data.Dates.dt.week
this gives all the week numbers, am I missing something basic or essential here?
I hope this is clear and concise, any advice (even how to ask better questions) is appreciated.
Don't use an explicit loop
Pandas specialises in vectorised operations. There's no need for a for loop. You can use, for example, numpy.where to create a series conditionally:
import numpy as np
data['Week'] = np.where(data['Dates'].dt.month >= 4, data['Dates'].dt.week - 13,
data['Dates'].dt.week + 39)
The reason your code doesn't work is because you are updating an entire series in each loop rather than elements in a series. In other words, you are applying elementwise logic to a series.
The issue arises because you are iterating through the values in df['Dates'], which are TimeStamp objects. This is equivalent to going through df['Dates'][0], df['Dates'][1]...to extract the feature of interest. To extract a particular "date-related feature" like month, or day, or week you can simply extract the attribute as follows:
df['Dates'][0].month
On the other hand, df['Dates'] in itself is a pandas timestamp Series object. To extract these date-related features from the entire Series, you would have to use something like:
df['Dates'].dt.month
This is similar to the functioning of a "string" Series object, where you have to call pd.Series.str.<method>, to perform the requisite string operation (such as extract, contains, get, etc) on the entire Series object.
The syntax error does not come from here but try to remove the 'dt' in your for loops:
import pandas as pd
df = pd.DataFrame()
df['Dates'] = pd.to_datetime({'Dates' : ['2018-10-15', '2018-02-01', '2018-04-01']})
for x in df.Dates:
if x.month >= 4:
df['Week'] = x.week - 13
else:
df['Week'] = x.week + 39
for x in df.Dates:
if x.month >= 4:
df['Year'] = FY & x.year + 1
else:
df['Year'] = FY & x.year
The question is a bit confusing due to the use of 'data' and 'df'. I hope I didn't miss-interpreted it.
If it does not work can you post the whole code so I can try it?
You are almost there, just drop dt like so:
for x in data.Dates:
if x.month >= 4:
df['Year'] = FY & x.year + 1
else:
df['Year'] = FY & x.year
however, if I do
data['Week'] = data.Dates.dt.week
this gives all the week numbers, am I missing something basic or essential here?
Try this
def my_f(x):
if x.month >= 4:
return x.week - 13
else:
return x.week + 39
df['Week'] = df.Dates.apply(lambda x: my_f(x))
I am trying to filter out some data and seem to be running into some errors.
Below this statement is a replica of the following code I have:
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
source = requests.get(url).text
s = StringIO(source)
election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
election_data.head(n=3)
last_day = max(election_data["Start Date"])
filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]
As you can see last_day is the max within the column election_data
I would like to filter out the data in which the difference between
the max and x is less than or equal to 5 days
I have tried using for - loops, and various combinations of list comprehension.
filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]
This line would normally work however, python3 gives me the following error:
<map object at 0x10798a2b0>
Your first attempt has it almost right. The issue is
(last_day - election_date['Start Date']).days
which should instead be
(last_day - election_date['Start Date']).dt.days
Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.
data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]
Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).
If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.
If you want an entirely new DataFrame object with the filtered data:
election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]
Or if you just want the Start Date column post-filtering:
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]
Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.