Create New Column in Pandas Using Values from Two Columns - python

I hope everyone is well. I am trying to create a new column that requires the values from two different columns ("Task Start Date" and "Hours"). I am trying to use the apply function but i cant figure out the correct syntax.
def get_start_date(end_date, day):
date_format = '%d-%b-%y'
date = datetime.strptime(end_date, date_format)
start_date = date - timedelta(days = day)
return start_date
asana_filtered["Task Start Date"] = asana_filtered.apply(get_start_date(["Task Due Date"], ["Days"]))

Found it!
python pandas- apply function with two arguments to columns
asana_filtered["Task Start Date"] = asana_filtered.apply(lambda x: get_start_date(x["Task Due Date"], x["Days"]), axis=1)

You can use the native pandas time converters:
df["Task Start Date"] = pd.to_datetime(df["Task Due Date"]) - pd.to_timedelta(df["Days"], unit="D")

Related

The fastest way to create a new column in the pandas dataframe that satisfies two conditions

I need to create a new column ('new_date') in pandas based on the conditions on the other two columns ('date' and 'hour'), which are integers. My code is doing what I need but it's too SLOW for big dataframes. Please see my code below.
import pandas as pd
import time
df = pd.DataFrame(data={'date': [20150101, 20150102, 20150103, 20150104, 20150105], 'hour': [113000, 142500,170000,235999,81500]})
def convert_date(row):
if row['hour']!=235999:
val = pd.to_datetime(row['date'], format='%Y%m%d') # convert the integer to date format
else:
val = pd.to_datetime(row['date'], format='%Y%m%d')+pd.offsets.BDay(1) # convert the integer to date format and add one business day
return val
start_time = time.time()
df['new_date']= df.apply(convert_date, axis=1)
print(round(time.time() - start_time,2), 'Seconds')
I also used this code which is too slow too!
df['new_date']= df.apply(lambda row: pd.to_datetime(row['date'], format='%Y%m%d') if row['hour']!=235999 else pd.to_datetime(row['date'], format='%Y%m%d')+pd.offsets.BDay(1), axis=1)
You can replace the function with the following approach using .loc(). That way you wouldn't have to loop throw individual rows.
df['new_date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df.loc[df['hour'] == 235999, 'new_date'] += pd.offsets.BDay(1)
You can also use the df.where() method
df['new_date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df['new_date'] = df['new_date'].where(df['hour'] != 235999, df['new_date'] + pd.offsets.BDay(1))
Both approaches are more efficient than your costume function.

Speed up the apply function in pandas (python)

I am working with a Dataframe containing date in string format. Dates look like this: 19620201 so with year first, then month, then day.
I want to convert those dates into Datetime. I tried to use this:
pd.to_datetime(df.Date)
But it doesn't work because some date have the day to "00" sometimes it's the month and sometimes it's even the year.
I don't wanna drop those dates because I still wnat the years or month.
So i tried to write a function like this one:
def handle_the_00_case(date):
try:
if date.endswith("0000"):
return pd.to_datetime(date[:-4], format="%Y")
elif date.endswith("00"):
return pd.to_datetime(date[:-2], format="%Y%m")
return pd.to_datetime(date, format="%Y%m%d")
except ValueError:
return
And use the following statement:
df.Date.apply(handle_the_00_case)
But this is really too long to compute.
Do you have an idea on how I can improve the speed of this ?
I tried the np.vectorize() and the swifter library but this doesn't work, I know I should change the way I wrote the function but i don't know how.
Thank you if you can help me ! :)
You should first convert the column to valid dates, and then convert to datetime only once:
date = df['Date'].str.replace('0000$','0101')
date = date.str.replace('00$','01')
date = pd.to_datetime(date, format="%Y%m%d")
First idea is use vectorized solution with pass column to to_datetime and generate ouput column by numpy.where:
d1 = pd.to_datetime(df['Date'].str[:-4], format="%Y", errors='coerce')
d2 = pd.to_datetime(df['Date'].str[:-2], format="%Y%m", errors='coerce')
d3 = pd.to_datetime(df['Date'], format="%Y%m%d", errors='coerce')
m1 = df['Date'].str.endswith("0000")
m2 = df['Date'].str.endswith("00")
df['Date_out'] = np.where(m1, d1, np.where(m2, d2, d3))

python error: can only concatenate str (not "datetime.timedelta") to str

i am trying to get the weeks between two dates and split into rows by week and here is the error message i got:
can only concatenate str (not "datetime.timedelta") to str
Can anyone help on this one? thanks!!!
import datetime
import pandas as pd
df=pd.read_csv(r'C:\Users\xx.csv')
print(df)
# Convert dtaframe to dates
df['Start Date'] = pd.to_datetime(df['start_date'])
df['End Date'] = pd.to_datetime(df['end_date'])
df_out = pd.DataFrame()
week = 7
# Iterate over dataframe rows
for index, row in df.iterrows():
date = row["start_date"]
date_end = row["end_date"]
dealtype = row["deal_type"]
ppg = row["PPG"]
# Get the weeks for the row
while date < date_end:
date_next = date + datetime.timedelta(week - 1)
df_out = df_out.append([[dealtype, ppg, date, date_next]])
date = date_next + datetime.timedelta(1)
# Remove extra index and assign columns as original dataframe
df_out = df_out.reset_index(drop=True)
df_out.columns = df.columns
df.to_csv(r'C:\Users\Output.csv', index=None)
date is a Timestamp object which is later converted to a datetime.timedelta object.
datetime.timedelta(week - 1) is a datetime.timedelta object.
Both of these objects can be converted to a string by using str().
If you want to concatenate the string, simply wrap it with str()
date_next = str(date) + str(datetime.timedelta(week - 1))
You converted the start_date and end_date column to datetime, but you added the converted columns as Start Date and End Date. Then, in the loop, you fetch row["start_date"], which is still a string. If you want to REPLACE the start_date column, then don't give it a new name. Spelling matters.

Python3- Replace a String Value with a Date from DateTime class

I am trying to format the due date column of my dataframe from strings to dates from the datetime class. It seems to work within my for-loop, however, when I leave the loop, the values in my dataframe do not change.
df.replace() is obsolete, and .iat and .at do not work either. Code below. Thanks!
for x in df['due date']:
if type(x) == str:
date = x.split('/')
month = int(date[0])
day = int(date[1])
year = int(date[2].split(' ')[0])
formattedDate = dt.date(year, month, day)
df.at[x, "due date"] = formattedDate
Unless I'm missing something here, you can just pass the column to the built in 'to_datetime' function.
df['due date'] = pd.to_datetime(df['due date'],format="%m/%d/%Y")
That is assuming your date format is something like: 02/24/2021
If you need to change the date format, see the following:
strftime and strptime behavior

Mapping Values in a pandas Dataframe column?

I am trying to filter out some data and seem to be running into some errors.
Below this statement is a replica of the following code I have:
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
source = requests.get(url).text
s = StringIO(source)
election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
election_data.head(n=3)
last_day = max(election_data["Start Date"])
filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]
As you can see last_day is the max within the column election_data
I would like to filter out the data in which the difference between
the max and x is less than or equal to 5 days
I have tried using for - loops, and various combinations of list comprehension.
filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]
This line would normally work however, python3 gives me the following error:
<map object at 0x10798a2b0>
Your first attempt has it almost right. The issue is
(last_day - election_date['Start Date']).days
which should instead be
(last_day - election_date['Start Date']).dt.days
Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.
data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]
Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).
If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.
If you want an entirely new DataFrame object with the filtered data:
election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]
Or if you just want the Start Date column post-filtering:
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]
Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.

Categories