Speed up the apply function in pandas (python) - python

I am working with a Dataframe containing date in string format. Dates look like this: 19620201 so with year first, then month, then day.
I want to convert those dates into Datetime. I tried to use this:
pd.to_datetime(df.Date)
But it doesn't work because some date have the day to "00" sometimes it's the month and sometimes it's even the year.
I don't wanna drop those dates because I still wnat the years or month.
So i tried to write a function like this one:
def handle_the_00_case(date):
try:
if date.endswith("0000"):
return pd.to_datetime(date[:-4], format="%Y")
elif date.endswith("00"):
return pd.to_datetime(date[:-2], format="%Y%m")
return pd.to_datetime(date, format="%Y%m%d")
except ValueError:
return
And use the following statement:
df.Date.apply(handle_the_00_case)
But this is really too long to compute.
Do you have an idea on how I can improve the speed of this ?
I tried the np.vectorize() and the swifter library but this doesn't work, I know I should change the way I wrote the function but i don't know how.
Thank you if you can help me ! :)

You should first convert the column to valid dates, and then convert to datetime only once:
date = df['Date'].str.replace('0000$','0101')
date = date.str.replace('00$','01')
date = pd.to_datetime(date, format="%Y%m%d")

First idea is use vectorized solution with pass column to to_datetime and generate ouput column by numpy.where:
d1 = pd.to_datetime(df['Date'].str[:-4], format="%Y", errors='coerce')
d2 = pd.to_datetime(df['Date'].str[:-2], format="%Y%m", errors='coerce')
d3 = pd.to_datetime(df['Date'], format="%Y%m%d", errors='coerce')
m1 = df['Date'].str.endswith("0000")
m2 = df['Date'].str.endswith("00")
df['Date_out'] = np.where(m1, d1, np.where(m2, d2, d3))

Related

The fastest way to create a new column in the pandas dataframe that satisfies two conditions

I need to create a new column ('new_date') in pandas based on the conditions on the other two columns ('date' and 'hour'), which are integers. My code is doing what I need but it's too SLOW for big dataframes. Please see my code below.
import pandas as pd
import time
df = pd.DataFrame(data={'date': [20150101, 20150102, 20150103, 20150104, 20150105], 'hour': [113000, 142500,170000,235999,81500]})
def convert_date(row):
if row['hour']!=235999:
val = pd.to_datetime(row['date'], format='%Y%m%d') # convert the integer to date format
else:
val = pd.to_datetime(row['date'], format='%Y%m%d')+pd.offsets.BDay(1) # convert the integer to date format and add one business day
return val
start_time = time.time()
df['new_date']= df.apply(convert_date, axis=1)
print(round(time.time() - start_time,2), 'Seconds')
I also used this code which is too slow too!
df['new_date']= df.apply(lambda row: pd.to_datetime(row['date'], format='%Y%m%d') if row['hour']!=235999 else pd.to_datetime(row['date'], format='%Y%m%d')+pd.offsets.BDay(1), axis=1)
You can replace the function with the following approach using .loc(). That way you wouldn't have to loop throw individual rows.
df['new_date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df.loc[df['hour'] == 235999, 'new_date'] += pd.offsets.BDay(1)
You can also use the df.where() method
df['new_date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df['new_date'] = df['new_date'].where(df['hour'] != 235999, df['new_date'] + pd.offsets.BDay(1))
Both approaches are more efficient than your costume function.

How can I change the Date column format in pandas?

I need to convert the date to Day, Month and Year. I tried some alternatives, but I was unsuccessful.
import pandas as pd
df = pd.read_excel(r"C:\__Imagens e Planilhas Python\Instagram\Postagem.xlsx")
print(df)
It's very confusing, because you're using two different formats between the image and the expected result (and you write you want the same).
Clarify that data is a date with:
df['data']= = pd.to_datetime(df['data'])
Once you have this, just change the format with:
my_format = '%m-%d-%Y'
df['data'] = df['data'].dt.strftime(my_format)

Dataframe date sorting is reversed. How to fix it?

So, I have a dataframe (mean_df) with a very messy column with dates. It's messy because it is in this format: 1/1/2018, 1/2/2018, 1/3/2018.... When it should be 01/01/2018, 02/01/2018, 03/01/2018... Not only has the wrong format, but it's ascending by the first day of every month, and then following second day of every month, and so on...
So I wrote this code to fix the format:
mean_df["Date"] = mean_df["Date"].astype('datetime64[ns]')
mean_df["Date"] = mean_df["Date"].dt.strftime('%d-%m-%Y')
Then, from displaying this:
It's now showing this (I have to run the same cell 3 times to make it work, it always throws error the first time):
Finally, in the last few hours I've been trying to sort the 'Dates' column, in an ascending way, but it keeps sorting it the wrong way:
mean_df = mean_df.sort_values(by='Date') # I tried this
But this is the output:
As you can see, it is still ascending prioritizing days.
Can someone guide me in the right direction?
Thank you in advance!
Make it into right format
mean_df["sort_date"] = pd.to_datetime(mean_df["Date"],format = '%d/%m/%Y')
mean_df = mean_df.sort_values(by='sort_date') # Try this now
You should sort the date just after convert it to datetime since dt.strftime convert datetime to string
mean_df["Date"] = pd.to_datetime(mean_df["Date"], dayfirst=True)
mean_df = mean_df.sort_values(by='Date')
mean_df["Date"] = mean_df["Date"].dt.strftime('%d-%m-%Y')
Here is my sample code.
import pandas as pd
df = pd.DataFrame()
df['Date'] = "1/1/2018, 1/2/2018, 1/3/2018".split(", ")
df['Date1'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['Date2'] = df['Date1'].dt.strftime('%d/%m/%Y')
df.sort_values(by='Date2')
First, I convert Date to datetime format. As I observed, you data follows '%d/%m/%Y' format. If you want to show data in another form, try the following line, for example
df['Date2'] = df['Date1'].dt.strftime('%d/%m/%Y')

Convert month, day in string to date in Python

I have 2 columns as month and day in my dataframe which are of the datatypes objects. I want to sort those in ascending order (Jan, Feb, Mar) but in order to do that, I need to convert them to date format. I tried using the following code, and some more but nothing seems to work.
ff['month'] = dt.datetime.strptime(ff['month'],format='%b')
and
ff['month'] = pd.to_datetime(ff['month'], format="%b")
Data Frame
Any help would be appreciated. Thank you
This works to convert Month Names to Integers:
import datetime as dt
ff['month'] = [dt.datetime.strptime(m, "%b").month for m in ff['month']]
(Basically, you're just passing strings one by one to the first function you mentioned, to make it work.)
You can then manipulate (e.g. sort) them.
Working with dataframe:
ff['month'] = ff['month'].apply(lambda x: dt.datetime.strptime(x, "%b"))
ff = ff.sort_values(by=['month'])
ff['month'] = ff['month'].apply(lambda x: x.strftime("%b"))

Convert date string YYYY-MM-DD to YYYYMM in pandas

Is there a way in pandas to convert my column date which has the following format '1997-01-31' to '199701', without including any information about the day?
I tried solution of the following form:
df['DATE'] = df['DATE'].apply(lambda x: datetime.strptime(x, '%Y%m'))
but I obtain this error : 'ValueError: time data '1997-01-31' does not match format '%Y%m''
Probably the reason is that I am not including the day in the format. Is there a way better to pass from YYYY-MM_DD format to YYYYMM in pandas?
One way is to convert the date to date time and then use strftime. Just a note that you do lose the datetime functionality of the date
df = pd.DataFrame({'date':['1997-01-31' ]})
df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'].dt.strftime('%Y%m')
date
0 199701
Might not need to go through the datetime conversion if the data are sufficiently clean (no incorrect strings like 'foo' or '001231'):
df = pd.DataFrame({'date':['1997-01-31', '1997-03-31', '1997-12-18']})
df['date'] = [''.join(x.split('-')[0:2]) for x in df.date]
# date
#0 199701
#1 199703
#2 199712
Or if you have null values:
df['date'] = df.date.str.replace('-', '').str[0:6]

Categories