Extracting dates in a pandas dataframe column using regex

Extracting dates in a pandas dataframe column using regex - python

I have a data frame with a column Campaign which consists of the campaign name (start date - end date) format. I need to create 3 new columns by extracting the start and end dates.
start_date, end_date, days_between_start_and_end_date.
The issue is Campaign column value is not in a fixed format, for the below values my code block works well.
1. Season1 hero (18.02. -24.03.2021)
What I am doing in my code snippet is extracting the start date & end date from the campaign column and as you see, start date doesn't have a year. I am adding the year by checking the month value.
import pandas as pd
import re
import datetime
# read csv file
df = pd.read_csv("report.csv")
# extract start and end dates from the 'Campaign' column
dates = df['Campaign'].str.extract(r'(\d+\.\d+)\.\s*-\s*(\d+\.\d+\.\d+)')
df['start_date'] = dates[0]
df['end_date'] = dates[1]
# convert start and end dates to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], format='%d.%m')
df['end_date'] = pd.to_datetime(df['end_date'], format='%d.%m.%Y')
# Add year to start date
for index, row in df.iterrows():
if pd.isna(row["start_date"]) or pd.isna(row["end_date"]):
continue
start_month = row["start_date"].month
end_month = row["end_date"].month
year = row["end_date"].year
if start_month > end_month:
year = year - 1
dates_str = str(row["start_date"].strftime("%d.%m")) + "." + str(year)
df.at[index, "start_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
dates_str = str(row["end_date"].strftime("%d.%m")) + "." + str(row["end_date"].year)
df.at[index, "end_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
but, I have multiple different column values where my regex fail and I receive nan values, for example
1. Sales is on (30.12.21-12.01.2022)
2. Sn 2 Fol CAMPAIGN A (24.03-30.03.2023)
3. M SALE (19.04 - 04.05.2022)
4. NEW SALE (29.12.2022-11.01.2023)
5. Year End (18.12. - 12.01.2023)
6. XMAS 1 - THE TRIBE CELEBRATES XMAS (18.11.-08.12.2021) (gifting communities)
Year End (18.12. - 12.01.2023)
in all the above 4 example, my date format is completely different.
expected output
start date end date
2021-12-30 2022-01-22
2023-03-24 2023-03-30
2022-04-19 2022-05-04
2022-12-29 2023-01-11
2022-18-12 2023-01-12
2021-11-18 2021-12-08
Can someone please help me here?

Since the datetimes in the data don't have a fixed format (some are dd.mm.yy, some are dd.mm.YYYY), it might be better if we apply a custom parser function that uses try-except. We can certainly do two conversions using pd.to_datetime and choose values using np.where etc. but it might not save any time given we need to do a lot of string manipulations beforehand.
To append the missing years for some rows, since pandas string methods are not optimized and as we'll need a few of them, (str.count(), str.cat() etc.) it's probably better to use Python string methods in a loop implementation instead.
Also, iterrows() is incredibly slow, so it's much faster if you use a python loop instead.
pd.to_datetime converts each element into datetime.datetime objects anyways, so we can use datetime.strptime from the built-in module to perform the conversions.
from datetime import datetime
def datetime_parser(date, end_date=None):
# remove space around dates
date = date.strip()
# if the start date doesn't have year, append it from the end date
dmy = date.split('.')
if end_date and len(dmy) == 2:
date = f"{date}.{end_date.rsplit('.', 1)[1]}"
elif end_date and not dmy[-1]:
edmy = end_date.split('.')
if int(dmy[1]) > int(edmy[1]):
date = f"{date}{int(edmy[-1])-1}"
else:
date = f"{date}{edmy[-1]}"
try:
# try 'dd.mm.YYYY' format (e.g. 29.12.2022) first
return datetime.strptime(date, '%d.%m.%Y')
except ValueError:
# try 'dd.mm.yy' format (e.g. 30.12.21) if the above doesn't work out
return datetime.strptime(date, '%d.%m.%y')
# extract dates into 2 columns (tentatively start and end dates)
splits = df['Campaign'].str.extract(r"\((.*?)-(.*?)\)").values.tolist()
# parse the dates
df[['start_date', 'end_date']] = [[datetime_parser(start, end), datetime_parser(end)] for start, end in splits]
# find difference
df['days_between_start_and_end_date'] = df['end_date'] - df['start_date']

I would do a basic regex with extract and then perform slicing :
ser = df["Campaign"].str.extract(r"\((.*)\)", expand=False)

start_date = ser.str.strip().str[-10:]
#or ser.str.strip().str.rsplit("-").str[-1]
end_date = ser.str.strip().str.split("\s*-\s*").str[0]

NB : You can assign the Series start_date and end_date to create your two new column.
Output :
start_date, end_date
(1.0 12.01.2022 # <- start_date
2.0 30.03.2023
3.0 04.05.2022
4.0 11.01.2023
Name: Campaign, dtype: object,
1.0 30.12.21 # <- end_date
2.0 24.03
3.0 19.04
4.0 29.12.2022
Name: Campaign, dtype: object)

Related

How do I calculate time difference in days or months in python3

I've been working on a scraping and EDA project on Python3 using Pandas, BeautifulSoup, and a few other libraries and wanted to do some analysis using the time differences between two dates. I want to determine the number of days (or months or even years if that'll make it easier) between the start dates and end dates, and am stuck. I have two columns (air start date, air end date), with dates in the following format: MM-YYYY (so like 01-2021). I basically wanted to make a third column with the time difference between the end and start dates (so I could use it in later analysis).
# split air_dates column into start and end date
dateList = df["air_dates"].str.split("-", n = 1, expand = True)
df['air_start_date'] = dateList[0]
df['air_end_date'] = dateList[1]
df.drop(columns = ['air_dates'], inplace = True)
df.drop(columns = ['rank'], inplace = True)
# changing dates to numerical notation
df['air_start_date'] = pds.to_datetime(df['air_start_date'])
df['air_start_date'] = df['air_start_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df['air_end_date'] = pds.Series(df['air_end_date'])
df['air_end_date'] = pds.to_datetime(df['air_end_date'], errors = 'coerce')
df['air_end_date'] = df['air_end_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df.isnull().sum()
df.dropna(subset = ['air_end_date'], inplace = True)
def time_diff(time_series):
return datetime.datetime.strptime(time_series, '%d')
df['time difference'] = df['air_end_date'].apply(time_diff) - df['air_start_date'].apply(time_diff)
The last four lines are my attempt at getting a time difference, but I got an error saying 'ValueError: unconverted data remains: -2021'. Any help would be greatly appreciated, as this has had me stuck for a good while now. Thank you!

As far as I can understand, if you have start date and time and end date and time then you can use datetime module in python.
To use this, something like this would be used:
import datetime
# variable = datetime(year, month, day, hour, minute, second)
start = datetime(2017,5,8,18,56,40)
end = datetime(2019,6,27,12,30,58)
print( start - end ) # this will print the difference of these 2 date and time
Hope this answer helps you.

Ok so I figured it out. In my second to last line, I replaced the %d with %m-%Y and now it populates the new column with the number of days between the two dates. I think the format needed to be consistent when running strptime so that's what was causing that error.

here's a slightly cleaned up version; subtract start date from end date to get a timedelta, then take the days attribute from that.
EX:
import pandas as pd
df = pd.DataFrame({'air_dates': ["Apr 2009 - Jul 2010", "not a date - also not a date"]})
df['air_start_date'] = df['air_dates'].str.split(" - ", expand=True)[0]
df['air_end_date'] = df['air_dates'].str.split(" - ", expand=True)[1]
df['air_start_date'] = pd.to_datetime(df['air_start_date'], errors="coerce")
df['air_end_date'] = pd.to_datetime(df['air_end_date'], errors="coerce")
df['timediff_days'] = (df['air_end_date']-df['air_start_date']).dt.days
That will give you for the dummy example
df['timediff_days']
0 456.0
1 NaN
Name: timediff_days, dtype: float64
Regarding calculation of difference in month, you can find some suggestions how to calculate those here. I'd go with #piRSquared's approach:
df['timediff_months'] = ((df['air_end_date'].dt.year - df['air_start_date'].dt.year) * 12 +
(df['air_end_date'].dt.month - df['air_start_date'].dt.month))
df['timediff_months']
0 15.0
1 NaN
Name: timediff_months, dtype: float64

Deriving special period based on date from file - Python

I am new to scripting need some help in writing the code in correct way. I have a csv file in which we have date based on the date I need to create a new column name period which will be combination of year and month.
If the date range is between 1 to 25, month will be the current month from the date
If the date range is greater then 25, month will be next month.
Sample file:
Date
10/21/2021
10/26/2021
01/26/2021
Expected results:
Date
Period (year+month)
10/21/2021
202110
10/26/2021
202111
01/26/2021
202102

Two ways I can think of.
Convert the incoming string into a date object and get the values you need from there. See Converting string into datetime
Use split("/") to split the date string into a list of three values and use those to do your calculations.

Good question.
I've included the code that I wrote to do this, below. The process we will follow is:
Load the data from a csv
Define a function that will calculate the period for each date
Apply the function to our data and store the result as a new column
import pandas as pd
# Step 1
# read in the data from a csv, parsing dates and store the data in a DataFrame
data = pd.read_csv("filepath.csv", parse_dates=["Date"])
# Create day, month and year columns in our DataFrame
data['day'] = data['Date'].dt.day
data['month'] = data['Date'].dt.month
data['year'] = data['Date'].dt.year
# Step 2
# Define a function that will get our periods from a given date
def get_period(date):
day = date.day
month = date.month
year = date.year
if day > 25:
if month == 12: # if december, increment year and change month to jan.
year += 1
month = 1
else:
month += 1
# convert our year and month into strings that we can concatenate easily
year_string = str(year).zfill(4) #
month_string = str(month).zfill(2)
period = str(year_string) + str(month_string) # concat the strings together
return period
# Step 3
# Apply our custom function (get_period) to the DataFrame
data['period'] = data.apply(get_period, axis = 1)

problems with python Pandas converting int to float

I'm using pandas read_csv to extract data and reformat it. For example, "10/28/2018" from the column "HBE date" will be reformatted to read "eHome 10/2018"
It mostly works except I am getting reformatted values like "ehome 1.0/2015.0"
eHomeHBEdata['HBE date'] = pd.to_datetime(eHomeHBEdata['Course Completed'])
#extract month and year values
eMonths=[]
eYears =[]
eHomeDates = eHomeHBEdata['HBE date']
for eDate in eHomeDates:
eMonth = eDate.month
eYear = eDate.year
eMonths.append(eMonth)
eYears.append(eYear)
At this point, if I print(type(eMonth)) it returns as 'int.' And if I print the eYears list, I get values like 2013, 2014, 2015 etc.
But then I assign the lists to columns in the data frame . . .
eHomeHBEdata.insert(0,'workshop Month',eMonths)
eHomeHBEdata.insert(1,'workshop Year',eYears)
. . . after which print(ehomeHomeHBEdata['workshop Month']) returns values like 2013.0, 2014.0, 2015.0. That's type float, right?
When I try to use the following code I get the misformatted errors mentioned above
eHomeHBEdata['course session'] = "ehome " + eHomeHBEdata['workshop Month'].astype(str) + "/" + eHomeHBEdata['workshop Year'].astype(str)
eHomeHBEdata['start'] = eHomeHBEdata['workshop Month'].astype(str) + "/1/" + eHomeHBEdata['workshop Year'].astype(str) + " 12:00 PM"
Could someone explain what's going on here and help me fix it?

Solution
To convert (reformat) your date columns as MM/YYYY, all you need to do is:
df["Your_Column_Name"].dt.strftime('%m/%Y')
See Section-A and Section-B for two different use-cases.
A. Example
I have created some dummy data for this illustration with a column called: Date. To reformat this column as MM/YYYY I am using df.Dates.dt.strftime('%m/%Y') which is equivalent to df["Dates"].dt.strftime('%m/%Y').
import pandas as pd
## Dummy Data
dates = pd.date_range(start='2020/07/01', end='2020/07/07', freq='D')
df = pd.DataFrame(dates, columns=['Dates'])
# Solution
df['Reformatted_Dates'] = df.Dates.dt.strftime('%m/%Y')
print(df)
## Output:
# Dates Reformatted_Dates
# 0 2020-07-01 07/2020
# 1 2020-07-02 07/2020
# 2 2020-07-03 07/2020
# 3 2020-07-04 07/2020
# 4 2020-07-05 07/2020
# 5 2020-07-06 07/2020
# 6 2020-07-07 07/2020
B. If your input data is in the following format
In this case, first you could convert the date using .astype('datetime64[ns, US/Eastern]') on the column. This lets you apply pandas datetime specific methods on the column. Try running df.Dates.astype('datetime64[ns, US/Eastern]').dt.to_period(freq='M') now.
## Dummy Data
dates = [
'10/2018',
'11/2018',
'8/2019',
'5/2020',
]
df = pd.DataFrame(dates, columns=['Dates'])
print(df.Dates.dtype)
print(df)
## To convert the column to datetime and reformat
df['Dates'] = df.Dates.astype('datetime64[ns, US/Eastern]') #.dt.strftime('%m/%Y')
print(df.Dates.dtype)
C. Avoid using the for loop
Try this. You can use the inbuilt vectorization of pandas on a column, instead for looping over each row. I have used .dt.month and .dt.year on the column to get the month and year as int.
eHomeHBEdata['HBE date'] = pd.to_datetime(eHomeHBEdata['Course Completed'])
eHomeDates = eHomeHBEdata['HBE date'] # this should be in datetime.datetime format
## This is what I changed
>>> eMonths = eHomeDates.dt.month
>>> eYears = eHomeDates.dt.year
eHomeHBEdata.insert(0,'workshop Month',eMonths)
eHomeHBEdata.insert(1,'workshop Year',eYears)

Not all dates are captured when filtering by dates. Python Pandas

I am filtering a dataframe by dates to produce two seperate versions:
Data from only today's date
Data from the last two years
However, when I try to filter on the date, it seems to miss dates that are within the last two years.
date_format = '%m-%d-%Y' # desired date format
today = dt.now().strftime(date_format) # today's date. Will always result in today's date
today = dt.strptime(today, date_format).date() # converting 'today' into a datetime object
today = today.strftime(date_format)
two_years = today - relativedelta(years=2) # date is today's date minus two years.
two_years = two_years.strftime(date_format)
# normalizing the format of the date column to the desired format
df_data['date'] = pd.to_datetime(df_data['date'], errors='coerce').dt.strftime(date_format)
df_today = df_data[df_data['date'] == today]
df_two_year = df_data[df_data['date'] >= two_years]
Which results in:
all dates ['07-17-2020' '07-15-2020' '08-01-2019' '03-25-2015']
today df ['07-17-2020']
two year df ['07-17-2020' '08-01-2019']
The 07-15-2020 date is missing from the two year, even though 08-01-2019 is captured.

you don't need to convert anything to string, simply work with datetime dtype. Ex:
import pandas as pd
df = pd.DataFrame({'date': pd.to_datetime(['07-17-2020','07-15-2020','08-01-2019','03-25-2015'])})
today = pd.Timestamp('now')
print(df[df['date'].dt.date == today.date()])
# date
# 0 2020-07-17
print(df[(df['date'].dt.year >= today.year-1) & (df['date'].dt.date != today.date())])
# date
# 1 2020-07-15
# 2 2019-08-01
What you get from the comparison operations (adjust them as needed...) are boolean masks - you can use them nicely to filter the df.

Your datatype conversions are the problem here. You could do this:
today = dt.now() # today's date. Will always result in today's date
two_years = today - relativedelta(years=2) # date is today's date minus two years.
This prints '2018-07-17 18:40:42.704395'. You can then convert it to the date only format.
two_years = two_years.strftime(date_format)
two_years = dt.strptime(two_years, date_format).date()

NaNs when extracting no. of days between two dates in pandas

I have a dataframe that contains the columns company_id, seniority, join_date and quit_date. I am trying to extract the number of days between join date and quit date. However, I get NaNs.
If I drop off all the columns in the dataframe except for quit date and join date and run the same code again, I get what I expect. However with all the columns, I get NaNs.
Here's my code:
df['join_date'] = pd.to_datetime(df['join_date'])
df['quit_date'] = pd.to_datetime(df['quit_date'])
df['days'] = df['quit_date'] - df['join_date']
df['days'] = df['days'].astype(str)
df1 = pd.DataFrame(df.days.str.split(' ').tolist(), columns = ['days', 'unwanted', 'stamp'])
df['numberdays'] = df1['days']
This is what I get:
days numberdays
585 days 00:00:00 NaN
340 days 00:00:00 NaN
I want 585 from the 'days' column in the 'numberdays' column. Similarly for every such row.
Can someone help me with this?
Thank you!

Instead of converting to string, extract the number of days from the timedelta value using the dt accessor.
import pandas as pd
df = pd.DataFrame({'join_date': ['2014-03-24', '2013-04-29', '2014-10-13'],
'quit_date':['2015-10-30', '2014-04-04', '']})
df['join_date'] = pd.to_datetime(df['join_date'])
df['quit_date'] = pd.to_datetime(df['quit_date'])
df['days'] = df['quit_date'] - df['join_date']
df['number_of_days'] = df['days'].dt.days
#Mohammad Yusuf Ghazi points out that dt.day is necessary to get the number of days instead of dt.days when working with datetime data rather than timedelta.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting dates in a pandas dataframe column using regex - python

Related

How do I calculate time difference in days or months in python3

Deriving special period based on date from file - Python

problems with python Pandas converting int to float

Not all dates are captured when filtering by dates. Python Pandas

NaNs when extracting no. of days between two dates in pandas

Categories

Resources