Cleaning date column in python with multiple date formats - python

I am cleaning up a dataframe that has date of birth and date of death as a string. There are multiple formats of dates in those columns. Some contain just year (which is all I need). These are the formats of dates:
Jan 10 2020
1913
10/8/2019
June 14th 1980
All I need is the year from each date. I have not been having any luck with pandas to_datetime since a significant portion of the rows only have year to begin with.
Is there a way for me to pull just year from the strings so that I can get each column to look like:
2020
1913
2019
1980

The simplest way is to use a parser which will accept these and other formats:
import pandas as pd
from dateutil import parser
df = pd.DataFrame({"mydates":["Jan 10 2020", "1913", "10/8/2019", "June 14th 1980"]})
df['years'] = df['mydates'].apply(parser.parse).dt.strftime('%Y')
print(df)

You can use str.extract:
df['BirthDate'] = df['BirthDate'].str.extract(r'\d{4}')

Related

How to convert date format (dd/mm/yyyy) to days in python csv

I need a function to count the total number of days in the 'days' column between a start date of 1st Jan 1995 and an end date of 31st Dec 2019 in a dataframe taking leap years into account as well.
Example: 1st Jan 1995 - Day 1, 1st Feb 1995 - Day 32 .......and so on all the way to 31st.
If you want to filter a pandas dataframe using a range of 2 date you can do this by:
start_date = '1995/01/01'
end_date = '1995/02/01'
df = df[ (df['days']>=start_date) & (df['days']<=end_date) ]
and with len(df) you will see the number of rows of the filter dataframe.
Instead, if you want to calculate a range of days between 2 different date you can do without pandas with datetime:
from datetime import datetime
start_date = '1995/01/01'
end_date = '1995/02/01'
delta = datetime.strptime(end_date, '%Y/%m/%d') - datetime.strptime(start_date, '%Y/%m/%d')
print(delta.days)
Output:
31
The only thing is that this not taking into account leap years

How to convert "Day_Name Month Day_No. Time Year" to date format?

I have a date column in a df with values like Fri Apr 01 16:41:32 +0000 2022. I want to convert it into proper date column format 01/04/2022 16:41:32. Where 01 is day and 04 is the month.
Any guidance please?
You can use pandas.to_datetime for getting datetime then with Series.dt.strftime convert to desired format.
import pandas as pd
# example df
df = pd.DataFrame({'date': ['Fri Apr 01 16:41:32 +0000 2022' ,
'Sat Apr 02 16:41:32 +0000 2022']})
df['date'] = pd.to_datetime(df['date']).dt.strftime('%d/%m/%Y %H:%M:%S')
print(df)
date
0 01/04/2022 16:41:32
1 02/04/2022 16:41:32
You can use this to get the datetime type.
from dateutil import parser
date=parser.parse("Fri Apr 01 16:41:32 +0000 2022")
If you want a specific string format, you can then use strftime()
first create a dictionary from month and the number of month for example for key "apr" value is 04.
Then with regex create a function for extract the name of month, year, time and day and then with the apply method, apply it on all rows and store output in a new column as a tuple.
now you can use from apply method again for create custom column as
datetime.datetime(year= ..., Month=..., ...)

Sort out dates with different formats Python

I am having a hard time trying to sort dates with different formats. I have a Series with inputs containing dates in many different formats and need to extract them and sort them chronologically.
So far I have setup different regex for fully numerical dates (01/01/1989), dates with month (either Mar 12 1989 or March 1989 or 12 Mar 1989) and dates where only the year is given (see code below)
pat1=r'(\d{0,2}[/-]\d{0,2}[/-]\d{2,4})' # matches mm/dd/yy and mm/dd/yyyy
pat2=r'((\d{1,2})?\W?(Jan|Feb|Mar|Apr|May|June|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\W+(\d{1,2})?\W?\d{4})'
pat3=r'((?<!\d)(\d{4})(?!\d))'
finalpat=pat1 + "|"+ pat2 + "|" + pat3
df2=df1.str.extractall(finalpat).groupby(level=0).first()
I now got a dataframe with the different regex expressions above in different columns that I need to transform in usable times.
The problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989 and Mar 1989 (no day) in the same column of my dataframe.
Without two formats ( Month dd YYYY and dd Month YYYY) I can easily do this :
df3=df2.copy()
dico={"Jan":'01','Feb':'02','Mar':'03','Apr':'04','May':'05','Jun':'06','Jul':'07','Aug':'08','Sep':'09','Oct':'10','Nov':'11','Dec':'12'}
df3[1]=df3[1].str.replace("(?<=[A-Z]{1}[a-z]{2})\w*","") # we replace the month in the column by its number, and remove
for key,item in dico.items(): # the letters in month after the first 3.
df3[1]=df3[1].str.replace(key,item)
df3[1]=df3[1].str.replace("^(\d{1,2}/\d{4})",r'01/\g<1>')
df3[1]=pd.to_datetime(df3[1],format='%d/%m/%Y').dt.strftime('%Y%m%d') # add 01 if no day given
where df3[1] is the column of interest. I use a dictionary to change Month to their number and get my dates as I want them.
The problem is that with two formats of dates ( Mar 12 1989 and 12 Mar 1989), one of the two format will be wrongly transformed.
Is there a way to discriminate between the date formats and apply different transformations accordingly ?
Thanks a lot
problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989
and Mar 1989 (no day) in the same column of my dataframe.
pandas.to_datetime can cope with that, consider following example
import pandas as pd
df = pd.DataFrame({'d_str':["Mar 12 1989", "12 Mar 1989", "Mar 1989"]})
df['d_dt'] = pd.to_datetime(df.d_str)
print(df)
output
d_str d_dt
0 Mar 12 1989 1989-03-12
1 12 Mar 1989 1989-03-12
2 Mar 1989 1989-03-01
Now you can sort using d_dt as it has type datetime64[ns] but you must keep in mind that lack of day is treated as 1st day of given month. Be warned though it might fail if your data contain dates in middle-endian format (mm/dd/yy).

Split dataframe into two using Data as splitting point

I have a dataframe which has 100,000 rows and 24 columns; representing crime over a year period October 2019 - October 2020
I'm trying to split the my df into two one dataframe of all rows ranging from october 1st - 31st March and the second ranging from April 1st - October 31st;
Would anyone be able to kindly assist how using pandas?
Assuming the column is of datetime type. You can do like this :
import pandas as pd
split_data = pd.datetime(2020,03,31)
df_1 = df.loc[df['Date']<= split_date]
df_2 = df.loc[df['Date'] > split_date]
if the column containing date field is not datetime type. You should first convert it into datetime type.
df['Date'] = pd.to_datetime(df['Date'])

ValueError in dataframe while trying to extract day, month and year using datetime python library

I have three columns in my dataframe: Tweet Posted Time (UTC), Tweet Content, and Tweet Location. The "Tweet Posted Time (UTC)" column has date object in the format: 31 Mar 2020 10:49:01
My objective is to reformat the dataframe in such a way that the 'Tweet Posted Time (UTC)' column displays only the day, month and the year alone (such as 31-03-2020), to be able to plot a time-series graph, but my attempts result in the error below.
ValueError: time data '0 31 Mar 2020 10:49:01\n1 31 Mar 2020 05:48:43\n2 30 Mar 2020 05:38:50\n3 29 Mar 2020 21:19:23\n4 29 Mar 2020 20:28:22\n ... \n2488 02 Jan 2018 13:36:07\n2489 02 Jan 2018 10:33:21\n2490 01 Jan 2018 12:23:47\n2491 01 Jan 2018 06:03:51\n2492 01 Jan 2018 02:09:15\nName: Tweet Posted Time (UTC), Length: 2451, dtype: object' does not match format '%d %b %Y %H:%M:%S'
My code is below, can you tell me what I am doing wrong, please?
from datetime import datetime
import pandas as pd
import re #regular expression
from textblob import TextBlob
import string
import preprocessor as p
pd.set_option("expand_frame_repr", False)
df1 = pd.read_csv("C:/tweet_data.csv")
dataType = df1.dtypes
print(dataType)
# convert datetime object to string
old_formatDate = str(df1['Tweet Posted Time (UTC)'])
# extract day, month, and year and convert back to datetime object
date_TimeObject = datetime.strptime(old_formatDate, '%d %b %Y %H:%M:%S')
new_formatDate = date_TimeObject.strftime('%d-%m-%Y')
print(new_formatDate)
I researched and solved the problem by changing the data frame to panda series and then to datetime format. Then, applied dt.strftime.
df.columns = ['Tweet_Posted_Time', 'Tweet_Content', 'Tweet_Location']
print(df)
# Convert the date and time column (Tweet_Posted_Time) from panda data frame to Panda Series
df1 = pd.Series(df['Tweet_Posted_Time'])
print(df1)
# Convert the Panda Series to datetime format
df1 = pd.to_datetime(df1)
print(df1)
# convert the date column to new date format
df1 = df1.dt.strftime('%d-%m-%Y')
print(df1)
# Replace the Column "Tweet_Posted_Time" in the original data frame with the new data frame containing new date format
df.assign(Tweet_Posted_Time=df1)````

Categories