Creating a function to execute it on entire Dataframe - python

I have a data that includes columns with dates:
col_1 col_2
'may 2021 - 2023' 'nov 2020 - feb 2021'
'jan 2022 - 2023' 'sep 2021- 2023'
With below code I can create the required output, but I am looking to create a function which can take a dataframe as input and produces the expected output :
s = df['col_1'].str.split(r'\s*-\s*')
df['year_1'] = (pd
.to_datetime(s.str[1])
.sub(pd.to_datetime(s.str[0])))
t = df['col_2'].str.split(r'\s*-\s*')
df['year_2'] = (pd
.to_datetime(t.str[1])
.sub(pd.to_datetime(t.str[0])))
to prepare the below output i need to rerun the code with change in variable. as explained i need to make a function. please note that number of columns can be more so code should work fine
Expected Output
col_1 Year_1 col_2 Year_2
'may 2021 - 2023' 610 days 'sep 2017-dec 2017' 91 days
'jan 2022 - 2023' 365 days 'sep 2021- 2023' 487 days

You can use:
def compute_days(sr):
parts = sr.str.strip("'").str.split('-', expand=True)
start = pd.to_datetime(parts[0])
end = pd.to_datetime(parts[1])
return end - start
days = df.apply(compute_days).rename(columns=lambda x: f"Year_{x.split('_')[1]}")
out = pd.concat([df, days], axis=1)
Output:
col_1 col_2 Year_1 Year_2
0 'may 2021 - 2023' 'nov 2020 - feb 2021' 610 days 92 days
1 'jan 2022 - 2023' 'sep 2021- 2023' 365 days 487 days
2 '03/2017 - 08/2021' '2022 - 2023' 1614 days 365 days
3 '' '' NaT NaT

Related

how to convert a column with string datetime to datetime format

i want to convert a column with string date '19 Desember 2022' for example (the month name is in Indonesian), to supported datetime format without translating it, how do i do that?
already tried this one
df_train['date'] = pd.to_datetime(df_train['date'], format='%d %B %Y') but got error time data '19 Desember 2022' does not match format '%d %B %Y' (match)
incase if anyone want to see the row image
Try using dateparser
import dateparser
df_train = pd.DataFrame(['19 Desember 2022', '20 Desember 2022', '21 Desember 2022', '22 Desember 2022'], columns = ['date'])
df_train['date'] = [dateparser.parse(x) for x in df_train['date']]
df_train
Output:
date
0 2022-12-19
1 2022-12-20
2 2022-12-21
3 2022-12-22
Pandas doesn't recognize bahasa(indonesian language) Try replacing the spelling of December (as pointed out you can use a one liner and create a new column):
df_train["formatted_date"] = pd.to_datetime(df_train["date"].str.replace("Desember", "December"), format="%d %B %Y")
print(df_train)
Output:
user_type date formatted_date
0 Anggota 19 Desember 2022 2022-12-19
1 Anggota 19 Desember 2022 2022-12-19
2 Anggota 19 Desember 2022 2022-12-19
3 Anggota 19 Desember 2022 2022-12-19
4 Anggota 19 Desember 2022 2022-12-19

convert series of dates to int number of dates [duplicate]

This question already has answers here:
Numbers of Day in Month
(4 answers)
Closed 3 months ago.
I have a pandas Series that is of the following format
dates = [Nov 2022, Dec 2022, Jan 2023, Feb 2023 ..]
I want to create a dataframe that takes these values and has the number of days. I have to consider of course the case if it is a leap year
I have created a small function that splits the dates into 2 dataframes and 2 lists of months depending if they have 30 or 31 days like the following
month = [Nov, Dec, Jan, Feb ..] and
year = [2022, 2022, 2023, 2023 ..]
and then use the isin function in a sense if the month is in listA then insert 31 days etc. I also check for the leap years. However, I was wondering if there is a way to automate this whole proces with the pd.datetime
If you want the number of days in this month:
dates = pd.Series(['Nov 2022', 'Dec 2022', 'Jan 2023', 'Feb 2023'])
out = (pd.to_datetime(dates, format='%b %Y')
.dt.days_in_month
)
# Or
out = (pd.to_datetime(dates, format='%b %Y')
.add(pd.offsets.MonthEnd(0))
.dt.day
)
Output:
0 30
1 31
2 31
3 28
dtype: int64
previous interpretation
If I understand correctly, you want the day of year?
Assuming:
dates = pd.Series(['Nov 2022', 'Dec 2022', 'Jan 2023', 'Feb 2023'])
You can use:
pd.to_datetime(dates, format='%b %Y').dt.dayofyear
NB. The reference is the start of each month.
Output:
0 305
1 335
2 1
3 32
dtype: int64

Multiple Date Formate to a single date pattern in pandas dataframe

I have a pandas date column in the following format
Date
0 March 13, 2020, March 13, 2020
1 3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021
2 NaN
3 May 20, 2022, May 21, 2022
I tried to convert the pattern to a single pattern to store to a new column.
import pandas as pd
import dateutil.parser
# initialise data of lists.
data = {'Date':['March 13, 2020, March 13, 2020', '3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021', 'NaN','May 20, 2022, May 21, 2022']}
# Create DataFrame
df = pd.DataFrame(data)
df["FormattedDate"] = df.Date.apply(lambda x: dateutil.parser.parse(x.strftime("%Y-%m-%d") ))
But i am getting an error
AttributeError: 'str' object has no attribute 'strftime'
Desired Output
Date DateFormatted
0 March 13, 2020, March 13, 2020 2020-03-13, 2020-03-13
1 3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021 2021-03-09, 2021-03-09, 2021-03-09, 2021-09-03
2 NaN NaN
3 May 20, 2022, May 21, 2022 2022-05-20, 2022-05-21
I was authot of previous solution, so possible solution is change also it for avoid , like separator and like value in date strings is used Series.str.extractall, converting to datetimes and last is aggregate join:
format_list = ["[0-9]{1,2}(?:\,|\.|\/|\-)(?:\s)?[0-9]{1,2}(?:\,|\.|\/|\-)(?:\s)?[0-9]{2,4}",
"[0-9]{1,2}(?:\.)(?:\s)?(?:(?:(?:j|J)a)|(?:(?:f|F)e)|(?:(?:m|M)a)|(?:(?:a|A)p)|(?:(?:m|M)a)|(?:(?:j|J)u)|(?:(?:a|A)u)|(?:(?:s|S)e)|(?:(?:o|O)c)|(?:(?:n|N)o)|(?:(?:d|D)e))\w*(?:\s)?[0-9]{2,4}",
"(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}",
"[0-9]{1,2}(?:\.)?(?:\s)?(?:\n)?(?:(?:(?:j|J)a)|(?:(?:f|F)e)|(?:(?:m|M)a)|(?:(?:a|A)p)|(?:(?:m|M)a)|(?:(?:j|J)u)|(?:(?:a|A)u)|(?:(?:s|S)e)|(?:(?:o|O)c)|(?:(?:n|N)o)|(?:(?:d|D)e))\w*(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]
# initialise data of lists.
data = {'Name':['Today is 09 September 2021', np.nan, '25 December 2021 is christmas', '01/01/2022 is newyear and will be holiday on 02.01.2022 also']}
# Create DataFrame
df = pd.DataFrame(data)
import dateutil.parser
f = lambda x: dateutil.parser.parse(x).strftime("%Y-%m-%d")
df['DateFormatted'] = df['Name'].str.extractall(f'({"|".join(format_list)})')[0].apply(f).groupby(level=0).agg(','.join)
print (df)
Name DateFormatted
0 Today is 09 September 2021 2021-09-09
1 NaN NaN
2 25 December 2021 is christmas 2021-12-25
3 01/01/2022 is newyear and will be holiday on 0... 2022-01-01,2022-02-01
Another alternative is processing lists after remove missing values in generato comprehension with join:
import dateutil.parser
f = lambda x: dateutil.parser.parse(x).strftime("%Y-%m-%d")
df['Date'] = df['Name'].str.findall("|".join(format_list)).dropna().apply(lambda y: ','.join(f(x) for x in y))
print (df)
Name Date
0 Today is 09 September 2021 2021-09-09
1 NaN NaN
2 25 December 2021 is christmas 2021-12-25
3 01/01/2022 is newyear and will be holiday on 0... 2022-01-01,2022-02-01

How can I get a substring of a row in a column returned by loc?

I have a dataframe (df) which has the column 'Date Created'.
I need to splice the string inside 'Date Created' so that I'm only left with the numerical day instead of the entire datetime string (For example, I want to cut 'Sun Mar 03 2020 11:52 pm' to "2020/03/"+ 'string in Date Created'[8:10] (9th and 10th character).
I tried this but I get a copy warning:
for x in range(len(df)):
df.iloc[x]['date'] = "202003" + (df.iloc[x]['Date Created'])[8:10]
I go to the documentation and it has instructions on how to use loc to get substrings but they do so for a very specific example case that doesn't apply to my code.
I tried this then:
df['date'] = ''
df.loc[:,['Date Created']] = "202003"+ (df.loc[:,['Date Created']])[8:10]
But this also doesn't work. Can someone please help on how I can get the 9th and 10th character of each row of Date Created and assign that to a new column (or even replace the existing value in Date Created)? TIA!
I made up this dataframe.
df = pd.DataFrame({"Date Created": ["Sun Mar 03 2020 11:52 pm",
"Sun Mar 08 2020 11:52 pm",
"Sun Mar 09 2020 11:52 pm"]})
So with
df.loc[:, "Date Created"] = "202003" + df["Date Created"].str[8:10]
You'll get this
Alternative approach would be accessing day field of datetime object:
import pandas as pd
df = pd.DataFrame({"Date Created": [
"Sun Mar 01 2020 11:52 pm",
"Sun Mar 08 2020 11:52 pm",
"Sun Mar 15 2020 11:52 pm"
]})
df
Output:
Date Created
0 Sun Mar 01 2020 11:52 pm
1 Sun Mar 08 2020 11:52 pm
2 Sun Mar 15 2020 11:52 pm
.
df['year'] = pd.DatetimeIndex(df['Date Created']).year
df['month'] = pd.DatetimeIndex(df['Date Created']).month
df['day'] = pd.DatetimeIndex(df['Date Created']).day
df['formatted'] = pd.DatetimeIndex(df['Date Created']).strftime('%Y/%m/%d')
df
Output:
Date Created year month day formatted
0 Sun Mar 01 2020 11:52 pm 2020 3 1 2020/03/01
1 Sun Mar 08 2020 11:52 pm 2020 3 8 2020/03/08
2 Sun Mar 15 2020 11:52 pm 2020 3 15 2020/03/15

Date formating in Pandas

I'm trying to format a column with date to 'Month Year' format without changing non-date values .
input_df = pd.DataFrame({'Period' :['2017-11-01 00:00:00', '2019-02-01 00:00:00', 'Mar 2020', 'Pre-Nov 2017', '2019-10-01 00:00:00' , 'Nov 17-Nov 18'] } )
input_df is
expected output is:
I tired with the below code which didn't work:
output_df['Period'] = input_df['Period'].apply(lambda x: x.strftime('%m %Y') if isinstance(x, datetime.date) else x)
Pls help..
You can do with error='coerce' and fillna:
input_df['new_period'] = (pd.to_datetime(input_df['Period'], errors='coerce')
.dt.strftime('%b %Y')
.fillna(input_df['Period'])
)
Output:
Period new_period
0 2017-11-01 00:00:00 Nov 2017
1 2019-02-01 00:00:00 Feb 2019
2 Mar 2020 Mar 2020
3 Pre-Nov 2017 Pre-Nov 2017
4 2019-10-01 00:00:00 Oct 2019
5 Nov 17-Nov 18 Nov 17-Nov 18
Update: Second, safer option:
s = pd.to_datetime(input_df['Period'], errors='coerce')
input_df['new_period'] = np.where(s.isna(), input_df['Period'],
s.dt.strftime('%b %Y'))

Categories