convert month_year value to month name and year columns in python - python

I've a sample dataframe
year_month
202004
202005
202011
202012
How can I append the month_name + year column to the dataframe
year_month month_name
202004 April 2020
202005 May 2020
202011 Nov 2020
202112 Dec 2021

You can use datetime.strptime to convert your string into a datetime object, then you can use datetime.strftime to convert it back into a string with different format.
>>> import datetime as dt
>>> import pandas as pd
>>> df = pd.DataFrame(['202004', '202005', '202011', '202012'], columns=['year_month'])
>>> df['month_name'] = df['year_month'].apply(lambda x: dt.datetime.strptime(x, '%Y%m').strftime('%b %Y'))
>>> df
year_month month_name
0 202004 Apr 2020
1 202005 May 2020
2 202011 Nov 2020
3 202012 Dec 2020
You can see the full list of format codes here.

Related

how to convert a column with string datetime to datetime format

i want to convert a column with string date '19 Desember 2022' for example (the month name is in Indonesian), to supported datetime format without translating it, how do i do that?
already tried this one
df_train['date'] = pd.to_datetime(df_train['date'], format='%d %B %Y') but got error time data '19 Desember 2022' does not match format '%d %B %Y' (match)
incase if anyone want to see the row image
Try using dateparser
import dateparser
df_train = pd.DataFrame(['19 Desember 2022', '20 Desember 2022', '21 Desember 2022', '22 Desember 2022'], columns = ['date'])
df_train['date'] = [dateparser.parse(x) for x in df_train['date']]
df_train
Output:
date
0 2022-12-19
1 2022-12-20
2 2022-12-21
3 2022-12-22
Pandas doesn't recognize bahasa(indonesian language) Try replacing the spelling of December (as pointed out you can use a one liner and create a new column):
df_train["formatted_date"] = pd.to_datetime(df_train["date"].str.replace("Desember", "December"), format="%d %B %Y")
print(df_train)
Output:
user_type date formatted_date
0 Anggota 19 Desember 2022 2022-12-19
1 Anggota 19 Desember 2022 2022-12-19
2 Anggota 19 Desember 2022 2022-12-19
3 Anggota 19 Desember 2022 2022-12-19
4 Anggota 19 Desember 2022 2022-12-19

Aggregate columns with same date (sum) in csv

My code is returning the following data in CSV
Quantity Date of purchase
1 17 May 2022 at 5:40:20PM BST
1 2 Apr 2022 at 7:41:29PM BST
1 2 Apr 2022 at 6:42:05PM BST
1 29 Mar 2022 at 12:34:56PM BST
1 29 Mar 2022 at 10:52:54AM BST
1 29 Mar 2022 at 12:04:52AM BST
1 28 Mar 2022 at 4:49:34PM BST
1 28 Mar 2022 at 11:13:37AM BST
1 27 Mar 2022 at 8:53:05PM BST
1 27 Mar 2022 at 5:10:21PM BST
I am trying to get the dates only and adding the quantity data with the same date but below is the code for that
data = read_csv("products_sold_history_data.csv")
data['Date of purchase'] = pandas.to_datetime(data['Date of purchase'] , format='%d-%m-%Y').dt.date
but its giving me error can anyone please help how can I take the dates only from Date of purchase column and then add the quantity values in the same date.
Date format in your data is not the format that you specified: format='%d-%m-%Y'.
You could specify it explicitly, or let pandas infer the format for you by not providing the format:
pandas.to_datetime(data['Date of purchase']).dt.date
If you want to specify the format explicitly, you should provide the format that matches your data:
pandas.to_datetime(data['Date of purchase'], format='%d %b %Y at %H:%M:%S%p %Z')
here is one way to do it, where a date is created as a on-fly field and not making part of the DF.
Also, IIUC you're not concerned with the time part and only date is what you need to use for summing it up
extract the date part using regex, create a temp field dte using pandas.assign, and then a groupby to sum up the quantity
df.assign(dte = pd.to_datetime(
df['purchase'].str.extract(r'(.*)(at)')[0].str.strip())
).groupby('dte')['qty'].sum().reset_index()
dte qty
0 2022-02-06 3
1 2022-02-07 3
2 2022-02-08 2
3 2022-02-09 2
4 2022-02-10 2
5 2022-02-11 3
6 2022-02-14 1
7 2022-02-15 1
8 2022-02-19 1

Multiple Date Formate to a single date pattern in pandas dataframe

I have a pandas date column in the following format
Date
0 March 13, 2020, March 13, 2020
1 3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021
2 NaN
3 May 20, 2022, May 21, 2022
I tried to convert the pattern to a single pattern to store to a new column.
import pandas as pd
import dateutil.parser
# initialise data of lists.
data = {'Date':['March 13, 2020, March 13, 2020', '3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021', 'NaN','May 20, 2022, May 21, 2022']}
# Create DataFrame
df = pd.DataFrame(data)
df["FormattedDate"] = df.Date.apply(lambda x: dateutil.parser.parse(x.strftime("%Y-%m-%d") ))
But i am getting an error
AttributeError: 'str' object has no attribute 'strftime'
Desired Output
Date DateFormatted
0 March 13, 2020, March 13, 2020 2020-03-13, 2020-03-13
1 3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021 2021-03-09, 2021-03-09, 2021-03-09, 2021-09-03
2 NaN NaN
3 May 20, 2022, May 21, 2022 2022-05-20, 2022-05-21
I was authot of previous solution, so possible solution is change also it for avoid , like separator and like value in date strings is used Series.str.extractall, converting to datetimes and last is aggregate join:
format_list = ["[0-9]{1,2}(?:\,|\.|\/|\-)(?:\s)?[0-9]{1,2}(?:\,|\.|\/|\-)(?:\s)?[0-9]{2,4}",
"[0-9]{1,2}(?:\.)(?:\s)?(?:(?:(?:j|J)a)|(?:(?:f|F)e)|(?:(?:m|M)a)|(?:(?:a|A)p)|(?:(?:m|M)a)|(?:(?:j|J)u)|(?:(?:a|A)u)|(?:(?:s|S)e)|(?:(?:o|O)c)|(?:(?:n|N)o)|(?:(?:d|D)e))\w*(?:\s)?[0-9]{2,4}",
"(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}",
"[0-9]{1,2}(?:\.)?(?:\s)?(?:\n)?(?:(?:(?:j|J)a)|(?:(?:f|F)e)|(?:(?:m|M)a)|(?:(?:a|A)p)|(?:(?:m|M)a)|(?:(?:j|J)u)|(?:(?:a|A)u)|(?:(?:s|S)e)|(?:(?:o|O)c)|(?:(?:n|N)o)|(?:(?:d|D)e))\w*(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]
# initialise data of lists.
data = {'Name':['Today is 09 September 2021', np.nan, '25 December 2021 is christmas', '01/01/2022 is newyear and will be holiday on 02.01.2022 also']}
# Create DataFrame
df = pd.DataFrame(data)
import dateutil.parser
f = lambda x: dateutil.parser.parse(x).strftime("%Y-%m-%d")
df['DateFormatted'] = df['Name'].str.extractall(f'({"|".join(format_list)})')[0].apply(f).groupby(level=0).agg(','.join)
print (df)
Name DateFormatted
0 Today is 09 September 2021 2021-09-09
1 NaN NaN
2 25 December 2021 is christmas 2021-12-25
3 01/01/2022 is newyear and will be holiday on 0... 2022-01-01,2022-02-01
Another alternative is processing lists after remove missing values in generato comprehension with join:
import dateutil.parser
f = lambda x: dateutil.parser.parse(x).strftime("%Y-%m-%d")
df['Date'] = df['Name'].str.findall("|".join(format_list)).dropna().apply(lambda y: ','.join(f(x) for x in y))
print (df)
Name Date
0 Today is 09 September 2021 2021-09-09
1 NaN NaN
2 25 December 2021 is christmas 2021-12-25
3 01/01/2022 is newyear and will be holiday on 0... 2022-01-01,2022-02-01

Convert column 'day' to datetime with year specification

I have a dataframe that includes a column of day numbers for which the year is known:
print (df)
year day time
0 2012 227 800
15 2012 227 815
30 2012 227 830
... ... ... ...
194250 2013 226 1645
194265 2013 226 1700
I have attempted to convert the day numbers to datetime %m-%d using:
import pandas as pd
df['day'] = pd.to_datetime(df['day'], format='%j').dt.strftime('%m-%d')
which gives:
year day time
0 2012 08-15 800
15 2012 08-15 815
30 2012 08-15 830
... ... ... ...
194250 2013 08-14 1645
194265 2013 08-14 1700
but this conversion is incorrect because the 227th day of 2012 is August 14th (08-14). I believe this error is down to the lack of year specification in the conversion.
How can I specify the year in the conversion to get a) %Y-%m-%d ; b) %m-%d ; c)%Y-%m-%dT%H:%M from the dataframe I have?
Thank you
you can convert to string and feed into pd.to_datetime, which you supply with the right parsing directive:
import pandas as pd
df = pd.DataFrame({'year': [2012, 2012], 'day' : [227, 228], 'time': [800, 0]})
df['datetime'] = pd.to_datetime(df.year.astype(str) + ' ' +
df.day.astype(str) + ' ' +
df.time.astype(str).str.zfill(4),
format='%Y %j %H%M')
df['datetime']
0 2012-08-14 08:00:00
1 2012-08-15 00:00:00
Name: datetime, dtype: datetime64[ns]
Formatting to string is just a call to strftime via dt accessor, e.g.
df['datetime'].dt.strftime('%Y-%m-%dT%H:%M')
0 2012-08-14T08:00
1 2012-08-15T00:00
Name: datetime, dtype: object
You can try converting year into datetime type and day into timedelta type, remember to offset the date:
dates = pd.to_datetime(df['year'], format='%Y') + \
pd.to_timedelta(df['day'] -1, unit='D')
Output:
0 2012-08-14
15 2012-08-14
30 2012-08-14
194250 2013-08-14
194265 2013-08-14
dtype: datetime64[ns]
Then extract the date-month with strftime:
df['day'] = dates.dt.strftime('%M-%D')

Get correct datetime object from dataframe column with random string present with date and time

I have dataframe like this:
id Time
0 N01 Thu Sep 10 11:44:30 XYZ 2020
1 V33 Thu Sep 10 11:39:05 ABC 2020
2 N01 Thu Sep 10 11:44:30 XYZ 2020
I am trying to convert Time column to datetime object. If I'm using:
df1['Time'] = pd.to_datetime(df1['Time'])
It is throwing a warning message:
UnknownTimezoneWarning: tzname BRT identified but not understood. Pass `tzinfos` argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception.
category=UnknownTimezoneWarning)
I am aware that there is a format argument in pd.to_datetime() to pass the input format. But I don't know what to pass as format to bypass the random strings in the middle of the Time column.
Is there any way to correctly get the datetime object from the Time column so that the random strings don't have any effect?
If you the characters you wants to remove are some following upper cases, you can handle it with a regex function with remove followed uppercase:
import pandas as pd
data={'id':['N01','V33','N01'],
'time':['Thu Sep 10 11:44:30 XYZ 2020','Thu Sep 10 11:39:05 ABC 2020','Thu Sep 10 11:44:30 XYZ 2020']}
df = pd.DataFrame(data)
df['time']=pd.to_datetime(df['time'].str.replace('([A-Z].[A-Z])',''),format=r'%a %b %d %H:%M:%S %Y')
print(df)
result:
id time
0 N01 2020-09-10 11:44:30
1 V33 2020-09-10 11:39:05
2 N01 2020-09-10 11:44:30

Categories