I am very new to pandas and i want to do the following, but getting some troubles with groupby. Please help.
I have a dataframe with many columns one of which is date.
I need a list of distinct month year from it.
df = pd.DataFrame(['02 Jan 2018', '02 Feb 2018', '02 Feb 2018', '02 Mar 2018'], columns=['date'])
datelist = pd.to_datetime(df.date)
datelist = datelist.groupby([datelist.dt.month, datelist.dt.year])
when i do datelist.all() i get the following,
date date
1 2018 True
2 2018 True
Name: date, dtype: bool
I need something like ['Jan 2018', 'Feb 2018']
I would really appreciate your help.
Thanks
Use to_datetime, then convert to custom strings with strftime, get unique values and last convert to strings:
datelist = pd.to_datetime(df.date).dt.strftime('%b %Y').unique().tolist()
print (datelist)
['Jan 2018', 'Feb 2018', 'Mar 2018']
Another solution if input format of datetimes is 02 Jan 2018 is split by first whitespace split, select second value and get unique values:
datelist = df['date'].str.split(n=1).str[1].unique().tolist()
You can use to_period (for a Series this would be dt.to_period):
In [11]: datelist.to_period("M")
Out[11]:
PeriodIndex(['2019-01', '2019-01', '2019-01', '2019-01', '2019-01', '2019-01',
...
'2019-02', '2019-02', '2019-02', '2019-02', '2019-02'],
dtype='period[M]', freq='M')
In [12]: datelist.to_period("M").unique()
Out[12]: PeriodIndex(['2019-01', '2019-02'], dtype='period[M]', freq='M')
In [13]: [str(m) for m in datelist.to_period("M").unique()]
Out[13]: ['2019-01', '2019-02']
Related
i want to convert a column with string date '19 Desember 2022' for example (the month name is in Indonesian), to supported datetime format without translating it, how do i do that?
already tried this one
df_train['date'] = pd.to_datetime(df_train['date'], format='%d %B %Y') but got error time data '19 Desember 2022' does not match format '%d %B %Y' (match)
incase if anyone want to see the row image
Try using dateparser
import dateparser
df_train = pd.DataFrame(['19 Desember 2022', '20 Desember 2022', '21 Desember 2022', '22 Desember 2022'], columns = ['date'])
df_train['date'] = [dateparser.parse(x) for x in df_train['date']]
df_train
Output:
date
0 2022-12-19
1 2022-12-20
2 2022-12-21
3 2022-12-22
Pandas doesn't recognize bahasa(indonesian language) Try replacing the spelling of December (as pointed out you can use a one liner and create a new column):
df_train["formatted_date"] = pd.to_datetime(df_train["date"].str.replace("Desember", "December"), format="%d %B %Y")
print(df_train)
Output:
user_type date formatted_date
0 Anggota 19 Desember 2022 2022-12-19
1 Anggota 19 Desember 2022 2022-12-19
2 Anggota 19 Desember 2022 2022-12-19
3 Anggota 19 Desember 2022 2022-12-19
4 Anggota 19 Desember 2022 2022-12-19
I have a df with dates in the format %B %Y (e.g. June 2021, December 2022 etc.)
Date
Price
Apr 2022
2
Dec 2021
8
I am trying to sort dates in order of oldest to newest but when I try:
.sort_values(by='Date', ascending=False)
it is ordering in alphabetical order.
The 'Date' column is an Object.
ascending=False will sort from newest to oldest, but you are asking to sort oldest to newest, so you don't need that option;
there is a key option to specify how to parse the values before sorting them;
you may or may not want option ignore_index=True, which I included below.
We can use the key option to parse the values into datetime objects with pandas.to_datetime.
import pandas as pd
df = pd.DataFrame({'Date': ['Apr 2022', 'Dec 2021', 'May 2022', 'May 2021'], 'Price': [2, 8, 12, 15]})
df = df.sort_values(by='Date', ignore_index=True, key=pd.to_datetime)
print(df)
# Date Price
# 0 May 2021 15
# 1 Dec 2021 8
# 2 Apr 2022 2
# 3 May 2022 12
Relevant documentation:
DataFrame.sort_values;
to_datetime.
The objective is to extract df under the month-year category while omitting other.
The code below one way how this objective can be achieved
df = DataFrame ( [['PP1', 'LN', 'T1', 'C11', 'C21', 'C31', 'C32']] )
df.columns =['dummy1','dummy2', 'Jan-20', 'Feb-20', 'Jan 2021', 'Feb 2080','Dec 1993']
extract_header_name=list(df.columns.values)
lookup_list= ['Jan', 'Feb', 'Mar','Apr', 'May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
month_year_list=[i for e in lookup_list for i in extract_header_name if e in i]
Output
['Jan-20', 'Jan 2021', 'Feb-20', 'Feb 2080', 'Dec 1993']
However, I wonder if is another efficient or pandas built module to achieve similar result?
Use str.contains with values joined by | for regex or - it means Jan or Feb... and filter by boolean indexing with df.columns:
month_year_list = df.columns[df.columns.str.contains('|'.join(lookup_list))].tolist()
print (month_year_list)
['Jan-20', 'Feb-20', 'Jan 2021', 'Feb 2080', 'Dec 1993']
Or use Series.str.startswith with convert list to tuple:
month_year_list = df.columns[df.columns.str.startswith(tuple(lookup_list))].tolist()
Another idea if only this 2 formats of datetimes:
s = df.columns.to_series()
s1 = pd.to_datetime(s, format='%b-%y', errors='coerce')
s2 = pd.to_datetime(s, format='%b %Y', errors='coerce')
month_year_list = df.columns[s1.fillna(s2).notna()].tolist()
print (month_year_list)
['Jan-20', 'Feb-20', 'Jan 2021', 'Feb 2080', 'Dec 1993']
Select columns using df.filter and extract their names.
list(df.filter(regex='|'.join(lookup_list)).columns)
['Jan-20', 'Feb-20', 'Jan 2021', 'Feb 2080', 'Dec 1993']
a two part question
I'm attempting to transform a column into a datetime, an easy task I assume ? as I've done it before on different df's using the documentation without much issue.
df = pd.DataFrame({'date' : ['24 October 2018', '23 April 2018', '18 January 2018']})
print(df)
date
0 24 October 2018
1 23 April 2018
2 18 January 2018
I was going through the datetime docs and I thought this piece of code would convert this column (which is an object) into a datetime
df.date = pd.to_datetime(df['date'], format="%d-%m-%Y",errors='ignore')
which gives the error :
ValueError: time data '24 April 2018' does not match format '%d-%m-%Y' (match)
I've attempted playing with formulas and going through documentation to no avail!
You are using the wrong format. '24 October 2018' uses format="%d %B %Y". The format specifiers are listed here.
edit: -Demo-
>>> import pandas as pd
>>> df = pd.DataFrame({'date':['24 October 2018', '23 April 2018', '18 January 2018']})
>>> df.date = pd.to_datetime(df['date'], format="%d %B %Y")
>>>
>>> df
date
0 2018-10-24
1 2018-04-23
2 2018-01-18
>>>
>>> df['date'][0]
Timestamp('2018-10-24 00:00:00')
>>> df['date'][0].month
10
edit 2: second question
>>> df['status'] = ['complete', 'complete', 'requested']
>>> df
date status
0 2018-10-24 complete
1 2018-04-23 complete
2 2018-01-18 requested
>>>
>>> df[df['status'] != 'complete']
date status
2 2018-01-18 requested
You can use pd.to_datetime or the datetime library
import datetime as dt
df['date'].apply(lambda x: dt.datetime.strptime(x,'%d %B %Y'))
I have a dataframe which has a Date column, I want to remove those row from Date column which doesn't have YYYY (eg, 2018, it can be any year) format.
I had used apply method with regex expression but doesn't work ,
df[df.Date.apply(lambda x: re.findall(r'[0-9]{4}', x))]
The Date column can have values such as,
12/3/2018
March 12, 2018
stackoverflow
Mar 12, 2018
no date text
3/12/2018
So here output should be
12/3/2018
March 12, 2018
Mar 12, 2018
3/12/2018
This is one approach. Using pd.to_datetime with errors="coerce"
Ex:
import pandas as pd
df = pd.DataFrame({"Col1": ['12/3/2018', 'March 12, 2018', 'stackoverflow', 'Mar 12, 2018', 'no date text', '3/12/2018']})
df["Col1"] = pd.to_datetime(df["Col1"], errors="coerce")
df = df[df["Col1"].notnull()]
print(df)
Output:
Col1
0 2018-12-03
1 2018-03-12
3 2018-03-12
5 2018-03-12
Or if you want to maintain the original data
import pandas as pd
def validateDate(d):
try:
pd.to_datetime(d)
return d
except:
return None
df = pd.DataFrame({"Col1": ['12/3/2018', 'March 12, 2018', 'stackoverflow', 'Mar 12, 2018', 'no date text', '3/12/2018']})
df["Col1"] = df["Col1"].apply(validateDate)
df.dropna(inplace=True)
print(df)
Output:
Col1
0 12/3/2018
1 March 12, 2018
3 Mar 12, 2018
5 3/12/2018