I have a dataframe which has a Date column, I want to remove those row from Date column which doesn't have YYYY (eg, 2018, it can be any year) format.
I had used apply method with regex expression but doesn't work ,
df[df.Date.apply(lambda x: re.findall(r'[0-9]{4}', x))]
The Date column can have values such as,
12/3/2018
March 12, 2018
stackoverflow
Mar 12, 2018
no date text
3/12/2018
So here output should be
12/3/2018
March 12, 2018
Mar 12, 2018
3/12/2018
This is one approach. Using pd.to_datetime with errors="coerce"
Ex:
import pandas as pd
df = pd.DataFrame({"Col1": ['12/3/2018', 'March 12, 2018', 'stackoverflow', 'Mar 12, 2018', 'no date text', '3/12/2018']})
df["Col1"] = pd.to_datetime(df["Col1"], errors="coerce")
df = df[df["Col1"].notnull()]
print(df)
Output:
Col1
0 2018-12-03
1 2018-03-12
3 2018-03-12
5 2018-03-12
Or if you want to maintain the original data
import pandas as pd
def validateDate(d):
try:
pd.to_datetime(d)
return d
except:
return None
df = pd.DataFrame({"Col1": ['12/3/2018', 'March 12, 2018', 'stackoverflow', 'Mar 12, 2018', 'no date text', '3/12/2018']})
df["Col1"] = df["Col1"].apply(validateDate)
df.dropna(inplace=True)
print(df)
Output:
Col1
0 12/3/2018
1 March 12, 2018
3 Mar 12, 2018
5 3/12/2018
Related
I have a df with dates in the format %B %Y (e.g. June 2021, December 2022 etc.)
Date
Price
Apr 2022
2
Dec 2021
8
I am trying to sort dates in order of oldest to newest but when I try:
.sort_values(by='Date', ascending=False)
it is ordering in alphabetical order.
The 'Date' column is an Object.
ascending=False will sort from newest to oldest, but you are asking to sort oldest to newest, so you don't need that option;
there is a key option to specify how to parse the values before sorting them;
you may or may not want option ignore_index=True, which I included below.
We can use the key option to parse the values into datetime objects with pandas.to_datetime.
import pandas as pd
df = pd.DataFrame({'Date': ['Apr 2022', 'Dec 2021', 'May 2022', 'May 2021'], 'Price': [2, 8, 12, 15]})
df = df.sort_values(by='Date', ignore_index=True, key=pd.to_datetime)
print(df)
# Date Price
# 0 May 2021 15
# 1 Dec 2021 8
# 2 Apr 2022 2
# 3 May 2022 12
Relevant documentation:
DataFrame.sort_values;
to_datetime.
I have a pandas date column in the following format
Date
0 March 13, 2020, March 13, 2020
1 3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021
2 NaN
3 May 20, 2022, May 21, 2022
I tried to convert the pattern to a single pattern to store to a new column.
import pandas as pd
import dateutil.parser
# initialise data of lists.
data = {'Date':['March 13, 2020, March 13, 2020', '3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021', 'NaN','May 20, 2022, May 21, 2022']}
# Create DataFrame
df = pd.DataFrame(data)
df["FormattedDate"] = df.Date.apply(lambda x: dateutil.parser.parse(x.strftime("%Y-%m-%d") ))
But i am getting an error
AttributeError: 'str' object has no attribute 'strftime'
Desired Output
Date DateFormatted
0 March 13, 2020, March 13, 2020 2020-03-13, 2020-03-13
1 3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021 2021-03-09, 2021-03-09, 2021-03-09, 2021-09-03
2 NaN NaN
3 May 20, 2022, May 21, 2022 2022-05-20, 2022-05-21
I was authot of previous solution, so possible solution is change also it for avoid , like separator and like value in date strings is used Series.str.extractall, converting to datetimes and last is aggregate join:
format_list = ["[0-9]{1,2}(?:\,|\.|\/|\-)(?:\s)?[0-9]{1,2}(?:\,|\.|\/|\-)(?:\s)?[0-9]{2,4}",
"[0-9]{1,2}(?:\.)(?:\s)?(?:(?:(?:j|J)a)|(?:(?:f|F)e)|(?:(?:m|M)a)|(?:(?:a|A)p)|(?:(?:m|M)a)|(?:(?:j|J)u)|(?:(?:a|A)u)|(?:(?:s|S)e)|(?:(?:o|O)c)|(?:(?:n|N)o)|(?:(?:d|D)e))\w*(?:\s)?[0-9]{2,4}",
"(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}",
"[0-9]{1,2}(?:\.)?(?:\s)?(?:\n)?(?:(?:(?:j|J)a)|(?:(?:f|F)e)|(?:(?:m|M)a)|(?:(?:a|A)p)|(?:(?:m|M)a)|(?:(?:j|J)u)|(?:(?:a|A)u)|(?:(?:s|S)e)|(?:(?:o|O)c)|(?:(?:n|N)o)|(?:(?:d|D)e))\w*(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]
# initialise data of lists.
data = {'Name':['Today is 09 September 2021', np.nan, '25 December 2021 is christmas', '01/01/2022 is newyear and will be holiday on 02.01.2022 also']}
# Create DataFrame
df = pd.DataFrame(data)
import dateutil.parser
f = lambda x: dateutil.parser.parse(x).strftime("%Y-%m-%d")
df['DateFormatted'] = df['Name'].str.extractall(f'({"|".join(format_list)})')[0].apply(f).groupby(level=0).agg(','.join)
print (df)
Name DateFormatted
0 Today is 09 September 2021 2021-09-09
1 NaN NaN
2 25 December 2021 is christmas 2021-12-25
3 01/01/2022 is newyear and will be holiday on 0... 2022-01-01,2022-02-01
Another alternative is processing lists after remove missing values in generato comprehension with join:
import dateutil.parser
f = lambda x: dateutil.parser.parse(x).strftime("%Y-%m-%d")
df['Date'] = df['Name'].str.findall("|".join(format_list)).dropna().apply(lambda y: ','.join(f(x) for x in y))
print (df)
Name Date
0 Today is 09 September 2021 2021-09-09
1 NaN NaN
2 25 December 2021 is christmas 2021-12-25
3 01/01/2022 is newyear and will be holiday on 0... 2022-01-01,2022-02-01
As per the discussion, extracting date/year/quarter in Pandas is as below
df = pd.DataFrame({'date_text': ['Jan 2020', 'May 2020', 'Jun 2020']})
df ['date'] = pd.to_datetime ( df.date_text ).dt.date
df ['year'], df ['month'],df['qtr'] = df ['date'].dt.year, df ['date'].dt.month, df ['date'].dt.quarter
However, the compiler return an error
AttributeError: Can only use .dt accessor with datetimelike values
May I know where did I do wrong?
Fix it by remove the first dt.date
df ['date'] = pd.to_datetime ( df.date_text )
df ['year'], df ['month'], df['qtr'] = df ['date'].dt.year, df ['date'].dt.month, df ['date'].dt.quarter
df
Out[43]:
date_text date year month qtr
0 Jan 2020 2020-01-01 2020 1 1
1 May 2020 2020-05-01 2020 5 2
2 Jun 2020 2020-06-01 2020 6 2
I'm trying to format a column with date to 'Month Year' format without changing non-date values .
input_df = pd.DataFrame({'Period' :['2017-11-01 00:00:00', '2019-02-01 00:00:00', 'Mar 2020', 'Pre-Nov 2017', '2019-10-01 00:00:00' , 'Nov 17-Nov 18'] } )
input_df is
expected output is:
I tired with the below code which didn't work:
output_df['Period'] = input_df['Period'].apply(lambda x: x.strftime('%m %Y') if isinstance(x, datetime.date) else x)
Pls help..
You can do with error='coerce' and fillna:
input_df['new_period'] = (pd.to_datetime(input_df['Period'], errors='coerce')
.dt.strftime('%b %Y')
.fillna(input_df['Period'])
)
Output:
Period new_period
0 2017-11-01 00:00:00 Nov 2017
1 2019-02-01 00:00:00 Feb 2019
2 Mar 2020 Mar 2020
3 Pre-Nov 2017 Pre-Nov 2017
4 2019-10-01 00:00:00 Oct 2019
5 Nov 17-Nov 18 Nov 17-Nov 18
Update: Second, safer option:
s = pd.to_datetime(input_df['Period'], errors='coerce')
input_df['new_period'] = np.where(s.isna(), input_df['Period'],
s.dt.strftime('%b %Y'))
I am very new to pandas and i want to do the following, but getting some troubles with groupby. Please help.
I have a dataframe with many columns one of which is date.
I need a list of distinct month year from it.
df = pd.DataFrame(['02 Jan 2018', '02 Feb 2018', '02 Feb 2018', '02 Mar 2018'], columns=['date'])
datelist = pd.to_datetime(df.date)
datelist = datelist.groupby([datelist.dt.month, datelist.dt.year])
when i do datelist.all() i get the following,
date date
1 2018 True
2 2018 True
Name: date, dtype: bool
I need something like ['Jan 2018', 'Feb 2018']
I would really appreciate your help.
Thanks
Use to_datetime, then convert to custom strings with strftime, get unique values and last convert to strings:
datelist = pd.to_datetime(df.date).dt.strftime('%b %Y').unique().tolist()
print (datelist)
['Jan 2018', 'Feb 2018', 'Mar 2018']
Another solution if input format of datetimes is 02 Jan 2018 is split by first whitespace split, select second value and get unique values:
datelist = df['date'].str.split(n=1).str[1].unique().tolist()
You can use to_period (for a Series this would be dt.to_period):
In [11]: datelist.to_period("M")
Out[11]:
PeriodIndex(['2019-01', '2019-01', '2019-01', '2019-01', '2019-01', '2019-01',
...
'2019-02', '2019-02', '2019-02', '2019-02', '2019-02'],
dtype='period[M]', freq='M')
In [12]: datelist.to_period("M").unique()
Out[12]: PeriodIndex(['2019-01', '2019-02'], dtype='period[M]', freq='M')
In [13]: [str(m) for m in datelist.to_period("M").unique()]
Out[13]: ['2019-01', '2019-02']