How to efficiently extract date year from dataframe header in Pandas? - python

The objective is to extract df under the month-year category while omitting other.
The code below one way how this objective can be achieved
df = DataFrame ( [['PP1', 'LN', 'T1', 'C11', 'C21', 'C31', 'C32']] )
df.columns =['dummy1','dummy2', 'Jan-20', 'Feb-20', 'Jan 2021', 'Feb 2080','Dec 1993']
extract_header_name=list(df.columns.values)
lookup_list= ['Jan', 'Feb', 'Mar','Apr', 'May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
month_year_list=[i for e in lookup_list for i in extract_header_name if e in i]
Output
['Jan-20', 'Jan 2021', 'Feb-20', 'Feb 2080', 'Dec 1993']
However, I wonder if is another efficient or pandas built module to achieve similar result?

Use str.contains with values joined by | for regex or - it means Jan or Feb... and filter by boolean indexing with df.columns:
month_year_list = df.columns[df.columns.str.contains('|'.join(lookup_list))].tolist()
print (month_year_list)
['Jan-20', 'Feb-20', 'Jan 2021', 'Feb 2080', 'Dec 1993']
Or use Series.str.startswith with convert list to tuple:
month_year_list = df.columns[df.columns.str.startswith(tuple(lookup_list))].tolist()
Another idea if only this 2 formats of datetimes:
s = df.columns.to_series()
s1 = pd.to_datetime(s, format='%b-%y', errors='coerce')
s2 = pd.to_datetime(s, format='%b %Y', errors='coerce')
month_year_list = df.columns[s1.fillna(s2).notna()].tolist()
print (month_year_list)
['Jan-20', 'Feb-20', 'Jan 2021', 'Feb 2080', 'Dec 1993']

Select columns using df.filter and extract their names.
list(df.filter(regex='|'.join(lookup_list)).columns)
['Jan-20', 'Feb-20', 'Jan 2021', 'Feb 2080', 'Dec 1993']

Related

Python: Order Dates that are in the format: %B %Y

I have a df with dates in the format %B %Y (e.g. June 2021, December 2022 etc.)
Date
Price
Apr 2022
2
Dec 2021
8
I am trying to sort dates in order of oldest to newest but when I try:
.sort_values(by='Date', ascending=False)
it is ordering in alphabetical order.
The 'Date' column is an Object.
ascending=False will sort from newest to oldest, but you are asking to sort oldest to newest, so you don't need that option;
there is a key option to specify how to parse the values before sorting them;
you may or may not want option ignore_index=True, which I included below.
We can use the key option to parse the values into datetime objects with pandas.to_datetime.
import pandas as pd
df = pd.DataFrame({'Date': ['Apr 2022', 'Dec 2021', 'May 2022', 'May 2021'], 'Price': [2, 8, 12, 15]})
df = df.sort_values(by='Date', ignore_index=True, key=pd.to_datetime)
print(df)
# Date Price
# 0 May 2021 15
# 1 Dec 2021 8
# 2 Apr 2022 2
# 3 May 2022 12
Relevant documentation:
DataFrame.sort_values;
to_datetime.

How to find and list of months between 2 given months using Python

I am trying to find and list of the months between two given months.
Input data
Month1 Month2 Month_list
Mar2020 Dec2020
Nov2020 Jan2021
Sep2020 Feb2021
Jun2020 Dec2020
Oct2020 Mar2021
Expected output
Month1 Month2 Month_list
Mar2020 Sep2020 Mar2020,Apr2020,May2020,Jun2020,Jul2020,Aug2020,Sep2020
Nov2020 Jan2021 Nov2020,Dec2020,Jan2021
Sep2020 Feb2021 Sep2020,Oct2020,Nov2020,Dec2020,Jan2021,Feb2021
Oct2020 Dec2020 Oct2020,Nov2020,Dec2020
Dec2020 Mar2021 Dec2020,Jan2021,Feb2021,Mar2021
Code snippet I have been using:
from datetime import datetime, timedelta
from collections import OrderedDict
dates = [df.Month1, df.Month2]
start, end = [datetime.strptime(_, "%Y-%m-%d") for _ in dates]
How to find the list of months and year?
You can use apply -
def get_date_list(x):
return ",".join(
item.strftime("%b %Y")
for item in pd.date_range(x['Month1'], x['Month2'], freq="MS")
)
df['Month_list'] = df.apply(lambda x: get_date_list(x), axis=1)
Use date_range.
I am assuming you have already parsed your dates since you haven't mentioned it in your question. Here is a partial solution to your problem:
>>> start = pd.to_datetime("Mar 2021")
>>> end = pd.to_datetime("Sep 2021")
>>> [d.strftime("%b %Y") for d in pd.date_range(start, end, freq="MS")]
['Mar 2021',
'Apr 2021',
'May 2021',
'Jun 2021',
'Jul 2021',
'Aug 2021',
'Sep 2021']

Period list for grouped datetime

I am very new to pandas and i want to do the following, but getting some troubles with groupby. Please help.
I have a dataframe with many columns one of which is date.
I need a list of distinct month year from it.
df = pd.DataFrame(['02 Jan 2018', '02 Feb 2018', '02 Feb 2018', '02 Mar 2018'], columns=['date'])
datelist = pd.to_datetime(df.date)
datelist = datelist.groupby([datelist.dt.month, datelist.dt.year])
when i do datelist.all() i get the following,
date date
1 2018 True
2 2018 True
Name: date, dtype: bool
I need something like ['Jan 2018', 'Feb 2018']
I would really appreciate your help.
Thanks
Use to_datetime, then convert to custom strings with strftime, get unique values and last convert to strings:
datelist = pd.to_datetime(df.date).dt.strftime('%b %Y').unique().tolist()
print (datelist)
['Jan 2018', 'Feb 2018', 'Mar 2018']
Another solution if input format of datetimes is 02 Jan 2018 is split by first whitespace split, select second value and get unique values:
datelist = df['date'].str.split(n=1).str[1].unique().tolist()
You can use to_period (for a Series this would be dt.to_period):
In [11]: datelist.to_period("M")
Out[11]:
PeriodIndex(['2019-01', '2019-01', '2019-01', '2019-01', '2019-01', '2019-01',
...
'2019-02', '2019-02', '2019-02', '2019-02', '2019-02'],
dtype='period[M]', freq='M')
In [12]: datelist.to_period("M").unique()
Out[12]: PeriodIndex(['2019-01', '2019-02'], dtype='period[M]', freq='M')
In [13]: [str(m) for m in datelist.to_period("M").unique()]
Out[13]: ['2019-01', '2019-02']

Python Keep row if YYYY present else remove the row

I have a dataframe which has a Date column, I want to remove those row from Date column which doesn't have YYYY (eg, 2018, it can be any year) format.
I had used apply method with regex expression but doesn't work ,
df[df.Date.apply(lambda x: re.findall(r'[0-9]{4}', x))]
The Date column can have values such as,
12/3/2018
March 12, 2018
stackoverflow
Mar 12, 2018
no date text
3/12/2018
So here output should be
12/3/2018
March 12, 2018
Mar 12, 2018
3/12/2018
This is one approach. Using pd.to_datetime with errors="coerce"
Ex:
import pandas as pd
df = pd.DataFrame({"Col1": ['12/3/2018', 'March 12, 2018', 'stackoverflow', 'Mar 12, 2018', 'no date text', '3/12/2018']})
df["Col1"] = pd.to_datetime(df["Col1"], errors="coerce")
df = df[df["Col1"].notnull()]
print(df)
Output:
Col1
0 2018-12-03
1 2018-03-12
3 2018-03-12
5 2018-03-12
Or if you want to maintain the original data
import pandas as pd
def validateDate(d):
try:
pd.to_datetime(d)
return d
except:
return None
df = pd.DataFrame({"Col1": ['12/3/2018', 'March 12, 2018', 'stackoverflow', 'Mar 12, 2018', 'no date text', '3/12/2018']})
df["Col1"] = df["Col1"].apply(validateDate)
df.dropna(inplace=True)
print(df)
Output:
Col1
0 12/3/2018
1 March 12, 2018
3 Mar 12, 2018
5 3/12/2018

Pandas: How to read an excel file defining several columns to be multi indexes?

I have a data frame that every row contains an office location object with several attribute like Global Region,Primary Function, and several energy consumption data as numerical values followed. The names of all columns is like below:
['Global Region',
'Primary Function',
'Subsidiaries',
'T&D Loss Rate Category',
'Type',
'Ref',
'Acquisition date',
'Disposal date',
'Corporate Admin Approver',
'Data Providers',
'Initiative administrator',
'Initiative approver',
'Initiative user',
'Invoice owner',
'Apr to Jun 2012',
'Jul to Sep 2012',
'Oct to Dec 2012',
'Jan to Mar 2013',
'Apr to Jun 2013',
'Jul to Sep 2013',
'Oct to Dec 2013',
'Jan to Mar 2014',
'Apr to Jun 2014',
'Jul to Sep 2014',
'Oct to Dec 2014',
'Jan to Mar 2015',
'Apr to Jun 2015',
'Jul to Sep 2015',
'Oct to Dec 2015',
'Jan to Mar 2016']
How can I sort different locations and view the data based on different attributes, e.g. based on primary function or global region. I can see the average energy consumption or rank energy intensity of all locations whose primary function is R&D.
I thought of multi index, but I didn't know how to do it. I tried this:
test = xls.parse('Sheet1',index_col=['Lenovo Global Region','Primary Function', 'Subsidiaries', 'Type','Acquisition date','Disposal date','Country'])
It didn't work, error said I could only use numbers not string, so I tried this:
test = xls.parse('Sheet1',index_col=0,1,3,4,5,7,9,10)
Still didn't work. Anyone has good suggestions?
You can use read_excel with parameter index_col which contains list of positions of necessary columns:
Sample:
df = pd.read_excel('test.xlsx', sheetname='Sheet1', index_col=[0,1,3])
print (df)
Subsidiaries Type Ref
Global Region Primary Function T&D Loss Rate Category
1 1 c a s 10
2 2 c b d 20
3 3 d c d 30
Reading a multiindex.
So if add [] it can works:
test = xls.parse('Sheet1',index_col=[0,1,3,4,5,7,9,10])

Categories