Reset a column's MultiIndex levels - python

Is there a shorter way of dropping a column MultiIndex level (in my case, basic_amt) except transposing it twice?
In [704]: test
Out[704]:
basic_amt
Faculty NSW QLD VIC All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
In [705]: test.reset_index(level=0, drop=True)
Out[705]:
basic_amt
Faculty NSW QLD VIC All
0 1 1 2 4
1 0 1 0 1
2 1 0 2 3
In [711]: test.transpose().reset_index(level=0, drop=True).transpose()
Out[711]:
Faculty NSW QLD VIC All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3

Another solution is to use MultiIndex.droplevel with rename_axis (new in pandas 0.18.0):
import pandas as pd
cols = pd.MultiIndex.from_arrays([['basic_amt']*4,
['NSW','QLD','VIC','All']],
names = [None, 'Faculty'])
idx = pd.Index(['All', 'Full Time', 'Part Time'])
df = pd.DataFrame([(1,1,2,4),
(0,1,0,1),
(1,0,2,3)], index = idx, columns=cols)
print (df)
basic_amt
Faculty NSW QLD VIC All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
df.columns = df.columns.droplevel(0)
#pandas 0.18.0 and higher
df = df.rename_axis(None, axis=1)
#pandas bellow 0.18.0
#df.columns.name = None
print (df)
NSW QLD VIC All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
print (df.columns)
Index(['NSW', 'QLD', 'VIC', 'All'], dtype='object')
If you need both column names, use list comprehension:
df.columns = ['_'.join(col) for col in df.columns]
print (df)
basic_amt_NSW basic_amt_QLD basic_amt_VIC basic_amt_All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
print (df.columns)
Index(['basic_amt_NSW', 'basic_amt_QLD', 'basic_amt_VIC', 'basic_amt_All'], dtype='object')

Zip levels together
Here is an alternative solution which zips the levels together and joins them with underscore.
Derived from the above answer, and this was what I wanted to do when I found this answer. Thought I would share even if it does not answer the exact above question.
["_".join(pair) for pair in df.columns]
gives
['basic_amt_NSW', 'basic_amt_QLD', 'basic_amt_VIC', 'basic_amt_All']
Just set this as a the columns
df.columns = ["_".join(pair) for pair in df.columns]
basic_amt_NSW basic_amt_QLD basic_amt_VIC basic_amt_All
Faculty
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3

How about simply reassigning df.columns:
levels = df.columns.levels
labels = df.columns.labels
df.columns = levels[1][labels[1]]
For example:
import pandas as pd
columns = pd.MultiIndex.from_arrays([['basic_amt']*4,
['NSW','QLD','VIC','All']])
index = pd.Index(['All', 'Full Time', 'Part Time'], name = 'Faculty')
df = pd.DataFrame([(1,1,2,4),
(0,01,0,1),
(1,0,2,3)])
df.columns = columns
df.index = index
Before:
print(df)
basic_amt
NSW QLD VIC All
Faculty
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
After:
levels = df.columns.levels
labels = df.columns.labels
df.columns = levels[1][labels[1]]
print(df)
NSW QLD VIC All
Faculty
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3

Related

Sort column names using wildcard using pandas

I have a big dataframe with more than 100 columns. I am sharing a miniature version of my real dataframe below
ID rev_Q1 rev_Q5 rev_Q4 rev_Q3 rev_Q2 tx_Q3 tx_Q5 tx_Q2 tx_Q1 tx_Q4
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
I would like to do the below
a) sort the column names based on Quarters (ex:Q1,Q2,Q3,Q4,Q5..Q100..Q1000) for each column pattern
b) By column pattern, I mean the keyword that is before underscore which is rev and tx.
So, I tried the below but it doesn't work and it also shifts the ID column to the back
df = df.reindex(sorted(df.columns), axis=1)
I expect my output to be like as below. In real time, there are more than 100 columns with more than 30 patterns like rev, tx etc. I want my ID column to be in the first position as shown below.
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
For the provided example, df.sort_index(axis=1) should work fine.
If you have Q values higher that 9, use natural sorting with natsort:
from natsort import natsort_key
out = df.sort_index(axis=1, key=natsort_key)
Or using manual sorting with np.lexsort:
idx = df.columns.str.split('_Q', expand=True, n=1)
order = np.lexsort([idx.get_level_values(1).astype(float), idx.get_level_values(0)])
out = df.iloc[:, order]
Something like:
new_order = list(df.columns)
new_order = ['ID'] + sorted(new_order.remove("ID"))
df = df[new_order]
we manually put "ID" in front and then sort what is remaining
The idea is to create a dataframe from the column names. Create two columns: one for Variable and another one for Quarter number. Finally sort this dataframe by values then extract index.
idx = (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']).index)
df = df.iloc[:, idx]
Output:
>>> df
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
0 1 1 1 1 1 1 1 1 1 1 1
1 2 1 1 1 1 1 1 1 1 1 1
>>> (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']))
V Q
0 0 0
1 rev 1
5 rev 2
4 rev 3
3 rev 4
2 rev 5
9 tx 1
8 tx 2
6 tx 3
10 tx 4
7 tx 5

How to merge two dataframes for updating older one with new one?

I am sorry for being a noob but I can't find a solution for my problem with hours of search.
import pandas as pd
df1 = pd.read_excel('df1.xlsx')
df1.set_index('time')
print(df1)
df2 = pd.read_excel('df2.xlsx')
df2.set_index('time')
print(df2)
new_df = pd.merge(df1, df2,how='outer')
print(new_df)
df1
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
df2
time bought
0 3 0
1 4 0
2 5 0
3 6 0
4 7 0
new_df
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 5 0
6 6 0
7 7 0
What I want is
updating df1(existing data) with df2(new data feed). when it comes to bought value, df1 data should comes first
the new_df should have all unique time values from df1, df2 without duplicates
I tried every method I found but no one made my desired outcome or created unnecessary duplicates as above.(two rows with time value of 5)
merge method created _x _y suffixes or duplicates
join() didn't work as well.
What I desire should look like:
new_df
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 6 0
6 7 0
Thank you in advance
if you perform the join as you have done all you need to do is remove the duplicate rows keeping only the more resent data,
drop_duplicates() take the kwarg subset which takes a list of columns and keep which sets which row to keep if there are duplicates
in this case we only need to check for duplicates in the time column and wee keep the first column
import pandas as pd
df1 = pd.read_excel('df1.xlsx')
df1.set_index('time')
print(df1)
df2 = pd.read_excel('df2.xlsx')
df2.set_index('time')
print(df2)
new_df = pd.merge(df1, df2,how='outer')
new_df = new_df.drop_duplicates(subset=['time'], keep='first')
print(new_df)
Output:
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 6 0
6 7 0

Get count of particular values and the total based on another column value in dataframe using Pandas

I have the following dataframe df:
names
status
John
Completed
James
To Do
Jill
To Do
Robert
In Progress
Jill
To Do
Jill
To Do
Marina
Completed
Evy
Completed
Evy
Completed
Now I want the count of each type of status for each user. I can get it like this for all types of statuses.
df = pd.crosstab(df.names,df.status).reset_index("names")
So now the resulting df is
status
names
Completed
In Progress
To Do
0
James
0
0
1
1
Robert
0
1
0
2
John
1
0
0
3
Marina
1
0
0
4
Jill
0
0
3
5
Evy
2
0
0
So my problem is how can I specify only a particular type of status value to be counted? For eg: I want only the values of In Progress and Completed and not To Do. And how can I add a extra column to the above called as Total Statuses, that will actually be the total number of rows for each name in the original dataframe?
Desired Dataframe:
status
names
Completed
In Progress
Total
0
James
0
0
1
1
Robert
0
1
1
2
John
1
0
1
3
Marina
1
0
1
4
Jill
0
0
3
5
Evy
2
0
2
Another way:
pass margins and margins_name parameters in pd.crosstab():
df=(pd.crosstab(df.names,df.status,margins=True,margins_name='Total').iloc[:-1]
.reset_index().drop('To Do',1))
OR
via crosstab()+assign()
df=(pd.crosstab(df.names,df.status).assign(Total=lambda x:x.sum(1))
.reset_index().drop('To Do',1))
OR
In 2 steps:
df=pd.crosstab(df.names,df.status)
df=df.assign(Total=df.sum(1)).drop('To Do',1).reset_index()
You can create the total from the addition of the three previous columns:
df['Total'] = (df['Completed'] + df['In Progress'] + df['To Do'])
Then you can drop the 'to-do' from your new data frame as follows :
df = df.drop(columns=['To Do'])
df = pd.DataFrame({'names': ['John', 'James', 'Jill', 'Robert', 'Jill', 'Jill', 'Marina', 'Evy', 'Evy'],
'status':['Completed', 'To Do', 'To Do', 'In Progress', 'To Do', 'To Do', 'Completed', 'Completed', 'Completed']})
df = pd.crosstab(df.names,df.status).reset_index("names")
df['Total'] = df['Completed'] + df['In Progress'] + df['To Do']
df = df.drop(columns=['To Do'])
print(df)
Output:
status names Completed In Progress Total
0 Evy 2 0 2
1 James 0 0 1
2 Jill 0 0 3
3 John 1 0 1
4 Marina 1 0 1
5 Robert 0 1 1
I can't comprehend what kind of sorting system you are using. But I think you will manage to do that yourself.

Count frequency of each word contained in column string values

For example, I have a dataframe like this:
data = {'id': [1,1,1,2,2],
'value': ['red','red and blue','yellow','oak','oak wood']
}
df = pd.DataFrame (data, columns = ['id','value'])
I want :
id value count
1 red 2
1 blue 1
1 yellow 1
2 oak 2
2 wood 1
Many thanks!
Solution for pandas 0.25+ with DataFrame.explode by lists created by Series.str.split and GroupBy.size:
df1 = (df.assign(value = df['value'].str.split())
.explode('value')
.groupby(['id','value'], sort=False)
.size()
.reset_index(name='count'))
print (df1)
id value count
0 1 red 2
1 1 and 1
2 1 blue 1
3 1 yellow 1
4 2 oak 2
5 2 wood 1
For lower pandas versions use DataFrame.set_index with Series.str.split and expand=True for DataFrame, reshape by DataFrame.stack, create columns from MultiIndex Series ands use same solution like above:
df1 = (df.set_index('id')['value']
.str.split(expand=True)
.stack()
.reset_index(name='value')
.groupby(['id','value'], sort=False)
.size()
.reset_index(name='count')
)
print (df1)
id value count
0 1 red 2
1 1 and 1
2 1 blue 1
3 1 yellow 1
4 2 oak 2
5 2 wood 1

Pandas GroupBy Count of a give Column

I want to Group By Date and Hour with an aggregation founction count and split the result for each different ID (in the columns) in output.
df = pd.DataFrame({'GpID': [1,1,0,1,1,0,1,1,2,2,1,1,2,1,1,0,1,2,0,1,1],
'HR': [1,1,1,1,1,1,1, 2,2,2,2,1,1,1, 2,2,2,2,3,3,3],
'Date_': [1,1,1,2,2,2,2, 2,2,2,2,3,3,3, 3,3,3,3,3,3,3]
})
The output format is like
df_out = pd.DataFrame({ 'HR': [1,2,3,1,2,3],
'Date_': [1,1,1,2,2,2],
'GpID_0': [1,2,5,1,4,2],
'GpID_1': [1,2,5,1,4,2],
'GpID_2': [4,2,5,1,4,2],
})
Tried:
# 1st try
df_g = df.groupby(["Hr", "Date_"], observed=False).count().fillna(0).unstack()
# 2nd try
df_g = df.groupby(["Hr", "Date_","GpId"], observed=False).count().fillna(0).unstack(-1)
# 3rd try
df_g = df.groupby(["Hr", "Date_"], observed=False).count().fillna(0).unstack()
Nothing accurate yet
I Believed you tried to do something like this
In [1]:
import pandas as pd
df = pd.DataFrame({'GpID': [1,1,0,1,1,0,1,1,2,2,1,1,2,1,1,0,1,2,0,1,1],
'HR': [1,1,1,1,1,1,1, 2,2,2,2,1,1,1, 2,2,2,2,3,3,3],
'Date_': [1,1,1,2,2,2,2, 2,2,2,2,3,3,3, 3,3,3,3,3,3,3]
})
df.loc[:,'Count']=1
pd.pivot_table(df, values='Count', index=['Date_', 'HR'], columns=['GpID'], aggfunc='count').fillna(0).reset_index()
Out [1]:
Date_ HR 0 1 2
0 1 1 1 2 0
1 2 1 1 3 0
2 2 2 0 2 2
3 3 1 0 2 1
4 3 2 1 2 1
5 3 3 1 2 0

Categories