How to reorder multi-index columns in Pandas? - python

df=pd.DataFrame({'Country':["AU","GB","KR","US","GB","US","KR","AU","US"],'Region Manager':['TL','JS','HN','AL','JS','AL','HN','TL','AL'],'Curr_Sales': [453,562,236,636,893,542,125,561,371],'Curr_Revenue':[4530,7668,5975,3568,2349,6776,3046,1111,4852],'Prior_Sales': [235,789,132,220,569,521,131,777,898],'Prior_Revenue':[1530,2668,3975,5668,6349,7776,8046,2111,9852]})
pd.pivot_table(df, values=['Curr_Sales', 'Curr_Revenue','Prior_Sales','Prior_Revenue'],index=['Country', 'Region Manager'],aggfunc=np.sum,margins=True)
Hi folks,
I have the following dataframe and I'd like to re-order the muti-index columns as
['Prior_Sales','Prior_Revenue','Curr_Sales', 'Curr_Revenue']
How can I do that in pandas?
The code is shown above
Thanks in advance for all the help!

Slice the resulting dataframe
pd.pivot_table(
df,
values=['Curr_Sales', 'Curr_Revenue', 'Prior_Sales', 'Prior_Revenue'],
index=['Country', 'Region Manager'],
aggfunc='sum',
margins=True
)[['Prior_Sales', 'Prior_Revenue', 'Curr_Sales', 'Curr_Revenue']]
Prior_Sales Prior_Revenue Curr_Sales Curr_Revenue
Country Region Manager
AU TL 1012 3641 1014 5641
GB JS 1358 9017 1455 10017
KR HN 263 12021 361 9021
US AL 1639 23296 1549 15196
All 4272 47975 4379 39875

cols = ['Prior_Sales','Prior_Revenue','Curr_Sales', 'Curr_Revenue']
df = df[cols]

Related

Calculate and add up Data from a reference dataframe

I have two pandas dataframes. The first one contains some data I want to multiplicate with the second dataframe which is a reference table.
So in my example I want to get a new column in df1 for every column in my reference table - but also add up every row in that column.
Like this (Index 205368421 with R21 17): (1205 * 0.526499) + (7562* 0.003115) + (1332* 0.000267) = 658
In Excel VBA I iterated through both tables and did it that way - but it took very long. I've read that pandas is way better for this without iterating.
df1 = pd.DataFrame({'Index': ['205368421', '206321177','202574796','200212811', '204376114'],
'L1.09A': [1205,1253,1852,1452,1653],
'L1.10A': [7562,7400,5700,4586,4393],
'L1.10C': [1332, 0, 700,1180,290]})
df2 = pd.DataFrame({'WorkerID': ['L1.09A', 'L1.10A', 'L1.10C'],
'R21 17': [0.526499,0.003115,0.000267],
'R21 26': [0.458956,0,0.001819]})
Index 1.09A L1.10A L1.10C
205368421 1205 7562 1332
206321177 1253 7400 0
202574796 1852 5700 700
200212811 1452 4586 1180
204376114 1653 4393 290
WorkerID R21 17 R21 26
L1.09A 0.526499 0.458956
L1.10A 0.003115 0
L1.10C 0.000267 0.001819
I want this:
Index L1.09A L1.10A L1.10C R21 17 R21 26
205368421 1205 7562 1332 658 555
206321177 1253 7400 0 683 575
202574796 1852 5700 700 993 851
200212811 1452 4586 1180 779 669
204376114 1653 4393 290 884 759
I would be okay with some hints. Like someone told me this might be matrix multiplication. So .dot() would be helpful. Is this the right direction?
Edit:
I have now done the following:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df_multiplied = df1_sorted # df2_sorted
This works with my example dataframes, but not with my real dataframes.
My real ones have these dimensions: df1_sorted(10429, 69) and df2_sorted(69, 18).
It should work, but my df_multiplied is full with NaN.
Alright, I did it!
I had to replace all nan with 0.
So the final solution is:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df1_sorted= df1_sorted.fillna(0)
df2_sorted= df2_sorted.fillna(0)
df_multiplied = df1_sorted # df2_sorted

Name with title a pandas dataframe

I am trying to add a name to my pandas df but I am failing. I want the two columns to be named "Job department" and "Amount"
df["sales"].value_counts()
>>>>>>output
sales 4140
technical 2720
support 2229
IT 1227
product_mng 902
marketing 858
RandD 787
accounting 767
hr 739
management 630
Name: sales, dtype: int64
Then I do:
job_frequency = pd.DataFrame(df["sales"].value_counts(), columns=['Job department','Amount'])
print(job_frequency)
but I get:
Empty DataFrame
Columns: [Job department, Amount]
Index: []
Use DataFrame.rename_axis for index name with
Series.reset_index for convert Series to DataFrame:
job_frequency = (df["sales"].value_counts()
.rename_axis('Job department')
.reset_index(name='Amount'))
print(job_frequency)
Job department Amount
0 sales 4140
1 technical 2720
2 support 2229
3 IT 1227
4 product_mng 902
5 marketing 858
6 RandD 787
7 accounting 767
8 hr 739
9 management 630
job_frequency = pd.DataFrame(
data={
'Job department': df["sales"].value_counts().index,
'Amount': df["sales"].value_counts().values
}
)

Merging two rows in pandas into one

I have a data frame like this
no, frc, val
1121,1,"John"
1121,0,236
3612,1,"Mary"
3612,0,545
I want to combine data like this
"John",236
"Mary",545
you can self join two subsets of this DF, using merge() method:
In [21]: (df[df['frc']==1]
.drop('frc',1)
.rename(columns={'val':'name'})
.merge(df[df['frc']==0].drop('frc',1)))
Out[21]:
no name val
0 1121 John 236
1 3612 Mary 545
df.set_index(['no', 'frc']).val.unstack().rename(columns={0:'val', 1:'name'})
frc val name
no
1121 236 John
3612 545 Mary
Or to produce OP output
print(
df.set_index(['no', 'frc']).val
.unstack()[[1, 0]]
.to_csv(index=False, header=False)
)
John,236
Mary,545

Combine MultiIndex columns to a single index in a pandas dataframe

With my code I integrate 2 databases in 1. The problem is when I add one more column to my databases, the result is not as expected. Use Python 2.7
code:
import pandas as pd
import pandas.io.formats.excel
import numpy as np
# Leemos ambos archivos y los cargamos en DataFrames
df1 = pd.read_excel("archivo1.xlsx")
df2 = pd.read_excel("archivo2.xlsx")
df = (pd.concat([df1,df2])
.set_index(["Cliente",'Fecha'])
.stack()
.unstack(1)
.sort_index(ascending=(True, False)))
m = df.index.get_level_values(1) == 'Impresiones'
df.index = np.where(m, 'Impresiones', df.index.get_level_values(0))
# Creamos el xlsx de salida
pandas.io.formats.excel.header_style = None
with pd.ExcelWriter("Data.xlsx",
engine='xlsxwriter',
date_format='dd/mm/yyyy',
datetime_format='dd/mm/yyyy') as writer:
df.to_excel(writer, sheet_name='Sheet1')
archivo1:
Fecha Cliente Impresiones Impresiones 2 Revenue
20/12/17 Jose 1312 35 $12
20/12/17 Martin 12 56 $146
20/12/17 Pedro 5443 124 $1,256
20/12/17 Esteban 667 1235 $1
archivo2:
Fecha Cliente Impresiones Impresiones 2 Revenue
21/12/17 Jose 25 5 $2
21/12/17 Martin 6347 523 $123
21/12/17 Pedro 2368 898 $22
21/12/17 Esteban 235 99 $7,890
Hope Results:
I tried with m1 = df.index.get_level_values(1) == 'Impresiones 2' df.index = np.where(m1, 'Impresiones 2', df.index.get_level_values(0)) but I have this error: IndexError: Too many levels: Index has only 1 level, not 2
The first bit of the solution is similar to jezrael's answer to your previous question, using concat + set_index + stack + unstack + sort_index.
df = pd.concat([df1, df2])\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
Now comes the challenging part, we have to incorporate the Names in the 0th level, into the 1st level, and then reset the index.
I use np.insert to insert names above the revenue entry in the index.
i, j = df.index.get_level_values(0), df.index.get_level_values(1)
k = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
Now, I create a new MultiIndex which I then use to reindex df -
idx = pd.MultiIndex.from_arrays([i.unique().repeat(len(df.index.levels[1]) + 1), k])
df = df.reindex(idx).fillna('')
Now, drop the extra level -
df.index = df.index.droplevel()
df
Fecha 20/12/17 21/12/17
Esteban
Revenue $1 $7,890
Impresiones2 1235 99
Impresiones 667 235
Jose
Revenue $12 $2
Impresiones2 35 5
Impresiones 1312 25
Martin
Revenue $146 $123
Impresiones2 56 523
Impresiones 12 6347
Pedro
Revenue $1,256 $22
Impresiones2 124 898
Impresiones 5443 2368

Stacking a pair of columns by looking at the first column

I struggle with making the move from Excel to Python since I'm so used to having everything be visible. Below, I'm trying to convert the table up top to the table below. Wanted to use pandas dataframes but if there's a different solution that's better then I'd love to hear it.
Also, as an added bonus, if someone can point me to some resources that are empathetic to visual excel converts to Python, that would be awesome!
*Note, there are actually ~350 rows of this and we could go as far as ID12 and Code 12. Also, a state could repeat in my raw data source just like VA is doing here.
State ID Code ID2 Code2 ID3 Code3
VA RIC 733 FFX 787 NULL NULL
NC WIL 798 GSB 698 WSS 444
VA NPN 757 NULL NULL NULL NULL
Required Output:
State ID Code
VA RIC 733
VA FFX 787
VA NPN 757
NC WIL 798
NC GSB 698
NC WSS 444
I think lreshape would be ideal for this situation.
pd.lreshape(df, {'Code': ['Code', 'Code2', 'Code3'], 'ID': ['ID', 'ID2', 'ID3']}) \
.sort_values('State', ascending=False)
State Code ID
0 VA 733.0 RIC
2 VA 757.0 NPN
3 VA 787.0 FFX
1 NC 798.0 WIL
4 NC 698.0 GSB
5 NC 444.0 WSS
A more generic solution apart from #MaxU's would be:
code_list = [col for col in list(df) if col.startswith('Code')]
id_list = [col for col in list(df) if col.startswith('ID')]
pd.lreshape(df, {'Code': code_list, 'ID': id_list}).sort_values('State', ascending=False)

Categories