pd.MultiIndex: How do I add 1 more level (0) to a multi-index column? - python

This sounds trivial, but I just can't add 1 more level of index to the columns of a multi-level column df.
Current State
Category | Cat1 | Cat2 |
|Total Assets| AUMs |
Firm 1 | 100 | 300 |
Firm 2 | 200 | 3400 |
Firm 3 | 300 | 800 |
Firm 4 | NaN | 800 |
Desired State
Importance | H | H |
Category | Cat1 | Cat2 |
|Total Assets| AUMs |
Firm 1 | 100 | 300 |
Firm 2 | 200 | 3400 |
Firm 3 | 300 | 800 |
Firm 4 | NaN | 800 |
When I use the below code
Code 1: Error: isnull is not defined for MultiIndex
df.columns=pd.MultiIndex.from_arrays([['H','H'],df.columns])
Code 2: Error 1st level Name become a combination
df.columns=pd.MultiIndex.from_arrays([['H','H'],df.columns.value])
Importance | H | H |
Category | (Cat1, Total Assets) | (Cat2, AUMs) |
Firm 1 | 100 | 300 |
Firm 2 | 200 | 3400 |
Firm 3 | 300 | 800 |
Firm 4 | NaN | 800 |

Use concat():
df=pd.concat([df],keys=['H'],names=['Importance'],axis=1)

Related

Groupby 2 columns and find .min of multiple other columns (python pandas)

My data frame looks like this:
|Months | Places | Sales_X | Sales_Y | Sales_Z |
|----------------------------------------------------|
|**month1 | Place1 | 10000 | 12000 | 13000 |
|month1 | Place2 | 300 | 200 | 1000 |
|month1 | Place3 | 350 | 1000 | 1200** |
|month2 | Place2 | 1400 | 12300 | 14000 |
|month2 | Place3 | 9000 | 8500 | 150 |
|month2 | Place1 | 90 | 4000 | 3000 |
|month3 | Place2 | 12350 | 8590 | 4000 |
|month3 | Place1 | 4500 | 7020 | 8800 |
|month3 | Place3 | 351 | 6500 | 4567 |
I need to find the highest number from the three sales columns by month and show the name of the place with the highest number.
I have been trying to solve it by using pandas.DataFrame.idxmax and groupby but it does not seem to work.
I created a new df with the highest number/row which may help
|Months | Places | Highest_sales|
|-----------------------------------|
|**month1| Place1 | 10000 |
|month1 | Place2 | 200 |
|month1 | Place3 | 350** |
| | | |
|**month2| Place2 | 1400 |
|month2 | Place3 | 150 |
|month2 | Place1 | 90** |
| | | |
|**month3| Place2 | 4000 |
|month3 | Place1 | 4500 |
|month3 | Place3 | 351** |
|-----------------------------------|
Now I just need the highest number/ month and the name of the place. When using groupby on two columns and getting the min of the Lowest_sales, the result
df.groupby(['Months', 'Places'])['Highest_sales'].max()
when I run this
Months Places Highest Sales
1 Place1 1549.0
Place2 2214.0
Place3 2074.0
...
12 Place1 1500.0
Place2 8090.0
Place3 2074.0
the format I am looking for would be
|**Months|Places |Highest Sales**|
|--------|--------------------------|---------------|
|Month1 |Place(*of highest sales*) |100000 |
|Month2 |Place(*of highest sales*) |900000 |
|Month3 |Place(*of highest sales*) |3232000 |
|Month4 |Place(*of highest sales*) |1300833 |
|.... | | |
|Month12 |Place(*of highest sales*) | |
-----------------------------------------------------
12 rows and 3 columns
Use DataFrame.filter for Sales columns, create Highest column adn then aggregate DataFrameGroupBy.idxmax only for Months and select rows and columns by list in DataFrame.loc:
#columns with substring Sales
df1 = df.filter(like='Sales')
#or all columns from third position
#df1 = df.iloc[: 2:]
df['Highest'] = df1.min(axis=1)
df = df.loc[df.groupby('Months')['Highest'].idxmax(), ['Months','Places','Highest']]
print (df)
Months Places Highest
0 month1 Place1 10000
3 month2 Place2 1400
7 month3 Place1 4500

pandas 0.20: df with columns of multi-level indexes - How do I filter with condition on multiple columns?

I want to find all rows where all 3 columns are >0. How do I do so? Thanks! I know that using loc with IndexSlicer can return a column of True/False.
But it doesn't work with condition for multiple columns, or return a table of values.
Importance| A | B | C |
Category | Cat1 | Cat2 | Cat1 |
|Total Assets| AUMs | Revenue |
Firm 1 | 100 | 300 | 300 |
Firm 2 | 200 | 3400 | 200 |
Firm 3 | 300 | 800 | 400 |
Firm 4 | NaN | 800 | 350 |
idx=pd.IndexSlice
df.sort_index(ascending=True, inplace=True, axis=1)
df.loc[:,idx[:,'Cat1','Total Assets']]>0
Importance| A |
Category | Cat1 |
|Total Assets|
Firm 1 | T |
Firm 2 | T |
Firm 3 | T |
Firm 4 | F |
Desired Output:
Importance| A | B | C |
Category | Cat1 | Cat2 | Cat1 |
|Total Assets| AUMs | Revenue |
Firm 1 | 100 | 300 | 300 |
Firm 2 | 200 | 3400 | 200 |
Firm 3 | 300 | 800 | 400 |
IIUC:
>>> df[df.iloc[:, 1:].gt(0).all(axis=1)]
Importance A B C
Category Cat1 Cat2 Cat1
TotalAssets AUMs Revenue
0 Firm1 100.0 300 300
1 Firm2 200.0 3400 200
2 Firm3 300.0 800 400
Update
I only want the filtering for col 'TotalAssets'>0 & 'Revenue'>0?
# idx = pd.IndexSlice
>>> df[df.loc[:, idx[:, :, ['TotalAssets', 'Revenue']]].gt(0).all(axis=1)]
Importance A B C
Category Cat1 Cat2 Cat1
TotalAssets AUMs Revenue
0 Firm1 100.0 300 300
1 Firm2 200.0 3400 200
2 Firm3 300.0 800 400

Python Pandas - Split Excel Spreadsheet By Empty Rows

Given the following input file ("ToSplit2.xlsx"):
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Section One | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 1 | 100 | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 2 | 100 | 200 | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 3 | 100 | 200 | 300 | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 4 | 100 | 200 | 300 | 400 | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 5 | 100 | 200 | 300 | 400 | 500 | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 6 | 100 | 200 | 300 | 400 | 500 | 600 | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 7 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 8 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 9 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 10 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| | | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Section Two | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 1 | 100 | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 2 | 100 | 200 | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 3 | 100 | 200 | 300 | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 4 | 100 | 200 | 300 | 400 | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 5 | 100 | 200 | 300 | 400 | 500 | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 6 | 100 | 200 | 300 | 400 | 500 | 600 | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 7 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 8 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 9 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 10 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| | | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Section Three | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 1 | 100 | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 2 | 100 | 200 | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 3 | 100 | 200 | 300 | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 4 | 100 | 200 | 300 | 400 | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 5 | 100 | 200 | 300 | 400 | 500 | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 6 | 100 | 200 | 300 | 400 | 500 | 600 | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 7 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 8 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 9 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 10 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
And the following Python code:
import pandas as pd
import numpy as np
spreadsheetPath = "ToSplit2.xlsx"
xls = pd.ExcelFile(spreadsheetPath)
# Iterate through worksheets in opened Excel file
for sheet in xls.sheet_names:
# Create a Pandas dataframe from the Excel worksheet (with no headers)
excel_data_df = pd.read_excel(
spreadsheetPath, sheet_name=sheet, header=None)
# Return a list of dataframe index values where entire row is blank
indexList = excel_data_df[excel_data_df.isnull().all(1)].index.tolist()
# Prints [11, 23]
print(indexList)
# Initiate a dictionary
dataframeDictionary = {}
# For every index value in the list
for index in indexList:
# Split and add the result to the dictionary of Panda's dataframes
dataframeDictionary = np.array_split(excel_data_df, index)
# For every pandas dataframe in the dataframe dictionary
for dataframe in dataframeDictionary:
# Write the pandas dataframe to Excel with a worksheet name equal to dataframe address 0,0
dataframe.to_excel("output.xlsx",sheet_name=str(dataframe.iloc[0][0]))
I am trying to split the Excel worksheet into multiple spreadsheets based on the blank rows. E.g.:
Section One: (there would also be Section Two and Section Three worksheets)
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Section One | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 1 | 100 | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 2 | 100 | 200 | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 3 | 100 | 200 | 300 | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 4 | 100 | 200 | 300 | 400 | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 5 | 100 | 200 | 300 | 400 | 500 | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 6 | 100 | 200 | 300 | 400 | 500 | 600 | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 7 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 8 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 9 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 10 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
I believe I am really close, but seem to be slipping up on the data frame splitting.
Make changes according to your file name.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Read excel file
df = pd.read_excel('ToSplit2.xlsx', skip_blank_lines=False, header=None)
# Split by blank rows
df_list = np.split(df, df[df.isnull().all(1)].index)
# Create new excel to write the dataframes
writer = pd.ExcelWriter('Excel_one.xlsx', engine='xlsxwriter')
for i in range(1, len(df_list) + 1):
df_list[i - 1] = df_list[i - 1].dropna(how='all')
df_list[i - 1].to_excel(writer, sheet_name='Sheet{}'.format(i), header=None, index=False)
# Save the excel file
writer.save()

Merging two dataframes which has duplicated 'on' value on one side

I have two dataframes, and the standard dataframe has some same values(=id) which i have to use as merging point.
+----+------------+------------+------------+
| id | res_number | type | payment |
+----+------------+------------+------------+
| a | 1 | toys | 20000 |
| a | 2 | clothing | 30000 |
| a | 3 | food | 40000 |
| b | 4 | food | 40000 |
| c | 5 | laptop | 30000 |
+----+------------+------------+------------+
\I want to merge this dataframe with below dataframe.
+----+------------+------------+
| id | group | unique_num |
+----+------------+------------+
| a | 1 | 1231 |
| b | 2 | 1234 |
| c | 1 | 1241 |
+----+------------+------------+
and i want to make dataframe like this.
+----+------------+------------+------------+------------+------------+
| id | res_number | type | payment | group | unique_num |
+----+------------+------------+------------+------------+------------+
| a | 1 | toys | 20000 | 1 | 1231 |
| a | 2 | clothing | 30000 | 1 | 1231 |
| a | 3 | food | 40000 | 1 | 1231 |
| b | 4 | food | 40000 | 2 | 1234 |
| c | 5 | laptop | 30000 | 3 | 1241 |
+----+------------+------------+------------+------------+------------+
As you can notice i want to merge dataframes with 'id', but the standard dataframe has some same values on 'id'. My target is just pasting values whatever values on 'id' has.
Can you give me good example of this problem?
I think you need merge with left join:
df = pd.merge(df1, df2, how='left')
Or if possible more common columns names in both DataFrames:
df = pd.merge(df1, df2, how='left', on='id')
print (df)
id payment res_number type group unique_num
0 a 20000 1 toys 1 1231
1 a 30000 2 clothing 1 1231
2 a 40000 3 food 1 1231
3 b 40000 4 food 2 1234
4 c 30000 5 laptop 1 1241

How do I get the change from the same quarter in the previous year in a pandas datatable grouped by more than 1 column

I have a datatable that looks like this (but with more than 1 country and many more years worth of data):
| Country | Year | Quarter | Amount |
-------------------------------------------
| UK | 2014 | 1 | 200 |
| UK | 2014 | 2 | 250 |
| UK | 2014 | 3 | 200 |
| UK | 2014 | 4 | 150 |
| UK | 2015 | 1 | 230 |
| UK | 2015 | 2 | 200 |
| UK | 2015 | 3 | 200 |
| UK | 2015 | 4 | 160 |
-------------------------------------------
I want to get the change for each row from the same quarter in the previous year. So for the first 4 rows in the example the change would be null (because there is no previous data for that quarter). For 2015 quarter 1, the difference would be 30 (because quarter 1 for the previous year is 200, so 230 - 200 = 30). So the data table I'm trying to get is:
| Country | Year | Quarter | Amount | Change |
---------------------------------------------------|
| UK | 2014 | 1 | 200 | NaN |
| UK | 2014 | 2 | 250 | NaN |
| UK | 2014 | 3 | 200 | NaN |
| UK | 2014 | 4 | 150 | NaN |
| UK | 2015 | 1 | 230 | 30 |
| UK | 2015 | 2 | 200 | -50 |
| UK | 2015 | 3 | 200 | 0 |
| UK | 2015 | 4 | 160 | 10 |
---------------------------------------------------|
From looking at other questions I've tried using the .diff() method but I'm not quite sure how to get it to do what I want (or if I'll actually need to do something more brute force to work this out), e.g. I've tried:
df.groupby(by=["Country", "Year", "Quarter"]).sum().diff().head(10)
This yields the difference from the previous row in the table as a whole though, rather than the difference from the same quarter for the previous year.
Since you want the change over Country and quarter and not the year, you have to remove the year from the group.
df['Change'] = df.groupby(['Country', 'Quarter']).Amount.diff()

Categories