Pandas MultiIndex DataFrame Sorting - python

I am looking for a way to sort a column in a DataFrame with multiple index levels. In my DataFrame index level 0 is state name ("STNAME") and index level 1 is city name ("CTYNAME").
My initial DataFrame looks like this:
In:
df = census_df
df = df.set_index(["STNAME" ,"CTYNAME"])
df = df.loc[: ,["CENSUS2010POP"]]
print(df.head())
Out:
CENSUS2010POP
STNAME CTYNAME
Alabama Alabama 4779736
Autauga County 54571
Baldwin County 182265
Barbour County 27457
Bibb County 22915
However, when I try to apply sorting to "CENSUS2010POP" column it ruins all the hierarchy:
In:
df = census_df
df = df.set_index(["STNAME" ,"CTYNAME"])
df = df.loc[: ,["CENSUS2010POP"]]
df = df.sort_values("CENSUS2010POP")
print (df.head())
Out:
CENSUS2010POP
STNAME CTYNAME
Texas Loving County 82
Hawaii Kalawao County 90
Texas King County 286
Kenedy County 416
Nebraska Arthur County 460
I am wondering if there's a way to sort column and index levels
Any help would be much appreciated

You can add STNAME to the sort_values
df.sort_values(['STNAME','CENSUS2010POP'])
On random data:
np.random.seed(1)
df = pd.DataFrame({
'STNAME':[0]*4+[1]*4,
'CTYNAME':[0,1,2,3]*2,
'CENSUS2010POP':np.random.randint(10,100,8)
}).set_index(['STNAME', 'CTYNAME'])
Output is:
CENSUS2010POP
STNAME CTYNAME
0 3 19
1 22
0 47
2 82
1 1 15
3 74
0 85
2 89

Related

adding dataframes to same csv with blank row in between

I have two dataframes:
In [31]: df1
Out[31]:
State Score
0 Arizona AZ 62
1 Georgia GG 47
2 Newyork NY 55
3 Indiana IN 74
4 Florida FL 31
and
In [30]: df3
Out[30]:
letter number animal
0 c 3 cat
1 d 4 dog
I want to obtain a csv like this:
1 State Score
2 Arizona AZ 62
3 Georgia GG 47
4 Newyork NY 55
5 Indiana IN 74
6 Florida FL 31
7
8 letter number animal
9 c 3 cat
8 d 4 dog
I was able to obtain it by creating an empty dataframe, appending it to the first dataframe and then adding the second dataframe to the csv like this:
empty_df = pd.Series([],dtype=pd.StringDtype())
df1.append(empty_df, ignore_index=True).to_csv('foo.csv', index=False)
df3.to_csv('foo.csv', mode='a', index=False)
but I am getting a warning that the function 'append' is getting deprecated and I should be using 'concat'.
I tried this with concat:
pd.concat([df1, empty_df], ignore_index=True).to_csv('foo.csv', index=False)
df3.to_csv('foo.csv', mode='a', index=False)
but I am not getting the empty line between the 2 sets of data.
Use pandas.DataFrame with np.nan to create the empty row :
import numpy as np
empty_df = pd.DataFrame([[np.nan] * len(df1.columns)], columns=df1.columns)
pd.concat([df1, empty_df], ignore_index=True).to_csv('foo.csv', index=False)
df2.to_csv('foo.csv', mode='a', index=False)
# Output (in Excel):

How to calculate the percentage of the sum value of the column?

I have a pandas dataframe which looks like this:
Country Sold
Japan 3432
Japan 4364
Korea 2231
India 1130
India 2342
USA 4333
USA 2356
USA 3423
I have use the code below and get the sum of the "sold" column
df1= df.groupby(df['Country'])
df2 = df1.sum()
I want to ask how to calculate the percentage of the sum of "sold" column.
You can get the percentage by adding this code
df2["percentage"] = df2['Sold']*100 / df2['Sold'].sum()
In the output dataframe, a column with the percentage of each country is added.
We can divide the original Sold column by a new column consisting of the grouped sums but keeping the same length as the original DataFrame, by using transform
df.assign(
pct_per=df['Sold'] / df.groupby('Country').transform(pd.DataFrame.sum)['Sold']
)
Country Sold pct_per
0 Japan 3432 0.440226
1 Japan 4364 0.559774
2 Korea 2231 1.000000
3 India 1130 0.325461
4 India 2342 0.674539
5 USA 4333 0.428501
6 USA 2356 0.232991
7 USA 3423 0.338509
Simple Solution
You were almost there.
First you need to group by country
Then create the new percentage column (by dividing grouped sales with sum of all sales)
# reset_index() is only there because the groupby makes the grouped column the index
df_grouped_countries = df.groupby(df.Country).sum().reset_index()
df_grouped_countries['pct_sold'] = df_grouped_countries.Sold / df.Sold.sum()
Are you looking for the percentage after or before aggregation?
import pandas as pd
countries = [['Japan',3432],['Japan',4364],['Korea',2231],['India',1130], ['India',2342],['USA',4333],['USA',2356],['USA',3423]]
df = pd.DataFrame(countries,columns=['Country','Sold'])
df1 = df.groupby(df['Country'])
df2 = df1.sum()
df2['percentage'] = (df2['Sold']/df2['Sold'].sum()) * 100
df2

Pandas iterate over rows and find the column names

i have a two dataframes as:
df = pd.DataFrame({'America':["Ohio","Utah","New York"],
'Italy':["Rome","Milan","Venice"],
'Germany':["Berlin","Munich","Jena"]});
df2 = pd.DataFrame({'Cities':["Rome", "New York", "Munich"],
'Country':["na","na","na"]})
i want to itirate on df2 "Cities" column to find the cities on my (df) and append the country of the city (df column names) to the df2 country column
Use melt with map by dictionary:
df1 = df.melt()
print (df1)
variable value
0 America Ohio
1 America Utah
2 America New York
3 Italy Rome
4 Italy Milan
5 Italy Venice
6 Germany Berlin
7 Germany Munich
8 Germany Jena
df2['Country'] = df2['Cities'].map(dict(zip(df1['value'], df1['variable'])))
#alternative, thanks #Sandeep Kadapa
#df2['Country'] = df2['Cities'].map(df1.set_index('value')['variable'])
print (df2)
Cities Country
0 Rome Italy
1 New York America
2 Munich Germany
After melting and renaming the first dataframe:
df1 = df.melt().rename(columns={'variable': 'Country', 'value': 'Cities'})
the solution is a simple merge:
df2 = df2[['Cities']].merge(df1, on='Cities')

multiple files combination in pandas

Assume the file1 is:
State Date
0 NSW 01/02/16
1 NSW 01/03/16
3 VIC 01/04/16
...
100 TAS 01/12/17
File 2 is:
State 01/02/16 01/03/16 01/04/16 .... 01/12/17
0 VIC 10000 12000 14000 .... 17600
1 NSW 50000
....
Now I would like to join these two files based on Date
In the other words, I want to combine the file1's Date column with file2 columns' date.
I believe you need melt with merge, parameter on is possible omit for merge by all columns same in both DataFrames:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df = df2.melt('State', var_name='Date', value_name='col').merge(df1, how='right')
print (df)
State Date col
0 NSW 01/02/16 50000.0
1 NSW 01/03/16 NaN
2 VIC 01/04/16 14000.0
3 TAS 01/12/17 NaN
Solution with left join:
df = df1.merge(df2.melt('State', var_name='Date', value_name='col'), how='left')
print (df)
State Date col
0 NSW 01/02/16 50000.0
1 NSW 01/03/16 NaN
2 VIC 01/04/16 14000.0
3 TAS 01/12/17 NaN
You can melt the second data frame to a long format, then merge with first data frame to get the values.
import pandas as pd
df1 = pd.DataFrame({'State': ['NSW','NSW','VIC'],
'Date': ['01/02/16', '01/03/16', '01/04/16']})
df2 = pd.DataFrame([['VIC',10000,12000,14000],
['NSW',50000,60000,62000]],
columns=['State', '01/02/16', '01/03/16', '01/04/16'])
df1.merge(pd.melt(df2, id_vars=['State'], var_name='Date'), on=['State', 'Date'])
# returns:
Date State value
0 01/02/16 NSW 50000
1 01/03/16 NSW 60000
2 01/04/16 VIC 14000

Update Specific Pandas Rows with Value from Different Dataframe

I have a pandas dataframe that contains budget data but my sales data is located in another dataframe that is not the same size. How can I get my sales data updated in my budget data? How can I write conditions so that it makes these updates?
DF budget:
cust type loc rev sales spend
0 abc new north 500 0 250
1 def new south 700 0 150
2 hij old south 700 0 150
DF sales:
cust type loc sales
0 abc new north 15
1 hij old south 18
DF budget outcome:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150
Any thoughts?
Assuming that 'cust' column is unique in your other df, you can call map on the sales df after setting the index to be the 'cust' column, this will map for each 'cust' in budget df to it's sales value, additionally you will get NaN where there are missing values so you call fillna(0) to fill those values:
In [76]:
df['sales'] = df['cust'].map(df1.set_index('cust')['sales']).fillna(0)
df
Out[76]:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150

Categories