Using 'groupby' without aggregation and sorting within groups

Using 'groupby' without aggregation and sorting within groups - python

I have countries_of_the_world.csv. Basically, it's the table with the following bits of information:
Country Region GDP
Austria Western Europe 100
Chad Africa 30
I need to sort GDP values in descending order by region with countries inside these regions. It should look like:
Region Country GDP
Africa Egypt 42
Chad 30
Kongo 28
Oceania Australia 120
New Zealand 100
Indonesia 50
I tried 'groupby' but it doesn't work without aggregation function applied so I tried lambda but it didn't sort correctly:
countries.sort_values(['GDP'], ascending=False).groupby(['Region','Country']).aggregate(lambda x:x)
How can I handle it?

Use DataFrame.sort_values by both columns and then convert Region and Country to MultiIndex by DataFrame.set_index:
df1 = (countries.sort_values(['Region','GDP'], ascending=[True, False])
.set_index(['Region','Country']))
print (df1)
GDP
Region Country
Africa Egypt 42
Chad 30
Kongo 28
Oceania Australia 120
New Zealand 100
Indonesia 50

Related

Create a column that divides the other 2 columns using Pandas Apply()

Given a dataset -
country year cases population
Afghanistan 1999 745 19987071
Brazil 1999 37737 172006362
China 1999 212258 1272915272
Afghanistan 2000 2666 20595360
Brazil 2000 80488 174504898
China 2000 213766 1280428583
The task is to get the ratio of cases to population using the pandas apply function, in a new column called "prevalence"
This is what I have written
def calc_prevalence(G):
assert 'cases' in G.columns and 'population' in G.columns
G_copy = G.copy()
G_copy['prevalence'] = G_copy['cases','population'].apply(lambda x: (x['cases']/x['population']))
display(G_copy)
but I am getting a
KeyError: ('cases', 'population')

Here is a solution that applies a named function to the dataframe without using lambda:
def calculate_ratio(row):
return row['cases']/row['population']
df['prevalence'] = df.apply(calculate_ratio, axis = 1)
print(df)
#output:
country year cases population prevalence
0 Afghanistan 1999 745 19987071 0.000037
1 Brazil 1999 37737 172006362 0.000219
2 China 1999 212258 1272915272 0.000167
3 Afghanistan 2000 2666 20595360 0.000129
4 Brazil 2000 80488 174504898 0.000461
5 China 2000 213766 1280428583 0.000167

First, unless you've been explicitly told to use an apply function here for some reason, you can call the operation on the columns themselves resulting in a much faster vectorized operation. ie;
G_copy['prevalence']=G_copy['cases']/G_copy['population']
Finally, if you must use an apply for some reason, apply on the df instead of the two series;
G_copy['prevalence']=G_copy.apply(lambda row: row['cases']/row['population'],axis=1)

How to create a column from another df according to matching columns?

I have a df named population with a column named countries. I want to merge rows so they reflect regions = ( africa, west hem, asia, europe, mideast). I have another df named regionref from kaggle that have all countries of the world and the region they are associated with.
How do I create a new column in the population df that has the corresponding regions for the countries in the country column, using the region column from the kaggle dataset.
so essentially this is the population dataframe
CountryName 1960 1950 ...
US
Zambia
India
And this is the regionref dataset
Country Region GDP...
US West Hem
Zambia Africa
India Asia
And I want the population df to look like
CountryName Region 1960 1950 ...
US West Hem
Zambia Africa
India Asia
EDIT: I tried the concatenation but for some reason the two columns are not recognizing the same values
population['Country Name'].isin(regionref['Country']).value_counts()
This returned False for all values, as in there are no values in common.
And this is the output, as you can see there are values in common

You just need a join functionality, or to say, concatenate, in pandas way.
Given two DataFrames pop, region:
pop = pd.DataFrame([['US', 1000, 2000], ['CN', 2000, 3000]], columns=['CountryName', 1950, 1960])
CountryName 1950 1960
0 US 1000 2000
1 CN 2000 3000
region = pd.DataFrame([['US', 'AMER', '5'], ['CN', 'ASIA', '4']], columns = ['Country', 'Region', 'GDP'])
Country Region GDP
0 US AMER 5
1 CN ASIA 4
You can do:
pd.concat([region.set_index('Country'), pop.set_index('CountryName')], axis = 1)\
.drop('GDP', axis =1)
Region 1950 1960
US AMER 1000 2000
CN ASIA 2000 3000
The axis = 1 is for concatenating horizontally. You have to set column index for joining it correctly.

How to calculate the percentage of the sum value of the column?

I have a pandas dataframe which looks like this:
Country Sold
Japan 3432
Japan 4364
Korea 2231
India 1130
India 2342
USA 4333
USA 2356
USA 3423
I have use the code below and get the sum of the "sold" column
df1= df.groupby(df['Country'])
df2 = df1.sum()
I want to ask how to calculate the percentage of the sum of "sold" column.

You can get the percentage by adding this code
df2["percentage"] = df2['Sold']*100 / df2['Sold'].sum()
In the output dataframe, a column with the percentage of each country is added.

We can divide the original Sold column by a new column consisting of the grouped sums but keeping the same length as the original DataFrame, by using transform
df.assign(
pct_per=df['Sold'] / df.groupby('Country').transform(pd.DataFrame.sum)['Sold']
)
Country Sold pct_per
0 Japan 3432 0.440226
1 Japan 4364 0.559774
2 Korea 2231 1.000000
3 India 1130 0.325461
4 India 2342 0.674539
5 USA 4333 0.428501
6 USA 2356 0.232991
7 USA 3423 0.338509

Simple Solution
You were almost there.
First you need to group by country
Then create the new percentage column (by dividing grouped sales with sum of all sales)
# reset_index() is only there because the groupby makes the grouped column the index
df_grouped_countries = df.groupby(df.Country).sum().reset_index()
df_grouped_countries['pct_sold'] = df_grouped_countries.Sold / df.Sold.sum()

Are you looking for the percentage after or before aggregation?
import pandas as pd
countries = [['Japan',3432],['Japan',4364],['Korea',2231],['India',1130], ['India',2342],['USA',4333],['USA',2356],['USA',3423]]
df = pd.DataFrame(countries,columns=['Country','Sold'])
df1 = df.groupby(df['Country'])
df2 = df1.sum()
df2['percentage'] = (df2['Sold']/df2['Sold'].sum()) * 100
df2

Pandas Python - Grouping counts to others

I am conducting data analysis for a project using python and pandas where I have the following data:
The numbers are the count.
USA: 5000
Canada: 7000
UK: 6000
France: 6500
Spain: 4000
Japan: 5
China: 7
Hong Kong: 10
Taiwan: 6
New Zealand: 8
South Africa: 11
My task is to make a pie chart that represent the count.
df['Country'].value_counts().plot.pie()
What I will get is a pie chart, but I would like to combined the countries with smaller counts and put them into a category like other.
How can I do that?

IIUC using np.where setting the boundary , then groupby + sum , notice here I am using pandas.Series.groupby
s=df['Country'].value_counts()
s.groupby(np.where(s>=4000,s.index,'other')).sum()#.plot.pie()
Out[64]:
Canada 7000
France 6500
Spain 4000
UK 6000
USA 5000
other 47

pandas fill missing country values based on city if it exists

I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).

You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.

I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.

One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!

Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using 'groupby' without aggregation and sorting within groups - python

Related

Create a column that divides the other 2 columns using Pandas Apply()

How to create a column from another df according to matching columns?

How to calculate the percentage of the sum value of the column?

Pandas Python - Grouping counts to others

pandas fill missing country values based on city if it exists

Categories

Resources