Pandas Python - Grouping counts to others - python

I am conducting data analysis for a project using python and pandas where I have the following data:
The numbers are the count.
USA: 5000
Canada: 7000
UK: 6000
France: 6500
Spain: 4000
Japan: 5
China: 7
Hong Kong: 10
Taiwan: 6
New Zealand: 8
South Africa: 11
My task is to make a pie chart that represent the count.
df['Country'].value_counts().plot.pie()
What I will get is a pie chart, but I would like to combined the countries with smaller counts and put them into a category like other.
How can I do that?

IIUC using np.where setting the boundary , then groupby + sum , notice here I am using pandas.Series.groupby
s=df['Country'].value_counts()
s.groupby(np.where(s>=4000,s.index,'other')).sum()#.plot.pie()
Out[64]:
Canada 7000
France 6500
Spain 4000
UK 6000
USA 5000
other 47

Related

How to convert "event" data into country-year data by summating information in columns? Using python/pandas

I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to summate the columns nkill, nhostage, and nwounded per year. This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).
Right now my data looks like this (there are a ton of other columns but they aren't important for this):
eventID
iyear
country_txt
nkill
nwounded
nhostages
10000102
2000
Nigeria
3
10
0
10000103
2000
Mali
1
3
15
10000103
2000
Nigeria
15
0
0
10000103
2001
Benin
1
0
0
10000103
2001
Nigeria
1
3
15
.
.
.
And I would like it to look like this:
country_txt
iyear
total_nkill
total_nwounded
total_nhostages
Nigeria
2000
200
300
300
Nigeria
2001
250
450
15
So basically, I want to add up the number of nkill, nwounded, and nhostages for each country-year group. So then I can have a list of all the countries and years with information about the number of deaths, injuries, and hostages taken per year in total. The countries also have an associated number if it is easier to write the code with a number instead of country_txt, the column with the country's number is just "country".
For a solution, I've been looking at the pandas "groupby" function, but I'm really new to coding so I'm having trouble understanding the documentation. It also seems like melt or pivot functions could be helpful.
This simplified example shows how you could use groupby -
import pandas as pd
df = pd.DataFrame({'country': ['Nigeria', 'Nigeria', 'Nigeria', 'Mali'],
'year': [2000, 2000, 2001, 2000],
'events1': [ 3, 4, 5, 2],
'events2': [1, 6, 3, 4]
})
df2 = df.groupby(['country', 'year'])[['events1', 'events2']].sum()
print(df2)
which gives the total of each type of event by country and by year
events1 events2
country year
Mali 2000 2 4
Nigeria 2000 7 7
2001 5 3

Comparing Values in Multiple Columns and Returning all Declining Regions

I have a data frame that is similar to the following, and lets say I have sales amounts for different regions for two different years:
Company
2021 Region 1 Sales
2021 Region 2 Sales
2020 Region 1 Sales
2020 Region 2 Sales
Company 1
300000
150000
250000
149000
Company 2
10000
17000
100000
80000
Company 3
12000
20000
22000
90000
I would like to compare each region for each year to determine which regions have declined in 2021. One caveat is that the regional sales have to be at least $25,000 to be counted. Therefore, I am looking to add a new column with all of the region names that had less than $25,000 in sales in 2021, but more than $25,000 in 2020. The output would look like this, although there will be more columns or "regions" to compare than 2.
Company
2021 Region 1 Sales
2021 Region 2 Sales
2020 Region 1 Sales
2020 Region 2 Sales
2021 Lost Regions
Company 1
300000
150000
250000
149000
None
Company 2
10000
17000
100000
80000
Region 1; Region 2
Company 3
12000
20000
22000
90000
Region 2
Thank you in advance for any assistance, and no rush on this. Hopefully there is a concise way to do this without using if-then and writing out a lot of combinations.
number_of_regions = 2 # You have to change this
def find_declined_regions(row):
result = []
for i in range(1, number_of_regions+1):
if row[f"2021 Region {i} Sales"] < 25000 and row[f"2020 Region {i} Sales"] > 25000:
result.append(f"Region {i}")
return "; ".join(result)
df.apply(find_declined_regions, axis=1)
df is your DataFrame and you have to change number_of_regions based on your problem.
EDIT:
if columns names are all different, There art two cases:
1- You have a list of all regions, so you can do this:
for region in all_regions:
if row[f"2021 {region} Sales"] < 25000 and row[f"2020 {region} Sales"] > 25000:
2- You don't have a list of all regions, so you have to create one:
all_regions = [col[5:-6] for col in df.columns[1:int(len(df.columns)/2)+1]]

Using 'groupby' without aggregation and sorting within groups

I have countries_of_the_world.csv. Basically, it's the table with the following bits of information:
Country Region GDP
Austria Western Europe 100
Chad Africa 30
I need to sort GDP values in descending order by region with countries inside these regions. It should look like:
Region Country GDP
Africa Egypt 42
Chad 30
Kongo 28
Oceania Australia 120
New Zealand 100
Indonesia 50
I tried 'groupby' but it doesn't work without aggregation function applied so I tried lambda but it didn't sort correctly:
countries.sort_values(['GDP'], ascending=False).groupby(['Region','Country']).aggregate(lambda x:x)
How can I handle it?
Use DataFrame.sort_values by both columns and then convert Region and Country to MultiIndex by DataFrame.set_index:
df1 = (countries.sort_values(['Region','GDP'], ascending=[True, False])
.set_index(['Region','Country']))
print (df1)
GDP
Region Country
Africa Egypt 42
Chad 30
Kongo 28
Oceania Australia 120
New Zealand 100
Indonesia 50

Group and sum data by common prefix from column value with different length prefixes and inconsistent delimiters in a Pandas dataframe

I am new to using Python and Pandas, but have been trying to automate some of the data cleaning/merging for reports of mine.
So far I've had success in building up the combined file of all information I need to feed into my reporting summary but have
gotten stuck with grouping and merging data with matching prefixes.
I have a data set that is structured similar to this in a pandas dataframe:
Company_Num Company_Name 2019_Amt 2020_Amt Code Flag Manager
1 ABC Company Ltd 2000 400 A Y John
1 ABC Company Ltd 2000 400 A Y John
2 DEFGHIJ Company (London) 480 100 B N James
3 DEFGHIJ Company (Bristol) 600 700 B N James
4 DEFGHIJ Company (York) 1500 1000 B N James
5 KLM Services 9000 7000 A Y Jane
6 NOPQ Industries 300 400 C Y Jen
7 NOPQ Industries - London 7000 8000 C Y Jen
I'm wanting to get a summary set of data where there are no duplicates in my data and
instead of having rows for each office I have one summarised value for each company. Ultimately
with a dataframe like:
Company_Name 2019_Amt 2020_Amt Code Flag
ABC Company Ltd 2000 400 A Y
DEFGHIJ Company 2580 1800 B N
KLM Services 9000 7000 A Y
NOPQ Industries 7300 8400 C Y
So far I have managed to drop the duplicates using:
df.drop_duplicates(subset=['Company_Num', 'Company_Name', 'Code', '2019_Amt', '2020_Amt'])
With the resulting table:
Company_Num Company_Name 2019_Amt 2020_Amt Code Flag Manager
1 ABC Company Ltd 2000 400 A Y John
2 DEFGHIJ Company (London) 480 100 B N James
3 DEFGHIJ Company (Bristol) 600 700 B N James
4 DEFGHIJ Company (York) 1500 1000 B N James
5 KLM Services 9000 7000 A Y Jane
6 NOPQ Industries 300 400 C Y Jen
7 NOPQ Industries - London 7000 8000 C Y Jen
The solution that I have tried is to substring the first 9 characters of each company name and use a groupby
and sum on those, but that leaves me with the column being saved as the substring. This has also dropped the
columns Code and Flag from my dataframe, leaving me with table like this:
df['SubString_Company_Name'] = df['Company_Name'].str.slice(0,9)
df.groupby([df.SubString_Company_Name]).sum().reset_index()
SubString_Company_Name 2019_Amt 2020_Amt
ABC Compa 2000 400
DEFGHIJ C 2580 1800
KLM Servi 9000 7000
NOPQ Indu 7300 8400
I have tried to use the os.path.commonprefix function to get the company names, but can't find a way to use it in a dataframe,
and for multiple values. My understanding is it will look at the list as a whole and return the longest common prefix of the
whole list which wouldn't work. I have also considered extracting all duplicate substrings into new dataframes and summing
and renaming there before merging back into one data set, but I'm not sure if that would work. The solutions I've found online
have been centred around uniform data where lambda can be used with a delimiter or the prefix is always the same size, whereas
my data is not uniform and the prefixes are varying sizes.
My data is changed every month and so I want to design a dynamic solution that isn't relying on substrings since I could run into
issues with only taking 9 characters. My final consideration is to extract the SubString_Company_Name
into a list, convert that to the os.path.commonprefix of the Company_Name and then save the unique commonprefix value of each
Company_Name into a new list and for each item in that list create a new summary table. But I don't know if this would work and
I want to know if there's a better or more efficient way of doing this before trying.
you can use groupby.agg after dropping duplicates and use series.str.split with the first string from the split .str[0] as the grouper:
d= {'Company_Name':'first','2019_Amt':'sum','2019_Amt':'sum',
'2020_Amt':'sum','Code':'first','Flag':'first'}
grouper = df['Company_Name'].str.split().str[0]
out = df.drop_duplicates().groupby(grouper).agg(d).reset_index(drop=True)
print(out)
Company_Name 2019_Amt 2020_Amt Code Flag
0 ABC Company Ltd 2000 400 A Y
1 DEFGHIJ Company (London) 2580 1800 B N
2 KLM Services 9000 7000 A Y
3 NOPQ Industries 7300 8400 C Y

How to calculate the percentage of the sum value of the column?

I have a pandas dataframe which looks like this:
Country Sold
Japan 3432
Japan 4364
Korea 2231
India 1130
India 2342
USA 4333
USA 2356
USA 3423
I have use the code below and get the sum of the "sold" column
df1= df.groupby(df['Country'])
df2 = df1.sum()
I want to ask how to calculate the percentage of the sum of "sold" column.
You can get the percentage by adding this code
df2["percentage"] = df2['Sold']*100 / df2['Sold'].sum()
In the output dataframe, a column with the percentage of each country is added.
We can divide the original Sold column by a new column consisting of the grouped sums but keeping the same length as the original DataFrame, by using transform
df.assign(
pct_per=df['Sold'] / df.groupby('Country').transform(pd.DataFrame.sum)['Sold']
)
Country Sold pct_per
0 Japan 3432 0.440226
1 Japan 4364 0.559774
2 Korea 2231 1.000000
3 India 1130 0.325461
4 India 2342 0.674539
5 USA 4333 0.428501
6 USA 2356 0.232991
7 USA 3423 0.338509
Simple Solution
You were almost there.
First you need to group by country
Then create the new percentage column (by dividing grouped sales with sum of all sales)
# reset_index() is only there because the groupby makes the grouped column the index
df_grouped_countries = df.groupby(df.Country).sum().reset_index()
df_grouped_countries['pct_sold'] = df_grouped_countries.Sold / df.Sold.sum()
Are you looking for the percentage after or before aggregation?
import pandas as pd
countries = [['Japan',3432],['Japan',4364],['Korea',2231],['India',1130], ['India',2342],['USA',4333],['USA',2356],['USA',3423]]
df = pd.DataFrame(countries,columns=['Country','Sold'])
df1 = df.groupby(df['Country'])
df2 = df1.sum()
df2['percentage'] = (df2['Sold']/df2['Sold'].sum()) * 100
df2

Categories