Summarize values in panda data frames

Summarize values in panda data frames - python

I want to calculate the maximum value for each year and show the sector and that value. For example, from the screenshot, I would like to display:
2010: Telecom 781
2011: Tech 973
I have tried using:
df.groupby(['Year', 'Sector'])['Revenue'].max()
but this does not give me the name of Sector which has the highest value.

Try using idxmax and loc:
df.loc[df.groupby(['Sector','Year'])['Revenue'].idxmax()]
MVCE:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Sector':['Telecom','Tech','Financial Service','Construction','Heath Care']*3,
'Year':[2010,2011,2012,2013,2014]*3,
'Revenue':np.random.randint(101,999,15)})
df.loc[df.groupby(['Sector','Year'])['Revenue'].idxmax()]
Output:
Sector Year Revenue
3 Construction 2013 423
12 Financial Service 2012 838
9 Heath Care 2014 224
1 Tech 2011 466
5 Telecom 2010 843

Also .sort_values + .tail, grouping on just year. Data from #Scott Boston
df.sort_values('Revenue').groupby('Year').tail(1)
Output:
Sector Year Revenue
9 Heath Care 2014 224
3 Construction 2013 423
1 Tech 2011 466
12 Financial Service 2012 838
5 Telecom 2010 843

Related

Adding columns and index to sum up values using Panda in Python

I have a .csv file, after reading it using Panda I have this output
Year Month Brunei Darussalam ... Thailand Viet Nam Myanmar
348 2007 Jan 3813 ... 25863 12555 4887
349 2007 Feb 3471 ... 22575 11969 3749
350 2007 Mar 4547 ... 33087 14060 5480
351 2007 Apr 3265 ... 34500 15553 6838
352 2007 May 3641 ... 30555 14995 5295
.. ... ... ... ... ... ... ...
474 2017 Jul 5625 ... 48620 71153 12619
475 2017 Aug 4610 ... 40993 51866 10934
476 2017 Sep 5387 ... 39692 40270 9888
477 2017 Oct 4202 ... 61448 39013 11616
478 2017 Nov 5258 ... 39304 36964 11402
I use this to get me the sum of all countries within the total years to display top 3
top3_country = new_df.iloc[0:, 2:9].sum(axis=0).sort_values(ascending=False).nlargest(3)
though my output is this
Indonesia 27572424
Malaysia 11337420
Philippines 6548622
I want to add columns and index into the sum value as if it was a new dataframe like this
Countries Visitors
0 Indonesia 27572424
1 Malaysia 11337420
2 Philippines 6548622
Sorry I am just starting to learn learn Panda any help will be gladly appreciated

Use Series.reset_index for 2 columns DataFrame and then set new columns names from list:
top3_country = top3_country.reset_index()
top3_country.columns = ['Countries', 'Visitors']
Or use Series.rename_axis with Series.reset_index:
top3_country = top3_country.rename_axis('Countries').reset_index(name='Visitors')

You can return back pd.DataFrame, use reset_index, and rename. Change your code to:
import pandas as pd
top3_country = pd.DataFrame(df.iloc[0:, 2:9].sum(axis=0).sort_values(ascending=False).nlargest(3)
).reset_index(
).rename(columns={'index':'Countries',0:'visitors'})
top3_country
Countries visitors
0 Indonesia 27572424
1 Malaysia 11337420
2 Philippines 6548622

Find largest 2 values for each year in the returned pandas groupby object after sorting each group

My dataframe has 3 columns: Year. Leading Cause,Deaths. I want to find the total number of deaths by leading cause in each year. I did the following:
totalDeaths_Cause = df.groupby(["Year", "Leading Cause"])["Deaths"].sum()
which results in:
The total number of deaths for :
Year Leading Cause
2009 Hypertension 26
2010 All Other Causes 2140
2011 Cerebrovascular Disease 281
Immunodeficiency 70
Parkinson Disease 180
2012 Cerebrovascular Disease 102
Disease1 183
Diseases of Heart 76
2013 Cerebrovascular Disease 386
Parkinson Disease 372
Self-Harm 17
Name: Deaths, dtype: int64
Now I want to get the largest 2 values(for deaths) each year and the leading Cause such that:
The total number of deaths for :
Year Leading Cause
2009 Hypertension 26
2010 All Other Causes 2140
2011 Cerebrovascular Disease 281
Parkinson Disease 180
2012 Disease1 183
Cerebrovasular disease 102
2013 Cerebrovascular Disease 386
Parkinson Disease 372
Thanks in advance for your help!

Let us do
df=df.sort_values().groupby(level=0).tail(1)

How to find name of column along with the maximum value

I have data(df_movies2) with columns: Year, production companies and revenue generated in that particular year. I want to return for each year, what is the maximum revenue along with the name of production companies. For example, in 2016 Studio Babelsberg has the maximum revenue. This is the data
Here is what I have tried
import pandas as pd
df_movie2.groupby(['year','production_companies']).revenue.max()
But its not working returnning all the names of production companies for each year.
Thanks for your help

I'm not entirely sure what you're hoping to return. If your output is sorted as you want but you're missing values, it's because the .max() is dropping duplicates for your respective year. Please see edit 1 to return all values in ascending order from max to min.
If it's a sorting issue where you want to return the max value to min value and aren't worried about dropping duplicate production_companies for each year then refer to edit 2:
import pandas as pd
d = ({
'year' : ['2016','2016','2016','2016','2016','2015','2015','2015','2015','2014','2014','2014','2014'],
'production_companies' : ['Walt Disney Pictures','Universal Pictures','DC Comics','Twentieth Century','Studio Babelsberg','DC Comics','Twentieth Century','Twentieth Century','Universal Pictures','The Kennedy/Marshall Company','Twentieth Century','Village Roadshow Pictures','Columbia Pictures'],
'revenue' : [966,875,873,783,1153,745,543,521,433,415,389,356,349],
})
df = pd.DataFrame(data = d)
Edit 1:
df = df.sort_values(['revenue', 'year'], ascending=[0, 1])
df = df.set_index(['year', 'production_companies'])
Output:
revenue
year production_companies
2016 Studio Babelsberg 1153
Walt Disney Pictures 966
Universal Pictures 875
DC Comics 873
Twentieth Century 783
2015 DC Comics 745
Twentieth Century 543
Twentieth Century 521
Universal Pictures 433
2014 Twentieth Century 389
Village Roadshow Pictures 356
Columbia Pictures 349
The Kennedy/Marshall Company 320
Edit 2:
df = df.groupby(['year','production_companies'])[['revenue']].max()
idx = df['revenue'].max(level=0).sort_values().index
i = pd.CategoricalIndex(df.index.get_level_values(0), ordered=True, categories=idx)
df.index = [i, df.index.get_level_values(1)]
df = df.sort_values(['year','revenue'], ascending=False)
Output:
revenue
year production_companies
2016 Studio Babelsberg 1153
Walt Disney Pictures 966
Universal Pictures 875
DC Comics 873
Twentieth Century 783
2015 DC Comics 745
Twentieth Century 543
Universal Pictures 433
2014 Twentieth Century 389
Village Roadshow Pictures 356
Columbia Pictures 349
The Kennedy/Marshall Company 320

Transpose subset of pandas dataframe into multi-indexed data frame

I have the following dataframe:
df.head(14)
I'd like to transpose just the yr and the ['WA_','BA_','IA_','AA_','NA_','TOM_']
variables by Label. The resulting dataframe should then be a Multi-indexed frame with Label and the WA_, BA_, etc. and the columns names will be 2010, 2011, etc. I've tried,
transpose(), groubby(), pivot_table(), long_to_wide(),
and before I roll my own nested loop going line by line through this df I thought I'd ping the community. Something like this by every Label group:
I feel like the answer is in one of those functions but I'm just missing it. Thanks for your help!

From what I can tell by your illustrated screenshots, you want WA_, BA_ etc as rows and yr as columns, with Label remaining as a row index. If so, consider stack() and unstack():
# sample data
labels = ["Albany County","Big Horn County"]
n_per_label = 7
n_rows = n_per_label * len(labels)
years = np.arange(2010, 2017)
min_val = 10000
max_val = 40000
data = {"Label": sorted(np.array(labels * n_per_label)),
"WA_": np.random.randint(min_val, max_val, n_rows),
"BA_": np.random.randint(min_val, max_val, n_rows),
"IA_": np.random.randint(min_val, max_val, n_rows),
"AA_": np.random.randint(min_val, max_val, n_rows),
"NA_": np.random.randint(min_val, max_val, n_rows),
"TOM_": np.random.randint(min_val, max_val, n_rows),
"yr":np.append(years,years)
}
df = pd.DataFrame(data)
AA_ BA_ IA_ NA_ TOM_ WA_ Label yr
0 27757 23138 10476 20047 34015 12457 Albany County 2010
1 37135 30525 12296 22809 27235 29045 Albany County 2011
2 11017 16448 17955 33310 11956 19070 Albany County 2012
3 24406 21758 15538 32746 38139 39553 Albany County 2013
4 29874 33105 23106 30216 30176 13380 Albany County 2014
5 24409 27454 14510 34497 10326 29278 Albany County 2015
6 31787 11301 39259 12081 31513 13820 Albany County 2016
7 17119 20961 21526 37450 14937 11516 Big Horn County 2010
8 13663 33901 12420 27700 30409 26235 Big Horn County 2011
9 37861 39864 29512 24270 15853 29813 Big Horn County 2012
10 29095 27760 12304 29987 31481 39632 Big Horn County 2013
11 26966 39095 39031 26582 22851 18194 Big Horn County 2014
12 28216 33354 35498 23514 23879 17983 Big Horn County 2015
13 25440 28405 23847 26475 20780 29692 Big Horn County 2016
Now set Label and yr as indices.
df.set_index(["Label","yr"], inplace=True)
From here, unstack() will pivot the inner-most index to columns. Then, stack() can swing our value columns down into rows.
df.unstack().stack(level=0)
yr 2010 2011 2012 2013 2014 2015 2016
Label
Albany County AA_ 27757 37135 11017 24406 29874 24409 31787
BA_ 23138 30525 16448 21758 33105 27454 11301
IA_ 10476 12296 17955 15538 23106 14510 39259
NA_ 20047 22809 33310 32746 30216 34497 12081
TOM_ 34015 27235 11956 38139 30176 10326 31513
WA_ 12457 29045 19070 39553 13380 29278 13820
Big Horn County AA_ 17119 13663 37861 29095 26966 28216 25440
BA_ 20961 33901 39864 27760 39095 33354 28405
IA_ 21526 12420 29512 12304 39031 35498 23847
NA_ 37450 27700 24270 29987 26582 23514 26475
TOM_ 14937 30409 15853 31481 22851 23879 20780
WA_ 11516 26235 29813 39632 18194 17983 29692

How to Create One Category from 2 Datasets in Python

I have two data sets that both have data in it like follows:
Data set 1:
Country Year Cause Gender Deaths
2090 2011 A000 1 70340
2090 2010 A001 2 53449
2090 2009 A002 1 1731
2090 2008 A003 2 1270
2090 2007 A004 1 148
2310 2011 A000 2 172
2310 2010 A001 1 24
2310 2009 A002 2 20
2310 2008 A003 1 27
2660 2013 A004 2 21
2660 2012 A005 1 88
2660 2011 A006 2 82
Data set 2:
Country Year Cause Gender Deaths
2090 1999 B529 1 557
2090 1995 A001 2 234
2090 1996 B535 1 29
2090 1997 A002 2 33
2090 1998 B546 1 3224
2090 1999 B556 2 850
2310 1995 B555 1 319
2310 1996 A003 2 143
2310 1997 B563 1 251
2310 1998 B573 2 117
2660 1997 B561 1 244
2660 1998 A002 2 115
2660 1999 A001 1 10
2660 2000 B569 2 2
I need to create categories on the Cause column codes which are for causes of death. But I need to make this category from using these combined causes from both data sets separately e.g.
Road Traffic Accidents Category: From Data set 1: A001, A003
Road Traffic Accidents Category: From Data set 2: B569, B555
and the causes from both of these must be included in the Road Traffic Accidents Category.
They must be included in each category for each data set (not combined) like: Road Traffic Accidents: A001, A003, B569, B555
This is because say for example A001. In Data set 1 A001 is for Car Accidents, but in Data set 2 A001 means Heart Attack and I don't want Heart Attack in the Road Traffic Accidents category. But when the category is made from both data sets (i.e. Road Traffic Accidents: A001, A003, B569, B555) then A001 from both data sets is included in the Road Traffic Accidents category.
The purpose of this question is to see how different categories differ over the years in terms of deaths - I am not allowed to combine both data sets manually not on Python. I am also not allowed to use any of the common packages such as Pandas, Numpy, etc.
Thank you for help in advanced

So my understanding of your problem is (correct me if I'm wrong), you have two datasets which both have a "Cause" column/variable. But the encoding of this "Cause" column in two datasets are different.
In Dataset1, perhaps the encoding says:
Road Traffic Accidents Category: A001, A003
Heart Attack Category: C001, C002 #made up encoding
In Dataset2, perhaps the encoding says:
Road Traffic Accidents Category: B569, B555
Heart Attack Category: A001
Hurricane Cause of Death Category: E941 # made up encoding
What you want is to create a consistent category to cause encoding mapping which works for two datasets.
I personally think python dictionary is the right data structure for this task. I assume you can load the category-cause mapping for both datasets.
data1_cat_cause = {'Road Traffic Accidents': ['A001', 'A003'],
'Heart Attack': ['C001', 'C002']}
data2_cat_cause = {'Road Traffic Accidents': ['B569', 'B555'],
'Heart Attack': ['A001'],
'Hurricane Cause of Death': ['E941']}
category_combined = set(data1_cat_cause.keys()) | set(data2_cat_cause.keys())
cat_cause_combined = {}
for category in category_combined:
cat_cause_combined[category] = {'Dataset1':data1_cat_cause.get(category),'Dataset2':data2_cat_cause.get(category)}
This would yield following information stored in the "cat_cause_combined" variable:
Dataset1 encoding Dataset2 encoding
Road Traffic Accidents : ['A001', 'A003'] ['B569', 'B555']
Heart Attack : ['C001', 'C002'] ['A001']
Hurricane Cause of Death: None ['E941']
I hope I understand your problem correctly and I hope this solution solves your problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Summarize values in panda data frames - python

Also .sort_values + .tail, grouping on just year. Data from #Scott Boston df.sort_values('Revenue').groupby('Year').tail(1) Output: Sector Year Revenue 9 Heath Care 2014 224 3 Construction 2013 423 1 Tech 2011 466 12 Financial Service 2012 838 5 Telecom 2010 843

Related

Adding columns and index to sum up values using Panda in Python

Find largest 2 values for each year in the returned pandas groupby object after sorting each group

How to find name of column along with the maximum value

Transpose subset of pandas dataframe into multi-indexed data frame

How to Create One Category from 2 Datasets in Python

Categories

Resources