I have data(df_movies2) with columns: Year, production companies and revenue generated in that particular year. I want to return for each year, what is the maximum revenue along with the name of production companies. For example, in 2016 Studio Babelsberg has the maximum revenue. This is the data
Here is what I have tried
import pandas as pd
df_movie2.groupby(['year','production_companies']).revenue.max()
But its not working returnning all the names of production companies for each year.
Thanks for your help
I'm not entirely sure what you're hoping to return. If your output is sorted as you want but you're missing values, it's because the .max() is dropping duplicates for your respective year. Please see edit 1 to return all values in ascending order from max to min.
If it's a sorting issue where you want to return the max value to min value and aren't worried about dropping duplicate production_companies for each year then refer to edit 2:
import pandas as pd
d = ({
'year' : ['2016','2016','2016','2016','2016','2015','2015','2015','2015','2014','2014','2014','2014'],
'production_companies' : ['Walt Disney Pictures','Universal Pictures','DC Comics','Twentieth Century','Studio Babelsberg','DC Comics','Twentieth Century','Twentieth Century','Universal Pictures','The Kennedy/Marshall Company','Twentieth Century','Village Roadshow Pictures','Columbia Pictures'],
'revenue' : [966,875,873,783,1153,745,543,521,433,415,389,356,349],
})
df = pd.DataFrame(data = d)
Edit 1:
df = df.sort_values(['revenue', 'year'], ascending=[0, 1])
df = df.set_index(['year', 'production_companies'])
Output:
revenue
year production_companies
2016 Studio Babelsberg 1153
Walt Disney Pictures 966
Universal Pictures 875
DC Comics 873
Twentieth Century 783
2015 DC Comics 745
Twentieth Century 543
Twentieth Century 521
Universal Pictures 433
2014 Twentieth Century 389
Village Roadshow Pictures 356
Columbia Pictures 349
The Kennedy/Marshall Company 320
Edit 2:
df = df.groupby(['year','production_companies'])[['revenue']].max()
idx = df['revenue'].max(level=0).sort_values().index
i = pd.CategoricalIndex(df.index.get_level_values(0), ordered=True, categories=idx)
df.index = [i, df.index.get_level_values(1)]
df = df.sort_values(['year','revenue'], ascending=False)
Output:
revenue
year production_companies
2016 Studio Babelsberg 1153
Walt Disney Pictures 966
Universal Pictures 875
DC Comics 873
Twentieth Century 783
2015 DC Comics 745
Twentieth Century 543
Universal Pictures 433
2014 Twentieth Century 389
Village Roadshow Pictures 356
Columbia Pictures 349
The Kennedy/Marshall Company 320
Related
I'm having troubles with avoiding negative values in interpolation. I have the following data in a DataFrame:
current_country =
idx Country Region Rank Score GDP capita Family Life Expect. Freedom Trust Gov. Generosity Residual Year
289 South Sudan Sub-Saharan Africa 143 3.83200 0.393940 0.185190 0.157810 0.196620 0.130150 0.258990 2.509300 2016
449 South Sudan Sub-Saharan Africa 147 3.59100 0.397249 0.601323 0.163486 0.147062 0.116794 0.285671 1.879416 2017
610 South Sudan Sub-Saharan Africa 154 3.25400 0.337000 0.608000 0.177000 0.112000 0.106000 0.224000 1.690000 2018
765 South Sudan Sub-Saharan Africa 156 2.85300 0.306000 0.575000 0.295000 0.010000 0.091000 0.202000 1.374000 2019
And I want to interpolate the following year (2019) - shown below - using pandas' df.interpolate()
new_row =
idx Country Region Rank Score GDP capita Family Life Expect. Freedom Trust Gov. Generosity Residual Year
593 South Sudan Sub-Saharan Africa 0 np.nan np.nan np.nan np.nan np.nan np.nan np.nan np.nan 2015
I create the df containing null values in all columns to be interpolated (as above) and append that one to the original dataframe before I interpolate to populate the cells with NaNs.
interpol_subset = current_country.append(new_row)
interpol_subset = interpol_subset.interpolate(method = "pchip", order = 2)
This produces the following df
idx Country Region Rank Score GDP capita Family Life Expect. Freedom Trust Gov. Generosity Residual Year
289 South Sudan Sub-Saharan Africa 143 3.83200 0.393940 0.185190 0.157810 0.196620 0.130150 0.258990 2.509300 2016
449 South Sudan Sub-Saharan Africa 147 3.59100 0.397249 0.601323 0.163486 0.147062 0.116794 0.285671 1.879416 2017
610 South Sudan Sub-Saharan Africa 154 3.25400 0.337000 0.608000 0.177000 0.112000 0.106000 0.224000 1.690000 2018
765 South Sudan Sub-Saharan Africa 156 2.85300 0.306000 0.575000 0.295000 0.010000 0.091000 0.202000 1.374000 2019
4 South Sudan Sub-Saharan Africa 0 2.39355 0.313624 0.528646 0.434473 -0.126247 0.072480 0.238480 0.963119 2015
The issue: In the last row, the value in "Freedom" is negative. Is there a way to parameterize the df.interpolate function such that it doesn't produce negative values? I can't find anything in the documentation. I'm fine with the estimates besides that negative value (Although they're a bit skewed)
I considered simply flipping the negative to a positive, but the "Score" value is a sum of all the other continuous features and I would like to keep it that way. What can I do here?
Here's a link to the actual code snippet. Thanks for reading.
I doubt this is an issue for interpolation. The main reason is the method you were using. 'pchip' will return a negative value for the 'freedom' anyway. If we take the values from your dataframe:
import numpy as np
import scipy.interpolate
y = np.array([0.196620, 0.147062, 0.112000, 0.010000])
x = np.array([0, 1, 2, 3])
pchip_obj = scipy.interpolate.PchipInterpolator(x, y)
print(pchip_obj(4))
The result is -0.126. I think if you want a positive result you should better change the method you are using.
My dataframe has 3 columns: Year. Leading Cause,Deaths. I want to find the total number of deaths by leading cause in each year. I did the following:
totalDeaths_Cause = df.groupby(["Year", "Leading Cause"])["Deaths"].sum()
which results in:
The total number of deaths for :
Year Leading Cause
2009 Hypertension 26
2010 All Other Causes 2140
2011 Cerebrovascular Disease 281
Immunodeficiency 70
Parkinson Disease 180
2012 Cerebrovascular Disease 102
Disease1 183
Diseases of Heart 76
2013 Cerebrovascular Disease 386
Parkinson Disease 372
Self-Harm 17
Name: Deaths, dtype: int64
Now I want to get the largest 2 values(for deaths) each year and the leading Cause such that:
The total number of deaths for :
Year Leading Cause
2009 Hypertension 26
2010 All Other Causes 2140
2011 Cerebrovascular Disease 281
Parkinson Disease 180
2012 Disease1 183
Cerebrovasular disease 102
2013 Cerebrovascular Disease 386
Parkinson Disease 372
Thanks in advance for your help!
Let us do
df=df.sort_values().groupby(level=0).tail(1)
I have an extremely long dataframe with a lot of data which I have to clean so that I can proceed with data visualization. There are several things I have in mind that needs to be done and I can do each of them to a certain extent but I don't know how to, or if it's even possible, to do them together.
This is what I have to do:
Find the highest arrival count every year and see if the mode of transport is by air, sea or land.
period arv_count Mode of arrival
0 2013-01 984350 Air
1 2013-01 129074 Sea
2 2013-01 178294 Land
3 2013-02 916372 Air
4 2013-02 125634 Sea
5 2013-02 179359 Land
6 2013-03 1026312 Air
7 2013-03 143194 Sea
8 2013-03 199385 Land
... ... ... ...
78 2015-03 940077 Air
79 2015-03 133632 Sea
80 2015-03 127939 Land
81 2015-04 939370 Air
82 2015-04 118120 Sea
83 2015-04 151134 Land
84 2015-05 945080 Air
85 2015-05 123136 Sea
86 2015-05 154620 Land
87 2015-06 930642 Air
88 2015-06 115631 Sea
89 2015-06 138474 Land
This is an example of what the data looks like. I don't know if it's necessary but I have created another column just for year like so:
def year_extract(year):
return year.split('-')[0].strip()
df1 = pd.DataFrame(df['period'])
df1 = df1.rename(columns={'period':'Year'})
df1 = df1['Year'].apply(year_extract)
df1 = pd.DataFrame(df1)
df = pd.merge(df, df1, left_index= True, right_index= True)
I know how to use groupby and I know how to find maximum but I don't know if it is possible to find maximum in a group like finding the highest arrival count in 2013, 2014, 2015 etc
The data above is the total arrival count for all countries based on the mode of transport and period but the original data also had hundreds of additional rows of which region and country stated are stated but I dropped because I don't know how to use or clean them. It looks like this:
period region country moa arv_count
2013-01 Total Total Air 984350
2013-01 Total Total Sea 129074
2013-01 Total Total Land 178294
2013-02 Total Total Air 916372
... ... ... ... ...
2015-12 AMERICAS USA Land 2698
2015-12 AMERICAS Canada Land 924
2013-01 ASIA China Air 136643
2013-01 ASIA India Air 55369
2013-01 ASIA Japan Air 51178
I would also like to make use of the region data if it is possible. Hoping to create a clustered column chart with the 7 regions as the x axis and arrival count as y axis and each region showing the arrival count via land, sea and air but I feel like there are too much excess data that I don't know how to deal with right now.
For example, I don't know how to deal with the period and the country because all I need is the total arrival count of land, sea and air based on region and year regardless of country and months.
I used this dataframe to test the code (the one in your question):
df = pd.DataFrame([['2013-01', 'Total', 'Total', 'Air', 984350],
['2013-01', 'Total', 'Total', 'Sea', 129074],
['2013-01', 'Total', 'Total', 'Land', 178294],
['2013-02', 'Total', 'Total', 'Air', 916372],
['2015-12', 'AMERICAS', 'USA', 'Land', 2698],
['2015-12', 'AMERICAS', 'Canada', 'Land', 924],
['2013-01', 'ASIA', 'China', 'Air', 136643],
['2013-01', 'ASIA', 'India', 'Air', 55369],
['2013-01', 'ASIA', 'Japan', 'Air', 51178]],
columns = ['period', 'region', 'country', 'moa', 'arv_count'])
Here is the code to get the sum of arrival counts, by year, region and type (sea, land air):
First add a 'year' column:
df['year'] = pd.to_datetime(df['period']).dt.year
Then group by (year, region, moa) and sum arv_count in each group:
df.groupby(['region', 'year', 'moa']).arv_count.sum()
Here is the output:
region year moa
AMERICAS 2015 Land 3622
ASIA 2013 Air 243190
Total 2013 Air 1900722
Land 178294
Sea 129074
I hope this is what you were looking for!
I want to calculate the maximum value for each year and show the sector and that value. For example, from the screenshot, I would like to display:
2010: Telecom 781
2011: Tech 973
I have tried using:
df.groupby(['Year', 'Sector'])['Revenue'].max()
but this does not give me the name of Sector which has the highest value.
Try using idxmax and loc:
df.loc[df.groupby(['Sector','Year'])['Revenue'].idxmax()]
MVCE:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Sector':['Telecom','Tech','Financial Service','Construction','Heath Care']*3,
'Year':[2010,2011,2012,2013,2014]*3,
'Revenue':np.random.randint(101,999,15)})
df.loc[df.groupby(['Sector','Year'])['Revenue'].idxmax()]
Output:
Sector Year Revenue
3 Construction 2013 423
12 Financial Service 2012 838
9 Heath Care 2014 224
1 Tech 2011 466
5 Telecom 2010 843
Also .sort_values + .tail, grouping on just year. Data from #Scott Boston
df.sort_values('Revenue').groupby('Year').tail(1)
Output:
Sector Year Revenue
9 Heath Care 2014 224
3 Construction 2013 423
1 Tech 2011 466
12 Financial Service 2012 838
5 Telecom 2010 843
I'm new to python. I've to plot graphs from a csv file that I've created.
a) Monthly sales vs Product Price
b) Geographic Region vs No of customer
The code that I've implemented was
import pandas as pd
import matplotlib.pyplot as plot
import csv
data = pd.read_csv('dataset_books.csv')
data.hist(bins=90)
plot.xlim([0,115054])
plot.title("Data")
x = plot.xlabel("Monthly Sales")
y = plot.ylabel("Product Price")
plot.show()
the output that I'm getting is not what I expected and is not approved.
I need a Horizontal Histogram with line plot.
Book ID Product Name Product Price Monthly Sales Shipping Type Geographic Region No Of Customer Who Bought the Product Customer Type
1 The Path to Power 486 2566.08 Free Gatton 4 Old
2 Touching Darkness (Midnighters, #2) 479 1264.56 Paid Hooker Creek 2 New
3 Star Wars: Lost Stars 456 1203.84 Paid Gladstone 2 New
4 Winter in Madrid 454 599.28 Paid Warruwi 1 New
5 Hairy Maclary from Donaldson's Dairy 442 2333.76 Free Mount Gambier 4 Old
6 Stand on Zanzibar 413 3816.12 Free Cessnock 7 Old
7 Marlfox 411 3797.64 Free Edinburgh 7 Old
8 The Matlock Paper 373 3446.52 Free Gladstone 7 Old
9 Tears of a Tiger 361 1906.08 Free Melbourne 4 Old
10 Star Wars: Vision of the Future 355 937.2 Paid Wagga Wagga 2 New
11 Nefes Nefese 344 454.08 Paid Gatton 1 New
this is my CSV file.
Can anyone help me?
Try this to check your columns name:
df.columns
>> Index(['Book ID', 'Product Name', 'Product Price', 'Monthly Sales',
'Shipping Type', 'Geographic Region',
'No Of Customer Who Bought the Product', 'Customer Type'],
dtype='object')
Next to plot horizontal plot you need 'barh'.
df['Product Price'].plot(kind='barh')
Another option to chose column is 'iloc'
df.iloc[:, 2].plot(kind='barh')
It will generate the same output