cast year revenue title
id
135397 Chris Pratt 2015 1.392446e+09 Jurassic World
135397 Bryce Dallas Howard 2015 1.392446e+09 Jurassic World
135397 Irrfan Khan 2015 1.392446e+09 Jurassic World
135397 Nick Robinson 2015 1.392446e+09 Jurassic World
Given the above DataFrame, I would like to find the highest earning actors per year (based on the combined revenue of movies they acted in that year). This is what I have so far :
#get the total revenue associated with each cast for each year
f ={'revenue':sum}
#revenue by year for each cast
df_actor_yr = df_actor_yr.groupby(['year', 'cast']).agg(f)
df_actor_yr
year cast
1960 Anthony Perkins 2.359350e+08
Charles Laughton 4.423780e+08
Fred MacMurray 1.843242e+08
Jack Kruschen 1.843242e+08
Jean Simmons 4.423780e+08
John Gavin 2.359350e+08
Kirk Douglas 4.423780e+08
Vera Miles 2.359350e+08
1961 Anthony Quayle 2.108215e+08
Anthony Quinn 2.108215e+08
Ben Wright 1.574815e+09
Betty Lou Gerson 1.574815e+09
...
Next to get the highest earning cast member for each year I did the following
df_actor_yr.reset_index(inplace=True)
g ={"revenue" : max }
df_actor_yr = df_actor_yr.groupby('year').agg(g)
df_actor_yr
revenue
year
1960 4.423780e+08
1961 1.574815e+09
1962 5.045914e+08
1963 5.617734e+08
1964 8.780804e+08
1965 1.129535e+09
1967 1.345551e+09
1968 4.187094e+08
1969 6.081511e+08
...
This only gives me the year and maximum revenue for that year.I would also like to get the corresponding name of the cast member associated with the revenue. How do I go about doing this?
You can split your logic into 2 steps. First sum by cast and year using GroupBy + sum. Then find the maximum revenue by year using GroupBy + idxmax:
# sum by cast and year
df_summed = df.groupby(['cast', 'year'])['revenue'].sum().reset_index()
# maximums by year
res = df_summed.loc[df_summed.groupby('year')['revenue'].idxmax()]
print(res)
cast year revenue
3 NickRobinson 2012 3.401340e+09
0 BryceDallasHoward 2015 1.568978e+09
For the above output, I've used more interesting data:
id cast year revenue title
135397 ChrisPratt 2015 1.392446e+09 JurassicWorld
135397 BryceDallasHoward 2015 1.568978e+09 SomeMovie
135397 IrrfanKhan 2012 1.392446e+09 JurassicWorld
135397 NickRobinson 2012 1.046987e+09 JurassicWorld
135398 NickRobinson 2012 2.354353e+09 SomeOtherMovie
Related
I have the following dataframe where each row is a unique state-city pair:
State City
NY Albany
NY NYC
MA Boston
MA Cambridge
I want to a add a column of years ranging from 2000 to 2018:
State City. Year
NY Albany 2000
NY Albany 2001
NY Albany 2002
...
NY Albany 2018
NY NYC 2000
NY NYC 2018
...
MA Cambridge 2018
I know I can create a list of numbers using Year = list(range(2000,2019))
Does anyone know how to put this list as a column in the dataframe for each state-city?
You could try adding it as a list and then performing explode. I think it should work:
df['Year'] = [list(range(2000,2019))] * len(df)
df = df.explode('Year')
One way is to use the DataFrame.stack() method.
Here is sample of your current data:
data = [['NY', 'Albany'],
['NY', 'NYC'],
['MA', 'Boston'],
['MA', 'Cambridge']]
cities = pd.DataFrame(data, columns=['State', 'City'])
print(cities)
# State City
# 0 NY Albany
# 1 NY NYC
# 2 MA Boston
# 3 MA Cambridge
First, make this into a multi-level index (this will end up in the final dataframe):
cities_index = pd.MultiIndex.from_frame(cities)
print(cities_index)
# MultiIndex([('NY', 'Albany'),
# ('NY', 'NYC'),
# ('MA', 'Boston'),
# ('MA', 'Cambridge')],
# names=['State', 'City'])
Now, make a dataframe with all the years in it (I only use 3 years for brevity):
years = list(range(2000, 2003))
n_cities = len(cities)
years_data = np.repeat(years, n_cities).reshape(len(years), n_cities).T
years_data = pd.DataFrame(years_data, index=cities_index)
years_data.columns.name = 'Year index'
print(years_data)
# Year index 0 1 2
# State City
# NY Albany 2000 2001 2002
# NYC 2000 2001 2002
# MA Boston 2000 2001 2002
# Cambridge 2000 2001 2002
Finally, use stack to transform this dataframe into a vertically-stacked series which I think is what you want:
years_by_city = years_data.stack().rename('Year')
print(years_by_city.head())
# State City Year index
# NY Albany 0 2000
# 1 2001
# 2 2002
# NYC 0 2000
# 1 2001
# Name: Year, dtype: int64
If you want to remove the index and have all the values as a dataframe just do
cities_and_years = years_by_city.reset_index()
I'm doing a simple sentiment analysis and am stuck on something that I feel is very simple. I'm trying to add an new column with a set of values, in this example compound values. But after the for loop iterates it adds the same value for all the rows rather than a value for each iteration. The compound values are the last column in the DataFrame. There should be a quick fix. thanks!
for i, row in real.iterrows():
real['compound'] = sid.polarity_scores(real['title'][i])['compound']
title text subject date compound
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews December 31, 2017 0.2263
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews December 29, 2017 0.2263
2 Senior U.S. Republican senator: 'Let Mr. Muell... WASHINGTON (Reuters) - The special counsel inv... politicsNews December 31, 2017 0.2263
3 FBI Russia probe helped by Australian diplomat... WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews December 30, 2017 0.2263
4 Trump wants Postal Service to charge 'much mor... SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews December 29, 2017 0.2263
IIUC:
real['compound'] = real.apply(lambda row: sid.polarity_scores(row['title'])['compound'], axis=1)
Let's say this is my data frame:
country Edition sports Athletes Medal Firstname Score
Germany 1990 Aquatics HAJOS, Alfred gold Alfred 3
Germany 1990 Aquatics HIRSCHMANN, Otto silver Otto 2
Germany 1990 Aquatics DRIVAS, Dimitrios silver Dimitrios 2
US 2008 Athletics MALOKINIS, Ioannis gold Ioannis 1
US 2008 Athletics HAJOS, Alfred silver Alfred 2
US 2009 Athletics CHASAPIS, Spiridon gold Spiridon 3
France 2010 Athletics CHOROPHAS, Efstathios gold Efstathios 3
France 2010 Athletics CHOROPHAS, Efstathios gold Efstathios 3
France 2010 golf HAJOS, Alfred Bronze Alfred 1
France 2011 golf ANDREOU, Joannis silver Joannis 2
Spain 2011 golf BURKE, Thomas gold Thomas 3
I am trying to find out which Athlete's first name has the largest sum of scores?
I have tried the following:
df.groupby ( 'Firstname' )[Score ].sum().idxmax()
This returns the first name of the Athlete but I want to display the full name of Athlete can anyone help me in this?
for example : I am getting 'Otto' as output but i want to display HIRSCHMANN, Otto as output!
Note: what I have noticed in my original data set when I groupby ( 'Athlete') the answer is different.
idxmax will only give you the index of the first row with maximal value. If multiple Firstname share the max score, it will find to find them.
Try this instead:
sum_score = df.groupby ('Firstname')['Score'].sum()
max_score = sum_score.max()
names = sum_score[sum_score == max_score].index
df[df['Firstname'].isin(names)]
I was attempting to find the movies of 2018 January to March of 2018 from wikipedia page using pandas read html.
Here is my code:
import pandas as pd
import numpy as np
link = "https://en.wikipedia.org/wiki/2018_in_film"
tables = pd.read_html(link)
jan_march = tables[5].iloc[1:]
jan_march.columns = ['Opening1','Opening2','Title','Studio','Cast','Genre','Country','Ref']
jan_march.head()
There is some error in reading the columns. If anybody has already scraped some
wikipedia tables may be they can help me solving the problem.
Thanks a lot.
Related links:
Scraping Wikipedia tables with Python selectively
https://roche.io/2016/05/scrape-wikipedia-with-python
Scraping paginated web table with python pandas & beautifulSoup
I am getting this:
But am expecting:
Because of how the table is designed it is not as simple as pd.read_html() while that is a start you will need to do some manipulation to get it in a desirable formate:
import pandas as pd
link = "https://en.wikipedia.org/wiki/2018_in_film"
tables = pd.read_html(link,header=0)[5]
# find na values and shift cells right
i = 0
while i < 2:
row_shift = tables[tables['Unnamed: 7'].isnull()].index
tables.iloc[row_shift,:] = tables.iloc[row_shift,:].shift(1,axis=1)
i+=1
# create new column names
tables.columns = ['Month', 'Day', 'Title', 'Studio', 'Cast and crew', 'Genre', 'Country', 'Ref.']
# forward fill values
tables['Month'] = tables['Month'].ffill()
tables['Day'] = tables['Day'].ffill()
out:
Month Day Title Studio Cast and crew Genre Country Ref.
0 JANUARY 5 Insidious: The Last Key Universal Pictures / Blumhouse Productions Adam Robitel (director); Leigh Whannell (scree... Horror, Thriller US [33]
1 JANUARY 5 The Strange Ones Vertical Entertainment Lauren Wolkstein (director); Christopher Radcl... Drama US [34]
2 JANUARY 5 Stratton Momentum Pictures Simon West (director); Duncan Falconer, Warren... Action, Thriller IT, UK [35]
3 JANUARY 10 Sweet Country Samuel Goldwyn Films Warwick Thornton (director); David Tranter, St... Drama AUS [36]
4 JANUARY 12 The Commuter Lionsgate / StudioCanal / The Picture Company Jaume Collet-Serra (director); Byron Willinger... Action, Crime, Drama, Mystery, Thriller US, UK [37]
5 JANUARY 12 Proud Mary Screen Gems Babak Najafi (director); John S. Newman, Chris... Action, Thriller US [38]
6 JANUARY 12 Acts of Violence Lionsgate Premiere Brett Donowho (director); Nicolas Aaron Mezzan... Action, Thriller US [39]
...
I have four columns of data that I am trying to consolidate based on two conditions. The data are formatted as follows:
CountyName Year Oil Gas
ANDERSON 2010 1358 0
ANDERSON 2010 621746 4996766
ANDERSON 2011 1587 0
ANDERSON 2011 633120 5020877
ANDERSON 2012 55992 387685
ANDERSON 2012 1342 0
ANDERSON 2013 635572 3036578
ANDERSON 2013 4873 0
ANDERSON 2014 656440 2690333
ANDERSON 2014 12332 0
ANDERSON 2015 608454 2836272
ANDERSON 2015 23339 0
ANDERSON 2016 551728 2682261
ANDERSON 2016 12716 0
ANDERSON 2017 132466 567874
ANDERSON 2017 1709 0
ANDREWS 2010 25701725 1860063
ANDREWS 2010 106351 0
ANDREWS 2011 97772 0
ANDREWS 2011 28818329 1377865
ANDREWS 2012 105062 0
...
I'm interested in combining the respective oil and then gas values for entries that are repeated. For example, I'd like to add all the oil entries for Anderson County for the year 2010 and have that value replace the existing entries in just one row. The code I am using now is summing all the values in the respective county regardless of year, giving me a condensed output like this:
CountyName Year Oil Gas
ANDERSON 3954774
ANDREWS 206472698
...
Here's the code I am using:
import csv
with open('Texas.csv', 'r') as Texas: #opening Texas csv file
TexasReader = csv.reader(Texas)
counties = {}
years = {}
index = 0 and 1
for row in TexasReader:
if index == 0 and 1:
header = row
else:
county = row[0]
year = row[1]
oil = row[2]
gas = row[3]
if county in counties:
counties[county] += int(oil)
else:
counties[county] = int(oil)
index += 1
with open('TexasConsolidated.csv', 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=header, delimiter=',', lineterminator='\n')
writer.writeheader()
for k, v in counties.items():
writer.writerow({header[0]: k, header[2]: v})
This is the line that is doing what you complain of:
if county in counties:
counties[county] += int(oil)
If you want a dict that stores sums over two keys then both values need to be in the dict key.
Add the line
counties_years = {}
then sum like this, using the tuple (county,year) as the key:
if (county,year) in counties_years:
counties_years[(county,year)] += int(oil)
else:
counties_years[(county,year)] = int(oil)