When I write the below codes in pandas
gold.groupby(['Games','country'])['Medal'].value_counts()
I get the below result, how to extract the top medal winner for each Games,The result should be all the games,country with most medal,medal tally
Games country Medal
1896 Summer Australia Gold 2
Austria Gold 2
Denmark Gold 1
France Gold 5
Germany Gold 25
...
2016 Summer UK Gold 64
USA Gold 139
Ukraine Gold 2
Uzbekistan Gold 4
Vietnam Gold 1
Name: Medal, Length: 1101, dtype: int64
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal country notes
68 17294 Cai Yalin M 23.0 174.0 60.0 China CHN 2000 Summer 2000 Summer Sydney Shooting Shooting Men's Air Rifle, 10 metres Gold China NaN
77 17299 Cai Yun M 32.0 181.0 68.0 China-1 CHN 2012 Summer 2012 Summer London Badminton Badminton Men's Doubles Gold China NaN
87 17995 Cao Lei F 24.0 168.0 75.0 China CHN 2008 Summer 2008 Summer Beijing Weightlifting Weightlifting Women's Heavyweight Gold China NaN
104 18005 Cao Yuan M 17.0 160.0 42.0 China CHN 2012 Summer 2012 Summer London Diving Diving Men's Synchronized Platform Gold China NaN
105 18005 Cao Yuan M 21.0 160.0 42.0 China CHN 2016 Summer 2016 Summer Rio de Janeiro Diving Diving Men's Springboard Gold China NaN
The data Your data only included Chinese gold medal winners so I added a row:
ID Name Sex Age Height Weight Team NOC \
0 17294 Cai Yalin M 23.0 174.0 60.0 China CHN
1 17299 Cai Yun M 32.0 181.0 68.0 China-1 CHN
2 17995 Cao Lei F 24.0 168.0 75.0 China CHN
3 18005 Cao Yuan M 17.0 160.0 42.0 China CHN
4 18005 Cao Yuan M 21.0 160.0 42.0 China CHN
5 292929 Serge de Gosson M 52.0 178.0 69.0 France FR
Games Year Season City Sport \
0 2000 Summer 2000 Summer Sydney Shooting
1 2012 Summer 2012 Summer London Badminton
2 2008 Summer 2008 Summer Beijing Weightlifting
3 2012 Summer 2012 Summer London Diving
4 2016 Summer 2016 Summer Rio de Janeiro Diving
5 2022 Summer 2022 Summer Stockholm Calisthenics
Event Medal country notes
0 Shooting Men's Air Rifle, 10 metres Gold China NaN
1 Badminton Men's Doubles Gold China NaN
2 Weightlifting Women's Heavyweight Gold China NaN
3 Diving Men's Synchronized Platform Gold China NaN
4 Diving Men's Springboard Gold China NaN
5 Planche Gold France NaN
YOu want to de exactly what you did but sort the data and keep the top row:
gold.groupby(['Games','country'])['Medal'].value_counts().groupby(level=0, group_keys=False).head(1)
Which returns:
Games country Medal
2000 Summer China Gold 1
2008 Summer China Gold 1
2012 Summer China Gold 2
2016 Summer China Gold 1
2022 Summer France Gold 1
Name: Medal, dtype: int64
or as a dataframe:
GOLD_TOP = pd.DataFrame(gold.groupby(['Games','country'])['Medal'].value_counts().groupby(level=0, group_keys=False).head(1))
df_gold = df[df["Medal"]=="Gold"].groupby("Team").Medal.count().reset_index()
df_gold = df_gold.sort_values(by="Medal",ascending=False)[:8]
df_gold
I have a pandas dataframe with one column like this:
Merged_Cities
New York, Wisconsin, Atlanta
Tokyo, Kyoto, Suzuki
Paris, Bordeaux, Lyon
Mumbai, Delhi, Bangalore
London, Manchester, Bermingham
And I want a new dataframe with the output like this:
Merged_Cities
Cities
New York, Wisconsin, Atlanta
New York
New York, Wisconsin, Atlanta
Wisconsin
New York, Wisconsin, Atlanta
Atlanta
Tokyo, Kyoto, Suzuki
Tokyo
Tokyo, Kyoto, Suzuki
Kyoto
Tokyo, Kyoto, Suzuki
Suzuki
Paris, Bordeaux, Lyon
Paris
Paris, Bordeaux, Lyon
Bordeaux
Paris, Bordeaux, Lyon
Lyon
Mumbai, Delhi, Bangalore
Mumbai
Mumbai, Delhi, Bangalore
Delhi
Mumbai, Delhi, Bangalore
Bangalore
London, Manchester, Bermingham
London
London, Manchester, Bermingham
Manchester
London, Manchester, Bermingham
Bermingham
In short I want to split all the cities into different rows while maintaining the 'Merged_Cities' column.
Here's a replicable version of df:
df = pd.DataFrame({'Merged_Cities':['New York, Wisconsin, Atlanta',
'Tokyo, Kyoto, Suzuki',
'Paris, Bordeaux, Lyon',
'Mumbai, Delhi, Bangalore',
'London, Manchester, Bermingham']})
Use .str.split() and .explode():
df = df.assign(Cities=df["Merged_Cities"].str.split(", ")).explode("Cities")
print(df)
Prints:
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
1 Tokyo, Kyoto, Suzuki Suzuki
2 Paris, Bordeaux, Lyon Paris
2 Paris, Bordeaux, Lyon Bordeaux
2 Paris, Bordeaux, Lyon Lyon
3 Mumbai, Delhi, Bangalore Mumbai
3 Mumbai, Delhi, Bangalore Delhi
3 Mumbai, Delhi, Bangalore Bangalore
4 London, Manchester, Bermingham London
4 London, Manchester, Bermingham Manchester
4 London, Manchester, Bermingham Bermingham
This is really similar to #AndrejKesely's answer, except it merges df and the cities on their index.
# Create pandas.Series from splitting the column on ', '
s = df['Merged_Cities'].str.split(', ').explode().rename('Cities')
# Merge df with s on their index
df = df.merge(s, left_index=True, right_index=True)
# Result
print(df)
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
1 Tokyo, Kyoto, Suzuki Suzuki
This question already has answers here:
How to iterate over rows in a DataFrame in Pandas
(32 answers)
Closed 2 years ago.
I have a loop cycling through the length of a data frame and going through a list of teams. My loop should go through 41 rows but it only does 2 rows and then stops, I have no idea why it is stalling out. It seems to me I should be cycling through the entire 41 team list but it stops after indexing two teams.
import pandas as pd
excel_data_df = pd.read_excel('New_Schedule.xlsx', sheet_name='Sheet1', engine='openpyxl')
print(excel_data_df)
print('Data Frame Above')
yahoot = len(excel_data_df)
print('Length Of Dataframe Below')
print(yahoot)
for games in excel_data_df:
yahoot -= 1
print(yahoot)
searching = excel_data_df.iloc[yahoot, 0]
print(searching)
excel_data_df2 = pd.read_excel('allstats.xlsx', sheet_name='Sheet1', engine='openpyxl')
print(excel_data_df2)
finding = excel_data_df2[excel_data_df2['TEAM:'] == searching].index
print(finding)
Here is the run log
HOME TEAM: AWAY TEAM:
0 Portland St. Weber St.
1 Nevada Air Force
2 Utah Idaho
3 San Jose St. Santa Clara
4 Southern Utah SAGU American Indian
5 West Virginia Iowa St.
6 Missouri Prairie View
7 Southeast Mo. St. UT Martin
8 Little Rock Champion Chris.
9 Tennessee St. Belmont
10 Wichita St. Emporia St.
11 Tennessee Tennessee Tech
12 FGCU Webber Int'l
13 Jacksonville St. Ga. Southwestern
14 Northern Ill. Chicago St.
15 Col. of Charleston Western Caro.
16 Georgia Tech Florida A&M
17 Rider Iona
18 Tulsa Northwestern St.
19 Rhode Island Davidson
20 Washington St. Montana St.
21 Montana Dickinson St.
22 Robert Morris Bowling Green
23 South Dakota Drake
24 Richmond Loyola Chicago
25 Coastal Carolina Alice Lloyd
26 Presbyterian South Carolina St.
27 Morehead St. SIUE
28 San Diego St. BYU
29 Siena Canisius
30 Monmouth Saint Peter's
31 Howard Hampton
32 App State Columbia Int'l
33 Southern Ill. North Dakota
34 Norfolk St. UNCW
35 Niagara Fairfield
36 N.C. A&T Greensboro
37 Western Mich. Central Mich.
38 DePaul Xavier
39 Georgia St. Carver
40 Northern Ariz. Eastern Wash.
41 Gardner-Webb VMI
Data Frame Above
Length Of Dataframe Below
42
41
Gardner-Webb
TEAM: TOTAL POINTS: ... TURNOVER RATIO: ASSIST TO TURNOVER RANK
0 Mount St. Marys 307 ... 65 239.0
1 Saint Josephs 163 ... 28 81.0
2 Saint Marys (CA) 518 ... 78 114.0
3 Saint Peters 399 ... 86 145.0
4 St. John's (NY) 656 ... 115 73.0
.. ... ... ... ... ...
314 Wofford 327 ... 54 113.0
315 Wright St. 220 ... 47 206.0
316 Wyoming 517 ... 64 27.0
317 Xavier 582 ... 84 12.0
318 Youngstown St. 231 ... 30 79.0
[319 rows x 18 columns]
Int64Index([85], dtype='int64')
40
Northern Ariz.
TEAM: TOTAL POINTS: ... TURNOVER RATIO: ASSIST TO TURNOVER RANK
0 Mount St. Marys 307 ... 65 239.0
1 Saint Josephs 163 ... 28 81.0
2 Saint Marys (CA) 518 ... 78 114.0
3 Saint Peters 399 ... 86 145.0
4 St. John's (NY) 656 ... 115 73.0
.. ... ... ... ... ...
314 Wofford 327 ... 54 113.0
315 Wright St. 220 ... 47 206.0
316 Wyoming 517 ... 64 27.0
317 Xavier 582 ... 84 12.0
318 Youngstown St. 231 ... 30 79.0
[319 rows x 18 columns]
Int64Index([180], dtype='int64')
Use:for i in index,data in excel_data_df.iterrrows() instead.
pandas.DataFrame.iterrows
DataFrame.iterrows()
Iterate over DataFrame rows as (index, Series) pairs.
I'm working on my python skills and I'm trying to scrape only the "Results" table from this page https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results . I'm new to web scraping, could anyone help me with an elegant solution for scraping the Results wikitable? Thanks!
The easiest way is to use Pandas to load the tables:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results')
# print second table (index 1):
print(tables[1])
Prints:
Date Venue Home team Away team Score Competition Winner Match report
0 7 March 2020 Twickenham Stadium England Wales 33–30 2020 Six Nations England BBC
1 22 February 2020 Principality Stadium Wales France 23–27 2020 Six Nations France BBC
2 8 February 2020 Aviva Stadium Ireland Wales 24–14 2020 Six Nations Ireland BBC
3 1 February 2020 Principality Stadium Wales Italy 42–0 2020 Six Nations Wales BBC
4 30 November 2019 Principality Stadium Wales Barbarians 43–33 Tour Match Wales BBC
.. ... ... ... ... ... ... ... ...
741 5 January 1884 Cardigan Fields England Wales 1G 2T–1G 1884 Home Nations Championship England NaN
742 8 January 1883 Raeburn Place Scotland Wales 3G–1G 1883 Home Nations Championship Scotland NaN
743 16 December 1882 St Helen's Wales England 0–2G 4T 1883 Home Nations Championship England NaN
744 28 January 1882 Lansdowne Road Ireland Wales 0–2G 2T NaN Wales NaN
745 19 February 1881 Richardson's Field England Wales 7G 6T 1D–0 NaN England NaN
[746 rows x 8 columns]
I am trying to web scrape, by using Python 3, a chart off of this website into a .csv file: 2013-14 NBA National TV Schedule
The chart starts out like:
Game/Time Network Matchup
Oct. 29, 8 p.m. ET TNT Chicago vs. Miami
Oct. 29, 10:30 p.m. ET TNT LA Clippers vs. LA Lakers
I am using these packages:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
I imported the data by:
pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule")[0]
The output sample is:
0 1 2
0 Game/Time Network Matchup
1 Oct. 29, 8 p.m. ET TNT Chicago vs. Miami
2 Oct. 29, 10:30 p.m. ET TNT LA Clippers vs. LA Lakers
The output I want in a .csv file looks like this:
I am unsure how I can split the game/time up into separate columns. Notice how the date is formatted like 10/29/13. I also am unsure how to split matchup into away (first team) and home (second team) into separate columns. I know pd.to_datetime and str.split() should be used. How do I implement the scraper to get this output?
Here's my take:
df = pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule")[0]
# set the correct column names
df = df.T.set_index([0]).T
# separate date and time
datetime = df['Game/Time'].str.extract('(?P<Date>.*), (?P<Time>.*) ET$')
# extract Home and Away
home_away = df['Matchup'].str.extract('^(?P<Away>.*) vs\. (?P<Home>.*)$')
# join the data
final_df = pd.concat([datetime, home_away, df[['Network']]], axis=1)
Output:
Date Time Away Home Network
1 Oct. 29 8 p.m. Chicago Miami TNT
2 Oct. 29 10:30 p.m. LA Clippers LA Lakers TNT
3 Oct. 31 8 p.m. New York Chicago TNT
4 Oct. 31 10:30 p.m. Golden State LA Clippers TNT
5 Nov. 1 8 p.m. Miami Brooklyn ESPN
.. ... ... ... ... ...
141 Apr. 13 1 p.m. Chicago New York ABC
142 Apr. 15 8 p.m. New York Brooklyn TNT
143 Apr. 15 10:30 p.m. Denver LA Clippers TNT
144 Apr. 16 8 p.m. Atlanta Milwaukee ESPN
145 Apr. 16 10:30 p.m. Golden State Denver ESPN
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')
This line should help you format the date in the exact way you want
import pandas as pd
import numpy as np
df = pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule",header=0)[0]
df['Date']=df['Game/Time'].str.extract(r'(.*),',expand=True)
df['Time']=df['Game/Time'].str.extract(r',(.*) ET',expand=True)
df['Time']=df['Time'].str.replace('p.m.','PM')
df['Date'] = np.where(df.Date.str.startswith(('10/', 11/', '12/')), df.Date + ' 13', df.Date + ' 14')
df['Date']=pd.to_datetime(df['Date'])
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')
df['Home'] = df['Matchup'].str.extract('(.*)vs')
df['Away'] = df['Matchup'].str.extract('vs.(.*)')
df = df.drop(columns=['Game/Time','Matchup'])
print(df)
Network Date Time Home Away
0 TNT 10/29/2013 8 PM Chicago Miami
1 TNT 10/29/2013 10:30 PM LA Clippers LA Lakers
2 TNT 10/31/2013 8 PM New York Chicago
3 TNT 10/31/2013 10:30 PM Golden State LA Clippers
4 ESPN 11/01/2013 8 PM Miami Brooklyn
I hope this is what you were looking for.
You can use regex to split out your columns, your time has different format so we can handle those by using specific formats and forcing the errors into NaT values.
df = pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule")[0]
# set column
df.columns = df.iloc[0]
df = df.iloc[1:].reset_index(drop=True)
#set date and time column.
df['date'] = pd.to_datetime((df['Game/Time'].str.split(',',expand=True)[0] + ' 2019')
,format='%b. %d %Y')
df['time'] = df['Game/Time'].str.split(',',expand=True)[1]
#time column has different formats, lets handle those.
s = pd.to_datetime(df['time'].str.strip('ET').str.replace('\.','').str.strip(),
format='%H %p',errors='coerce')
s = s.fillna(pd.to_datetime(df['time'].str.strip('ET').str.replace('\.','').str.strip(),
format='%H:%M %p',errors='coerce'))
df['time'] = s.dt.time
#home and away columns.
df['home'] = df['Matchup'].str.extract('(.*)vs(.*)')[0].str.strip()
df['away'] = df['Matchup'].str.extract('(.*)vs(.*)')[1].str.strip('.')
# slice dataframe.
df2 = df[['date','time','home','away','Network']]
print(df2)
0 date time home away Network
0 2019-10-29 08:00:00 Chicago Miami TNT
1 2019-10-29 10:30:00 LA Clippers LA Lakers TNT
2 2019-10-31 08:00:00 New York Chicago TNT
3 2019-10-31 10:30:00 Golden State LA Clippers TNT
4 2019-11-01 08:00:00 Miami Brooklyn ESPN
.. ... ... ... ... ...
140 2019-04-13 01:00:00 Chicago New York ABC
141 2019-04-15 08:00:00 New York Brooklyn TNT
142 2019-04-15 10:30:00 Denver LA Clippers TNT
143 2019-04-16 08:00:00 Atlanta Milwaukee ESPN
144 2019-04-16 10:30:00 Golden State Denver ESPN