I am scraping some NBA data with Python. I have the following script
def scrape_data():
#URL
url = "https://basketball-reference.com/leagues/NBA_2020_advanced.html"
html = urlopen(url)
soup = bs(html, 'html.parser')
soup.findAll('tr', limit = 2)
headers = [th.getText() for th in soup.findAll('tr', limit = 2)[0].findAll('th')]
headers = headers[1:]
rows = soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]for i in range(len(rows))]
stats = pd.DataFrame(player_stats, columns=headers)
stats.head(10)
return stats
Which returns this
Player Pos Age Tm G ... OBPM DBPM BPM VORP
0 Steven Adams C 26 OKC 43 ... 1.6 3.3 4.9 2.0
1 Bam Adebayo PF 22 MIA 47 ... 1.2 3.8 5.0 2.8
2 LaMarcus Aldridge C 34 SAS 43 ... 1.7 0.6 2.4 1.6
3 Nickeil Alexander-Walker SG 21 NOP 38 ... -3.4 -2.3 -5.6 -0.4
4 Grayson Allen SG 24 MEM 30 ... -0.7 -2.8 -3.5 -0.2
.. ... .. .. ... .. ... .. ... ... ... ...
537 Thaddeus Young PF 31 CHI 49 ... -2.2 0.9 -1.3 0.2
538 Trae Young PG 21 ATL 44 ... 7.8 -2.3 5.5 2.9
539 Cody Zeller C 27 CHO 45 ... 0.0 -0.6 -0.6 0.4
540 Ante Žižić C 23 CLE 16 ... -2.3 -1.4 -3.6 -0.1
541 Ivica Zubac C 22 LAC 48 ... 0.4 2.3 2.7 1.0
I want to scrape a second url, where the table is formatted the exact same, and append the player's stats from this table to the other one, if that makes sense. The problem is, on the second url, there will be a few stats that are on both tables. I don't want to add these in again when I'm "merging" the two tables> How do I go about this?
Your doing a ton of work to put a <table> tag into a table. Let pandas do that for you (it uses BeautifulSoup under the hood). Then to merge, there's 2 ways you can do it:
1) Make one of the dataframes only have what is not contained in the other (However, keep columns that you will do the merge on).
2) Drop columns from the second dataframe that are in the dataframe (again, make sure to not drop the columns you will do the merge on.
import pandas as pd
def scrape_data(url):
stats = pd.read_html(url)[0]
return stats
df1 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_advanced.html")
df1 = df1[df1['Rk'] != 'Rk']
df2 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_per_poss.html")
df2 = df2[df2['Rk'] != 'Rk']
uniqueCols = [ col for col in df2.columns if col not in df1.columns ]
# Below will do the same as above line
#uniqueCols = list(df2.columns.difference(df1.columns))
df2 = df2[uniqueCols + ['Player', 'Tm']]
df = df1.merge(df2, how='left', on=['Player', 'Tm'])
OR
import pandas as pd
def scrape_data(url):
stats = pd.read_html(url)[0]
return stats
df1 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_advanced.html")
df1 = df1[df1['Rk'] != 'Rk']
df2 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_per_poss.html")
df2 = df2[df2['Rk'] != 'Rk']
dropCols = [ col for col in df1.columns if col in df2.columns and col not in ['Player','Tm']]
df2 = df2.drop(dropCols, axis=1)
df = df1.merge(df2, how='left', on=['Player', 'Tm'])
I think you want to use drop_duplicates(). Here's a simplified example:
import pandas as pd
df = pd.DataFrame([["foo", "bar"],["foo2", "bar2"],["foo3", "bar3"]], columns=["first_column", "second_column"])
df2 = pd.DataFrame([["foo3", "bar4"],["foo4", "bar5"],["foo5", "bar6"]], columns=["first_column", "second_column"])
print(pd.concat([df, df2], ignore_index=True).drop_duplicates(subset="first_column"))
Output:
first_column second_column
0 foo bar
1 foo2 bar2
2 foo3 bar3
4 foo4 bar5
5 foo5 bar6
As you can see, the "foo3" row from the second dataframe gets filtered out because it is already contained in the first dataframe.
In your case you would use something like:
pd.concat([stats, stats2], ignore_index=True).drop_duplicates(subset="Player"))
Related
I'm downloading football data with pandas read_html function, but not struggling to clean the player names with all the accented characters. This is what I have so far:
import pandas as pd
from unidecode import unidecode
shooting = pd.read_html("https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fshooting%2FPremier-League-Stats&div=div_stats_shooting")
for idx,table in enumerate(shooting):
print("***************************")
print(idx)
print(table)
shooting = table
for col in [('Unnamed: 1_level_0', 'Player')]:
shooting[col] = shooting[col].apply(unidecode)
shooting
shooting = table
#print(shooting.droplevel(1))
table.to_csv (r'C:\Users\khabs\OneDrive\Documents\Python Testing\shooting.csv', index = False, header=True)
print (shooting)
I think the issue is that the coding is messed before I even do the cleaning, but really not sure.
Any help would be greatly appreciated!!
Just use the encoding parameter in pandas.
import pandas as pd
url = "https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fshooting%2FPremier-League-Stats&div=div_stats_shooting"
shooting = pd.read_html(url, header=1, encoding='utf8')[0]
However, that (and I'm assuming) will not get you what you want, as there are extra html characters in the response returned from that widget.
Just go after the actual html. The table is within the comments.
import requests
import pandas as pd
url = 'https://fbref.com/en/comps/9/shooting/Premier-League-Stats'
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
shooting = pd.read_html(html, header=1)[-1]
shooting = shooting[shooting['Rk'].ne('Rk')]
Output:
print(shooting.head(10))
Rk Player Nation Pos ... npxG/Sh G-xG np:G-xG Matches
0 1 Brenden Aaronson us USA FW,MF ... 0.03 -0.1 -0.1 Matches
1 2 Che Adams sct SCO FW ... 0.09 +1.6 +1.6 Matches
2 3 Tyler Adams us USA MF ... 0.01 0.0 0.0 Matches
3 4 Tosin Adarabioyo eng ENG DF ... NaN 0.0 0.0 Matches
4 5 Rayan Aït Nouri fr FRA DF ... 0.08 -0.1 -0.1 Matches
5 6 Nathan Aké nl NED DF ... 0.05 -0.2 -0.2 Matches
6 7 Thiago Alcántara es ESP MF ... NaN 0.0 0.0 Matches
7 8 Trent Alexander-Arnold eng ENG DF ... 0.05 -0.2 -0.2 Matches
8 9 Alisson br BRA GK ... NaN 0.0 0.0 Matches
9 10 Dele Alli eng ENG FW,MF ... NaN 0.0 0.0 Matches
Let's say that I have this dataframe with three column : "Name", "Account" and "Ccy".
import pandas as pd
Name = ['Dan', 'Mike', 'Dan', 'Dan', 'Sara', 'Charles', 'Mike', 'Karl']
Account = ['100', '30', '50', '200', '90', '20', '65', '230']
Ccy = ['EUR','EUR','USD','USD','','CHF', '','DKN']
df = pd.DataFrame({'Name':Name, 'Account' : Account, 'Ccy' : Ccy})
Name Account Ccy
0 Dan 100 EUR
1 Mike 30 EUR
2 Dan 50 USD
3 Dan 200 USD
4 Sara 90
5 Charles 20 CHF
6 Mike 65
7 Karl 230 DKN
I would like to reprensent this data differently. I would like to write a script that find all the duplicates in the column name and regroup them wit the different account and if there are an currency "Ccy", it add a new column next to it with all the currency associated.
So something like that :
Dan Ccy1 Mike Ccy2 Sara Charles Ccy3 Karl Ccy4
0 100 EUR 30 EUR 90 20 CHF 230 DKN
1 50 USD 65
2 200 USD
I dont' really know how to start that ! So I simplify the problem to do step y step. I try to regroup the dupicates by the name with a list however it did not identify the duplicates.
x_len, y_len = df.shape
new_data = []
for i in range(x_len) :
if df.iloc[i,0] not in new_data :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in new_data)))
new_data.append([df.iloc[i,0],df.iloc[i,1]])
else:
new_data[str(df.iloc[i,0])].append(df.iloc[i,1])
Then I thought that it was easier to use a dictionary. So I try this loop but there is an error and maybe it is not the best way to go to the expected final result
from collections import defaultdict
dico=defaultdict(list)
x_len, y_len = df.shape
for i in range(x_len) :
if df.iloc[i,0] not in dico :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in dico)))
dico[str(df.iloc[i,0])] = df.iloc[i,1]
print(dico)
else :
dico[df.iloc[i,0]].append(df.iloc[i,1])
Anyone has an idea how to start or to do the code if it is simple ?
Thank you
Use GroupBy.cumcount for counter, reshape by DataFrame.set_index and DataFrame.unstack and last flatten columns names:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
Account_Charles Ccy_Charles Account_Dan Ccy_Dan Account_Karl Ccy_Karl \
0 20 CHF 100 EUR 230 DKN
1 NaN NaN 50 USD NaN NaN
2 NaN NaN 200 USD NaN NaN
Account_Mike Ccy_Mike Account_Sara Ccy_Sara
0 30 EUR 90
1 65 NaN NaN
2 NaN NaN NaN NaN
If need custom columns names use if-else in list comprehension:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
L = [b if a == 'Account' else f'{a}{i // 2}' for i, (a, b) in enumerate(df.columns)]
df.columns = L
print (df)
Charles Ccy0 Dan Ccy1 Karl Ccy2 Mike Ccy3 Sara Ccy4
0 20 CHF 100 EUR 230 DKN 30 EUR 90
1 NaN NaN 50 USD NaN NaN 65 NaN NaN
2 NaN NaN 200 USD NaN NaN NaN NaN NaN NaN
I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!
How do I turn a table like this--batting gamelogs table--into a CSV file using Python and BeautifulSoup?
I want the first header where it says Rk, Gcar, Gtm, etc. and not any of the other headers within the table (the ones for each month of the season).
Here is the code I have so far:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
def stir_the_soup():
player_links = open('player_links.txt', 'r')
player_ID_nums = open('player_ID_nums.txt', 'r')
id_nums = [x.rstrip('\n') for x in player_ID_nums]
idx = 0
for url in player_links:
print url
soup = BeautifulSoup(urlopen(url), "lxml")
p_type = ""
if url[-12] == 'p':
p_type = "pitching"
elif url[-12] == 'b':
p_type = "batting"
table = soup.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']== (p_type + "_gamelogs"))
header = [[val.text.encode('utf8') for val in table.find_all('thead')]]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('th')])
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
with open("%s.csv" % id_nums[idx], 'wb') as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(row for row in rows if row)
idx += 1
player_links.close()
if __name__ == "__main__":
stir_the_soup()
The id_nums list contains all of the id numbers for each player to use as the names for the separate CSV files.
For each row, the leftmost cell is a tag and the rest of the row is tags. In addition to the header how do I put that into one row?
this code gets you the big table of stats, which is what I think you want.
Make sure you have lxml, beautifulsoup4 and pandas installed.
df = pd.read_html(r'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010')
print(df[4])
Here is the output of first 5 rows. You may need to clean it slightly as I don't know what your exact endgoal is:
df[4].head(5)
Rk Gcar Gtm Date Tm Unnamed: 5 Opp Rslt Inngs PA ... CS BA OBP SLG OPS BOP aLI WPA RE24 Pos
0 1 66 2 (1) Apr 6 ARI NaN SDP L,3-6 7-8 1 ... 0 1.000 1.000 1.000 2.000 9 .94 0.041 0.51 PH
1 2 67 3 Apr 7 ARI NaN SDP W,5-3 7-8 1 ... 0 .500 .500 .500 1.000 9 1.16 -0.062 -0.79 PH
2 3 68 4 Apr 9 ARI NaN PIT W,9-1 8-GF 1 ... 0 .667 .667 .667 1.333 2 .00 0.000 0.13 PH SS
3 4 69 5 Apr 10 ARI NaN PIT L,3-6 CG 4 ... 0 .500 .429 .500 .929 2 1.30 -0.040 -0.37 SS
4 5 70 7 (1) Apr 13 ARI # LAD L,5-9 6-6 1 ... 0 .429 .375 .429 .804 9 1.52 -0.034 -0.46 PH
to select certain columns within this DataFrame: df[4]['COLUMN_NAME_HERE'].head(5)
Example: df[4]['Gcar']
Also, if doing df[4] is getting annoying you could always just switch to another dataframe df2=df[4]
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010'
html=urllib2.urlopen(url)
bs = BeautifulSoup(html,'lxml')
table = str(bs.find('table',{'id':'batting_gamelogs'}))
dfs = pd.read_html(table)
This uses Pandas, which is pretty useful for stuff like this. It also puts it in a pretty reasonable format to do other operations on.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html
The ordering of my age, height and weight columns is changing with each run of the code. I need to keep the order of my agg columns static because I ultimately refer to this output file according to the column locations. What can I do to make sure age, height and weight are output in the same order every time?
d = pd.read_csv(input_file, na_values=[''])
df = pd.DataFrame(d)
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col).agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})
df_out.to_csv(output_file, sep=',')
I think you can use subset:
df_out = df.groupby(df.index_col)
.agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})[['age','height','weight']]
Also you can use pandas functions:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
Sample:
df = pd.DataFrame({'name':['q','q','a','a'],
'address':['a','a','s','s'],
'age':[7,8,9,10],
'height':[1,3,5,7],
'weight':[5,3,6,8]})
print (df)
address age height name weight
0 a 7 1 q 5
1 a 8 3 q 3
2 s 9 5 a 6
3 s 10 7 a 8
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
print (df_out)
age height weight
name address
a s 9.5 12 14
q a 7.5 4 8
EDIT by suggestion - add reset_index, here as_index=False does not work if need index values too:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
.reset_index()
print (df_out)
name address age height weight
0 a s 9.5 12 14
1 q a 7.5 4 8
If you care mostly about the order when written to a file and not while its still in a DataFrame object, you can set the columns parameter of the to_csv() method:
>>> df = pd.DataFrame(
{'age': [28,63,28,45],
'height': [183,156,170,201],
'weight': [70.2, 62.5, 65.9, 81.0],
'name': ['Kim', 'Pat', 'Yuu', 'Sacha']},
columns=['name','age','weight', 'height'])
>>> df
name age weight height
0 Kim 28 70.2 183
1 Pat 63 62.5 156
2 Yuu 28 65.9 170
3 Sacha 45 81.0 201
>>> df_out = df.groupby(['age'], as_index=False).agg(
{'weight': sum, 'height': sum})
>>> df_out
age height weight
0 28 353 136.1
1 45 201 81.0
2 63 156 62.5
>>> df_out.to_csv('out.csv', sep=',', columns=['age','height','weight'])
out.csv then looks like this:
,age,height,weight
0,28,353,136.10000000000002
1,45,201,81.0
2,63,156,62.5