python pd.read_html gives error - python

I've been using pandas and request to pull some tables to get NFL statistics. It's been going pretty well, I've been able to pull tables from other sites, until I tried to get NFL combine table from this one particular site.
It gives me the error message after df_list = pd.read_html(html)
The error I get is:
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')
Here's the code I've been using at other sites that worked really well.
import requests
import pandas as pd
df = pd.DataFrame()
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?
year=1987&pos=&college='
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
I've read and seen a little bit about BeautifulSoup, but the simplicity of the pd.read_html() is just so nice and compact. So I don't know if there's a quick fix that I am not aware of, or if I need to indeed dive into BeautifulSoup to get these tables from 1987 - 2017.

This isn't shorter, but may be more robust:
import requests
import pandas as pd
from bs4 import BeautifulSoup
A convenience function:
def souptable(table):
for row in table.find_all('tr'):
yield [col.text for col in row.find_all('td')]
Return a DataFrame with data loaded for a given year:
def getyear(year):
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?year=%d&pos=&college=' % year
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
data = list(souptable(soup.table))
df = pd.DataFrame(data[1:], columns=data[0])
df = df[pd.notnull(df['Name'])]
return df.apply(pd.to_numeric, errors="ignore")
This function slices out the heading row when the DataFrame is created, uses the first row for column names, and filters out any rows with an empty Name value.
Finally, concatenate up as many years as you need into a single DataFrame:
dfs = pd.concat([getyear(year) for year in range(1987, 1990)])

Ok. After doing some more research. I looks like the issue is that the last row is one merged cell, and that's possibly where the issue is coming in. So I did go into using BeautifulSoup to pull the data. Here is my solution:
import requests
import pandas as pd
from bs4 import BeautifulSoup
I wanted to pull for each year from 1987 to 2017
seasons = list(range(1987, 2018))
df = pd.DataFrame()
temp_df = pd.DataFrame()
So it would run through each year. Appending each cell into a new row. Then again knowing the last cell is a "blank", I eliminate that last row by defining the dataframe as df[:-1] before it loops and appends the next years data.
for i in seasons:
df = df[:-1]
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?
year=%s&pos=&college=' % (i)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for tr in soup.table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
temp_df = row
df = df.append(temp_df, ignore_index = True)
Finally, since there is no new year to append, I need to eliminate the last row. Then I reshape the dataframe into the 16 columns, rename the columns from the first row, and then eliminate the row headers within the dataframe.
df = df[:-1]
df = (pd.DataFrame(df.values.reshape(-1, 16)))
df.columns = df.iloc[0]
df = df[df.Name != 'Name']
I'm still learning python, so any input, advice, any respectful constructive criticism is always welcome. Maybe there is a better, more appropriate solution?

Related

web scraping a dataframe

i'm currently trying to web scraping a dataframe (about sctack exchange of a companie) in a website in order to make a new dataframe in python this data.
I've tried to scrap the row of the dataframe in order to store in a csv file and use the method pandas.read_csv().
I meet some trouble because the csv file is not as good as i thought.
How can i manage to get the exactly same dataframe in python with web-scraping it
Here's my code :
from bs4 import BeautifulSoup
import urllib.request as ur
import csv
import pandas as pd
url_danone = "https://www.boursorama.com/cours/1rPBN/"
our_url = ur.urlopen(url_danone)
soup = BeautifulSoup(our_url, 'html.parser')
with open('danone.csv', 'w') as filee:
for ligne in soup.find_all("table", {"class": "c-table c-table--generic"}):
row = ligne.find("tr", {"class": "c-table__row"}).get_text()
writer = csv.writer(filee)
writer.writerow(row)
The dataframe in the website
The csv file
You can use pd.read_html to read the required table:
import pandas as pd
url = "https://www.boursorama.com/cours/1rPBN/"
df = pd.read_html(url)[1].rename(columns={"Unnamed: 0": ""}).set_index("")
print(df)
df.to_csv("data.csv")
Prints and saves data.csv (screenshot from LibreOffice):
Please try this for loop instead:
rows = []
headers = []
# loop to get the values
for tr in soup.find_all("tr", {"class": "c-table__row"})[13:18]:
row = [td.text.strip() for td in tr.select('td') if td.text.strip()]
rows.append(row)
# get the header
for th in soup.find_all("th", {"class": "c-table__cell c-table__cell--head c-table__cell--dotted c-table__title / u-text-uppercase"}):
head = th.text.strip()
headers.append(head)
This would get your values and header in the way you want. Note that, since the tables don't have ids or any unique identifiers, you need to proper stabilish which rows you want considering all tables (see [13:18] in the code above).
You can check your content making a simple dataframe from the headers and rows as below:
# write csv
df = pd.DataFrame(rows, columns=headers)
print(df.head())
Hope this helps.

Having trouble putting data into a pandas dataframe

I am new to coding, so take it easy on me! I recently started a pet project which scrapes data from a table and will create a csv of the data for me. I believe I have successfully pulled the data, but trying to put it into a dataframe returns the error "Shape of passed values is (31719, 1), indices imply (31719, 23)". I have tried looking at the length of my headers and my rows and those numbers are correct, but when I try to put it into a dataframe it appears that it is only pulling one column into the dataframe. Again, I am very new to all of this but would appreciate any help! Code below
from bs4 import BeautifulSoup
from pandas.core.frame import DataFrame
import requests
import pandas as pd
url = 'https://www.fangraphs.com/leaders.aspx? pos=all&stats=bat&lg=all&qual=0&type=8&season=2018&month=0&season1=2018&ind=0&page=1_1500'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
#pulling table from HTML
Table1 = soup.find('table', id = 'LeaderBoard1_dg1_ctl00')
#finding and filling table columns
headers = []
for i in Table1.find_all('th'):
title = i.text
headers.append(title)
#finding and filling table rows
rows = []
for j in Table1.find_all('td'):
data = j.text
rows.append(data)
#filling dataframe
df = pd.DataFrame(rows, columns = headers)
#show dataframe
print(df)
You are creating a dataframe with 692 rows with 23 columns as a new dataframe. However looking at the rows array, you only have 1 dimensional array so shape of passed values is not matching with indices. You are passing 692 x 1 to a dataframe with 692 x 23 which won't work.
If you want to create with the data you have, you should just use:
df=pd.DataFrame(rows, columns=headers[1:2])
Alternativly you can achieve your goal directly by using pandas.read_html that processe the data by BeautifulSoup for you:
pd.read_html(url, attrs={'id':'LeaderBoard1_dg1_ctl00'}, header=[1])[0].iloc[:-1]
attrs={'id':'LeaderBoard1_dg1_ctl00'} selects table by id
header=[1] adjusts the header cause there are multiple headers
.iloc[:-1] removes the table footer with pagination
Example
import pandas as pd
pd.read_html('https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2018&month=0&season1=2018&ind=0&page=1_1500',
attrs={'id':'LeaderBoard1_dg1_ctl00'},
header=[1])[0]\
.iloc[:-1]

How to retrieve and store 2nd and 3rd row elements in a dataframe

I am new to Pandas and Webscraping and BeautifulSoup in Python.
As I was learning to do some basic webscraping in Python by using requests and BeautifulSoup to scrape a webpage, I am confused with the task of assigning the 2nd and 3rd elements of an html table into a pandas dataframe.
Suppose I have this table:
Here is my code so far:
import pandas as pd
from bs4 import BeautifulSoup
import requests
html_data = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_data, 'html.parser')
data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])
for row in soup.find_all('tbody')[3].find_all('tr'): #This line will make sure to get to the third table which is this "By market capitalization" table on the webpage and finding all the rows of this table
col = row.find_all('td') #This is to target individual column values in a particular row of the table
for j, cell in enumerate(col):
#Further code here
As it can be seen, I want to target all the 2nd and 3rd column values of a row and append to the
empty dataframe, data, so that data contains the Bank names and market cap values. How can I achieve that kind of functionality?
For tables I would suggest pandas:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
tables = pd.read_html(url)
df = tables[1]
When you prefer using beautifulsoup, you can try this to accomplish the same:
url = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser').find_all('table')
table = soup[1]
table_rows = table.find_all('tr')
table_header = [th.text.strip() for th in table_rows[0].find_all('th')]
table_data = []
for row in table_rows[1:]:
table_data.append([td.text.strip() for td in row.find_all('td')])
df = pd.DataFrame(table_data, columns=table_header)
When needed, you can set Rank as index with df.set_index('Rank', inplace=True. Image below is the unmodified dataframe.

How can I webscrape a Wikipedia table with lists of data instead of rows?

I am trying to get data from the Localities table located on the Wikipedia https://en.wikipedia.org/wiki/Districts_of_Warsaw page.
I would like to collect this data and put it into a dataframe with two columns ["Districts"] and ["Neighbourhoods"].
My code so far looks like this:
url = "https://en.wikipedia.org/wiki/Districts_of_Warsaw"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html")
table = soup.find_all('table')[2]
A=[]
B=[]
for row in table.findAll('tr'):
cells=row.findAll('td')
if len(cells)==2:
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
df=pd.DataFrame(A,columns=['Neighbourhood'])
df['District']=B
print(df)
This gives the following dataframe:
Dataframe
Certainly, scraping the Neighbourhood column is not right since they are contained in lists, but I don't know how it should be done so will be glad for any tips.
In addition to it, I will appreciate any hints why scraping gives me only 10 districts instead of 18.
Are you sure that you are scraping the right table? I understood that you need a second table with 18 districts and listed neighbourhoods.
Also, I'm not sure how you want to have districts and neighbourhoods arranged in a DataFrame, I've set districts as columns and neighbourhoods as rows. You can change it as you want.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://en.wikipedia.org/wiki/Districts_of_Warsaw"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find_all("table")[1]
def process_list(tr):
result = []
for td in tr.findAll("td"):
result.append([x.string for x in td.findAll("li")])
return result
districts = []
neighbourhoods = []
for row in table.findAll("tr"):
if row.find("ul"):
neighbourhoods.extend(process_list(row))
else:
districts.extend([x.string.strip() for x in row.findAll("th")])
# Check and arrange as you wish
for i in range(len(districts)):
print(f'District {districts[i]} has neighbourhoods: {", ".join(neighbourhoods[i])}')
df = pd.DataFrame()
for i in range(len(districts)):
df[districts[i]] = pd.Series(neighbourhoods[i])
Some tips:
Use element.string to get the text from an element
Use string.strip() to remove any leading (spaces at the beginning) and trailing (spaces at the end) characters (space is the default leading character to remove) i.e. to clean the text
You can use the fact that odd rows are the Districts and even rows are the Neighbourhoods to walk the odd rows and use FindNext to grab the neighbourhoods , from row below, whilst iterating the District columns within the odd rows:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from itertools import zip_longest
soup = bs(requests.get('https://en.wikipedia.org/wiki/Districts_of_Warsaw').content, 'lxml')
table = soup.select_one('h2:contains("Localities") ~ .wikitable') #isolate table of interest
results = []
for row in table.select('tr')[0::2]: #walk the odd rows
for i in row.select('th'): #walk the districts
r = list(zip_longest([i.text.strip()] , [i.text for i in row.findNext('tr').select('li')], fillvalue=i.text.strip())) # zip the current district to the list of neighbourhoods in row below. Fill with District name to get lists of equal length
results.append(r)
results = [i for j in results for i in j] #flatten list of lists
df = pd.DataFrame(results, columns= ['District','Neighbourhood'])
print(df)

Web scraping golf data from ESPN. I am receiving 3 ouputs of the same table and only want 1. How can I limit this?

I am new to python and am stuck. I cant figure out how to only output one of the tables given. In the output, it gives the desired table, but three versions of them. The first two are awfully formatted, and the last table is the table desired.
I have tried running a for loop and counting to only print the third table.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in dfs:
print(df[0:])
Just use index to print the table.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
print(dfs[2])
OR
print(dfs[-1])
OR If you want to use loop then try that.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in range(len(dfs)):
if df==2:
print(dfs[df])

Categories