Can't read html table with pd.read_html

Can't read html table with pd.read_html - python

on this link: https://www.basketball-reference.com/teams/MIA/2022.html
I want to read this table:
I use this code:
import pandas as pd
url="https://www.basketball-reference.com/teams/MIA/2022.html"
pd.read_html(url,match="Shooting")
But it says: ValueError: No tables found matching pattern 'Shooting'.
If I try pd.read_html(url,match="Roster") or pd.read_html(url,match="Totals") it searches for these tables.

Its the second table that you want to read. You can simply do:
import pandas as pd
url="https://www.basketball-reference.com/teams/MIA/2022.html"
pd.read_html(url)[1]

I've discovered that the HTML code commented inside each div#all_* are the same with actual scoring tables content. So it looks like the tables somehow generated from the comments using JavaScript after page loads. Obviously it's some kind scraping protection.
Screenshots of what I mean (for Shooting section you want to get):
Well, the only solution I see for now is firstly to load the whole HTML of page then modify req.content with replace function (delete all special HTML comments characters) and finally get the table you want using pandas:
import requests
import pandas as pd
url = "https://www.basketball-reference.com/teams/MIA/2022.html"
req = requests.get(url)
req = req.text.replace('<!--', '')
# req = req.replace('-->', '') # not necessary in this case
pd.read_html(req, match="Shooting")
Since the whole HTML code doesn't contains comments anymore I recommend to get tables by index.
For Shooting - Regular Season tab:
pd.read_html(req)[15]
and for Shooting - Playoffs tab:
pd.read_html(req)[16]

pd.read_html() isn't finding all the table tags. Only 7 are being returned.
Roster, Per Game, Totals, Advanced and 3 others. Shooting is not among them so pd.read_html(url,match="Shooting") is going to give you an error.
import pandas as pd
url = 'https://www.basketball-reference.com/teams/MIA/2022.html'
x = pd.read_html(url)
print(len(x)) #7

Related

Is there an efficient way to optimize this web-scraping script?

Here I have a web-scraping script that utilizes "requests" and "BeautifulSoup" modules to extract the movie names and ratings from the "https://www.imdb.com/chart/top/" website. I also extracted a short description of each movie from the link provided in "td.posterColumn a" for each tag to form the third column. For doing so, I had to create a secondary soup object and extract the summary text from it for each . Even though the method works and I'm able to form a table, the runtime is too long and that is understandable as a new soup object is created for each iteration of the row. Could anyone please suggest me a faster and efficient way to perform this operation? Also, how do I make all the rows and columns appear in its entirety in the DataFrame output? Thanks!
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
import pdb
start_time=time.time()
response=requests.get("https://www.imdb.com/chart/top/")
soup = BeautifulSoup(response.content,"lxml")
body=soup.select("tbody.lister-list")[0]
titles=[]
ratings=[]
summ=[]
for row in body.select("tr"):
title=row.select("td.titleColumn a")[0].get_text().strip()
titles.append(title)
rating=row.select("td.ratingColumn.imdbRating")[0].get_text().strip()
ratings.append(rating)
innerlink=row.select("td.posterColumn a")[0]["href"]
link="https://imdb.com"+innerlink
#pdb.set_trace()
response2=requests.get(link).content
soup2=BeautifulSoup(response2,"lxml")
summary=soup2.select("div.summary_text")[0].get_text().strip()
summ.append(summary)
df=pd.DataFrame({"Title":titles,"IMDB Rating":ratings, "Movie Summary":summ})
df.to_csv("imdbmovies.csv")
end_time=time.time()
finish=end_time-start_time
print("Runtime is {f:1.4f} secs".format(f=finish))
print(df)
Pandas DataFrame output:

Scraping from a dropdown menu using beautifulsoup

I am trying to scrape a list of dates from: https://ca.finance.yahoo.com/quote/AAPL/options
The dates are located within a drop down menu right above the option chain. I've scraped text from this website before but this text is using a 'select' & 'option' syntax. How would I adjust my code to gather this type of text? I have used many variations of the code below to try and scrape the text but am having no luck.
Thank you very much.
import bs4
import requests
datesLink = ('https://ca.finance.yahoo.com/quote/AAPL/options')
datesPage = requests.get(datesLink)
datesSoup = BeautifulSoup(datesPage.text, 'lxml')
datesQuote = datesSoup.find('div', {'class': 'Cf Pt(18px)controls'}).find('option').text

The reason you can't seem to extract this dropdown list is because this list is generated dynamically, and the easiest way to know this is by saving your html content into a file and giving it a manual look, in a text editor.
You CAN, however, parse those dates out of the script source code, which is in the same html file, using some ugly regex way. For example, this seems to work:
import requests, re
from datetime import *
content = requests.get('https://ca.finance.yahoo.com/quote/AAPL/options').content.decode()
match = re.search(r'"OptionContractsStore".*?"expirationDates".*?\[(.*?)\]', content)
dates = [datetime.fromtimestamp(int(x), tz=timezone.utc) for x in match.group(1).split(',')]
for d in dates:
print(d.strftime('%Y-%m-%d'))
It should be obvious that parsing stuff in such a nasty way isn't fool-proof, and likely going to break sooner rather than later. But the same can be said about any kind of web scraping entirely.

You can simply read HTML directly to Pandas:
import pandas as pd
URI = 'https://ca.finance.yahoo.com/quote/AAPL/options'
df = pd.read_html(URI)[0] #[1] depending on the table you wish for

Pandas using RegEx with a list in lxml

I am trying to scrape all the URLs from a website that meet a certain criterion. My code is so far as follows:
import pandas as pd
from urllib.request import urlopen
import lxml.html
links = []
connection = urlopen("http://www.open.ac.uk/courses/modules")
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
links.append(link)
This is getting me the URLs in a list. However, I only want the ones that end with /[some letters][3 numbers]. I have the following expression which works at www.regex101.com:
\/[a-z]*[0-9][0-9][0-9]
Ideally I would like to amend the scrape so it only returns the required information. How can I use the expression on the list to filter it?
I have found a few things that kind of answer my question but nothing that is the same as my problem.
An example of the data I am getting is
/courses/modules/tm352
/courses/modules/a332
/courses/modules/ke322
/courses/modules/e318
/postgraduate
#int-site
http://www.open.ac.uk/contact/
http://www2.open.ac.uk/tutors/help/who-to-contact
http://www.open.ac.uk/about/employment/
http://www.open.ac.uk/about/main/management/policies-and-
statements/website-accessibility-open-university
http://www.open.ac.uk/wales/cy
So of that the first 4 lines match what I want the rest do not match.

try this:
df = pd.DataFrame(links)
df[0] = df[0].str.extract('(.+[A-z]+\d\d\d$)', expand=False)
df.loc[df[0].notnull()]

Using urllib with Python 3

I'm trying to write a simple application that reads the HTML from a webpage, converts it to a string, and displays certain slices of that string to the user.
However, it seems like these slices change themselves! Each time I run my code I get a different output! Here's the code.
# import urllib so we can get HTML source
from urllib.request import urlopen
# import time, so we can choose which date to read from
import time
# save HTML to a variable
content = urlopen("http://www.islamicfinder.org/prayerDetail.php?country=canada&city=Toronto&state=ON&lang")
# make HTML readable and covert HTML to a string
content = str(content.read())
# select part of the string containing the prayer time table
table = content[24885:24935]
print(table) # print to test what is being selected
I'm not sure what's going on here.

You should really be using something like Beautiful soup. Something along the lines of the following should help. From looking at the source code for that url there is not id/class for the table which makes it a little bit more trickier to find.
from bs4 import BeautifulSoup
import requests
url = "http://www.islamicfinder.org/prayerDetail.php?country=canada&city=Toronto&state=ON&lang"
r = requests.get(url)
soup = BeautifulSoup(r.text)
for table in soup.find_all('table'):
# here you can find the table you want and deal with the results
print(table)

You shouldn't be looking for the part you want by grabbing the specific indexes of the list, websites are often dynamic and the list contain the exact same content each time
What you want to do is search for the table you want, so say the table started with the keyword class="prayer_table" you could find this with str.find()
better yet, extract the tables from the webpage instead of relying on str.find() The code below is from a question on extract tables from a webpage reference
from lxml import etree
import urllib
web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
html = etree.HTML(s)
## Get all 'tr'
tr_nodes = html.xpath('//table[#id="Report1_dgReportDemographic"]/tr')
## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]
## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

Python Regex scraping data from a webpage

My idea was to explore the Groupon's website to extract the url of the deals. The problem is that I'm trying to do a findall on the Groupon's page to find datas like this: (of this page: http://www.groupon.de/alle-deals/muenchen/restaurant-296)
"category":"RESTAURANT1","dealPermaLink":"/deals/muenchen-special/Casa-Lavecchia/24788330", and I'd like to get the 'deals/muenchen-special/Casa-Lavecchia/24788330'.
I tried the whole night but I'm unable to find a correct regex. I tried:
import urllib2
import re
Page_Web = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
for m in re.findall('category*RESATAURANT1*dealPermaLink*:?/*/*/*/*\d$',Page_Web):
print m
But it doesn't print anything.

In order to extrapolate the block that interest you, I would do this way:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
soup = BeautifulSoup(html)
scriptResults = soup('script',{'type' : 'text/javascript'})
js_block = scriptResults[12]
Starting from this you can parse with a regex if you want or try to interprete the js (there are some threads on stackoverflow about that).
Anyway, like the others said, you should use groupon api...
P.S.
The block that you are parsing can be easily parsed as a dictionary, is already a list of dictionary if you look well...

How about changing RESATAURANT1 to RESTAURANT1, for starters?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't read html table with pd.read_html - python

Its the second table that you want to read. You can simply do: import pandas as pd url="https://www.basketball-reference.com/teams/MIA/2022.html" pd.read_html(url)[1]

Related

Is there an efficient way to optimize this web-scraping script?

Scraping from a dropdown menu using beautifulsoup

Pandas using RegEx with a list in lxml

Using urllib with Python 3

Python Regex scraping data from a webpage

Categories

Resources