Scraping from a dropdown menu using beautifulsoup

Scraping from a dropdown menu using beautifulsoup - python

I am trying to scrape a list of dates from: https://ca.finance.yahoo.com/quote/AAPL/options
The dates are located within a drop down menu right above the option chain. I've scraped text from this website before but this text is using a 'select' & 'option' syntax. How would I adjust my code to gather this type of text? I have used many variations of the code below to try and scrape the text but am having no luck.
Thank you very much.
import bs4
import requests
datesLink = ('https://ca.finance.yahoo.com/quote/AAPL/options')
datesPage = requests.get(datesLink)
datesSoup = BeautifulSoup(datesPage.text, 'lxml')
datesQuote = datesSoup.find('div', {'class': 'Cf Pt(18px)controls'}).find('option').text

The reason you can't seem to extract this dropdown list is because this list is generated dynamically, and the easiest way to know this is by saving your html content into a file and giving it a manual look, in a text editor.
You CAN, however, parse those dates out of the script source code, which is in the same html file, using some ugly regex way. For example, this seems to work:
import requests, re
from datetime import *
content = requests.get('https://ca.finance.yahoo.com/quote/AAPL/options').content.decode()
match = re.search(r'"OptionContractsStore".*?"expirationDates".*?\[(.*?)\]', content)
dates = [datetime.fromtimestamp(int(x), tz=timezone.utc) for x in match.group(1).split(',')]
for d in dates:
print(d.strftime('%Y-%m-%d'))
It should be obvious that parsing stuff in such a nasty way isn't fool-proof, and likely going to break sooner rather than later. But the same can be said about any kind of web scraping entirely.

You can simply read HTML directly to Pandas:
import pandas as pd
URI = 'https://ca.finance.yahoo.com/quote/AAPL/options'
df = pd.read_html(URI)[0] #[1] depending on the table you wish for

Related

Can't read html table with pd.read_html

on this link: https://www.basketball-reference.com/teams/MIA/2022.html
I want to read this table:
I use this code:
import pandas as pd
url="https://www.basketball-reference.com/teams/MIA/2022.html"
pd.read_html(url,match="Shooting")
But it says: ValueError: No tables found matching pattern 'Shooting'.
If I try pd.read_html(url,match="Roster") or pd.read_html(url,match="Totals") it searches for these tables.

Its the second table that you want to read. You can simply do:
import pandas as pd
url="https://www.basketball-reference.com/teams/MIA/2022.html"
pd.read_html(url)[1]

I've discovered that the HTML code commented inside each div#all_* are the same with actual scoring tables content. So it looks like the tables somehow generated from the comments using JavaScript after page loads. Obviously it's some kind scraping protection.
Screenshots of what I mean (for Shooting section you want to get):
Well, the only solution I see for now is firstly to load the whole HTML of page then modify req.content with replace function (delete all special HTML comments characters) and finally get the table you want using pandas:
import requests
import pandas as pd
url = "https://www.basketball-reference.com/teams/MIA/2022.html"
req = requests.get(url)
req = req.text.replace('<!--', '')
# req = req.replace('-->', '') # not necessary in this case
pd.read_html(req, match="Shooting")
Since the whole HTML code doesn't contains comments anymore I recommend to get tables by index.
For Shooting - Regular Season tab:
pd.read_html(req)[15]
and for Shooting - Playoffs tab:
pd.read_html(req)[16]

pd.read_html() isn't finding all the table tags. Only 7 are being returned.
Roster, Per Game, Totals, Advanced and 3 others. Shooting is not among them so pd.read_html(url,match="Shooting") is going to give you an error.
import pandas as pd
url = 'https://www.basketball-reference.com/teams/MIA/2022.html'
x = pd.read_html(url)
print(len(x)) #7

How to substring with specific start and end positions where a set of characters appear?

I am trying to clean the data I scraped from their links. I have over 100 links in a CSV I'm trying to clean.
This is what a link looks like in the CSV:
"https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
I've observed that scraping this for HTML data doesn't go well and I have to get the URL present inside this.
I want to get the substring which starts with &url= and ends at &ct as that's where the real URL resides.
I've read posts like this but couldn't find one for ending str too. I've tried an approach from this using the substring package but it doesn't work for more than one character.
How do I do this? Preferably without using third party packages?

I don't understand problem
If you have string then you can use string- functions like .find() and slice [start:end]
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
start = text.find('url=') + len('url=')
end = text.find('&ct=')
text[start:end]
But it may have url= and ct= in different order so better search first & after url=
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
start = text.find('url=') + len('url=')
end = text.find('&', start)
text[start:end]
EDIT:
There is also standard module urllib.parse to work with url - to split or join it.
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
import urllib.parse
url, query = urllib.parse.splitquery(text)
data = urllib.parse.parse_qs(query)
data['url'][0]
In data you have dictionary
{'cd': ['SldisGkopisopiasenjA6Y28Ug'],
'ct': ['ga'],
'rct': ['j'],
'sa': ['t'],
'url': ['https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428'],
'usg': ['AFQjaskdfYJkasKugowe896fsdgfsweF']}
EDIT:
Python shows warning that splitquery() is deprecated as of 3.8 and code should use urlparse()
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
import urllib.parse
parts = urllib.parse.urlparse(text)
data = urllib.parse.parse_qs(parts.query)
data['url'][0]

Using urllib with Python 3

I'm trying to write a simple application that reads the HTML from a webpage, converts it to a string, and displays certain slices of that string to the user.
However, it seems like these slices change themselves! Each time I run my code I get a different output! Here's the code.
# import urllib so we can get HTML source
from urllib.request import urlopen
# import time, so we can choose which date to read from
import time
# save HTML to a variable
content = urlopen("http://www.islamicfinder.org/prayerDetail.php?country=canada&city=Toronto&state=ON&lang")
# make HTML readable and covert HTML to a string
content = str(content.read())
# select part of the string containing the prayer time table
table = content[24885:24935]
print(table) # print to test what is being selected
I'm not sure what's going on here.

You should really be using something like Beautiful soup. Something along the lines of the following should help. From looking at the source code for that url there is not id/class for the table which makes it a little bit more trickier to find.
from bs4 import BeautifulSoup
import requests
url = "http://www.islamicfinder.org/prayerDetail.php?country=canada&city=Toronto&state=ON&lang"
r = requests.get(url)
soup = BeautifulSoup(r.text)
for table in soup.find_all('table'):
# here you can find the table you want and deal with the results
print(table)

You shouldn't be looking for the part you want by grabbing the specific indexes of the list, websites are often dynamic and the list contain the exact same content each time
What you want to do is search for the table you want, so say the table started with the keyword class="prayer_table" you could find this with str.find()
better yet, extract the tables from the webpage instead of relying on str.find() The code below is from a question on extract tables from a webpage reference
from lxml import etree
import urllib
web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
html = etree.HTML(s)
## Get all 'tr'
tr_nodes = html.xpath('//table[#id="Report1_dgReportDemographic"]/tr')
## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]
## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

Python Regex scraping data from a webpage

My idea was to explore the Groupon's website to extract the url of the deals. The problem is that I'm trying to do a findall on the Groupon's page to find datas like this: (of this page: http://www.groupon.de/alle-deals/muenchen/restaurant-296)
"category":"RESTAURANT1","dealPermaLink":"/deals/muenchen-special/Casa-Lavecchia/24788330", and I'd like to get the 'deals/muenchen-special/Casa-Lavecchia/24788330'.
I tried the whole night but I'm unable to find a correct regex. I tried:
import urllib2
import re
Page_Web = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
for m in re.findall('category*RESATAURANT1*dealPermaLink*:?/*/*/*/*\d$',Page_Web):
print m
But it doesn't print anything.

In order to extrapolate the block that interest you, I would do this way:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
soup = BeautifulSoup(html)
scriptResults = soup('script',{'type' : 'text/javascript'})
js_block = scriptResults[12]
Starting from this you can parse with a regex if you want or try to interprete the js (there are some threads on stackoverflow about that).
Anyway, like the others said, you should use groupon api...
P.S.
The block that you are parsing can be easily parsed as a dictionary, is already a list of dictionary if you look well...

How about changing RESATAURANT1 to RESTAURANT1, for starters?

Screen scraping in LXML with python-- extract specific data

I've been trying to write a program for the last several hours that does what I thought would be an incredibly simple task:
Program asks for user input (let's say the type 'happiness')
Program queries the website thinkexist using this format ("http://thinkexist.com/search/searchQuotation.asp?search=USERINPUT")
Program returns first quote from the website.
I've tried using Xpath with lxml, but have no experience and every single construction comes back with a blank array.
The actual meat of the quote appears to be contained in the class "sqq."
If I navigate the site via Firebug, click the DOM tab, it appears the quote is in a textNode attribute "wholeText" or "textContent"-- but I don't know how to use that knowledge programatically.
Any ideas?

import lxml.html
import urllib
site = 'http://thinkexist.com/search/searchquotation.asp'
userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})
root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[#class="sqq"]')
print quotes[0].text_content()
... and if you enter 'Shakespeare', it returns
In real life, unlike in Shakespeare, the sweetness
of the rose depends upon the name it bears. Things
are not only what they are. They are, in very important
respects, what they seem to be.

If it's not necessary for you to implement this via XPath, you may use BeautifilSoup library like this (let myXml variable contain the page HTML source):
soup = BeautifulSoup(myXml)
for a in soup.findAll(a,{'class' : 'sqq'}):
# this is your quote
print a.contents
Anyway, read the BS documentation, it may be very useful for some scraping needs that don't require the power of XPath.

You could open the html source to find out the exact class you are looking for. For example, to grab the first StackOverflow username encountered on the page you could do:
#!/usr/bin/env python
from lxml import html
url = 'http://stackoverflow.com/questions/4710307'
tree = html.parse(url)
path = '//div[#class="user-details"]/a[#href]'
print tree.findtext(path)
# -> Parseltongue
# OR to print text including the text in children
a = tree.find(path)
print a.text_content()
# -> Parseltongue

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping from a dropdown menu using beautifulsoup - python

You can simply read HTML directly to Pandas: import pandas as pd URI = 'https://ca.finance.yahoo.com/quote/AAPL/options' df = pd.read_html(URI)[0] #[1] depending on the table you wish for

Related

Can't read html table with pd.read_html

How to substring with specific start and end positions where a set of characters appear?

Using urllib with Python 3

Python Regex scraping data from a webpage

Screen scraping in LXML with python-- extract specific data

Categories

Resources