Python Web Scraping Not Getting the Desire Contents - python

Web Scraping Noob here. I start my journey to stock investing, so I am trying to get data from Morningstar. However, I am not getting the data I want.
Take https://www.morningstar.com/stocks/xnas/aapl/financials as an example.
These will be the numbers I am interested.
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.morningstar.com/stocks/xnas/aapl/financials"
page = requests.get(url)
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
The status_code is 200, which is fine. However, the parsed page contents did not contain the numbers I want. For example, no result is returned for soup.find_all('Price/Book').
I inspected the web page. The content stops at tag where I marked red.
soup.find("div", class_="mdc-column mds-layout-grid__col stock__content-sal mds-layout-grid__col--12 mds-layout-grid__col--auto-at-1092").contents
The code above returns an empty array. It should contain something () but does not. I don't think it has anything to do with js, but I don't know why beautiful soup fail to capture it either. Any hints are appreciated.
Hope I made myself clear enough.

Related

Webscraping Python BS4 issue not returning data

I am new here and have had a read through much of the historic posts but cannot exactly find what I am looking for.
I am new to webscraping and have successfully scraped data from a handful of sites.
However I am having an issue with this code as I am trying to extract the titles of the products using beautiful soup but have an issue somewhere in the code as it is not returning the data? Any help would be appreciated:
from bs4 import BeautifulSoup
import requests
import pandas as pd
webpage = requests.get('https://groceries.asda.com/aisle/beer-wine-spirits/spirits/whisky/1215685911554-1215685911575-1215685911576')
sp = BeautifulSoup(webpage.content, 'html.parser')
title = sp.find_all('h3', class_='co-product__title')
print(title)
I assume my issue lies somewhere in the find_all function, however cannot quite work out how to resolve?
Regards
Milan
You could try to use this link, it seems to pull the information you desire:
from bs4 import BeautifulSoup
import requests
webpage = requests.get("https://groceries.asda.com/api/items/iconmetadata?request_origin=gi")
sp = BeautifulSoup(webpage.content, "html.parser")
print(sp)
Let me know if this helps.
Try this:
from bs4 import BeautifulSoup
import requests
import pandas as pd
webpage = requests.get('https://groceries.asda.com/aisle/beer-wine-spirits/spirits/whisky/1215685911554-1215685911575-1215685911576')
sp = BeautifulSoup(webpage.content, 'html.parser')
title = sp.find_all('h3', {'class':'co-product__title'})
print(title[0])
also i prefer
sp = BeautifulSoup(webpage.text, 'lxml')
Also note that this will return a list with all elements of that class. If you want just the first instance, use .find ie:
title = sp.find('h3', {'class':'co-product__title'})
Sorry to rain on this parade, but you wont be able to scrape this data with out a webdriver or You can call the api directly. You should research how to get post rendered js in python.

Web Scraping masslottery.com using beautiful soup

I am trying to use beautiful soup to get all the keno numbers that came in ever. I mostly know how to do this but I am running into an issue. I need to get the game numbers and the numbers that came in for each game. However, using beautiful soup, I cannot figure out how to access these. I'll include a screenshot of the html, as well as a screenshot of what ive tried, as well as the link to the page I am inspecting.
html code
I am trying to access <div class="winning-number-ball-circle solid"> as you can see in that picture but all the html is nested.
I've tried
soup.find_all('div',{'class':'winning-number-ball-circle solid'})
and that does not work. Does anyone know how to access the inner elements?
Heres my code:
from bs4 import BeautifulSoup
import urllib.request
mass = 'https://www.masslottery.com/tools/past-results/keno?draw_date=2021-06-17'
page = urllib.request.urlopen(mass)
soup = BeautifulSoup(page,'lxml')
div = soup.find('div',{'class','winning-number-ball-circle solid'})
print(div)
Thanks in advance!
The data comes from an REST API call the browser makes by running Javascript. You need to make a request to that and then use the json returned
import requests
r = requests.get('https://www.masslottery.com/rest/keno/getDrawsByDateRange?startDate=2021-06-17&endDate=2021-06-17').json()
Thanks to #MendelG for suggestion to use pandas for formatting:
import requests
import pandas as pd
r = requests.get('https://www.masslottery.com/rest/keno/getDrawsByDateRange?startDate=2021-06-17&endDate=2021-06-17').json()
pd.json_normalize(r['draws'])
divs = soup.findAll('div',{'class':['winning-number-ball-circle','solid']})
for div in divs:
print(div.text)
with soup.findAll you find all the divs with class names 'winning-number-ball-circle' and 'solid'
this returns a list.
With a for you display the text of these divs

Cannot Scrapte Youtube with BeautifulSoup

I am trying to do a very simple task of scraping Youtube page.
But it keeps returning empty list as the result.
Target: Get join date of a youtube channel
Original link: channel page
Code:
from bs4 import BeautifulSoup
import requests
url="https://www.youtube.com/channel/"+"UCxX9wt5FWQUAAz4UrysqK9A"+"/about"
html=requests.get(url)
soup=BeautifulSoup(html.text,"lxml")
channel_name=soup.find_all(name="yt-formatted-string")
print(channel_name)
Result:
[]
Part of source code:
<yt-formatted-string class="style-scope ytd-channel-about-metadata-renderer">
Joined
Feb 25, 2016
</yt-formatted-string>
Please guide. Why i cannot get correct answer?
Thanks.
You were really close, you just needed to make some minor changes. You do not need Selenium to do this. Below is a snippet that will do what you need. However, I wasn't too sure what particular data you were looking to scrape, so I just went for the main body. Also if you begin to scrape more, try to get in the habit of using headers with requests so that YouTube doesn't block your ip. Good luck!
from bs4 import BeautifulSoup
import requests
url="https://www.youtube.com/channel/"+"UCxX9wt5FWQUAAz4UrysqK9A"+"/about"
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
for div in soup.find_all("div", class_="about-description"):
print(div.text.replace("\n",""))
Sorry, upon reviewing your question again it looks like you may have wanted the "joined information". So below is a snippet for that as well.
for span in soup.find_all("span", class_="about-stat"):
print(span.text.replace("\n",""))

Python scraping website with flight tickets

I am trying to extract information about prices of flight tickets with a python script. Please take a look at the picture:
I would like to parse all the prices (such as "121" at the bottom of the tree). I have constructed a simple script and my problem is that I am not sure how to get the right parts from the code behind page's "inspect element". My code is below:
import urllib3
from bs4 import BeautifulSoup as BS
http = urllib3.PoolManager()
ULR = "https://greatescape.co/?datesType=oneway&dateRangeType=exact&departDate=2019-08-19&origin=EAP&originType=city&continent=europe&flightType=3&city=WAW"
response = http.request('GET', URL)
soup = BS(response.data, "html.parser")
body = soup.find('body')
__next = body.find('div', {'id':'__next'})
ui_container = __next.find('div', {'class':'ui-container'})
bottom_container_root = ui_container.find('div', {'class':'bottom-container-root'})
print(bottom_container_root)
The problem is that I am stuck at the level of ui-container. bottom-container-root is an empty variable, despite it is a direct child under ui-container. Could someone please let me know how to parse this tree properly?
I have no experience in web scraping, but as it happens it is one step in a bigger workflow I am building.
.find_next_siblings and .next_element can be useful in navigating through containers.
Here is some example usage below.
from bs4 import BeautifulSoup
html = open("small.html").read()
soup = BeautifulSoup(html)
print soup.head.next_element
print soup.head.next_element.next_element

Identifying DJIA data using Beautiful Soup

I'm fairly new to coding and am trying to write a script that would pull market data at timed intervals while running, then compare the delta between each pull and notify the user of the change - looking for simple shifts, let's say >.1% in any interval.
My initial approach is to run a Beautiful Soup script to obtain posted market data, using either Yahoo Finance or Barron's, as both seem to have the data available in the HTML code:
https://finance.yahoo.com/calendar
http://www.barrons.com/mdc/public/page/9_3000.html?mod=bol_mdc_topnav_9_3000
This is as far as I've gotten and not having much luck, the find function doesn't seem to be returning anything from the site - looking for any nudging that might help me get on the right track with this
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
url = 'https://finance.yahoo.com/calendar'
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
soup.find("span")
I would expect this to return the first span tag so I could later hone in on the DJIA data: "
span class="Trsdu(0.3s) Fz(s) Mt(4px) Mb(0px) Fw(b) D(ib)" data-reactid="31">26,430.14</span
but the script runs and returns nothing
You can use the same url the bottom of your listed urls is using to source the quote
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://quotes.wsj.com/index/DJIA?mod=mdc_uss_dtabnk')
soup = bs(r.content, 'lxml')
djia = soup.select_one('#quote_val').text
print(djia)
That is clear as source when you inspect the network traffic of the original bottom url you list and then focus on this url
http://www.barrons.com/mdc/public/js/9_3001_Refresh.js?
which has the js for refreshing that value. There you can see the listed source url for quote.
The response which contains:

Categories