Webscraping Python BS4 issue not returning data - python

I am new here and have had a read through much of the historic posts but cannot exactly find what I am looking for.
I am new to webscraping and have successfully scraped data from a handful of sites.
However I am having an issue with this code as I am trying to extract the titles of the products using beautiful soup but have an issue somewhere in the code as it is not returning the data? Any help would be appreciated:
from bs4 import BeautifulSoup
import requests
import pandas as pd
webpage = requests.get('https://groceries.asda.com/aisle/beer-wine-spirits/spirits/whisky/1215685911554-1215685911575-1215685911576')
sp = BeautifulSoup(webpage.content, 'html.parser')
title = sp.find_all('h3', class_='co-product__title')
print(title)
I assume my issue lies somewhere in the find_all function, however cannot quite work out how to resolve?
Regards
Milan

You could try to use this link, it seems to pull the information you desire:
from bs4 import BeautifulSoup
import requests
webpage = requests.get("https://groceries.asda.com/api/items/iconmetadata?request_origin=gi")
sp = BeautifulSoup(webpage.content, "html.parser")
print(sp)
Let me know if this helps.

Try this:
from bs4 import BeautifulSoup
import requests
import pandas as pd
webpage = requests.get('https://groceries.asda.com/aisle/beer-wine-spirits/spirits/whisky/1215685911554-1215685911575-1215685911576')
sp = BeautifulSoup(webpage.content, 'html.parser')
title = sp.find_all('h3', {'class':'co-product__title'})
print(title[0])
also i prefer
sp = BeautifulSoup(webpage.text, 'lxml')
Also note that this will return a list with all elements of that class. If you want just the first instance, use .find ie:
title = sp.find('h3', {'class':'co-product__title'})
Sorry to rain on this parade, but you wont be able to scrape this data with out a webdriver or You can call the api directly. You should research how to get post rendered js in python.

Related

Web Scraping masslottery.com using beautiful soup

I am trying to use beautiful soup to get all the keno numbers that came in ever. I mostly know how to do this but I am running into an issue. I need to get the game numbers and the numbers that came in for each game. However, using beautiful soup, I cannot figure out how to access these. I'll include a screenshot of the html, as well as a screenshot of what ive tried, as well as the link to the page I am inspecting.
html code
I am trying to access <div class="winning-number-ball-circle solid"> as you can see in that picture but all the html is nested.
I've tried
soup.find_all('div',{'class':'winning-number-ball-circle solid'})
and that does not work. Does anyone know how to access the inner elements?
Heres my code:
from bs4 import BeautifulSoup
import urllib.request
mass = 'https://www.masslottery.com/tools/past-results/keno?draw_date=2021-06-17'
page = urllib.request.urlopen(mass)
soup = BeautifulSoup(page,'lxml')
div = soup.find('div',{'class','winning-number-ball-circle solid'})
print(div)
Thanks in advance!
The data comes from an REST API call the browser makes by running Javascript. You need to make a request to that and then use the json returned
import requests
r = requests.get('https://www.masslottery.com/rest/keno/getDrawsByDateRange?startDate=2021-06-17&endDate=2021-06-17').json()
Thanks to #MendelG for suggestion to use pandas for formatting:
import requests
import pandas as pd
r = requests.get('https://www.masslottery.com/rest/keno/getDrawsByDateRange?startDate=2021-06-17&endDate=2021-06-17').json()
pd.json_normalize(r['draws'])
divs = soup.findAll('div',{'class':['winning-number-ball-circle','solid']})
for div in divs:
print(div.text)
with soup.findAll you find all the divs with class names 'winning-number-ball-circle' and 'solid'
this returns a list.
With a for you display the text of these divs

Python Web Scraping Not Getting the Desire Contents

Web Scraping Noob here. I start my journey to stock investing, so I am trying to get data from Morningstar. However, I am not getting the data I want.
Take https://www.morningstar.com/stocks/xnas/aapl/financials as an example.
These will be the numbers I am interested.
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.morningstar.com/stocks/xnas/aapl/financials"
page = requests.get(url)
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
The status_code is 200, which is fine. However, the parsed page contents did not contain the numbers I want. For example, no result is returned for soup.find_all('Price/Book').
I inspected the web page. The content stops at tag where I marked red.
soup.find("div", class_="mdc-column mds-layout-grid__col stock__content-sal mds-layout-grid__col--12 mds-layout-grid__col--auto-at-1092").contents
The code above returns an empty array. It should contain something () but does not. I don't think it has anything to do with js, but I don't know why beautiful soup fail to capture it either. Any hints are appreciated.
Hope I made myself clear enough.

Scraping pdfs from a webpage

I would like to download all financial reports for a given company from the Danish company register (csv register). An example could be Chr. Hansen Holding in the link below:
https://datacvr.virk.dk/data/visenhed?enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da
Specifically, I would like to download all the PDF under the tab "Regnskaber" (=Financial reports). I do not have previous experience with webscraping using Python. I tried using BeautifulSoup, but given my non-existing experience, I cannot find the correct way to search from the response.
Below are what I tried, but no data are printed (i.e. it did not find any pdfs).
from urllib.parse import urljoin
from bs4 import BeautifulSoup
web_page = "https://datacvr.virk.dk/data/visenhed?
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"
response = requests.get(web_page)
soup = BeautifulSoup(response.text)
soup.findAll('accordion-toggle')
for link in soup.select("a[href$='.pdf']"):
print(link['href'].split('/')[-1])
All help and guidance will be much appreciated.
you should use select instead of findAll
from urllib.parse import urljoin
from bs4 import BeautifulSoup
web_page = "https://datacvr.virk.dk/data/visenhed?
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"
response = requests.get(web_page)
soup = BeautifulSoup(response.text, 'lxml')
pdfs = soup.select('div[id="accordion-Regnskaber-og-nogletal"] a[data-type="PDF"]')
for link in pdfs:
print(link['href'].split('/')[-1])

A beginner trying to web scrape using Python and BeautifulSoup

I'm complete beginner to all of coding.
I need to scrape a list of high school football players from that site that won all-state awards.
I dived into the problem and was led towards Python and Beautiful Soup to web scrape.
I came up with the following code, But I am having difficulty figuring out to just get the player information.
I am getting a bunch of titles, links, and adds, but not the information I want.
Any tips would be greatly appreciated. This is what I came up with so far. Be kind.
import urllib
import urllib.request
from bs4 import BeautifulSoup
theurl = "https://cumberlink.com/sports/high-school/football/pa-football-writers-all-state-team-
class-a-a-and/article_4d286757-a501-5b5b-b3be-cfebc06ef455.html"
thepage = urllib.request.urlopen (theurl)
soup = BeautifulSoup (thepage, "html.parser")
print (soup.title.text)
""""""
for link in soup.findAll('p'):
print (link.get('href'))
print (link.text)
""""""
print (soup.find('div', {"class":"subscriber-only"}))
Also, If anyone can help me understand how to import it into an Excel file where I can have it automatically go into a chart format. I.E. (Player, Position, School, Height, Weight, Year, Award, etc.)
Basically you don't need to use urllib since Python already have a great module which is requests.
if you looking to use print(soup.title.text) so it's will give you the title of the page.
here is the correct way to loop over your specific div by it's class
import requests
from bs4 import BeautifulSoup
r = requests.get('https://cumberlink.com/sports/high-school/football/pa-football-writers-all-state-team-class-a-a-and/article_4d286757-a501-5b5b-b3be-cfebc06ef455.html').text
soup = BeautifulSoup(r, 'html.parser')
for item in soup.findAll('div', {"class": "subscriber-only"}):
print(item.text)

Identifying DJIA data using Beautiful Soup

I'm fairly new to coding and am trying to write a script that would pull market data at timed intervals while running, then compare the delta between each pull and notify the user of the change - looking for simple shifts, let's say >.1% in any interval.
My initial approach is to run a Beautiful Soup script to obtain posted market data, using either Yahoo Finance or Barron's, as both seem to have the data available in the HTML code:
https://finance.yahoo.com/calendar
http://www.barrons.com/mdc/public/page/9_3000.html?mod=bol_mdc_topnav_9_3000
This is as far as I've gotten and not having much luck, the find function doesn't seem to be returning anything from the site - looking for any nudging that might help me get on the right track with this
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
url = 'https://finance.yahoo.com/calendar'
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
soup.find("span")
I would expect this to return the first span tag so I could later hone in on the DJIA data: "
span class="Trsdu(0.3s) Fz(s) Mt(4px) Mb(0px) Fw(b) D(ib)" data-reactid="31">26,430.14</span
but the script runs and returns nothing
You can use the same url the bottom of your listed urls is using to source the quote
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://quotes.wsj.com/index/DJIA?mod=mdc_uss_dtabnk')
soup = bs(r.content, 'lxml')
djia = soup.select_one('#quote_val').text
print(djia)
That is clear as source when you inspect the network traffic of the original bottom url you list and then focus on this url
http://www.barrons.com/mdc/public/js/9_3001_Refresh.js?
which has the js for refreshing that value. There you can see the listed source url for quote.
The response which contains:

Categories