A beginner trying to web scrape using Python and BeautifulSoup

A beginner trying to web scrape using Python and BeautifulSoup - python

I'm complete beginner to all of coding.
I need to scrape a list of high school football players from that site that won all-state awards.
I dived into the problem and was led towards Python and Beautiful Soup to web scrape.
I came up with the following code, But I am having difficulty figuring out to just get the player information.
I am getting a bunch of titles, links, and adds, but not the information I want.
Any tips would be greatly appreciated. This is what I came up with so far. Be kind.
import urllib
import urllib.request
from bs4 import BeautifulSoup
theurl = "https://cumberlink.com/sports/high-school/football/pa-football-writers-all-state-team-
class-a-a-and/article_4d286757-a501-5b5b-b3be-cfebc06ef455.html"
thepage = urllib.request.urlopen (theurl)
soup = BeautifulSoup (thepage, "html.parser")
print (soup.title.text)
""""""
for link in soup.findAll('p'):
print (link.get('href'))
print (link.text)
""""""
print (soup.find('div', {"class":"subscriber-only"}))
Also, If anyone can help me understand how to import it into an Excel file where I can have it automatically go into a chart format. I.E. (Player, Position, School, Height, Weight, Year, Award, etc.)

Basically you don't need to use urllib since Python already have a great module which is requests.
if you looking to use print(soup.title.text) so it's will give you the title of the page.
here is the correct way to loop over your specific div by it's class
import requests
from bs4 import BeautifulSoup
r = requests.get('https://cumberlink.com/sports/high-school/football/pa-football-writers-all-state-team-class-a-a-and/article_4d286757-a501-5b5b-b3be-cfebc06ef455.html').text
soup = BeautifulSoup(r, 'html.parser')
for item in soup.findAll('div', {"class": "subscriber-only"}):
print(item.text)

Related

Webscraping Python BS4 issue not returning data

I am new here and have had a read through much of the historic posts but cannot exactly find what I am looking for.
I am new to webscraping and have successfully scraped data from a handful of sites.
However I am having an issue with this code as I am trying to extract the titles of the products using beautiful soup but have an issue somewhere in the code as it is not returning the data? Any help would be appreciated:
from bs4 import BeautifulSoup
import requests
import pandas as pd
webpage = requests.get('https://groceries.asda.com/aisle/beer-wine-spirits/spirits/whisky/1215685911554-1215685911575-1215685911576')
sp = BeautifulSoup(webpage.content, 'html.parser')
title = sp.find_all('h3', class_='co-product__title')
print(title)
I assume my issue lies somewhere in the find_all function, however cannot quite work out how to resolve?
Regards
Milan

You could try to use this link, it seems to pull the information you desire:
from bs4 import BeautifulSoup
import requests
webpage = requests.get("https://groceries.asda.com/api/items/iconmetadata?request_origin=gi")
sp = BeautifulSoup(webpage.content, "html.parser")
print(sp)
Let me know if this helps.

Try this:
from bs4 import BeautifulSoup
import requests
import pandas as pd
webpage = requests.get('https://groceries.asda.com/aisle/beer-wine-spirits/spirits/whisky/1215685911554-1215685911575-1215685911576')
sp = BeautifulSoup(webpage.content, 'html.parser')
title = sp.find_all('h3', {'class':'co-product__title'})
print(title[0])
also i prefer
sp = BeautifulSoup(webpage.text, 'lxml')
Also note that this will return a list with all elements of that class. If you want just the first instance, use .find ie:
title = sp.find('h3', {'class':'co-product__title'})
Sorry to rain on this parade, but you wont be able to scrape this data with out a webdriver or You can call the api directly. You should research how to get post rendered js in python.

Web Scraping masslottery.com using beautiful soup

I am trying to use beautiful soup to get all the keno numbers that came in ever. I mostly know how to do this but I am running into an issue. I need to get the game numbers and the numbers that came in for each game. However, using beautiful soup, I cannot figure out how to access these. I'll include a screenshot of the html, as well as a screenshot of what ive tried, as well as the link to the page I am inspecting.
html code
I am trying to access <div class="winning-number-ball-circle solid"> as you can see in that picture but all the html is nested.
I've tried
soup.find_all('div',{'class':'winning-number-ball-circle solid'})
and that does not work. Does anyone know how to access the inner elements?
Heres my code:
from bs4 import BeautifulSoup
import urllib.request
mass = 'https://www.masslottery.com/tools/past-results/keno?draw_date=2021-06-17'
page = urllib.request.urlopen(mass)
soup = BeautifulSoup(page,'lxml')
div = soup.find('div',{'class','winning-number-ball-circle solid'})
print(div)
Thanks in advance!

The data comes from an REST API call the browser makes by running Javascript. You need to make a request to that and then use the json returned
import requests
r = requests.get('https://www.masslottery.com/rest/keno/getDrawsByDateRange?startDate=2021-06-17&endDate=2021-06-17').json()
Thanks to #MendelG for suggestion to use pandas for formatting:
import requests
import pandas as pd
r = requests.get('https://www.masslottery.com/rest/keno/getDrawsByDateRange?startDate=2021-06-17&endDate=2021-06-17').json()
pd.json_normalize(r['draws'])

divs = soup.findAll('div',{'class':['winning-number-ball-circle','solid']})
for div in divs:
print(div.text)
with soup.findAll you find all the divs with class names 'winning-number-ball-circle' and 'solid'
this returns a list.
With a for you display the text of these divs

Python BeautifulSoup trouble extracting titles from a page with JS

I'm having some serious issues trying to extract the titles from a webpage. I've done this before on some other sites but this one seems to be an issue because of the Javascript.
The test link is "https://www.thomasnet.com/products/adhesives-393009-1.html"
The first title I want extracted is "Toagosei America, Inc."
Here is my code:
import requests
from bs4 import BeautifulSoup
url = ("https://www.thomasnet.com/products/adhesives-393009-1.html")
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
print(soup.get_text())
Now if I run it like this, with get_text, i can find the titles in the result, however as soon as I change it to find_all or find, the titles are lost. I cant find them using web browser's inspect tool, because its all JS generated.
Any advice would be greatly appreciated.

You have to specify what to find, in this case <h2> to get first title:
import requests
from bs4 import BeautifulSoup
url = 'https://www.thomasnet.com/products/adhesives-393009-1.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
first_title = soup.find('h2')
print(first_title.text)
Prints:
Toagosei America, Inc.

Cannot Scrapte Youtube with BeautifulSoup

I am trying to do a very simple task of scraping Youtube page.
But it keeps returning empty list as the result.
Target: Get join date of a youtube channel
Original link: channel page
Code:
from bs4 import BeautifulSoup
import requests
url="https://www.youtube.com/channel/"+"UCxX9wt5FWQUAAz4UrysqK9A"+"/about"
html=requests.get(url)
soup=BeautifulSoup(html.text,"lxml")
channel_name=soup.find_all(name="yt-formatted-string")
print(channel_name)
Result:
[]
Part of source code:
<yt-formatted-string class="style-scope ytd-channel-about-metadata-renderer">
Joined
Feb 25, 2016
</yt-formatted-string>
Please guide. Why i cannot get correct answer?
Thanks.

You were really close, you just needed to make some minor changes. You do not need Selenium to do this. Below is a snippet that will do what you need. However, I wasn't too sure what particular data you were looking to scrape, so I just went for the main body. Also if you begin to scrape more, try to get in the habit of using headers with requests so that YouTube doesn't block your ip. Good luck!
from bs4 import BeautifulSoup
import requests
url="https://www.youtube.com/channel/"+"UCxX9wt5FWQUAAz4UrysqK9A"+"/about"
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
for div in soup.find_all("div", class_="about-description"):
print(div.text.replace("\n",""))
Sorry, upon reviewing your question again it looks like you may have wanted the "joined information". So below is a snippet for that as well.
for span in soup.find_all("span", class_="about-stat"):
print(span.text.replace("\n",""))

Identifying DJIA data using Beautiful Soup

I'm fairly new to coding and am trying to write a script that would pull market data at timed intervals while running, then compare the delta between each pull and notify the user of the change - looking for simple shifts, let's say >.1% in any interval.
My initial approach is to run a Beautiful Soup script to obtain posted market data, using either Yahoo Finance or Barron's, as both seem to have the data available in the HTML code:
https://finance.yahoo.com/calendar
http://www.barrons.com/mdc/public/page/9_3000.html?mod=bol_mdc_topnav_9_3000
This is as far as I've gotten and not having much luck, the find function doesn't seem to be returning anything from the site - looking for any nudging that might help me get on the right track with this
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
url = 'https://finance.yahoo.com/calendar'
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
soup.find("span")
I would expect this to return the first span tag so I could later hone in on the DJIA data: "
span class="Trsdu(0.3s) Fz(s) Mt(4px) Mb(0px) Fw(b) D(ib)" data-reactid="31">26,430.14</span
but the script runs and returns nothing

You can use the same url the bottom of your listed urls is using to source the quote
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://quotes.wsj.com/index/DJIA?mod=mdc_uss_dtabnk')
soup = bs(r.content, 'lxml')
djia = soup.select_one('#quote_val').text
print(djia)
That is clear as source when you inspect the network traffic of the original bottom url you list and then focus on this url
http://www.barrons.com/mdc/public/js/9_3001_Refresh.js?
which has the js for refreshing that value. There you can see the listed source url for quote.
The response which contains:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

A beginner trying to web scrape using Python and BeautifulSoup - python

Related

Webscraping Python BS4 issue not returning data

Web Scraping masslottery.com using beautiful soup

Python BeautifulSoup trouble extracting titles from a page with JS

Cannot Scrapte Youtube with BeautifulSoup

Identifying DJIA data using Beautiful Soup

Categories

Resources