How to find class inside nested div using BeautifulSoup python - python

i'm working on web crawler right now, and it seems that I couldn't get the class that is inside div from a particular website. Below is my code. I use BeautifulSoup in Python3
import requests
from bs4 import BeautifulSoup as bs
response = requests.get('https://e27.co/startup/flipkart').text
soup = bs(response, 'html.parser')
content_div = soup.findAll('h1',class_ = 'profile-startup')
print(content_div)
I want to extract the text inside the h1 that has class "profile-startup". the above code returns nothing. can you guys help me?

This website is populating data using Javascript. If you take a look at the contents in response you will see that there is no h1. You have to see if they have an API you can use to retrieve the information you need or consider using a browser automation technology like Selenium: http://selenium-python.readthedocs.io/installation.html#introduction

Related

Scraping webpage using Python and Requests Package

I am looking to scrape a certain number from a website.
When inspecting in chrome I see the following div I want to pull:
<div class="sc-18nh1jk-0 bTfoun css-1p6fq9y">2472.38</div>
This class name looks weird to me. Here is the code simple code I use to try and pull the '2472.38' number:
from lxml import html
import requests
r = requests.get('MYWEBSITE')
tree = html.fromstring(r.content)
CurrentPrice = tree.xpath('//div[#class="sc-18nh1jk-0 bTfoun css-1p6fq9y"]')
print(CurrentPrice)
output is: []
Any suggestions? Thanks ahead of time!
If you provided the websites url it would have been nice, but I think that this webpage you are trying to scrape is using generated class names which means that the class will be dynamic.

Web scraping youtube page

i'm trying to get the title of youtube videos given a link.
But i'm unable to access the element that hold the title. I'm using bs4 to parse the html.
I noticed im unable to access any element that is within 'ytd-app' tag in the youtube page.
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
a = soup.find_all(attrs={"class": "style-scope ytd-video-primary-info-renderer"})
print(a)
So how can i get the video title ? Is there something i'm doing wrong or youtube intentionally created a tag like this to prevent web_scraping ?
See class that you are using is render through Javascript and all the contents are dynamic so it is very difficult to find any data using bs4
So what you can do find data in soup by manually and find particular tag
Also you can try out with pytube
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
soup.find("title").get_text()

Web Scraping masslottery.com using beautiful soup

I am trying to use beautiful soup to get all the keno numbers that came in ever. I mostly know how to do this but I am running into an issue. I need to get the game numbers and the numbers that came in for each game. However, using beautiful soup, I cannot figure out how to access these. I'll include a screenshot of the html, as well as a screenshot of what ive tried, as well as the link to the page I am inspecting.
html code
I am trying to access <div class="winning-number-ball-circle solid"> as you can see in that picture but all the html is nested.
I've tried
soup.find_all('div',{'class':'winning-number-ball-circle solid'})
and that does not work. Does anyone know how to access the inner elements?
Heres my code:
from bs4 import BeautifulSoup
import urllib.request
mass = 'https://www.masslottery.com/tools/past-results/keno?draw_date=2021-06-17'
page = urllib.request.urlopen(mass)
soup = BeautifulSoup(page,'lxml')
div = soup.find('div',{'class','winning-number-ball-circle solid'})
print(div)
Thanks in advance!
The data comes from an REST API call the browser makes by running Javascript. You need to make a request to that and then use the json returned
import requests
r = requests.get('https://www.masslottery.com/rest/keno/getDrawsByDateRange?startDate=2021-06-17&endDate=2021-06-17').json()
Thanks to #MendelG for suggestion to use pandas for formatting:
import requests
import pandas as pd
r = requests.get('https://www.masslottery.com/rest/keno/getDrawsByDateRange?startDate=2021-06-17&endDate=2021-06-17').json()
pd.json_normalize(r['draws'])
divs = soup.findAll('div',{'class':['winning-number-ball-circle','solid']})
for div in divs:
print(div.text)
with soup.findAll you find all the divs with class names 'winning-number-ball-circle' and 'solid'
this returns a list.
With a for you display the text of these divs

Web scraping google flight prices

I am trying to learn to use the python library BeautifulSoup, I would like to, for example, scrape a price of a flight on Google Flights.
So I connected to Google Flights, for example at this link, and I want to get the cheapest flight price.
So I would get the value inside the div with this class "gws-flights-results__itinerary-price" (as in the figure).
Here is the simple code I wrote:
from bs4 import BeautifulSoup
import urllib.request
url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
div = soup.find('div', attrs={'class': 'gws-flights-results__itinerary-price'})
But the resulting div has class NoneType.
I also try with
find_all('div')
but within all the div I found in this way, there was not the div I was interested in.
Can someone help me?
Looks like javascript needs to run so use a method like selenium
from selenium import webdriver
url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
driver = webdriver.Chrome()
driver.get(url)
print(driver.find_element_by_css_selector('.gws-flights-results__cheapest-price').text)
driver.quit()
Its great that you are learning web scraping! The reason you are getting NoneType as a result is because the website that you are scraping loads content dynamically. When requests library fetches the url it only contains javascript. and the div with this class "gws-flights-results__itinerary-price" isn't rendered yet! So it won't be possible by the scraping approach you are using to scrape this website.
However you can use other methods such as fetching the page using tools such as selenium or splash to render the javascript and then parse the content.
BeautifulSoup is a great tool for extracting part of HTML or XML, but here it looks like you only need to get the url to another GET-request for a JSON object.
(I am not by a computer now, can update with an example tomorrow.)

I want to get all links from a certain webpage using python

i want to be able to pull all urls from the following webpage using python https://yeezysupply.com/pages/all i tried using some other suggestions i found but they didn't seem to work with this particular website. i would end up not finding any urls at all.
import urllib
import lxml.html
connection = urllib.urlopen('https://yeezysupply.com/pages/all')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link
perhaps it would be useful for you to make use of modules specifically designed for this. heres a quick and dirty script that gets the relative links on the page
#!/usr/bin/python3
import requests, bs4
res = requests.get('https://yeezysupply.com/pages/all')
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = soup.find_all('a')
for link in links:
print(link.attrs['href'])
it generates output like this:
/pages/jewelry
/pages/clothing
/pages/footwear
/pages/all
/cart
/products/womens-boucle-dress-bleach/?back=%2Fpages%2Fall
/products/double-sleeve-sweatshirt-bleach/?back=%2Fpages%2Fall
/products/boxy-fit-zip-up-hoodie-light-sand/?back=%2Fpages%2Fall
/products/womens-boucle-skirt-cream/?back=%2Fpages%2Fall
etc...
is this what you are looking for? requests and beautiful soup are amazing tools for scraping.
There are no links in the page source; they are inserted using Javascript after the page is loaded int the browser.

Categories