How to get div element under many divs using BeautifulSoup?

How to get div element under many divs using BeautifulSoup? - python

So I am using BeautfiulSoup4 with Python and I am trying to get an element with "div class". But this element is under many divs and when I try to use "find" with BeautifulSoup, it just returns "None". The element I'm trying to get is show with class "WhatIWant" in the screenshot. Here is the screenshot of the website html:
Screenshot
And this is the code I use for getting that element
page = requests.get(URL)
soup = BeautifulSoup(page.content, "lxml")
element = soup.find_all("div", {"class": "WhatIWant"})

import requests
from bs4 import BeautifulSoup
url = 'https://www.leagueofgraphs.com/summoner/tr/AvaIanche'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.find('div', {'class':'leagueTier'}).text.strip())
output:
Platinum I

Maybe your web page that you request does not load that element using a simple request, some of the web pages have JavaScript, and you cant scrape it with Bs4; it may be better to use Selenium.
Test it and then send the response here comment; it may be better to send here this URL.

Related

Why I can't scrape images inside a class or a div?

I want to get all the images within a div, but everytime I try the output returns 'none' or just an empyt list. The issue just seems to happens when I try to scrape between a div, or a class. Even using different user-agents, .find or .find_all .
from bs4 import BeautifulSoup
import requests
abcde = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36'}
r = requests.get('https://www.gettyimages.com.br/fotos/randon', headers=abcde)
soup = BeautifulSoup(r.content, 'html.parser')
check = soup.find_all('img', class_="GalleryItems-module__searchContent___DbMmK"})
print(check)

Would recommend to work with an api, while there is on https://developers.gettyimages.com/docs/
To answer your question concerning just images - Classes are not the best identifier cause often they are dynamic, also there is a gallery(fixed) and a mosaic view.
Simply select the <article> and its child <img> to get your goal:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.gettyimages.com.br/fotos/randon?assettype=image&sort=mostpopular&phrase=randon',
headers = {'User-Agent': 'Mozilla/5.0'}
)
soup = BeautifulSoup(r.text)
for e in soup.select('article img'):
print(e.get('src'))
Output
https://media.gettyimages.com/photos/randon-norway-picture-id974597088?k=20&m=974597088&s=612x612&w=0&h=EIwbJNzCld1tbU7rTyt42pie2yCEk5z4e6L6Z4kWhdo=
https://media.gettyimages.com/photos/caption-patrick-roanhouse-a-266-member-chats-about-some-software-on-picture-id97112678?k=20&m=97112678&s=612x612&w=0&h=zmwqIlVv2f-M9Vz_qcpITPzj-3SON99G3P69h69J5Gs=
https://media.gettyimages.com/photos/12th-and-f-streets-nw-washington-dc-pedestrians-teofila-randon-left-picture-id97102402?k=20&m=97102402&s=612x612&w=0&h=potzNUgMo3gKab5eS_pwyNggS2YGn6sCnDQYxdGUHqc=
https://media.gettyimages.com/photos/randon-perdue-kari-barnhart-attend-the-other-nashville-society-one-picture-id969787702?k=20&m=969787702&s=612x612&w=0&h=kcaYmOKruLb57Vqga68xvEZB1V12wSPPYkC6GdvXO18=
https://media.gettyimages.com/photos/death-of-duguesclin-to-chateauneuf-de-randon-july-13-1380-during-the-picture-id959538894?k=20&m=959538894&s=612x612&w=0&h=lx3DHDSf3kBc_h-O2kjR2D6UYDjPPvhn8xJ_KM0cmMc=
https://media.gettyimages.com/photos/ski-de-randone-a-saintefoy-au-dessus-du-couloir-de-la-croix-savoie-mr-picture-id945817638?k=20&m=945817638&s=612x612&w=0&h=fRd3M2KCa5dd0z8ePnPw2IkAKhXYJpuCFuUTz7jpVPU=
...

Beautiful Soup only returning the first 10 listings using soup.select(), What could be the issue here?

import requests
import lxml
from bs4 import BeautifulSoup
LISTINGS_URL = 'https://shorturl.at/ceoAB'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/95.0.4638.69 Safari/537.36 ",
"Accept-Language": "en-US,en;q=0.9"
}
response = requests.get(LISTINGS_URL, headers=headers)
listings = response.text
class DataScraper:
def __init__(self):
self.soup = BeautifulSoup(listings, "html.parser")
def get_links(self):
for a in self.soup.select(".list-card-top a"):
print(a)
# listing_text = [link.getText() for link in links]
def get_address(self):
pass
def get_prices(self):
pass
I Have Used the correct css selectors, even trying to find the elements using attrs in find_all()
What I am trying to achieve is to parse in all the anchor tags then to fetch the href links for the specific listings however it is only returning the first 10

You can make a GET request to this endpoint and fetch the data you need.
https://www.zillow.com/search/GetSearchPageState.htm?searchQueryState={"pagination":{"currentPage":1},"mapBounds":{"west":-123.33522421253342,"east":-121.44008261097092,"south":37.041584214606814,"north":38.39290664366326},"isMapVisible":false,"filterState":{"price":{"max":872627},"beds":{"min":1},"isForSaleForeclosure":{"value":false},"monthlyPayment":{"max":3000},"isAuction":{"value":false},"isNewConstruction":{"value":false},"isForRent":{"value":true},"isForSaleByOwner":{"value":false},"isComingSoon":{"value":false},"isForSaleByAgent":{"value":false}},"isListVisible":true,"mapZoom":9}&wants={"cat1":["listResults"]}
Change the "currentPage" url parameter value in the above URL to fetch data from different pages.
Since the response is JSON, you can easily parse it and extract the information using json module.

Website is using probably lazy loading, so you can either use something like selenium/puppeteer or use an API of this website (will be an easier way). To do this you need to make a GET request to an url which starts with https://www.zillow.com/search/GetSearchPageState.htm (see in your dev tools in browser), parse JSON response and you have your href link under cat1.searchResults.listResults[index in array].detailUrl.

soup.find() always returns None, beautifullsoup web scraping

I'm new to web scraping and don't understand why my code doesn't work ^^'
No matter what data I try to find in twitch, soup.find() and soup.find_all() returns "None" or an empty array, but when I print all the data print(soup) it works (& I'm sure that the elements I search exist on the page). So I don't understand what's going wrong.. I had the same issue on another project it was due to my parser "html.parser" that I replaced by "lxml" and it worked! But it doesn't work for this one. Here is my code hopefully you can see where the problem comes from :)
from bs4 import BeautifulSoup
URL = 'https://www.twitch.tv'
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "lxml")
#print(soup.prettify())
offline = soup.find('a', {'class':'side-nav-card__link'})
spans = soup.find_all('a')
for span in spans:
print(span)
print(offline)
PS: I know there is a twitch API that I can use to do the same job, but just want to practice web scraping + if you have other web scraping project ideas feel free to suggest them ^.^

This code for Web Scraping using python returning None. Why? Any help would be appreciated

from bs4 import BeautifulSoup
import requests
headers = {'Use-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
url = 'https://www.amazon.com/Sony-Alpha-a6400-Mirrorless-Camera/dp/B07MV3P7M8/ref=sr_1_4?keywords=sony+alpha&qid=1581656953&s=electronics&sr=1-4'
page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle").get_text()
price = soup.find(id="priceblock_ourprice").get_text()
print(title)
print(price)

Your code works fine but there is a robot check before the product page so your request looks for the span tag in that robot check page, fails and returns None.
Here is a link which may help you: python requests & beautifulsoup bot detection

Using python to scrape push data?

I'm trying to scrape the left side of this news site (= SENESTE NYT):
https://www.dr.dk/nyheder/
But it seems the data isn't anywhere to be found? Neither in the html or related api/json etc. Is it some kind of push data?
Using Chrome's Network console I've found this api but it doesn't contain the news items on the left side:
https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100
Can anyone help me? How do I scrape "SENESTE NYT"?

I first loaded the page with selenium and then processed with BeautifulSoup.
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.dr.dk/nyheder"
driver = webdriver.Chrome()
driver.get(url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, "lxml")
div = soup.find("div", {"class":"timeline-container"})
headlines = div.find_all("h3")
print(headlines)
And it seems to find the headlines:
[<h3>Puigdemont: Debatterede spørgsmål af interesse for hele Europa</h3>,
<h3>Afblæser tsunami-varsel for Hawaii</h3>,
<h3>56.000 flygter fra vulkan i udbrud </h3>,
<h3>Pence: USA offentliggør snart plan for ambassadeflytning </h3>,
<h3>Østjysk motorvej genåbnet </h3>]
Not sure if this is what you wanted.
-----EDITED----
More efficient way would be to create request with some custom headers (already confirmed this is not working)
import requests
headers = {
"Accept":"*/*",
"Host":"www.dr.dk",
"Referer":"https://www.dr.dk/nyheder",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
r = requests.get(url="https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100", headers=headers)
r.json()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get div element under many divs using BeautifulSoup? - python

Maybe your web page that you request does not load that element using a simple request, some of the web pages have JavaScript, and you cant scrape it with Bs4; it may be better to use Selenium. Test it and then send the response here comment; it may be better to send here this URL.

Related

Why I can't scrape images inside a class or a div?

Beautiful Soup only returning the first 10 listings using soup.select(), What could be the issue here?

soup.find() always returns None, beautifullsoup web scraping

This code for Web Scraping using python returning None. Why? Any help would be appreciated

Using python to scrape push data?

Categories

Resources