So I am using BeautfiulSoup4 with Python and I am trying to get an element with "div class". But this element is under many divs and when I try to use "find" with BeautifulSoup, it just returns "None". The element I'm trying to get is show with class "WhatIWant" in the screenshot. Here is the screenshot of the website html:
Screenshot
And this is the code I use for getting that element
page = requests.get(URL)
soup = BeautifulSoup(page.content, "lxml")
element = soup.find_all("div", {"class": "WhatIWant"})
import requests
from bs4 import BeautifulSoup
url = 'https://www.leagueofgraphs.com/summoner/tr/AvaIanche'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.find('div', {'class':'leagueTier'}).text.strip())
output:
Platinum I
Maybe your web page that you request does not load that element using a simple request, some of the web pages have JavaScript, and you cant scrape it with Bs4; it may be better to use Selenium.
Test it and then send the response here comment; it may be better to send here this URL.
Related
I want to get all the images within a div, but everytime I try the output returns 'none' or just an empyt list. The issue just seems to happens when I try to scrape between a div, or a class. Even using different user-agents, .find or .find_all .
from bs4 import BeautifulSoup
import requests
abcde = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36'}
r = requests.get('https://www.gettyimages.com.br/fotos/randon', headers=abcde)
soup = BeautifulSoup(r.content, 'html.parser')
check = soup.find_all('img', class_="GalleryItems-module__searchContent___DbMmK"})
print(check)
Would recommend to work with an api, while there is on https://developers.gettyimages.com/docs/
To answer your question concerning just images - Classes are not the best identifier cause often they are dynamic, also there is a gallery(fixed) and a mosaic view.
Simply select the <article> and its child <img> to get your goal:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.gettyimages.com.br/fotos/randon?assettype=image&sort=mostpopular&phrase=randon',
headers = {'User-Agent': 'Mozilla/5.0'}
)
soup = BeautifulSoup(r.text)
for e in soup.select('article img'):
print(e.get('src'))
Output
https://media.gettyimages.com/photos/randon-norway-picture-id974597088?k=20&m=974597088&s=612x612&w=0&h=EIwbJNzCld1tbU7rTyt42pie2yCEk5z4e6L6Z4kWhdo=
https://media.gettyimages.com/photos/caption-patrick-roanhouse-a-266-member-chats-about-some-software-on-picture-id97112678?k=20&m=97112678&s=612x612&w=0&h=zmwqIlVv2f-M9Vz_qcpITPzj-3SON99G3P69h69J5Gs=
https://media.gettyimages.com/photos/12th-and-f-streets-nw-washington-dc-pedestrians-teofila-randon-left-picture-id97102402?k=20&m=97102402&s=612x612&w=0&h=potzNUgMo3gKab5eS_pwyNggS2YGn6sCnDQYxdGUHqc=
https://media.gettyimages.com/photos/randon-perdue-kari-barnhart-attend-the-other-nashville-society-one-picture-id969787702?k=20&m=969787702&s=612x612&w=0&h=kcaYmOKruLb57Vqga68xvEZB1V12wSPPYkC6GdvXO18=
https://media.gettyimages.com/photos/death-of-duguesclin-to-chateauneuf-de-randon-july-13-1380-during-the-picture-id959538894?k=20&m=959538894&s=612x612&w=0&h=lx3DHDSf3kBc_h-O2kjR2D6UYDjPPvhn8xJ_KM0cmMc=
https://media.gettyimages.com/photos/ski-de-randone-a-saintefoy-au-dessus-du-couloir-de-la-croix-savoie-mr-picture-id945817638?k=20&m=945817638&s=612x612&w=0&h=fRd3M2KCa5dd0z8ePnPw2IkAKhXYJpuCFuUTz7jpVPU=
...
import requests
import lxml
from bs4 import BeautifulSoup
LISTINGS_URL = 'https://shorturl.at/ceoAB'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/95.0.4638.69 Safari/537.36 ",
"Accept-Language": "en-US,en;q=0.9"
}
response = requests.get(LISTINGS_URL, headers=headers)
listings = response.text
class DataScraper:
def __init__(self):
self.soup = BeautifulSoup(listings, "html.parser")
def get_links(self):
for a in self.soup.select(".list-card-top a"):
print(a)
# listing_text = [link.getText() for link in links]
def get_address(self):
pass
def get_prices(self):
pass
I Have Used the correct css selectors, even trying to find the elements using attrs in find_all()
What I am trying to achieve is to parse in all the anchor tags then to fetch the href links for the specific listings however it is only returning the first 10
You can make a GET request to this endpoint and fetch the data you need.
https://www.zillow.com/search/GetSearchPageState.htm?searchQueryState={"pagination":{"currentPage":1},"mapBounds":{"west":-123.33522421253342,"east":-121.44008261097092,"south":37.041584214606814,"north":38.39290664366326},"isMapVisible":false,"filterState":{"price":{"max":872627},"beds":{"min":1},"isForSaleForeclosure":{"value":false},"monthlyPayment":{"max":3000},"isAuction":{"value":false},"isNewConstruction":{"value":false},"isForRent":{"value":true},"isForSaleByOwner":{"value":false},"isComingSoon":{"value":false},"isForSaleByAgent":{"value":false}},"isListVisible":true,"mapZoom":9}&wants={"cat1":["listResults"]}
Change the "currentPage" url parameter value in the above URL to fetch data from different pages.
Since the response is JSON, you can easily parse it and extract the information using json module.
Website is using probably lazy loading, so you can either use something like selenium/puppeteer or use an API of this website (will be an easier way). To do this you need to make a GET request to an url which starts with https://www.zillow.com/search/GetSearchPageState.htm (see in your dev tools in browser), parse JSON response and you have your href link under cat1.searchResults.listResults[index in array].detailUrl.
I'm new to web scraping and don't understand why my code doesn't work ^^'
No matter what data I try to find in twitch, soup.find() and soup.find_all() returns "None" or an empty array, but when I print all the data print(soup) it works (& I'm sure that the elements I search exist on the page). So I don't understand what's going wrong.. I had the same issue on another project it was due to my parser "html.parser" that I replaced by "lxml" and it worked! But it doesn't work for this one. Here is my code hopefully you can see where the problem comes from :)
from bs4 import BeautifulSoup
URL = 'https://www.twitch.tv'
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "lxml")
#print(soup.prettify())
offline = soup.find('a', {'class':'side-nav-card__link'})
spans = soup.find_all('a')
for span in spans:
print(span)
print(offline)
PS: I know there is a twitch API that I can use to do the same job, but just want to practice web scraping + if you have other web scraping project ideas feel free to suggest them ^.^
from bs4 import BeautifulSoup
import requests
headers = {'Use-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
url = 'https://www.amazon.com/Sony-Alpha-a6400-Mirrorless-Camera/dp/B07MV3P7M8/ref=sr_1_4?keywords=sony+alpha&qid=1581656953&s=electronics&sr=1-4'
page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle").get_text()
price = soup.find(id="priceblock_ourprice").get_text()
print(title)
print(price)
Your code works fine but there is a robot check before the product page so your request looks for the span tag in that robot check page, fails and returns None.
Here is a link which may help you: python requests & beautifulsoup bot detection
I'm trying to scrape the left side of this news site (= SENESTE NYT):
https://www.dr.dk/nyheder/
But it seems the data isn't anywhere to be found? Neither in the html or related api/json etc. Is it some kind of push data?
Using Chrome's Network console I've found this api but it doesn't contain the news items on the left side:
https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100
Can anyone help me? How do I scrape "SENESTE NYT"?
I first loaded the page with selenium and then processed with BeautifulSoup.
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.dr.dk/nyheder"
driver = webdriver.Chrome()
driver.get(url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, "lxml")
div = soup.find("div", {"class":"timeline-container"})
headlines = div.find_all("h3")
print(headlines)
And it seems to find the headlines:
[<h3>Puigdemont: Debatterede spørgsmål af interesse for hele Europa</h3>,
<h3>Afblæser tsunami-varsel for Hawaii</h3>,
<h3>56.000 flygter fra vulkan i udbrud </h3>,
<h3>Pence: USA offentliggør snart plan for ambassadeflytning </h3>,
<h3>Østjysk motorvej genåbnet </h3>]
Not sure if this is what you wanted.
-----EDITED----
More efficient way would be to create request with some custom headers (already confirmed this is not working)
import requests
headers = {
"Accept":"*/*",
"Host":"www.dr.dk",
"Referer":"https://www.dr.dk/nyheder",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
r = requests.get(url="https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100", headers=headers)
r.json()