Im want to use bs4 in my Flask-App for searching a specific span.
I never used bs4 before so I'm a little bit confused why I don't get any results for my search.
from bs4 import BeautifulSoup
url = "https://www.mcfit.com/de/fitnessstudios/studiosuche/studiodetails/studio/berlin-lichtenberg/"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
spans = soup.find_all('span', {'class': 'sc-fzoXWK hnKkAN'})
print(spans)
The class 'sc-fzoXWK hnKkAN' only contains 1 span.
When I execute I only get a []as result.
Those contents are dynamically generated using javascript so using requests to retrieve the HTML will just retrieve the static contents, you can combine BeautifulSoup with something like Selenium to achieve what you want:
Install selenium:
pip install selenium
And then retrieve the contents using the Firefox engine or any other that supports javascript:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.mcfit.com/de/fitnessstudios/studiosuche/studiodetails/studio/berlin-lichtenberg/')
html_content = driver.page_source
soup = BeautifulSoup(html_content, "lxml")
elems = soup.find_all('div', {'class': 'sc-fzoXWK hnKkAN'})
print(elems)
If you use Firefox, the geckodriver needs to be accessible by your script, you can download it from https://github.com/mozilla/geckodriver/releases and put it in your PATH (or c:/windows if you are using this OS) so it is available from everywhere.
Related
I'm trying to scrape the link to the image on this reddit website for practice, but BS4 seems to be returning none whenever I use find() to find the class of the object. Any help?
from bs4 import BeautifulSoup as soup
page = requests.get("https://www.reddit.com/r/wallpaper/comments/qswblq/the_frontier_by_me_5120x2880/")
soup = soup(page.content, "html.parser")
print(soup.find(class_="ImageBox-image")['src'])
As mentioned in the comments there is an alternativ, you can use selenium.
Instead of requests it will render the site like a browser and will give page_source you could inspect and find your element.
Example:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('YOUR PATH TO CHROMEDIVER')
driver.get('https://www.reddit.com/r/wallpaper/comments/qswblq/the_frontier_by_me_5120x2880/')
content = driver.page_source
soup = BeautifulSoup(content,'html.parser')
soup.find(class_="ImageBox-image")['src']
for context I am pretty new to Python. I am trying to use bs4 to parse some data out of https://bigfuture.collegeboard.org/college-university-search/university-of-california-los-angeles
To be exact, I want to obtain the 57% number in the "paying" section of the webpage.
My problem is that bs4 will only return the first layer of the HTML, while the data I want is deeply nested in the code. I think it's under 17 divs.
Here is my python code:
import requests
import bs4
url = 'https://bigfuture.collegeboard.org/college-university-search/university-of-california-los-angeles'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, "html.parser")
print(soup.find_all("div", {"id": "gwtDiv"}))
(This returns [<div class="clearfix margin60 marginBottomOnly" id="gwtDiv" style="min-height: 300px;height: 300px;height: auto;"></div>] None of the elements inside it are shown.)
If the page is using js to render content inside the element then requests will not be able to get that content since that content is rendered on the client side in a browser. I'd recommend using ChromeDriver and Selenium along with BeautifulSoup.
You can download the chrome driver from here:https://chromedriver.chromium.org/
Put this in the same folder in which you're running your program.
Try something like this
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://bigfuture.collegeboard.org/college-university-search/university-of-california-los-angeles'
driver = webdriver.Chrome()
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
sel_soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all("div", {"id": "gwtDiv"}))
I have been digging on the site for some time and im unable to find the solution to my issue. Im fairly new to web scraping and trying to simply extract some links from a web page using beautiful soup.
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
page = urlopen(url).read()
soup = BeautifulSoup(page, "lxml")
print(soup)
At the most basic level, all im trying to do is access a specific tag within the website. I can work out the rest for myself, but the part im struggling with is the fact that a tag that I am looking for is not in the output.
For example: using the built in find() I can grab the following div class tag:
class="l__grid js-page-layout"
However what i'm actually looking for are the contents of a tag that is embedded at a lower level in the tree.
js-event-list-tournament-events
When I perform the same find operation on the lower-level tag, I get no results.
Using Azure-based Jupyter Notebook, i have tried a number of the solutions to similar problems on stackoverflow and no luck.
Thanks!
Kenny
The page use JS to load the data dynamically so you have to use selenium. Check below code.
Note you have to install selenium and chromedrive (unzip the file and copy into python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
'class':'js-event-list-tournament-events'})
print(container)
or you can use their json api
import requests
url = 'https://www.sofascore.com/football//2018-09-18/json'
r = requests.get(url)
print(r.json())
I had the same problem and the following code worked for me. Chromedriver must be installed!
import time
from bs4 import BeautifulSoup
from selenium import webdriver
chromedriver_path= "/Users/.../chromedriver"
driver = webdriver.Chrome(chromedriver_path)
url = "https://yourURL.com"
driver.get(url)
time.sleep(3) #if you want to wait 3 seconds for the page to load
page_source = driver.page_source
soup = bs4.BeautifulSoup(page_source, 'lxml')
This soup you can use as usual.
I have been digging on the site for some time and im unable to find the solution to my issue. Im fairly new to web scraping and trying to simply extract some links from a web page using beautiful soup.
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
page = urlopen(url).read()
soup = BeautifulSoup(page, "lxml")
print(soup)
At the most basic level, all im trying to do is access a specific tag within the website. I can work out the rest for myself, but the part im struggling with is the fact that a tag that I am looking for is not in the output.
For example: using the built in find() I can grab the following div class tag:
class="l__grid js-page-layout"
However what i'm actually looking for are the contents of a tag that is embedded at a lower level in the tree.
js-event-list-tournament-events
When I perform the same find operation on the lower-level tag, I get no results.
Using Azure-based Jupyter Notebook, i have tried a number of the solutions to similar problems on stackoverflow and no luck.
Thanks!
Kenny
The page use JS to load the data dynamically so you have to use selenium. Check below code.
Note you have to install selenium and chromedrive (unzip the file and copy into python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
'class':'js-event-list-tournament-events'})
print(container)
or you can use their json api
import requests
url = 'https://www.sofascore.com/football//2018-09-18/json'
r = requests.get(url)
print(r.json())
I had the same problem and the following code worked for me. Chromedriver must be installed!
import time
from bs4 import BeautifulSoup
from selenium import webdriver
chromedriver_path= "/Users/.../chromedriver"
driver = webdriver.Chrome(chromedriver_path)
url = "https://yourURL.com"
driver.get(url)
time.sleep(3) #if you want to wait 3 seconds for the page to load
page_source = driver.page_source
soup = bs4.BeautifulSoup(page_source, 'lxml')
This soup you can use as usual.
I am trying to fetch the download CSV file link from this: https://patents.google.com/?assignee=intel
This is my code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://patents.google.com/?assignee=intel")
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('a', class_='style-scope search-results')
soup.find_all('a', class_='style-scope')
But last 2 lines are returning empty array. What am I missing here?
Even this is not returning anything:
soup.find(id="resultsLayout")
That's because the elements are being generated by javascript. You can use selenium to get the whole page source.
Here's an edited version of your code using selenium.
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://patents.google.com/?assignee=intel')
page = browser.page_source
browser.quit()
soup = BeautifulSoup(page, 'html.parser')
soup.find_all('a', class_='style-scope search-results')
soup.find_all('a', class_='style-scope')
Let me know if you need clarifications. Thanks!