I'm tying to scrape information from some urls using Requests and BeautifulSoup in Python. But some sites only return an partial HTML response missing the content of the page
This is the code, that is not working:
import requests
from bs4 import BeautifulSoup
url = "http://www.exampleurl.com"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
Here is the incomplete response:
Picture
I tried to use Selenium with Chrome Webdriver instead, but ended up with the same issue.
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
browser = webdriver.Chrome(options=options)
browser.get(url)
html = browser.page_source
Any ideas?
What happens
You do not get the expected html cause it is in an iframe
Try to get the src of the iframe soup.find('iframe')['src'] and request with it again.
Example
import requests
from bs4 import BeautifulSoup
url = "http://www.ingenieur-jobs.de/jobangebote/3075/"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
iframe = requests.get(soup.find('iframe')['src'])
soup = BeautifulSoup(iframe.content, 'html.parser')
soup
Related
I have to get <a class="last" aria-label="Last Page" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>.
From this site:https://webtoon-tr.com/webtoon/
But when i try to scrape it with this code:
from bs4 import BeautifulSoup
import requests
url = "https://webtoon-tr.com/webtoon/"
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")
last = soup.find_all("a",{"class":"last"})
print(last)
It just returns me an empty list, and when i try to scrape all "a" tags it only returns 2 which are completly different things.
Can somebody help me about it ? I really appreciate it.
Try using the request_html library.
from bs4 import BeautifulSoup
import requests_html
url = "https://webtoon-tr.com/webtoon/"
s = requests_html.HTMLSession()
html = s.get(url)
soup = BeautifulSoup(html.content, "lxml")
last = soup.findAll("a", {"class":"last"})
print(last)
[<a aria-label="Last Page" class="last" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>]
Website is protected by Cloudflare. requests, cloudscraper or request_html doesn't work for me, only selenium:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
browser.get("https://webtoon-tr.com/webtoon/")
soup = BeautifulSoup(browser.page_source, 'html5lib')
browser.quit()
link = soup.select_one('a.last')
print(link)
This returns
<a aria-label="Last Page" class="last" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>
I'm very new to the web-scraping world, and I'm trying to scrape the names of shoes from a website. When I use inspect on the website, there's a div tag that has basically the entire webpage inside it, but when I print out the html code, the div tag is completely empty! Here's my current code:
from bs4 import BeautifulSoup
import requests
import time
def findShoeNames():
html_file = requests.get('https://www.goat.com/sneakers/brand/air-jordan').text
soup = BeautifulSoup(html_file, 'lxml')
print(soup)
if __name__ == "__main__":
findShoeNames()
When I call my function and print(soup), the div tag looks like this:
<div id="root"></div>
But as previously mentioned, when I hit inspect on the website, this div tag has basically the entire webpage inside it. So I'm unable to scrape any data from the website.
Please help! Thanks
website use js to load. so you should use selenium and chromedriver.
install selenium
install chromedriver from here (unzip and copy your python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.goat.com/sneakers/brand/air-jordan"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(1)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify)
So I want to scrape all the dates from the clash of stats, and there are multiple pages to it, and when you turn the page, the URL does not change. How do I scrape all the dates on which the player has joined a new clan?
The website:
https://www.clashofstats.com/players/pink-panther-VL029CJ2/history/log
My code now:
from emoji import UNICODE_EMOJI
import requests
from bs4 import BeautifulSoup
link = f'https://www.clashofstats.com/players/{"Pink Panther"}-{str("#VL029CJ2").replace("#", "")}/history/log'
link = link.replace("#", "%2523")
link = link.replace("#", "%2540")
link = link.replace(" ", "-")
print(link)
for i in name:
if i in UNICODE_EMOJI:
link = link.replace(i, "")
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
dates = soup.find_all(class_="start date")
print(dates)
You should use Selenium
pip install selenium
and download the Chrome Driver for example.
Then the code will look sth like this:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import TimeoutException, WebDriverException
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome('chromedriver.exe', options=options)
link = 'https://www.clashofstats.com/players/pink-panther-VL029CJ2/history/log'
driver.get(link)
while True:
try:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
dates = soup.find_all(class_="start date")
print(dates)
# next_page_link = driver.find_element_by_xpath('path_to_element')
# next_page_link = driver.find_elements_by_class_name('class_name')
print(next_page_link)
next_page_link.click()
print("Navigating to Next Page")
except (TimeoutException, WebDriverException) as e:
print("Last page reached")
break
driver.quit()
But you need to find XPath to element, or locate the element via class name
Locate elements in Selenium
from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path = "C:/Users/USER/Desktop/chromedriver.exe")
url=""
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
print(soup)
When i try to use the selenium, "window.onload=function(){process();}Please enable JavaScript and Cookies in your browser." is shown. How can i enable JavaScript?
Try enabling javascript:
options = webdriver.ChromeOptions()
options.add_argument("--enable-javascript")
I would to scrape this website : https://www.projets-environnement.gouv.fr/pages/home/
More precisely, I would like to collect the table in the div with id = table-wrapper.
My trouble is that I can't catch it with BeautifulSoup.
Here is my code :
url = 'https://www.projets-environnement.gouv.fr/pages/home/'
html = requests.get(url).text
soup = BeautifulSoup(html, "html5lib")
div_table = soup.findAll('div', id_='table-wrapper')
But div_table is a None object.
Is Selenium the solution ?
I think you should use selenium:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
url = 'https://www.projets-environnement.gouv.fr/pages/home/'
options = Options()
options.headless = True
driver = webdriver.Firefox(firefox_options=options)
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
mytable = soup.find('div', id='table-wrapper')
and you get that table.
The correct way to call is:
soup.find("div", {"id": "table-wrapper"})