I'm very new to the web-scraping world, and I'm trying to scrape the names of shoes from a website. When I use inspect on the website, there's a div tag that has basically the entire webpage inside it, but when I print out the html code, the div tag is completely empty! Here's my current code:
from bs4 import BeautifulSoup
import requests
import time
def findShoeNames():
html_file = requests.get('https://www.goat.com/sneakers/brand/air-jordan').text
soup = BeautifulSoup(html_file, 'lxml')
print(soup)
if __name__ == "__main__":
findShoeNames()
When I call my function and print(soup), the div tag looks like this:
<div id="root"></div>
But as previously mentioned, when I hit inspect on the website, this div tag has basically the entire webpage inside it. So I'm unable to scrape any data from the website.
Please help! Thanks
website use js to load. so you should use selenium and chromedriver.
install selenium
install chromedriver from here (unzip and copy your python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.goat.com/sneakers/brand/air-jordan"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(1)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify)
Related
I have to get <a class="last" aria-label="Last Page" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>.
From this site:https://webtoon-tr.com/webtoon/
But when i try to scrape it with this code:
from bs4 import BeautifulSoup
import requests
url = "https://webtoon-tr.com/webtoon/"
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")
last = soup.find_all("a",{"class":"last"})
print(last)
It just returns me an empty list, and when i try to scrape all "a" tags it only returns 2 which are completly different things.
Can somebody help me about it ? I really appreciate it.
Try using the request_html library.
from bs4 import BeautifulSoup
import requests_html
url = "https://webtoon-tr.com/webtoon/"
s = requests_html.HTMLSession()
html = s.get(url)
soup = BeautifulSoup(html.content, "lxml")
last = soup.findAll("a", {"class":"last"})
print(last)
[<a aria-label="Last Page" class="last" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>]
Website is protected by Cloudflare. requests, cloudscraper or request_html doesn't work for me, only selenium:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
browser.get("https://webtoon-tr.com/webtoon/")
soup = BeautifulSoup(browser.page_source, 'html5lib')
browser.quit()
link = soup.select_one('a.last')
print(link)
This returns
<a aria-label="Last Page" class="last" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>
I'm currently using selenium and BeautifulSoup to scrape a website but I'm running into two major issues, first of all, I can't get Chrome to launch in headless mode and it says there are multiple unexpected ends of inputs (photo of said errors). The other problem I have is that I keep getting an error on the line that contains "html.parser" saying that a 'str' is not a callable object. Any advice on these issues would be greatly appreciated thank you.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import urllib.request
import lxml
import html5lib
import time
from bs4 import BeautifulSoup
#config options
options = Options()
options.headless = True
# Set the URL you want to webscrape from
url = 'https://tokcount.com/?user=mrsam993'
# Connect to the URL
browser = webdriver.Chrome(options=options, executable_path='D:\chromedriver') #chrome_options=options
browser.get(url)
# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(browser.page_source(), "html.parser")
browser.quit()
# for i in range(10):
links = soup.findAll('span', class_= 'odometer-value')
print(links)
As for the headless you need to call this way:
from selenium import webdriver
options = webdriver.ChromeOptions()
...
the page_source is not a method. So you need to remove the brackets:
browser.page_source
I am trying to make 'Google Patent Crawler' by using python.
I used modules like Requests, BS4, and Selenium, But I am totally stuck on one thing.
The problem is my code is not parsing entire html sources. something is missing after parsing.
I found the error from 'driver.page_source'
It is not parsing all html.
So I wanna ask about another good way to solve it.
Thank you.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_driver_path = 'C:/Users/Kay/Documents/Python/Driver/chromedriver.exe'
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver_path)
URL = 'https://patents.google.com/?q=engine'
driver.get(URL)
html = driver.page_source
gp_soup = BeautifulSoup(html, 'html5lib')
I'm trying to download the PDF slides off this website using Python and selenium but I think the the links to the slides only appear after loading a script. I tried waiting for the javascript to load but it's still not finding anything. Any ideas?
import os, sys, time, random
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://mila.umontreal.ca/en/cours/deep-learning-summer-school-2017/slides'
browser = webdriver.Chrome()
browser.get(url)
browser.implicitly_wait(3)
html = browser.page_source
links = browser.find_elements_by_class_name('flip-entry')
print(links)
browser.quit()
The reason is that there are no links on the main page. You are getting links inside an IFrame. This IFrame points to https://drive.google.com/embeddedfolderview?hl=fr&id=0ByUKRdiCDK7-c0k1TWlLM1U1RXc#list
You can either directly browse that URL in your code instead of main page. Or you can switch to the frame
browser.switch_to_frame(browser.find_element_by_class_name("iframe-class"))
links = browser.find_elements_by_css_selector('.flip-entry a')
for link in links:
print(link.get_attribute("href"))
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://mila.umontreal.ca/en/cours/deep-learning-summer-school-2017/slides'
browser = webdriver.Chrome()
browser.get(url)
browser.switch_to_frame(browser.find_element_by_class_name('iframe-class'))
links = browser.find_elements_by_class_name('.flip-entry a')
for link in links:
print(link.get_attribute("href"))
browser.quit()
What's the best way to get all the HTML in a page that is created by Javascript to pass to BeautifulSoup?
I'm currently using:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from BeautifulSoup import BeautifulSoup
browser = webdriver.Firefox()
browser.get("http://www.yahoo.co.uk")
html = browser.find_elements_by_id("html")
But "html" is always an empty list. What am I doing wrong?
The correct way to pass the page source to Beautiful Soup from Selenium would be:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from BeautifulSoup import BeautifulSoup
browser = webdriver.Firefox()
browser.get("http://www.yahoo.co.uk")
html_source = browser.page_source
html = BeautifulSoup(html_source)
This way, the browser is loading the page, extracting the FULL html source and passing it to BeautifulSoup. The result can be parsed like any other Beautiful Soup object.
HTML is not an id. It should instead be like this:
html = browser.find_elements_by_tag_name("html")
since html is a tag.
The search you originally did would return all elements where the id has been set to "html". An example of an element that would be returned:
<p id="html">Lorem ipsum</p>
Th id of that element is "html" and the tag name is "p".
You could also use something like
html_source = browser.page_source
This a webdriver provided function call, precisely to collect the full source or "get all the html in a page"