How to scrape this using bs4

How to scrape this using bs4 - python

I have to get <a class="last" aria-label="Last Page" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>.
From this site:https://webtoon-tr.com/webtoon/
But when i try to scrape it with this code:
from bs4 import BeautifulSoup
import requests
url = "https://webtoon-tr.com/webtoon/"
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")
last = soup.find_all("a",{"class":"last"})
print(last)
It just returns me an empty list, and when i try to scrape all "a" tags it only returns 2 which are completly different things.
Can somebody help me about it ? I really appreciate it.

Try using the request_html library.
from bs4 import BeautifulSoup
import requests_html
url = "https://webtoon-tr.com/webtoon/"
s = requests_html.HTMLSession()
html = s.get(url)
soup = BeautifulSoup(html.content, "lxml")
last = soup.findAll("a", {"class":"last"})
print(last)
[<a aria-label="Last Page" class="last" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>]

Website is protected by Cloudflare. requests, cloudscraper or request_html doesn't work for me, only selenium:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
browser.get("https://webtoon-tr.com/webtoon/")
soup = BeautifulSoup(browser.page_source, 'html5lib')
browser.quit()
link = soup.select_one('a.last')
print(link)
This returns
<a aria-label="Last Page" class="last" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>

Related

Get 'None' from website by using beautifulSoup

I am new joiner and doing self study for crawling. I tried to get the information from Disneyland
https://www.hongkongdisneyland.com/book/general-tickets/1day-tickets-j
I tried to crawl the price from the website, but it return "None", the result should be HK$639
url5 = 'https://www.hongkongdisneyland.com/book/general-tickets/1day-tickets-j'
r = requests.get(url5)
sp = BeautifulSoup(r.content, 'html.parser')
door = sp.find('div', class_='container')
price = door.find('p', class_='price')
print(price)
My concept of beautifulSoup: It parse the website to be html and I can use find/find_all to find the information by using 'div', 'p', and by its class. Please correct me if it is wrong, thank you.

The page is loaded by JavaScript.So to pull out the desired data, you can use selenium with bs4
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--no-sandbox")
options.add_argument("start-maximized")
options.add_experimental_option("detach", True)
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service,options=options)
url = 'https://www.hongkongdisneyland.com/book/general-tickets/1day-tickets-j'
driver.get(url)
#driver.maximize_window()
time.sleep(10)
soup=BeautifulSoup(driver.page_source, 'lxml')
price = soup.select_one('p:-soup-contains("General Admission:") > strong').text
print(price)
Output:
HK$639

Missing Elements from HTML File Using BeautifulSoup

I'm very new to the web-scraping world, and I'm trying to scrape the names of shoes from a website. When I use inspect on the website, there's a div tag that has basically the entire webpage inside it, but when I print out the html code, the div tag is completely empty! Here's my current code:
from bs4 import BeautifulSoup
import requests
import time
def findShoeNames():
html_file = requests.get('https://www.goat.com/sneakers/brand/air-jordan').text
soup = BeautifulSoup(html_file, 'lxml')
print(soup)
if __name__ == "__main__":
findShoeNames()
When I call my function and print(soup), the div tag looks like this:
<div id="root"></div>
But as previously mentioned, when I hit inspect on the website, this div tag has basically the entire webpage inside it. So I'm unable to scrape any data from the website.
Please help! Thanks

website use js to load. so you should use selenium and chromedriver.
install selenium
install chromedriver from here (unzip and copy your python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.goat.com/sneakers/brand/air-jordan"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(1)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify)

Incomplete HTML-response on some sites using Requests & BeautifulSoup or Selenium

I'm tying to scrape information from some urls using Requests and BeautifulSoup in Python. But some sites only return an partial HTML response missing the content of the page
This is the code, that is not working:
import requests
from bs4 import BeautifulSoup
url = "http://www.exampleurl.com"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
Here is the incomplete response:
Picture
I tried to use Selenium with Chrome Webdriver instead, but ended up with the same issue.
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
browser = webdriver.Chrome(options=options)
browser.get(url)
html = browser.page_source
Any ideas?

What happens
You do not get the expected html cause it is in an iframe
Try to get the src of the iframe soup.find('iframe')['src'] and request with it again.
Example
import requests
from bs4 import BeautifulSoup
url = "http://www.ingenieur-jobs.de/jobangebote/3075/"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
iframe = requests.get(soup.find('iframe')['src'])
soup = BeautifulSoup(iframe.content, 'html.parser')
soup

Python says "Please enable JavaScript and Cookies in your browser." in Selenium webdriver

from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path = "C:/Users/USER/Desktop/chromedriver.exe")
url=""
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
print(soup)
When i try to use the selenium, "window.onload=function(){process();}Please enable JavaScript and Cookies in your browser." is shown. How can i enable JavaScript?

Try enabling javascript:
options = webdriver.ChromeOptions()
options.add_argument("--enable-javascript")

Is this website scrape-able with BeautifulSoup?

I would to scrape this website : https://www.projets-environnement.gouv.fr/pages/home/
More precisely, I would like to collect the table in the div with id = table-wrapper.
My trouble is that I can't catch it with BeautifulSoup.
Here is my code :
url = 'https://www.projets-environnement.gouv.fr/pages/home/'
html = requests.get(url).text
soup = BeautifulSoup(html, "html5lib")
div_table = soup.findAll('div', id_='table-wrapper')
But div_table is a None object.
Is Selenium the solution ?

I think you should use selenium:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
url = 'https://www.projets-environnement.gouv.fr/pages/home/'
options = Options()
options.headless = True
driver = webdriver.Firefox(firefox_options=options)
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
mytable = soup.find('div', id='table-wrapper')
and you get that table.

The correct way to call is:
soup.find("div", {"id": "table-wrapper"})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape this using bs4 - python

Related

Get 'None' from website by using beautifulSoup

Missing Elements from HTML File Using BeautifulSoup

Incomplete HTML-response on some sites using Requests & BeautifulSoup or Selenium

Python says "Please enable JavaScript and Cookies in your browser." in Selenium webdriver

Is this website scrape-able with BeautifulSoup?

Categories

Resources