BeautifulSoup showing wrong results - python

Hi I am trying to scrape different websites but for some of them it outputs a cloudflare url .
As the code below, I want the href values of the anchor tags for every tags that have a value(which is the href/a url.) and print each one of them in new lines.
For example in this code:
some text
its should return the https://www.google.com in output which for some websites it doesn't work at all or return the output that I mentioned below.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
page = requests.get("https://www.swappa.com")
soup = BeautifulSoup(page.content, "html.parser")
links = soup.find_all("a", href=True)
for link in links:
print(link['href'])
and vs code output this:
https://www.cloudflare.com/?utm_source=challenge&utm_campaign=j

this is not a problem of the beautiful soup but yeah this site first detects your IP and many data like which kinds of browser you are using etc and then forward it to the main site. this is a new type of blocking the web scrapping.

Related

HTML parsing with BeautifulSoup in Python unknown error

I know that this code works for other websites that end in .com
However I noticed that the code doesn't work if I try to parse websites that end in .kr
Can somebody help to find why this is happening and an alternate solution to parse these types of websites?
Following is my code.
import requests
from bs4 import BeautifulSoup
URL = 'https://everytime.kr/#nN4K1XC0weHnnM9VB5Qe'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='container')
print(results)
The URL here is a link to my timetable. I need to parse this website so that I can easily collect the information for the subjects and data relevant to the subject (duration, location, professor's name, etc.).
Thanks
Website is serving dynamic content and you get an empty response back - you may use selenium.
Example
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://everytime.kr/#nN4K1XC0weHnnM9VB5Qe'
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find(id='container')
print(results)
driver.close()

BeautifulSoup can't find HTML element by class

This is the website I'm trying to scrape with Python:
https://www.ebay.de/sch/i.html?_from=R40&_nkw=iphone+8&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=3000
I want to access the 'ul' element with the class of 'srp-results srp-list clearfix'. This is what I tried with requests and BeautifulSoup:
from bs4 import BeautifulSoup
import requests
url = 'https://www.ebay.de/sch/i.html?_from=R40&_nkw=iphone+8&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=3000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
uls = soup.find_all('ul', attrs = {'class': 'srp-results srp-list clearfix'})
And the output is always an empty string.
I also tried scraping the website with Selenium Webdriver and I got the same result.
First I was a little bit confused about your error but after a bit of debugging I figured out that: eBay dynamically generates that ul with JavaScript
So since you can't execute JavaScript with BeautifulSoup you have to use selenium and wait until the JavaScript loads that ul
It is probably because the content you are looking for is rendered by JavaScript After the page loads on a web browser this means that the web browser load that content after running javascript which you cannot get with requests.get request from python.
I would suggest to learn Selenium to Scrape the data you want

Unable to scrape a class

I am currently trying to make a web scraper using python. The objective I have is for my web scraper to find the name and the price of a stock. Here is my code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://finance.yahoo.com/quote/MA?p=MA&.tsrc=fin-srch')
soup = BeautifulSoup(page.content, "html.parser")
stock_name = soup.find({ "class" : "D(ib) Fz(18px)"})
print(stock_name)
but when i run it i get this:
C:\Users\baribal\Desktop>py web_scraper.py
None
thank you in advance!
Your request just gives you the raw HTML of the webpage. The elements you are trying to retrieve are React components that are rendered in the browser after the HTML source text is loaded.
You need to use a headless browser like Selenium instead.

Scraping all links using Python BeautifulSoup/lxml

http://www.snapdeal.com/
I was trying to scrape all links from this site and when I do, I get an unexpected result. I figured out that this is happening because of javascript.
under "See All categories" Tab you will find all major product categories. If you hover the mouse over any category it will expand the categories. I want those links from each major categories.
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
#print data
for link in page.findAll('a'):
l = link.get('href')
print l
But, this gave me a different result than what I expected (I turned off javascript and looked at the page source and output was from this source)
I just want to finds all sub links from each major category. any suggestions will be appreciated.
This is happening just because you are letting BeautifulSoup chose its own best parser , and you might not have installed lxml .
The best option is to use html.parser to parse the url .
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
This worked for me .Make sure to install dependencies .
I thinks you should try another library such as selenium , it provide a web driver for you and this is the advantage of this library ,for my self I couldn't handle javascripts with bs4.
Categories Menu is the url you are looking for. Many websites generate the content dynamically using XHR(XMLHTTPRequest).
In order to examine the components of a website get familiar with Firebug add-on in Firefox or Developer Tools(inbuilt addon) in Chrome. You can check the XHR used in website under the network tab in aforementioned add-ons.
Use a web scraping tool such as scrapy or mechanize
In mechanize, to get all the links in the snapdeal homepage,
br=Browser()
br.open("http://www.snapdeal.com")
for link in browser.links():
print link.name
print link.url
I have been looking into a way to scrape links from webpages that are only rendered in an actual browser but wanted the results to be run using a headless browser.
I was able to achieve this using phantomJS, selenium and beautiful soup
#!/usr/bin/python
import bs4
import requests
from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')
url = 'http://www.snapdeal.com/'
browser = driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
links = [a.attrs.get('href') for a in soup.find_all('a')]
for paths in links:
print paths
driver.close()
The following examples will work for both HTTP and HTTPS. I'm writing this answer to show how this can be used in both Python 2 and Python 3.
Python 2
This is inspired by this answer.
from bs4 import BeautifulSoup
import urllib2
url = 'https://stackoverflow.com'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
Python 3
from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl
# to open up HTTPS URLs
gcontext = ssl.SSLContext()
# You can give any URL here. I have given the Stack Overflow homepage
url = 'https://stackoverflow.com'
data = urlopen(url, context=gcontext).read()
page = BeautifulSoup(data, 'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print(l)
Other Languages
For other languages, please see this answer.

How to scrape Instagram with BeautifulSoup

I want to scrape pictures from a public Instagram account. I'm pretty familiar with bs4 so I started with that. Using the element inspector on Chrome, I noted the pictures are in an unordered list and li has class 'photo', so I figure, what the hell -- can't be that hard to scrape with findAll, right?
Wrong: it doesn't return anything (code below) and I soon notice that the code shown in element inspector and the code that I drew from requests were not the same AKA no unordered list in the code I pulled from requests.
Any idea how I can get the code that shows up in element inspector?
Just for the record, this was my code to start, which didn't work because the unordered list was not there:
from bs4 import BeautifulSoup
import requests
import re
r = requests.get('http://instagram.com/umnpics/')
soup = BeautifulSoup(r.text)
for x in soup.findAll('li', {'class':'photo'}):
print x
Thank you for your help.
If you look at the source code for the page, you'll see that some javascript generates the webpage. What you see in the element browser is the webpage after the script has been run, and beautifulsoup just gets the html file. In order to parse the rendered webpage you'll need to use something like Selenium to render the webpage for you.
So, for example, this is how it would look with Selenium:
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
url = 'http://instagram.com/umnpics/'
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)
for x in soup.findAll('li', {'class':'photo'}):
print x
Now the soup should be what you are expecting.

Categories