Web scraping withe BeautifulSoup and unfound content

Web scraping withe BeautifulSoup and unfound content - python

I'm trying to make a basic web scraper using BeautifulSoup in Python. However my target page is making it difficult.
When I make the request, I get a response with the HTML. However in the body, it only displays 1 div as:
'<div id="miniwidget" style="width:100%; height:100%;"></div>'
I've navigated through the websites HTML in Google Chrome, but I'm new enough to this to not exactly understand why the page doesn't generate all of the content within that div.
How would I go about making a request that would generate the rest of the HTML?
Here's what I've written:
from bs4 import BeautifulSoup
from urllib.request import urlopen
def Call_Webpage(url):
html = urlopen(url)
bsObj = BeautifulSoup(html, features="html.parser")
soup = bsObj.body.findAll('div')
print(soup)
Response:
<div id="miniwidget" style="width:100%; height:100%;"></div>

depends what you want to find on that page...
e.g. for inspecting meta tags you would use soup.find_all('meta') etc.
you also could do
request = urllib.request.Request(domain_url, None, headers)
result = urllib.request.urlopen(request,timeout=timeout)
resulttext = result.read()
to get the whole page as text

Related

BeautifulSoup can't find HTML element by class

This is the website I'm trying to scrape with Python:
https://www.ebay.de/sch/i.html?_from=R40&_nkw=iphone+8&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=3000
I want to access the 'ul' element with the class of 'srp-results srp-list clearfix'. This is what I tried with requests and BeautifulSoup:
from bs4 import BeautifulSoup
import requests
url = 'https://www.ebay.de/sch/i.html?_from=R40&_nkw=iphone+8&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=3000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
uls = soup.find_all('ul', attrs = {'class': 'srp-results srp-list clearfix'})
And the output is always an empty string.
I also tried scraping the website with Selenium Webdriver and I got the same result.

First I was a little bit confused about your error but after a bit of debugging I figured out that: eBay dynamically generates that ul with JavaScript
So since you can't execute JavaScript with BeautifulSoup you have to use selenium and wait until the JavaScript loads that ul

It is probably because the content you are looking for is rendered by JavaScript After the page loads on a web browser this means that the web browser load that content after running javascript which you cannot get with requests.get request from python.
I would suggest to learn Selenium to Scrape the data you want

Webscraping / Beautifulsoup / sometimes None-return?

i try to scrape some informations from a webpage and on the one page it is working fine, but on the other webpage it is not working cause i only get a none return-value
This code / webpage is working fine:
# https://realpython.com/beautiful-soup-web-scraper-python/
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.at/jobs/suche/?q=Software-Devel&where=Graz"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.findAll("div", attrs={"class": "company"})
print (name_box)
But with this code / webpage i only get a None as return-value
# https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/
import requests
from bs4 import BeautifulSoup
URL = "https://www.bloomberg.com/quote/SPX:IND"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.find("h1", attrs={"class": "companyName__99a4824b"})
print (name_box)
Why is that?
(at first i thought due the number in the class on the second webpage "companyName__99a4824b" it changes the classname dynamicly - but this is not the case - when i refresh the webpage it is still the same classname...)

The reason you get None is that the Bloomberg page uses Javascript to load its content while the user is on the page.
BeautifulSoup simply returns to you the html of the page as found as soon as it reaches the page -- which does not contain the companyName_99a4824b class-tag.
Only after the user has waited for the page to fully load does the html include the desired tag.
If you want to scrape that data, you'll need to use something like Selenium, which you can instruct to wait until the desired element of the page is ready.

The website blocks scrapers, check the title:
print(soup.find("title"))
To bypass this you must use a real browser which can run JavaScript.
A tool called Selenium can do that for you.

BeautifulSoup doesn't find all spans or children

I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta
The sequence is stored under the div class="seq gbff". Each line is stored under
<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>
When I try to search for the spans containing the sequence, beautiful soup returns None. Same problem when I try to look at the children or content of the div above the spans.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})
Neither method works and I'd greatly appreciate any help :D

This page uses JavaScript to load data
With DevTools in Chrome/Firefox I found this url and there are all <span>
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.
EDIT: if in url you change retmode=html to retmode=xml then you get it as XML. If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.

Scraping a table appearing on click with python

I want to scrape information from this page.
Specifically, I want to scrape the table which appears when you click "View all" under the "TOP 10 HOLDINGS" (you have to scroll down on the page a bit).
I am new to webscraping, and have tried using BeautifulSoup to do this. However, there seems to be an issue because the "onclick" function I need to take into account. In other words: The HTML code I scrape directly from the page doesn't include the table I want to obtain.
I am a bit confused about my next step: should I use something like selenium or can I deal with the issue in an easier/more efficient way?
Thanks.
My current code:
from bs4 import BeautifulSoup
import requests
Soup = BeautifulSoup
my_url = 'http://www.etf.com/SHE'
page = requests.get(my_url)
htmltxt = page.text
soup = Soup(htmltxt, "html.parser")
print(soup)

You can get a json response from the api: http://www.etf.com/view_all/holdings/SHE. The table you're looking for is located in 'view_all'.
import requests
from bs4 import BeautifulSoup as Soup
url = 'http://www.etf.com/SHE'
api = "http://www.etf.com/view_all/holdings/SHE"
headers = {'X-Requested-With':'XMLHttpRequest', 'Referer':url}
page = requests.get(api, headers=headers)
htmltxt = page.json()['view_all']
soup = Soup(htmltxt, "html.parser")
data = [[td.text for td in tr.find_all('td')] for tr in soup.find_all('tr')]
print('\n'.join(': '.join(row) for row in data))

Scraping visible text

I am an absolute newbie in the field of web scraping and right now I want to extract visible text from a web page. I found a piece of code online :
import urllib2
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)
soup = BeautifulSoup(url , "lxml")
print (soup.prettify())
To the above code, I get the following result :
/usr/local/lib/python2.7/site-packages/bs4/__init__.py:282: UserWarning: "http://www.espncricinfo.com/" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
<html>
<body>
<p>
http://www.espncricinfo.com/
</p>
</body>
</html>
Anyway I could get a more concrete result and what wrong is happening with the code. Sorry for being clueless.

Try passing the html document and not url to prettify to:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)
soup = BeautifulSoup(web_page , 'html.parser')
print (soup.prettify().encode('utf-8'))

soup = BeautifulSoup(web_page, "lxml")
you should pass a file-like object to BeautifulSoup,not url.
url is handled by urllib2.urlopen(url) and stored in web_page

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping withe BeautifulSoup and unfound content - python

Related

BeautifulSoup can't find HTML element by class

Webscraping / Beautifulsoup / sometimes None-return?

BeautifulSoup doesn't find all spans or children

Scraping a table appearing on click with python

Scraping visible text

Categories

Resources