I want to scrap data in the <span/> attribute for a given website using BeautifulSoup. You can see at the screenshot where it locates. However, the code that I'm using is just returning an empty list. I can't find the data in the list that I want. What am I doing wrong?
from bs4 import BeautifulSoup
from urllib import request
url = "http://144.122.167.229"
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
data = opener.open(url).read()
soup = BeautifulSoup(data, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'mc1_legend_value'}):
your_data.append(line.text)
for line in soup.findAll('span'):
your_data.append(line.text)
ScreenShot : https://imgur.com/a/z0vNh
Thank you.
The dashboard from the screenshot looks to me like something javascript would generate. If you can't find the tag in the page source, that means it was later added by some javascript code or your browser tried to fix some html which it considered broken or out of place.
Keep in mind that right now you're sending a request to a server and it serves you the plain html back. A browser would parse the html and execute any javascript code if it finds any. In your case, beautiful soup or urllib doesn't execute any javascript code. urllib fetches the html and beautiful soup makes it easier to parse and extract relevant information.
If you want to get the value from that tag, I recommend using a headless browser to render your page and just after that parse it's html through beautiful soup or any other parser.
Give a try to selenium: http://selenium-python.readthedocs.io/.
You can control your own browser programmatically. You can make it request the page for you, render it, save the new html in a variable, parse it using beautifoul soup and extract the values you're interested in. I believe that it already has it's own parser implemented which you can use directly to search for that tag.
Or maybe even scrapinghub's splash: https://github.com/scrapinghub/splash
If the dashboard communicates with a server in real-time and that value is continuously received from the server, you could take a look at what requests are sent to the server in order to get that value. Take a look in developer console under the networks tab. Press F12 to open the developer console and click on Network. Refresh the page and you should get all the request send to the server along with the responses. Requests sent by the javascript are usually XMLHttpRequests. Click on XHR in the Network tab to filter out any other requests. (These are instructions for Google Chrome. Firefox might differ a bit).
Related
Novice web scraper here:
I am trying to scrape the name and address from this website https://propertyinfo.knoxcountytn.gov/Datalets/Datalet.aspx?sIndex=1&idx=1. I have attempted the following code which only returns 'None' or an empty array if I replace find() with find_all(). I would like it to return the html of this particular section so I can extract the text and later add it to a csv file. If the link doesn't work, or take to you where I'm working, simply go to the knox county tn website > property search > select a property.
Much appreciation in advance!
from splinter import Browser
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
from webdriver_manager.chrome import ChromeDriverManager
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find('td', class_='DataletData')
owner_elem
OR
# this being the tag and class of the whole section where the info is located
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find_all('div', class_='datalet_div_2')
owner_elem
OR when I try:
browser.find_by_css('td.DataletData')[15]
it returns:
<splinter.driver.webdriver.WebDriverElement at 0x11a763160>
and I can't pull the html contents from that element.
There's a few issues I see, but it could be that you didn't include your code as you actually have it.
Splinter works on its own to get page data by letting you control a browser. You don't need BeautifulSoup or requests if you're using splinter. You use requests if you want the raw response without running any of the things that browsers do for you automatically.
One of these automatic things is redirects. The link you provided does not provide the HTML that you are seeing. This link just has a response header that redirects you to https://propertyinfo.knoxcountytn.gov/, which redirects you again to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, which redirects again to https://propertyinfo.knoxcountytn.gov/Search/Disclaimer.aspx?FromUrl=../search/commonsearch.aspx?mode=realprop
On this page you have to hit the 'agree' button to get redirected to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, this time with these cookies set:
Cookie: ASP.NET_SessionId=phom3bvodsgfz2etah1wwwjk; DISCLAIMER=1
I'm assuming the session id is autogenerated, and the Disclaimer value just needs to be '1' for the server to know you agreed to their terms.
So you really have to study a page and understand what's going on to know how to do it on your own using just the requests and beautifulsoup libraries. Besides the redirects I mentioned, you still have to figure out what network request gives you that session id to manually add it to the cookie header you send on all future requests. You can avoid doing some requests, and so this way is a lot faster, but you do need to be able to follow along in the developer tools 'network' tab.
Postman is a good tool to help you set up requests yourself and see their result. Then you can bring all the set up from there into your code.
let me briefly describe the problem. When I use urllib3 to scrape the html from a website, it isn't the same as the html code that I get when I manually enter the website with chrome and use 'inspect element'
Here is an example from my code. The problem is that the html code I got here is different from the html code I would get when I use inspect element on chrome
#myUrl is the url of the website I'm trying to scrape
http = urllib3.PoolManager()
response = http.request('GET', myUrl)
soup = BeautifulSoup(response.data.decode('utf-8'), features="html.parser")
m = str(soup)
that problem, probably is due to: the content on the page is being loaded with javascript. To get the whole data, you have to use some library that runs javascript. I recommend using Selenium.
To verify that case, you can disable the browser's javascript and trying to load the page.
My issue I'm having is that I want to grab the related links from this page: http://support.apple.com/kb/TS1538
If I Inspect Element in Chrome or Safari I can see the <div id="outer_related_articles"> and all the articles listed. If I attempt to grab it with BeautifulSoup it will grab the page and everything except the related articles.
Here's what I have so far:
import urllib2
from bs4 import BeautifulSoup
url = "http://support.apple.com/kb/TS1538"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read())
print soup
This section is loaded using Javascript. Disable your browser's Javascript to see how BeautifulSoup "sees" the page.
From here you have two options:
Use a headless browser, that will execute the Javascript. See this questions about this: Headless Browser for Python (Javascript support REQUIRED!)
Try and figure out how the apple site loads the content and simulate it - it probably does an AJAX call to some address.
After some digging it seems it does a request to this address (http://km.support.apple.com/kb/index?page=kmdata&requestid=2&query=iOS%3A%20Device%20not%20recognized%20in%20iTunes%20for%20Windows&locale=en_US&src=support_site.related_articles.TS1538&excludeids=TS1538&callback=KmLoader.receiveSuccess) and uses JSONP to load the results with KmLoader.receiveSuccess being the name of the receiving function. Use Firebug of Chrome dev tools to inspect the page in more detail.
I ran into a similar problem, the html contents that are created dynamically may not be captured by BeautifulSoup. A very basic solution for this is to make it wait for few seconds before capturing the contents, or use Selenium instead that has the functionality to wait for an element and then proceed. So for the former, this worked for me:
import time
# .... your initial bs4 code here
time.sleep(5) #5 seconds, it worked with 1 second too
html_source = browser.page_source
# .... do whatever you want to do with bs4
I am working on a web scraping project and want to get a list of products from Dell's website. I found this link (https://www.dell.com/support/home/us/en/04/products/) which pulls up a box with a list of product categories (really just redirect urls. If it doesn't come up for you click the button which says "Browse all products"). I tried using Python Requests to GET the page and save the text to a file to parse through, but the response doesn't contain any of the categories/redirect urls. My code is as basic as it gets:
import requests
url = "https://www.dell.com/support/home/us/en/04/products/"
page = requests.get(url)
with open("laptops.txt", "w", encoding="utf-8") as outf:
outf.write(page.text)
outf.close()
Is there a way to get these redirect urls? I am essentially trying to make my own site map of their products so that I can scrape the details of each one. Thanks
This page uses JavaScript to get and display these links - but requests/urllib and BeautifulSoup/lxml can't run JavaScript.
Using DevTools in Firefox/Chrome (tab: Network) I found it reads it from url
https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl®ion=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743
so I use it to get links.
You may have to to change country=pl&language=pl in url to get it in different language.
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl®ion=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743"
response = requests.get(url)
soup = BS(response.text, 'html.parser')
all_items = soup.find_all('a')
for item in all_items:
print(item.text, item['href'])
BTW: Other method is it use Selenium to control real web browser which can run JavaScript.
try using selenium chrome driver it helps for handling dynamic data on website and also features like clicking buttons, handling page refresh etc.
Beginner guide to web scraping
I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.
I am trying to access the website as follows:
from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")
This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests?
I considered using mechanize.Browser but that seems a pretty roundabout way of doing it.
Try setting:
cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)
This will bypass the cookie consent page and will land you staight to the page.
Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.
I have found this SO question which asks how to send cookies in a post using requests. The accepted answer states that the latest build of Requests will build CookieJars for you from simple dictionaries. Below is the POC code included in the original answer.
import requests
cookie = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookie)