I tried to webscrape a HTML webpage, https://streamelements.com/logna/leaderboard, but the HTML code that I can see in inspect element with Firefox is different to the HTML source code of the webpage.
Is it possile to webscrape webpages like this or is there a way to get the code you can see through inspect element?
The Html code seen from inspect tool may differ from the original source code. It is because all the js and php code are rendered by the browser. So, while doing web scraping you should consider the HTML code as seen on browser npt the original source code.
Hope, this will help you.
Related
Im currently using beautiful soup to try and webscrape a website for data however the python module is reading the source code of the page. In the source code of the page the information i need isn't there however if i right click on the page in chrome and inspect element it is.
i was wondering if there was any way a python module could scrape the elements from a webpage and not the source code
In beautiful soup ive tried to search for the elements like however they just dont come up or appear because its searching in the source code. Im also not sure why or how it doesnt appear there.
When the contents are loaded by JavaScript, you can not get the data via Beautiful Soup. In this situation, the Selenium library is used as it is more useful and handy to extract the required dynamic contents.
This is the code that I wrote. I watched lot of tutorials but they get the output with exactly the same code
import requests
from bs4 import BeautifulSoup as bs
url="https://shop.punamflutes.com/pages/5150194068881408"
page=requests.get(url).text
soup=bs(page,'lxml')
#print(soup)
tag=soup.find('div',class_="flex xs12")
print(tag)
I always get none. Also the class name seems strange. The view source code has different stuff than the inspect element thing
Bs4 is weird. Sometimes it returns different code than what is on the page...it alters it depending on the source. Try using selenium. It works great and has many more uses than bs4. Most of all...it is super easy to find elements on a site.
It's not a bs4 problem, it is correctly parsing what requests returns. It rather depends on the webpage itself
If you inspect the "soup", you will see that the source of the page is a set of links to scripts that render the content on the page. In order for these scripts to be executed, you need to have a browser - requests will only get you what the webserver returns, but won't execute the javascript for you. You can verify this yourself by deactivating javascript in the developer tools of your browser.
The solution is to use a web browser (e.g. headless chrome + chromedriver) and Selenium to control it. There are plenty of good tutorials out there on how to do this.
I am trying to get full HTML off of a webpage with Python for a little project I am doing. The HTML I get from printing the HTML is different from what I see on Chrome. Urllib does the same thing, I have also tried using selenium webdriver but the info on the webpage is always updating and I don't want to have it constantly opening Chrome.
I am tring to crawl this link using Python's BeautifulSoup and urllib2 libraries. One problem that I am running into is that the soup object does not match the webpage's html shown using GoogleChrome's DeveloperTool. I checked multiple times and I am certain that I am passing the correct address. The reason I know they are different is because I printed the entire soup object onto sublime2 and compared it against what is shown on chrome's DeveloperTools. I also searched for really specific tags in the soup object. After debugging for hours, I am out of ideas. Does anyone know why this is happening? Is there some sort of re-direction that is going on?
JavaScript will be run in the website which changes the website DOM. Any url library (such as urllib2) only downloads the HTML and does not execute included/linked JavaScript. That's why you see a difference.
So I'm performing a scrape of omegle trying to scrape the users online.
This is the HTML code:
<div id="onlinecount">
<strong>
30,000+
</strong>
</div>
Now I would presume that using LXML it would be //div[#id="onlinecount"] to scrape any text within the , I want to get the numbers from the tags, but when I try to scrape this, I just end up with an empty list
Here's my relevant code:
print "\n Grabbing users online now from",self.website
site = requests.get(self.website)
tree = html.fromstring(site.text)
users = tree.xpath('//div[#id="onlinecount"]')
Note that the self.website variable is just http://www.omegle.com
Any ideas what I'm doing wrong? Note I can scrape other parts just not the number of online users.
I ended up using a different set of code which I learned from a friend.
Here's my full code for anyone interested.
http://pastebin.com/u1kTLZtJ
When you send a GET request to "http://www.omegle.com" using requests python module,what I observed is that there is no "onlinecount" in site.text. The reason is that part gets rendered by a javascript. You should use a library that is able to execute the javascript and give you the final html source that is rendered in a browser. One such third party library is Selenium http://selenium-python.readthedocs.org/. The only downside is that it opens a real web browser.
Below is a working code using selenium and an attached screenshot:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.omegle.com")
element = browser.find_element_by_id("onlinecount")
onlinecount = element.find_element_by_tag_name("strong")
You can also use GET method on this http://front1.omegle.com/status
that will return the count of online users and other details in JSON form
I have done a bit of looking at this and that particular part of the page is not XML but Javascript.
Here is the source (this is what the requests library is returning in your program)
<div id="onlinecount"></div>
<script>
if (IS_MOBILE) {
$('sharebuttons').dispose();
$('onlinecount').dispose();
}
</script>
</div>
As you can see, in lxml's eyes there is nothing but a script in the onlinecount div.
I agree with Praveen.
If you want to avoid launching a visible browser, you could use PhantomJS which also has a selenium driver :
http://phantomjs.org/
PhantomJS is a headless WebKit scriptable with a JavaScript API
Instead of selenium scripts, you could also write PhantomJS js scripts (but I assume you prefer to stay in Python env ;))