Python - Getting HTML with DOM - python

I have a flash card making program for Spanish that pulls information from here: http://www.spanishdict.com/examples/zorro (this is just an example). I've set it up so it gets the translations fine, but now I want to add examples. I noticed however, that the examples on that page are dynamically generated so I installed Beautiful Soup and HTML5 parser. The tag I'm specifically interested in is:
<span class="megaexamples-pair-part">Los perros siguieron el rastro del <span
class="megaexamples-highlight">zorro</span>. </span>
The code I'm using to try and retrieve it is:
soup = BeautifulSoup(urlopen("http://www.spanishdict.com/examples/zorro").read(), 'html5lib')
example = soup.findAll("span", {"class": "megaexamples-pair-part"})
However, no matter what way I swing it, I can't seem to get it to pull down the dynamically generated code. I have confirmed I get the page by doing a search for megaexamples-container, which works fine (and you can see by just right clicking in google chrome and hitting View Page Source).
Any ideas?

What you're doing is just pull the HTML page, and it's likely loading more data from the server via a JavaScript call.
You have 2 options:
Use a webdriver such as selenium to control a web browser that correctly loads the entire page (you can then parse it with BeautifulSoup or find elements with selenium's own tools). This incurs in some overhead due to the browser usage.
Use the network tab of your browser's developer tools (usually accessed with F12) to analyze incoming and outgoing requests from dynamic loading and use the requests module to replicate them. This is more efficient but might also be more tricky.
Remember to do this only if you have permission from the site's owner, though. In many cases it's against the ToS.

I used Pedro's answer to get me moving in the right direction. Here is what I did to get it to work:
Download selenium with pip install selenium
Download the driver for the browser you want to emulate. You can download them from this page. The driver must be in the PATH variable or you will need to specify the path in the constructor for the webdriver.
Import selenium with from selenium import webdriver
Now use the following code:
browser = webdriver.Chrome()
browser.get(raw_input("Enter URL: "))
html_source = browser.page_source
Note: If you did not put your driver in path, you have to call the constructor with browser = webdriver.Chrome(<PATH_TO_DRIVER_HERE>)
Note 2: You can use something like webdriver.Firefox() if you want a different browser.
Now you can parse it with something like: soup = BeautifulSoup(html_source, 'html5lib')

Related

unable to parse a webpage using python

I am trying to parse below webpage to get name of stocks hitting now all time high or low in the exchange.
https://www.bseindia.com/markets/equity/EQReports/HighLow.html?Flag=H#
however, when i download the webpage using beautiful soup and check the data i do not find the stock name or price mentioned in the webpage.
I wish to write a function to download the stocks that hit a new all time high each day please help what am i missing?
Part of the HTML on the page is generated dynamically by JavaScript. You are most likely using the requests library, which cannot handle HTML generated in this way.
What you can do, instead, is use the Selenium library, which allows you to launch an instance of a web browser controlled by Python, and get the page source from there.
from selenium import webdriver
path = '...' # path to driver here
url = 'https://www.bseindia.com/markets/equity/EQReports/HighLow.html?Flag=H#'
driver = webdriver.Chrome(path)
page_source = driver.get(url).page_source
By parsing page_source with BeautifulSoup, you can get what you want.

Web scraping for divs inserted by scripts

Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'
Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source
check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network
The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.
Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()

Executing a page's JavaScript at a low level with Python?

When this page is scraped with urllib2:
url = https://www.geckoboard.com/careers/
response = urllib2.urlopen(url)
content = response.read()
the following element (the link to the job) is nowhere to be found in the source (content)
Taking a look at the full source that gets rendered in a browser:
So it would appear that the FRONT-END ENGINEER element is dynamically loaded by Javascript. Is it possible to have this Javascript executed by urllib2 (or other low-level library) without involving e.g. Selenium, BeautifulSoup, or other?
The pieces of information are loaded using some ajax request. You could use firebug extension for mozilla or google chrome has it's own tool to get theese details. Just hit f12 in google chrome while opening the URL. You can find the complete details there.
There you will find a request with url https://app.recruiterbox.com/widget/13587/openings/
Information from the above url is rendered in that web page.
From what I understand, you are building something generic for multiple web-sites and don't want to go deep down in how a certain site is loaded, what requests are made under-the-hood to construct the page. In this case, a real browser is your friend - load the page in a real browser automated via selenium - then, once the page is loaded, pass the .page_source to lxml.html (from what I see this is your HTML parser of choice) for further parsing.
If you don't want a browser to show up or you don't have a display, you can go headless - PhantomJS or a regular browser on a virtual display.
Here is a sample code to get you started:
from lxml.html import fromstring
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_page_load_timeout(15)
driver.get("https://www.geckoboard.com/careers/")
# TODO: you might need a delay here
tree = fromstring(driver.page_source)
driver.close()
# TODO: parse HTML
You should also know that, there are plenty of methods to locate elements in selenium and you might not even need a separate HTML parser here.
I think you're looking for something like this: https://github.com/scrapinghub/splash

How to get the content from web browser using python?

I have a webpage :
http://kff.org/womens-health-policy/state-indicator/ultrasound-requirements/#
and I need to extract the table from this webpage.
Problem Encountered : I have been using BeautifulSoup and requests to get the url content. The problem with these methods is that I am able to get the web content even before the table is being generated.
So I get empty table
< table>
< thead>
< /thead>
< tbody>
< /tbody>
< /table>
My approach : Now I am trying to open the url in the browser using
webbrowser.open_new_tab(url) and then get the content from the browser directly . This will give the server to update the table and then i will be able to get the content from the page.
Problem : I am not sure how to fetch information from Web browser directly .
Right now i am using Mozilla on windows system.
Closest link found website Link . But it gives which sites are opened and not the content
Is there any other way to let the table load in urllib2 or beautifulsoup and requests ? or is there any way to get the loaded content directly from the webpage.
Thanks
To add to Santiclause answer, if you want to scrape java-script populated data you need something to execute it.
For that you can use selenium package and webdriver such as Firefox or PhantomJS (which is headless) to connect to the page, execute the scripts and get the data.
example for your case:
from selenium import webdriver
driver = webdriver.Firefox() # You can replace this with other web drivers
driver.get("http://kff.org/womens-health-policy/state-indicator/ultrasound-requirements/#")
source = driver.page_source # Here is your populated data.
driver.quit() # don't forget to quit the driver!
of course if you can access direct json like user Santiclause mentioned, you should do that. You can find it by checking the network tab when inspecting the element on the website, which needs some playing around.
The reason the table isn't being filled is because Python doesn't process the page it receives with urllib2 - so there's no DOM, no Javascript that runs, et cetera.
After reading through the source, it looks like the information you're looking for can be found at http://kff.org/datacenter.json?post_id=32781 in JSON format.

omegle lxml scrape not working

So I'm performing a scrape of omegle trying to scrape the users online.
This is the HTML code:
<div id="onlinecount">
<strong>
30,000+
</strong>
</div>
Now I would presume that using LXML it would be //div[#id="onlinecount"] to scrape any text within the , I want to get the numbers from the tags, but when I try to scrape this, I just end up with an empty list
Here's my relevant code:
print "\n Grabbing users online now from",self.website
site = requests.get(self.website)
tree = html.fromstring(site.text)
users = tree.xpath('//div[#id="onlinecount"]')
Note that the self.website variable is just http://www.omegle.com
Any ideas what I'm doing wrong? Note I can scrape other parts just not the number of online users.
I ended up using a different set of code which I learned from a friend.
Here's my full code for anyone interested.
http://pastebin.com/u1kTLZtJ
When you send a GET request to "http://www.omegle.com" using requests python module,what I observed is that there is no "onlinecount" in site.text. The reason is that part gets rendered by a javascript. You should use a library that is able to execute the javascript and give you the final html source that is rendered in a browser. One such third party library is Selenium http://selenium-python.readthedocs.org/. The only downside is that it opens a real web browser.
Below is a working code using selenium and an attached screenshot:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.omegle.com")
element = browser.find_element_by_id("onlinecount")
onlinecount = element.find_element_by_tag_name("strong")
You can also use GET method on this http://front1.omegle.com/status
that will return the count of online users and other details in JSON form
I have done a bit of looking at this and that particular part of the page is not XML but Javascript.
Here is the source (this is what the requests library is returning in your program)
<div id="onlinecount"></div>
<script>
if (IS_MOBILE) {
$('sharebuttons').dispose();
$('onlinecount').dispose();
}
</script>
</div>
As you can see, in lxml's eyes there is nothing but a script in the onlinecount div.
I agree with Praveen.
If you want to avoid launching a visible browser, you could use PhantomJS which also has a selenium driver :
http://phantomjs.org/
PhantomJS is a headless WebKit scriptable with a JavaScript API
Instead of selenium scripts, you could also write PhantomJS js scripts (but I assume you prefer to stay in Python env ;))

Categories