When this page is scraped with urllib2:
url = https://www.geckoboard.com/careers/
response = urllib2.urlopen(url)
content = response.read()
the following element (the link to the job) is nowhere to be found in the source (content)
Taking a look at the full source that gets rendered in a browser:
So it would appear that the FRONT-END ENGINEER element is dynamically loaded by Javascript. Is it possible to have this Javascript executed by urllib2 (or other low-level library) without involving e.g. Selenium, BeautifulSoup, or other?
The pieces of information are loaded using some ajax request. You could use firebug extension for mozilla or google chrome has it's own tool to get theese details. Just hit f12 in google chrome while opening the URL. You can find the complete details there.
There you will find a request with url https://app.recruiterbox.com/widget/13587/openings/
Information from the above url is rendered in that web page.
From what I understand, you are building something generic for multiple web-sites and don't want to go deep down in how a certain site is loaded, what requests are made under-the-hood to construct the page. In this case, a real browser is your friend - load the page in a real browser automated via selenium - then, once the page is loaded, pass the .page_source to lxml.html (from what I see this is your HTML parser of choice) for further parsing.
If you don't want a browser to show up or you don't have a display, you can go headless - PhantomJS or a regular browser on a virtual display.
Here is a sample code to get you started:
from lxml.html import fromstring
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_page_load_timeout(15)
driver.get("https://www.geckoboard.com/careers/")
# TODO: you might need a delay here
tree = fromstring(driver.page_source)
driver.close()
# TODO: parse HTML
You should also know that, there are plenty of methods to locate elements in selenium and you might not even need a separate HTML parser here.
I think you're looking for something like this: https://github.com/scrapinghub/splash
Related
This is the code that I wrote. I watched lot of tutorials but they get the output with exactly the same code
import requests
from bs4 import BeautifulSoup as bs
url="https://shop.punamflutes.com/pages/5150194068881408"
page=requests.get(url).text
soup=bs(page,'lxml')
#print(soup)
tag=soup.find('div',class_="flex xs12")
print(tag)
I always get none. Also the class name seems strange. The view source code has different stuff than the inspect element thing
Bs4 is weird. Sometimes it returns different code than what is on the page...it alters it depending on the source. Try using selenium. It works great and has many more uses than bs4. Most of all...it is super easy to find elements on a site.
It's not a bs4 problem, it is correctly parsing what requests returns. It rather depends on the webpage itself
If you inspect the "soup", you will see that the source of the page is a set of links to scripts that render the content on the page. In order for these scripts to be executed, you need to have a browser - requests will only get you what the webserver returns, but won't execute the javascript for you. You can verify this yourself by deactivating javascript in the developer tools of your browser.
The solution is to use a web browser (e.g. headless chrome + chromedriver) and Selenium to control it. There are plenty of good tutorials out there on how to do this.
I have a flash card making program for Spanish that pulls information from here: http://www.spanishdict.com/examples/zorro (this is just an example). I've set it up so it gets the translations fine, but now I want to add examples. I noticed however, that the examples on that page are dynamically generated so I installed Beautiful Soup and HTML5 parser. The tag I'm specifically interested in is:
<span class="megaexamples-pair-part">Los perros siguieron el rastro del <span
class="megaexamples-highlight">zorro</span>. </span>
The code I'm using to try and retrieve it is:
soup = BeautifulSoup(urlopen("http://www.spanishdict.com/examples/zorro").read(), 'html5lib')
example = soup.findAll("span", {"class": "megaexamples-pair-part"})
However, no matter what way I swing it, I can't seem to get it to pull down the dynamically generated code. I have confirmed I get the page by doing a search for megaexamples-container, which works fine (and you can see by just right clicking in google chrome and hitting View Page Source).
Any ideas?
What you're doing is just pull the HTML page, and it's likely loading more data from the server via a JavaScript call.
You have 2 options:
Use a webdriver such as selenium to control a web browser that correctly loads the entire page (you can then parse it with BeautifulSoup or find elements with selenium's own tools). This incurs in some overhead due to the browser usage.
Use the network tab of your browser's developer tools (usually accessed with F12) to analyze incoming and outgoing requests from dynamic loading and use the requests module to replicate them. This is more efficient but might also be more tricky.
Remember to do this only if you have permission from the site's owner, though. In many cases it's against the ToS.
I used Pedro's answer to get me moving in the right direction. Here is what I did to get it to work:
Download selenium with pip install selenium
Download the driver for the browser you want to emulate. You can download them from this page. The driver must be in the PATH variable or you will need to specify the path in the constructor for the webdriver.
Import selenium with from selenium import webdriver
Now use the following code:
browser = webdriver.Chrome()
browser.get(raw_input("Enter URL: "))
html_source = browser.page_source
Note: If you did not put your driver in path, you have to call the constructor with browser = webdriver.Chrome(<PATH_TO_DRIVER_HERE>)
Note 2: You can use something like webdriver.Firefox() if you want a different browser.
Now you can parse it with something like: soup = BeautifulSoup(html_source, 'html5lib')
Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'
Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source
check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network
The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.
Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()
I am trying to scrape this page on Flipkart:
http://www.flipkart.com/moto-x-play/p/itmeajtqp9sfxgsk?pid=MOBEAJTQRH4CCRYM&ref=L%3A7224647610489585789&srno=p_1&query=moto+x+play&otracker=from-search
I am trying to find the div with class "fk-ui-ccarousel-supercontainer same-vreco-section reco-carousel-border-top sameHorizontalReco" but it returns empty result.
from bs4 import BeautifulSoup
import requests
url = "http://www.flipkart.com/moto-x-play/p/itmeajtqp9sfxgsk?pid=MOBEAJTQRH4CCRYM&ref=L%3A7224647610489585789&srno=p_1&query=moto%20x%20play&otracker=from-search"
page = requests.get(url)
soup = BeautifulSoup(page.text)
divs = soup.find_all("div",{"class":"fk-ui-ccarousel-supercontainer same-vreco-section reco-carousel-border-top sameHorizontalReco"})
print divs
divs is empty. I copied the class name using inspect element.
I found the answer in this question. http://www.google.com/url?q=http%3A%2F%2Fstackoverflow.com%2Fquestions%2F22028775%2Ftried-python-beautifulsoup-and-phantom-js-still-cant-scrape-websites&sa=D&sntz=1&usg=AFQjCNFOZIMVyUDcUqNNuv-05Dp7P_L6-g
When you use requests.get(url) you load the HTML content of the url without JavaScript enabled. Without JavaScript enabled, the section of the page called 'customers who viewed this product also viewed' is never even rendered.
You can explore this behaviour by turning off JavaScript in your browser. If you scrape regularly, you might also want to download a JavaScript switcher plugin.
An alternative that you might want to look into is using a browser automation tool such as selenium.
requests.get(..) will return the content that is the plain HTTP GET on that url. all the Javascript rels that the page contains will not be downloaded, also, any inline javascript will not be executed either.
If flipkart uses js to modify the DOM after it is loaded in the browser, those changes will not reflect in the page.contents or page.text values.
you could try a different parser instead of the default parser in beautiful soup. I tried html5lib and it worked for a different website. maybe it will for you too. It will be slower than the default parser, but could be faster than selenium or other full fledged headless browsers.
I have a webpage :
http://kff.org/womens-health-policy/state-indicator/ultrasound-requirements/#
and I need to extract the table from this webpage.
Problem Encountered : I have been using BeautifulSoup and requests to get the url content. The problem with these methods is that I am able to get the web content even before the table is being generated.
So I get empty table
< table>
< thead>
< /thead>
< tbody>
< /tbody>
< /table>
My approach : Now I am trying to open the url in the browser using
webbrowser.open_new_tab(url) and then get the content from the browser directly . This will give the server to update the table and then i will be able to get the content from the page.
Problem : I am not sure how to fetch information from Web browser directly .
Right now i am using Mozilla on windows system.
Closest link found website Link . But it gives which sites are opened and not the content
Is there any other way to let the table load in urllib2 or beautifulsoup and requests ? or is there any way to get the loaded content directly from the webpage.
Thanks
To add to Santiclause answer, if you want to scrape java-script populated data you need something to execute it.
For that you can use selenium package and webdriver such as Firefox or PhantomJS (which is headless) to connect to the page, execute the scripts and get the data.
example for your case:
from selenium import webdriver
driver = webdriver.Firefox() # You can replace this with other web drivers
driver.get("http://kff.org/womens-health-policy/state-indicator/ultrasound-requirements/#")
source = driver.page_source # Here is your populated data.
driver.quit() # don't forget to quit the driver!
of course if you can access direct json like user Santiclause mentioned, you should do that. You can find it by checking the network tab when inspecting the element on the website, which needs some playing around.
The reason the table isn't being filled is because Python doesn't process the page it receives with urllib2 - so there's no DOM, no Javascript that runs, et cetera.
After reading through the source, it looks like the information you're looking for can be found at http://kff.org/datacenter.json?post_id=32781 in JSON format.