I'd like to get the data from inspect element using Python. I'm able to download the source code using BeautifulSoup but now I need the text from inspect element of a webpage. I'd truly appreciate if you could advise me how to do it.
Edit:
By inspect element I mean, in google chrome, right click gives us an option called inspect element which has code related to each element of that particular page. I'd like to extract that code/ just its text strings.
If you want to automatically fetch a web page from Python in a way that runs Javascript, you should look into Selenium. It can automatically drive a web browser (even a headless web browser such as PhantomJS, so you don't have to have a window open).
In order to get the HTML, you'll need to evaluate some javascript. Simple sample code, alter to suit:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://google.com")
# This will get the initial html - before javascript
html1 = driver.page_source
# This will get the html after on-load javascript
html2 = driver.execute_script("return document.documentElement.innerHTML;")
Note 1: If you want a specific element or elements, you actually have a couple of options -- parse the HTML in Python, or write more specific JavaScript that returns what you want.
Note 2: if you actually need specific information from Chrome's tools that is not just dynamically generated HTML, you'll need a way to hook into Chrome itself. No way around that.
I would like to update answer from Jason S. I wasn't able to start phantomjs on OS X
driver = webdriver.PhantomJS()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 50, in __init__
self.service.start()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/service.py", line 74, in start
raise WebDriverException("Unable to start phantomjs with ghostdriver.", e)
selenium.common.exceptions.WebDriverException: Message: Unable to start phantomjs with ghostdriver.
Resolved by answer here by downloading executables
driver = webdriver.PhantomJS("phantomjs-2.0.0-macosx/bin/phantomjs")
Inspect element shows all the HTML of the page which is the same as fetching the html using urllib
do something like this
import urllib
from bs4 import BeautifulSoup as BS
html = urllib.urlopen(URL).read()
soup = BS(html)
print soup.findAll(tag_name).get_text()
BeautifulSoup could be used to parse the html document, and extract anything you want. It's not designed for downloading. You could find the elements you want by it's class and id.
Related
My issue I'm having is that I want to grab the related links from this page: http://support.apple.com/kb/TS1538
If I Inspect Element in Chrome or Safari I can see the <div id="outer_related_articles"> and all the articles listed. If I attempt to grab it with BeautifulSoup it will grab the page and everything except the related articles.
Here's what I have so far:
import urllib2
from bs4 import BeautifulSoup
url = "http://support.apple.com/kb/TS1538"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read())
print soup
This section is loaded using Javascript. Disable your browser's Javascript to see how BeautifulSoup "sees" the page.
From here you have two options:
Use a headless browser, that will execute the Javascript. See this questions about this: Headless Browser for Python (Javascript support REQUIRED!)
Try and figure out how the apple site loads the content and simulate it - it probably does an AJAX call to some address.
After some digging it seems it does a request to this address (http://km.support.apple.com/kb/index?page=kmdata&requestid=2&query=iOS%3A%20Device%20not%20recognized%20in%20iTunes%20for%20Windows&locale=en_US&src=support_site.related_articles.TS1538&excludeids=TS1538&callback=KmLoader.receiveSuccess) and uses JSONP to load the results with KmLoader.receiveSuccess being the name of the receiving function. Use Firebug of Chrome dev tools to inspect the page in more detail.
I ran into a similar problem, the html contents that are created dynamically may not be captured by BeautifulSoup. A very basic solution for this is to make it wait for few seconds before capturing the contents, or use Selenium instead that has the functionality to wait for an element and then proceed. So for the former, this worked for me:
import time
# .... your initial bs4 code here
time.sleep(5) #5 seconds, it worked with 1 second too
html_source = browser.page_source
# .... do whatever you want to do with bs4
I am struggling using Selenium to download CSV file for couple of days. Please advise, much appreciated!!
I use Selenium Webdriver Language Bindings (Python) 2.4 + HTMLUnit browser.
Code:
browser.find_element_by_id("generate_csv").click()
csv_file = browser.page_source
In that webpage, if I use Firefox, after clicking "generate_csv" button, it will generate a CSV file, and usually download it. As I am using HTMLUnit, it is difficult to implement downloading files, so I use page_source attribute to get the CSV content.
Sometimes, it is successful!! But sometimes it will throw an error:
org.openqa.selenium.NoSuchElementException: Returned node was not an HTML element
Could someone help me analyze why this happens? I am so confused that running the script is like throwing a dice.
Thank you.
Update: (Part of traceback)
14:29:15.913 INFO - Executing: [find element: By.selector: .controlbuttons > a > img[alt='CSV']])
14:29:16.404 WARN - Exception thrown
org.openqa.selenium.NoSuchElementException: Returned node was not an HTML element
For documentation on this error, please visit: ...
Driver info: driver.version: EventFiringWebDriver
at org.openqa.selenium.htmlunit.HtmlUnitDriver.findElementByCssSelector(HtmlUnitDriver.java:952)
at org.openqa.selenium.By$ByCssSelector.findElement(By.java:426)
at org.openqa.selenium.htmlunit.HtmlUnitDriver$5.call(HtmlUnitDriver.java:1565)
at org.openqa.selenium.htmlunit.HtmlUnitDriver$5.call(HtmlUnitDriver.java:1)
at org.openqa.selenium.htmlunit.HtmlUnitDriver.implicitlyWaitFor(HtmlUnitDriver.java:1241)
at org.openqa.selenium.htmlunit.HtmlUnitDriver.findElement(HtmlUnitDriver.java:1562)
at org.openqa.selenium.htmlunit.HtmlUnitDriver.findElement(HtmlUnitDriver.java:530)
at sun.reflect.GeneratedMethodAccessor129.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.openqa.selenium.support.events.EventFiringWebDriver$2.invoke(EventFiringWebDriver.java:101)
at com.sun.proxy.$Proxy14.findElement(Unknown Source)
at org.openqa.selenium.support.events.EventFiringWebDriver.findElement(EventFiringWebDriver.java:184)
at org.openqa.selenium.remote.server.handler.FindElement.call(FindElement.java:47)
at org.openqa.selenium.remote.server.handler.FindElement.call(FindElement.java:1)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at org.openqa.selenium.remote.server.DefaultSession$1.run(DefaultSession.java:169)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
14:29:16.405 WARN - Exception: Returned node was not an HTML element
Sounds like your html doesn't finish loading before you call the click on the generate csv button. This happens a bunch with selenium when loading html from javascript - at least for me.
Not sure if this is the greatest way to handle it but I would use a recursive method to click until you get it...
import time
def generateCsv(browser):
try:
browser.find_element_by_id("generate_csv").click()
csv_file = browser.page_source
Except NoSuchElementException,e:
time.sleep(3)
generateCsv(browser)
Hope that helps
I'm scraping my site which uses a Google custom search iframe. I am using Selenium to switch into the iframe, and output the data. I am using BeautifulSoup to parse the data, etc.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import html5lib
driver = webdriver.Firefox()
driver.get('http://myurl.com')
driver.execute_script()
time.sleep(4)
iframe = driver.find_elements_by_tag_name('iframe')[0]
driver.switch_to_default_content()
driver.switch_to_frame(iframe)
output = driver.page_source
soup = BeautifulSoup(output, "html5lib")
print soup
I am successfully getting into the iframe and getting 'some' of the data. At the very top of the data output, it talks about Javascript being enabled, and the page being reloaded, etc. The part of the page I'm looking for isn't there (from when I look at the source via developer tools). So, obviously some of it isn't loading.
So, my question - how do you get Selenium to load ALL page javascripts? Is it done automatically?
I see a lot of posts on SO about running an individual function, etc... but nothing about running all of the JS on the page.
Any help is appreciated.
Ahh, so it was in the tag that featured the "Javascript must be enabled" text.
I just posted a question on how to switch within the nested iframe here:
Python Selenum Swith into an iframe within an iframe
I am learning to use Python Selenium and BeautifulSoup for web scraping. Currently, I am trying to scrape the hot searches on Google search trends http://www.google.com/trends/hottrends#pn=p5
This is my current code. However, I realized the full html is not downloaded and I only have content from the most recent few dates. What can I do to rectify this problem?
from selenium import webdriver
from bs4 import BeautifulSoup
googleURL = "http://www.google.com/trends/hottrends#pn=p5"
browser = webdriver.Firefox()
browser.get(googleURL)
content = browser.page_source
soup = BeautifulSoup(content)
print soup
Users add more content to the page (from previous dates) by clicking the <div onclick="control.moreData()" id="moreLink">More...</div> element at the bottom of the page.
So to get your desired content, you could use Selenium to click the id="moreLink" element or execute some JavaScript to call control.moreData(); in a loop.
For example, if you want to get all content as far back as Friday, February 15, 2013 (it looks like a string of this format exists for every date, for loaded content) your python might look something like this:
content = browser.page_source
desired_content_is_loaded = false;
while (desired_content_is_loaded == false):
if not "Friday, February 15, 2013" in content:
sel.run_script("control.moreData();")
content = browser.page_source
else:
desired_content_is_loaded = true;
EDIT:
If you disable JavaScript in your browser and reload the page, you will see that there is no "trends" content at all. What that tells me, is that the those items are loaded dynamically. Meaning, they are not part of the HTML document which is downloaded when you open the page. Selenium's .get() waits for the HTML document to load, but not for all JS to complete. There's no telling if async JS will complete before or after any other event. It completes when it's ready, and could be different every time. That would explain why you might sometimes get all, some, or none of that content when you call browser.page_source because it depends how fast async JS happens to be working at that moment.
So, after opening the page, you might try waiting a few seconds before getting the source - giving the JS which loads the content time to complete.
browser.get(googleURL)
time.sleep(3)
content = browser.page_source
I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.
I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.
After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:
from selenium import webdriver
import codecs
filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')
driver = webdriver.Firefox()
driver.get("http://www.examplepage.com")
allelements = driver.find_elements_by_xpath("//*")
ferdigtxt = []
for i in allelements:
if i.text in ferdigtxt:
pass
else:
ferdigtxt.append(i.text)
filen.writelines(i.text)
filen.close()
driver.quit()
The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)
I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.
Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.
Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text
Using lxml, you might try something like this:
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean
url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
browser.get(url) # Load page
content=browser.page_source
cleaner=clean.Cleaner()
content=cleaner.clean_html(content)
with open('/tmp/source.html','w') as f:
f.write(content.encode('utf-8'))
doc=LH.fromstring(content)
with open('/tmp/result.txt','w') as f:
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text=elt.text or ''
tail=elt.tail or ''
words=' '.join((text,tail)).strip()
if words:
words=words.encode('utf-8')
f.write(words+'\n')
This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).
Here's a variation on #unutbu's answer:
#!/usr/bin/env python
import sys
from contextlib import closing
import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean import Cleaner
from selenium.webdriver import Firefox # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug
cache = FileSystemCache('.cachedir', threshold=100000)
url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"
# get page
page_source = cache.get(url)
if page_source is None:
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url)
page_source = browser.page_source
cache.set(url, page_source, timeout=60*60*24*7) # week in seconds
# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text
I've separated your task in two:
get page (including elements generated by javascript)
extract text
The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.