I am following this video to get myself familiar with selenium. My code is
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from pyvirtualdisplay import Display
import os
chromedriver = "/usr/bin/chromedriver"
os.environ['webdriver.chrome.driver'] = chromedriver
display = Display(visible=0, size=(800,600))
display.start()
br = webdriver.Chrome(chromedriver)
br.get("http://www.google.com")
Now to print the results
q = br.find_element_by_name('q')
q.send_keys('python')
q.send_keys(Keys.RETURN)
print br.title
results = br.find_elements_by_class_name('g')
print results
for result in results:
print result.text
print "-"*140
The output I am getting is just python and when I try to print results it is [].
When I try the below code in chrome's javascript console it works fine.
res = document.getElementsByClassName('g')[0]
<li class="g">…</li>
res.textContent
" Python Programming Language – Official Websitewww.python.org/Cached - SimilarShareShared on Google+. View the post.You +1'd this publicly. UndoHome page for Python, an interpreted, interactive, object-oriented, extensible programming language. It provides an extraordinary combination of clarity and ...CPython - Documentation - IDEs - GuiProgramming"
So, any idea why am I not getting any results with selenium+python.
Adding time.sleep(3) after q.send_keys(Keys.RETURN) seems to solve the problem. That's because when you press Keys.RETURN, ajax starts working and when you try to collect the result, they aren't on page yet. Selenium, AFAI has no stright way to determine whether the scripts like this have finished execution.
As I think, it would be more reliable to do
br.get("http://www.google.com/search?q=python")
results = br.find_elements_by_class_name('g')
Related
Title sums it up really. I'm completely new to using selenium with python. Using a python script, I'm trying to run a command on a website's console, then I'm trying to retrieve the output from that command to be used by my python script.
from selenium import webdriver
from selenium.webdriver.common.by import By
chromedriver = "path to driver"
driver = webdriver.Chrome(executable_path=chromedriver)
# load some site
driver.get('http://example.com') #orignally I used foo.com but stackoverflow doesn't like that
driver.execute_script("console.log('hello world')")
# print messages
for entry in driver.get_log('browser'):
print(entry)
but this doesn't return anything
Upon performing an inspect element on the page opened and going to the console. I saw that my hello world message was indeed there.
But I have no idea why this isn't being returned by my code. Any help would be super appreciated, thankyou!
Here's how to save the console logs and print them out:
from selenium import webdriver
driver = webdriver.Chrome()
driver.execute_script("""
console.stdlog = console.log.bind(console);
console.logs = [];
console.log = function(){
console.logs.push(Array.from(arguments));
console.stdlog.apply(console, arguments);
}
""")
driver.execute_script("console.log('hello world ABC')")
driver.execute_script("console.log('hello world 123')")
print(driver.execute_script("return console.logs"))
driver.quit()
Here's the output of that:
[['hello world ABC'], ['hello world 123']]
I am attempting to web-scrape info off of the following website: https://www.axial.net/forum/companies/united-states-family-offices/
I am trying to scrape the description for each family office, so "https://www.axial.net/forum/companies/united-states-family-offices/"+insert_company_name" are the pages I need to scrape.
So I wrote the following code to test the program for just one page:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('insert_path_here/chromedriver')
driver.get("https://network.axial.net/company/ansaco-llp")
page_source = driver.page_source
soup2 = soup(page_source,"html.parser")
soup2.findAll('axl-teaser-description')[0].text
This works for the single page, as long as the description doesn't have a "show full description" drop down button. I will save that for another question.
I wrote the following loop:
#Note: Lst2 has all the names for the companies. I made sure they match the webpage
lst3=[]
for key in lst2[1:]:
driver.get("https://network.axial.net/company/"+key.lower())
page_source = driver.page_source
for handle in driver.window_handles:
driver.switch_to.window(handle)
word_soup = soup(page_source,"html.parser")
if word_soup.findAll('axl-teaser-description') == []:
lst3.append('null')
else:
c = word_soup.findAll('axl-teaser-description')[0].text
lst3.append(c)
print(lst3)
When I run the loop, all of the values come out as "null", even the ones without "click for full description" buttons.
I edited the loop to instead print out "word_soup", and the page is different then if I had run it without a loop and does not have the description text.
I don't understand why a loop would cause that but apparently it does. Does anyone know how to fix this problem?
Found solution. pause the program for 3 seconds after driver.get:
import time
lst3=[]
for key in lst2[1:]:
driver.get("https://network.axial.net/company/"+key.lower())
time.sleep(3)
page_source = driver.page_source
word_soup = soup(page_source,"html.parser")
if word_soup.findAll('axl-teaser-description') == []:
lst3.append('null')
else:
c = word_soup.findAll('axl-teaser-description')[0].text
lst3.append(c)
print(lst3)
I see that the page uses javascript to generate the text meaning it doesn't show up in the page source, which is weird but ok. I don't quite understand why you're only iterating through and switching to all the instances of Selenium you have open, but you definitely won't find the description in the page source / beautifulsoup.
Honestly, I'd personally look for a better website if you can, otherwise, you'll have to try it with selenium which is inefficient and horrible.
Using the Selenium Module to try and webscrape but when I print out the element, it seems that it returns a location the data is stored on the Selenium Server? I'm not exactly sure how this works. Anyway, here's my code. I'm very confused. Can someone tell me what I'm doing wrong?
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://caribeexpress.com.do/') #get method
elem2 = browser.find_elements_by_css_selector('div.plan:nth-child(3) > div:nth-child(2) > span:nth-child(2)')
print(elem2)
elems3 = browser.find_elements_by_class_name('value')
print(elems3)
elem4 = browser.find_element_by_xpath('//*[#id="content-wrapper"]/div[2]/div[3]/div/span[2]')
print(elem4)
For some reason, what displays in my Python IDE doesn't display here, I included it in my gist.
https://gist.github.com/jtom343
In case you want to extract the text between span tags.
Replace this to :
print(elem2)
TO:
print(elem2.text.strip())
and this : print(elem4)
To:
print(elem4.text.strip())
I am trying to get video url from links on this page. Video link could be seen on https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html . (Open in Chrome)
For that I wrote chrome web driver related code as below :
from bs4 import BeautifulSoup
from selenium import webdriver
from pyvirtualdisplay import Display
chromedriver = '/usr/local/bin/chromedriver'
os.environ['webdriver.chrome.driver'] = chromedriver
display = Display(visible=0, size=(800,600))
display.start()
driver = webdriver.Chrome(chromedriver)
driver.get('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')
try:
element = WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_class_name('yvp-main'))
self.yahoo_video_trend = []
for s in driver.find_elements_by_class_name('yvp-main'):
print "Processing link - ", item['link']
trend = item
print item['description']
trend['video_link'] = s.find_element_by_tag_name('video').get_attribute('src')
print
print s.find_element_by_tag_name('video').get_attribute('src')
self.yahoo_video_trend.append(trend)
except:
return
This works fine on my local system but when I run on my azure server it does not give any result at s.find_element_by_tag_name('video').get_attribute('src')
I have installed chrome on my azureserver.
Update :
Please see, requests and Beautifulsoup I already tried, but as yahoo loads html content dynamically from json, I could not get it using them.
And yeah azure server is simple linux system with command line access. Not any application.
I tried to reproduce your issue using you code. However, I found there was no tag named video in that page('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')(using IE and Chrome to test).
I used the developer Tool to check the HTML code, like this picture:
It seems that this page used the flash player to play video,not HTML5 video control.
For this reason, I suggest that you can check your code whether used the rightly tag name.
Any concerns, please feel free to let me know.
We tried to reproduce the error on our side. I was not able to get chrome driver to work, but I did try the firefox driver and it worked fine. It was able to load the page and get the link via the URL.
Can you change your code to print the exception and send it to us, to see where the script is failing?
Change your code:
except:
return
try
do
except Exception,e: print str(e)
Send us the exception, so we can take a look.
How do I print a webpage using selenium please.
import time
from selenium import webdriver
# Initialise the webdriver
chromeOps=webdriver.ChromeOptions()
chromeOps._binary_location = "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"
chromeOps._arguments = ["--enable-internal-flash"]
browser = webdriver.Chrome("C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe", port=4445, chrome_options=chromeOps)
time.sleep(3)
# Login to Webpage
browser.get('www.webpage.com')
Note: I am using the, at present, current version of Google Chrome: Version 32.0.1700.107 m
While it's not directly printing the webpage, it is easy to take a screenshot of the entire current page:
browser.save_screenshot("screenshot.png")
Then the image can be printed using any image printing library. I haven't personally used any such library so I can't necessarily vouch for it, but a quick search turned up win32print which looks promising.
The key "trick" is that we can execute JavaScript in the selenium browser window using the "execute_script" method of the selenium webdriver, and if you execute the JavaScript command "window.print();" it will activate the browsers print function.
Now, getting it to work elegantly requires setting a few preferences to print silently, remove print progress reporting, etc. Here is a small but functional example that loads up and prints whatever website you put in the last line (where 'http://www.cnn.com/' is now):
import time
from selenium import webdriver
import os
class printing_browser(object):
def __init__(self):
self.profile = webdriver.FirefoxProfile()
self.profile.set_preference("services.sync.prefs.sync.browser.download.manager.showWhenStarting", False)
self.profile.set_preference("pdfjs.disabled", True)
self.profile.set_preference("print.always_print_silent", True)
self.profile.set_preference("print.show_print_progress", False)
self.profile.set_preference("browser.download.show_plugins_in_list",False)
self.driver = webdriver.Firefox(self.profile)
time.sleep(5)
def get_page_and_print(self, page):
self.driver.get(page)
time.sleep(5)
self.driver.execute_script("window.print();")
if __name__ == "__main__":
browser_that_prints = printing_browser()
browser_that_prints.get_page_and_print('http://www.cnn.com/')
The key command you were probably missing was "self.driver.execute_script("window.print();")" but one needs some of that setup in init to make it run smooth so I thought I'd give a fuller example. I think the trick alone is in a comment above so some credit should go there too.