I want to do web scraping for Bing's search results. Basically, I am using selenium, the idea is to using selenium to click 'Next' automatedly and scrap the URLs of search results of each page. I made it run with chrome browser on my Ubuntu:
from selenium import web driver
import os
class bingURL(object):
def __init__(self):
self.driver=webdriver.Chrome(os.path.expanduser('./chromedriver'))
def get_urls(self,url):
driver=self.driver
driver.get(url)
elems = driver.find_elements_by_xpath("//a[#href]")
href=[]
for elem in elems:
link=elem.get_attribute("href")
try:
if 'bing.com' not in link and 'http' in link and 'microsoft.com' not in link and 'smashboards.com' not in link:
href.append(link)
except:
pass
return list(set(href))
def search_urls(self,keyword,pagenum):
driver=self.driver
searchurl=self.lookup(keyword) ### url of first page of google search
driver.get(searchurl)
results=self.get_urls(searchurl)
for i in range(pagenum):
driver.find_elements_by_class_name("sb_pagN")[0].click() # click 'Next' of bing search result
time.sleep(5) # wait to load page
current_url=driver.current_url
#print(current_url)
#print(self.get_urls(current_url))
results[0:0]=self.get_urls(current_url)
driver.quit()
return results
def lookup(self,query):
return "https://www.bing.com/search?q="+query
if __name__ == "__main__":
g=bingURL()
result=g.search_urls('Stackoverflow is good',10)
it works perfectly, when I run the code, it launches a Chrome browser, and I can saw it go to the next page automatically, and get URLs for 10 pages of searching results.
However, my goal is to run these codes on AWS successfully. The original codes failed with error 'Chrome failed to start'. After google, it seems I need to use a headless browser like PhantomJS on AWS. Thus I installed PhantomJS, and change the def __init__(self): to:
def __init__(self):
self.driver=webdriver.PhantomJS()
However, it cannot click 'next' anymore, and cannot scrap URLs using the old code. The error message is:
File ".../SEARCH_BING_MODULE.py", line 70, in search_urls
driver.find_elements_by_class_name("sb_pagN")[0].click()
IndexError: list index out of range
It looks like change the browser completely change the rules. How should I modify the more original code to make it work again? or how to scrap Bing search results' URLs using selenium+PhantomJS?
Thanks for your help!
Yes, You can perform all operations as per of your all 3 point using headless browser. Don't use HTMLUnit as it have many configuration issue.
PhamtomJS was another approach for headless browser but PhantomJs is having bug these days because of poorly maintenance of it.
You can use chromedriver itself for headless jobs.
You just need to pass one option in chromedriver as below:-
chromeOptions.addArguments("--headless");
Full code will appear like this :-
System.setProperty("webdriver.chrome.driver","D:\\Workspace\\JmeterWebdriverProject\\src\\lib\\chromedriver.exe");
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.addArguments("--headless");
chromeOptions.addArguments("--start-maximized");
WebDriver driver = new ChromeDriver(chromeOptions);
driver.get("https://www.google.co.in/");
Hope it will help you :)
Related
I am trying to scrape this url but the url I enter in the driver.get() gets changed when the program runs and chrome page is opened.
What can be causing it to change?
I want to open this link and get specific things but the url changes and it displays error because this class doesnot exist on the changed url.
Here's my code:
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
driver.get("https://www.booking.com/hotel/pk/one-bahawalpur.html?aid=378266&label=bdot-Os1%2AaFx2GVFdW3rxGd0MYQS541115605091%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-334108349%3Alp1011080%3Ali%3Adec%3Adm%3Appccp%3DUmFuZG9tSVYkc2RlIyh9YYriJK-Ikd_dLBPOo0BdMww&sid=0a2d2e37ba101e6b9547da95c4a30c48&all_sr_blocks=41645002_248999805_0_1_0;checkin=2022-11-10;checkout=2022-11-11;dest_id=-2755460;dest_type=city;dist=0;group_adults=2;group_children=0;hapos=1;highlighted_blocks=41645002_248999805_0_1_0;hpos=1;matching_block_id=41645002_248999805_0_1_0;no_rooms=1;req_adults=2;req_children=0;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=41645002_248999805_0_1_0__1120000;srepoch=1668070223;srpvid=e9da3e27383900b4;type=total;ucfs=1&#hotelTmpl")
print(driver.find_element(by=By.CLASS_NAME, value="d2fee87262"))
The url after chrome page is opened is as follows:
https://www.booking.com/searchresults.html?aid=378266&label=bdot-Os1%2AaFx2GVFdW3rxGd0MYQS541115605091%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-334108349%3Alp1011080%3Ali%3Adec%3Adm%3Appccp%3DUmFuZG9tSVYkc2RlIyh9YYriJK-Ikd_dLBPOo0BdMww&sid=63b32ef8c1d53ae0613d71baf62c3e56&checkin=2022-11-10&checkout=2022-11-11&city=-2755460&group_adults=2&group_children=0&highlighted_hotels=416450&hlrd=with_av&keep_landing=1&no_rooms=1&redirected=1&source=hotel&srpvid=e9da3e27383900b4&room1=A,A,;#hotelTmpl
The URL is changed by the site. You can not change its behavior. It is redirecting users with new sessions to the search instead of the page of a hotel.
As a workaround I can suggest to click the hotel name in search:
driver.find_element(By.XPATH, '//div[text()="Hotel One Bahawalpur"]').click
put this line between the line that opens the page and the line that finds the element.
Also I recommend you to add a line driver.implicitly_wait(10) after the line driver = webdriver.Chrome(service=s). This will improve your scraper's stability as it makes selenium wait while an element appears on the page.
I am looking at a tag :
.
When I write a code,
message = soup.find("div", {"class": "text-msg-container"})
it gave me none. What are _ngcontent-vex-c62 and data-e2e-text-message-content tags? Do I need to include them too? How should I write them to get the div tag?
You can't because the div isn't there when you send a GET request to get the page code.
That page is built using Angular framework which produce SPA(Single Page Application) which means you can't scrape data from it when you send a GET request because the data isn't there.
The data is being generated by Javascript code which needs to run first to add the required data to the webpage.
You need to use another way that allows Javascript code to run first then you try to get the data you want.
If you want to find class text-msg-container, try Selenium. It will find any locator easily.
import unittest
from selenium import webdriver
class PythonSearch(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
def test_search(self):
driver = self.driver
driver.get("http://www.yoursite.com")
elem = driver.find_element_by_css_selector(".text-msg-container")
def tearDown(self):
self.driver.close()
if __name__ == "__main__":
unittest.main()
Use driver = webdriver.Chrome('/path/to/chromedriver') if you are testing Chrome. Look here for more info https://chromedriver.chromium.org/getting-started .
Getting started for Selenium https://selenium-python.readthedocs.io/getting-started.html#simple-usage
try this please
message = soup.find("div", _class="text-msg-container")
i hope that works
from selenium import webdriver
path = "C:/chromedriver.exe" ### path to downloaded chromedriver on your
#pc change this directory or put the same location C:
driver = webdriver.Chrome(path) ## your browser change it if you are not using chrome
driver.get("website link")
out = driver.find_element_by_class_name("text-msg-container")
print(out.text)
I've written a script using python in combination with selenium to parse table from a target page which can be reached out following some steps I've tried to describe below for the clarity. It does reach the destination but at the time of scraping data from that table It throws an error showing in the console "Unable to locate element". I tried with online xpath tester to see if it is wrong but I found that the xpath I've used in my script for "td_data" is right. I suppose, what I'm missing here is beyond my knowledge. Hope there is somebody to take a look into it and provide me with a workaround.
Btw, the site link is given in my script.
Link to see the html contents for the table: "https://www.dropbox.com/s/kaom5qzk78xndqn/Partial%20Html%20content%20for%20the%20table.txt?dl=0"
Steps to reach the target page which my script is able to maintain:
Selecting "I've read and understand above"
Putting this keyword "pump" in the inputbox located right below "Select medical devices".
Selecting the checkbox "Devices found for "pump".
Finally, pressing the search button
Script I've tried with so far:
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://apps.tga.gov.au/Prod/devices/daen-entry.aspx')
driver.find_element_by_id('disclaimer-accept').click()
time.sleep(5)
driver.find_element_by_id('medicine-name').send_keys('pump')
time.sleep(8)
driver.find_element_by_id('medicines-header-text').click()
driver.find_element_by_id('submit-button').click()
time.sleep(7)
for item in driver.find_elements_by_xpath('//div[#class="table-responsive"]'):
for tr_data in item.find_elements_by_xpath('.//tr'):
td_data = tr_data.find_element_by_xpath('.//span[#class="hovertext"]//a')
print(td_data.text)
driver.close()
Why don't you just do this:
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://apps.tga.gov.au/Prod/devices/daen-entry.aspx')
driver.find_element_by_id('disclaimer-accept').click()
time.sleep(5)
driver.find_element_by_id('medicine-name').send_keys('pump')
time.sleep(8)
driver.find_element_by_id('medicines-header-text').click()
driver.find_element_by_id('submit-button').click()
time.sleep(7)
for item in driver.find_elements_by_xpath(
'//table[#id]/tbody/tr/td[#class]/span[#class]/a[#id]'
):
print(item.text)
driver.close()
Output:
27233
27283
27288
27289
27390
27413
27441
27520
25445
27816
27866
27970
28033
28238
26999
28264
28407
28448
28437
28509
28524
28553
28647
28677
28646
Maybe you want to think about saving the page with driver.page_source, pull out the table, save it as a html file. Then use pandas from html to open the table into a dataframe
I'm newbie in Selenium. I start to learn Selenium via book. And I struggle with unclear behavior of Selenium. For educational purposes I use this site:
http://magento-demo.lexiconn.com/ - I'm trying to find search button by its class name, (which is: class='button search button') or by it xpath
search_button = self.driver.find_element_by_xpath('/html/body/div/div[2]/header/div/div[4]/form/div[1]/button')
or
search_button = self.driver.find_element_by_class_name('button')
but each time selenium unable to find it. Please help me to understand reason of such behavior. Thank you
I used Selenium IDE and it shows me XPATH: //button[#type='submit']
when I tried to find element by xpath,I have got the same error and it is strange. Please advise.
My code is:
import unittest
from selenium import webdriver
class HomePageTest(unittest.TestCase):
#classmethod
def setUpClass(cls):
#create new Firefox session
cls.driver = webdriver.Firefox()
cls.driver.implicitly_wait(30)
cls.driver.maximize_window()
#navvigate to application home page
cls.driver.get('http://magento-demo.lexiconn.com/')
def test_search__text_field_max_length(self):
#get the search text box
search_field=self.driver.find_element_by_id("search")
#check maxlenght attribute st to 128
self.assertEqual("128",search_field.get_attribute("maxlength"))
def test_search_button_enabled(self):
# get Search button
search_button = self.driver.find_element_by_class_name('button')
# check Search button is enabled
self.assertTrue(search_button.is_enabled())
#classmethod
def tearDown(self):
#close the browser window
self.driver.quit()
if __name__=='__main__':
unittest.main(verbosity=2)
Try this :
search_button = self.driver.find_element_by_xpath('//button[#class="button search-button"]')
Try downloading the selenium IDE plugin, install and start recording. Click on the button you want and view how its target is recorded in the IDE. Programmatically, selenium will accept the same xpaths and other selectors as the IDE. After it's been recorded in the IDE, there is a pull down on the target field that lets you see all the different ways you can select that element, ie xpath vs. by class etc.
http://www.seleniumhq.org/projects/ide/
you might try:
css=button.button.search-button
//button[#type='submit']
//form[#id='search_mini_form']/div/button
I think the issue is that your locator isn't specific enough. There is more than one button on the page and more than one element with class=button on the page. This CSS selector is working for me.
self.driver.find_element_by_css_selector("button[title='Search']")
Try this way using xpath locator
Explanation: Use title attribute of <button> tag.
self.driver.find_element_by_xpath("//button[#title='Search']")
OR
Explanation: Use title and type attribute of <button> tag.
self.driver.find_element_by_xpath("//button[#title='Search'][#type='submit']")
I am trying to get video url from links on this page. Video link could be seen on https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html . (Open in Chrome)
For that I wrote chrome web driver related code as below :
from bs4 import BeautifulSoup
from selenium import webdriver
from pyvirtualdisplay import Display
chromedriver = '/usr/local/bin/chromedriver'
os.environ['webdriver.chrome.driver'] = chromedriver
display = Display(visible=0, size=(800,600))
display.start()
driver = webdriver.Chrome(chromedriver)
driver.get('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')
try:
element = WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_class_name('yvp-main'))
self.yahoo_video_trend = []
for s in driver.find_elements_by_class_name('yvp-main'):
print "Processing link - ", item['link']
trend = item
print item['description']
trend['video_link'] = s.find_element_by_tag_name('video').get_attribute('src')
print
print s.find_element_by_tag_name('video').get_attribute('src')
self.yahoo_video_trend.append(trend)
except:
return
This works fine on my local system but when I run on my azure server it does not give any result at s.find_element_by_tag_name('video').get_attribute('src')
I have installed chrome on my azureserver.
Update :
Please see, requests and Beautifulsoup I already tried, but as yahoo loads html content dynamically from json, I could not get it using them.
And yeah azure server is simple linux system with command line access. Not any application.
I tried to reproduce your issue using you code. However, I found there was no tag named video in that page('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')(using IE and Chrome to test).
I used the developer Tool to check the HTML code, like this picture:
It seems that this page used the flash player to play video,not HTML5 video control.
For this reason, I suggest that you can check your code whether used the rightly tag name.
Any concerns, please feel free to let me know.
We tried to reproduce the error on our side. I was not able to get chrome driver to work, but I did try the firefox driver and it worked fine. It was able to load the page and get the link via the URL.
Can you change your code to print the exception and send it to us, to see where the script is failing?
Change your code:
except:
return
try
do
except Exception,e: print str(e)
Send us the exception, so we can take a look.