How to Pull Links from Google Search using Selenium, Python - python

I am trying to ask Google to pull up a query's relevant Search Links, in this case I am using Wikipedia, and then parse the urls of the first three via Selenium. So far I have only been able to do the first part, Googling. Here's my code:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
query = raw_input("What do you wish to search on Wikipedia?\n")
query = " " + query
# Create a new instance of the Firefox driver
driver = webdriver.Firefox()
# go to the google home page
driver.get("https://www.google.com/search?q=site%3Awikipedia.com&ie=utf-8&oe=utf-8")
# the page is ajaxy so the title is originally this:
print driver.title
# find the element that's name attribute is q (the google search box)
inputElement = driver.find_element_by_name("q")
# type in the search
inputElement.send_keys(query)
# submit the form (although google automatically searches now without submitting)
inputElement.submit()
try:
# we have to wait for the page to refresh, the last thing that seems to be updated is the title
# You should see "cheese! - Google Search"
print driver.title
driver.find_element_by_xpath("//h3[contains(text(),'Wikipedia')]").click()
finally:
driver.quit()
I am trying to use the example from Selenium's documentation, so please excuse the comments and, at times, unnecessary code.
The line of code I'm having trouble with is:
driver.find_element_by_xpath("//h3[contains(text(),'Wikipedia')]").click()
What I'm attempting to do is obtain the relevant Wikipedia link, or, more specifically, the link that the H3 'r' path directs to.
Here's a picture of a Google page that I'm describing.
In this instance, I wish to pull the link http://en.wikipedia.com/wiki/salary
Sorry for the wall of text, I'm trying to be as specific as possible. Anyways, thank you for the assistance in advance.
Best Regards!

The problem is that this XPath is not correct - there is an a element that has "Wikipedia" inside the text, not the h3 element. Fix it:
driver.find_element_by_xpath("//a[contains(text(), 'Wikipedia')]").click()
You can even go further and simplify it using:
driver.find_element_by_partial_link_text("Wikipedia").click()

Related

No output while scraping Google search page

I am trying to scrape from Google search results the blue highlighted portion as shown below:
When I use inspect element, it shows: span class="YhemCb". I have tried using various soup.find and soup.find_all commands, but everything I have tried has no
output so far. What command should I use to scrape this part?
Google uses javascript to display most of its web elements, so using something like requests and BeautifulSoup is unfortunately not enough.
Instead, use selenium! It essentially allows you to control a browser using code.
First, you will need to navigate to the google page you wish to scrape
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
Then, you have to wait until the review page loads in the browser.
This is done using WebDriverWait: you have to specify an element that needs to appear on the page. The [data-attrid="kc:/local:one line summary"] span css selector allows me to select the review info about the hotel.
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
And finally, print the rating
print(review_element.get_attribute('innerHTML'))
Here's the full code in case you want to play around with it
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
# navigate to google
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
# wait until the page loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
# print the rating
print(review_element.get_attribute('innerHTML'))
Note Google is notoriously defensive against anyone who is trying to scrape them. On first few attempts you might be successful, but eventually you will have to deal with Google Captcha.
To work around that, I would suggest using the search engine scraper, something like the quickstart guide to get you started!
Disclaimer: I work at Oxylabs.io

How do I use selenium ChromeDriver to scroll the sidebar on Google maps to load more results?

I’ve run into a problem trying to use Selenium ChromeDriver to scroll down the sidebar of a google maps results page. I am trying to get to the 6th result down but the result does not fully load until you scroll down. Using the find_element_by_xpath method, I am successfully able to access results 1-5 and click into them individually, but when trying to use the actions.move_to_element(link).perform() method to scroll to the 6th element, it does not work and throws an error message.
The error that I get is:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element:
However, I know this element exists because when I manually scroll and more results are loaded, the Xpath works correctly. What am I doing wrong? I’ve spent many hours trying to solve this and I haven’t been able to solve with the available content out there. I appreciate any help or insights you can offer, thank you!
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as soup
import time
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.google.com/maps")
time.sleep(7)
page = soup(driver.page_source, 'html.parser')
#find the searchbar, enter search, and hit return
search = driver.find_element_by_id('searchboxinput')
search.send_keys("dentists in Austin Texas")
search.send_keys(Keys.RETURN)
driver.maximize_window()
time.sleep(7)
#I want to get the 6th result down but it requires a sidebar scroll to load
link = driver.find_element_by_xpath("//*[#id='pane']/div/div[1]/div/div/div[4]/div[1]/div[13]/div/a")
actions.move_to_element(link).perform()
link.click()
time.sleep(5)
driver.back()```
I found a solution that works, it is to target the element in XPATH from the javascript interface of selenium. You must then execute two commands on an instruction (targeting and scroll)
driver.executeScript("var el = document.evaluate('/html/body/jsl/div[3]/div[10]/div[8]/div/div[1]/div/div/div[4]/div[1]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue; el.scroll(0, 5000);");
this is the only solution that worked for me
The search results in the google map are located with //div[contains(#aria-label,'dentists in Austin Texas')]//div[contains(#jsaction,'mouseover')] XPath.
So, to select 6-th element there you can do the following
from selenium.webdriver.common.action_chains import ActionChains
results = driver.find_elements_by_xpath('//div[contains(#aria-label,"dentists in Austin Texas")]//div[contains(#jsaction,"mouseover")]')
ActionChains(driver).move_to_element(results[6]).click(button).perform()
I was just implementing scrolling on google map sidebar, it's working on my side. check this code please
# selecting scroll body
driver.find_element_by_xpath('/html/body/div[3]/div[9]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div[1]').click()
#start scrolling your sidebar
html = driver.find_element_by_xpath('/html/body/div[3]/div[9]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div[1]')
html.send_keys(Keys.END)
also add the "KEYS" library
from selenium.webdriver.common.keys import Keys
I hope it would help you.
by the way I have implemented scrapping of google map with its available data and used above code to scroll. check if you have any problem, let me know then

Using Python to Scrape a JS Form

I'm currently working on a research project in which we are trying to collect saved image files from Brazil's Hemeroteca database. I've done web scraping on PHP pages before using C/C++ with HTML forms, but as this is a shared script, I need to switch to python such that everyone in the group can use this tool.
The page which I'm trying to scrape is: http://bndigital.bn.gov.br/hemeroteca-digital/
There are three forms which populate, the first being the newspaper/journal. Upon selecting this, the available times populate, and the final field is the search term. I've inspected the HTML page here and the three IDs of these are respectively: 'PeriodicoCmb1_Input', 'PeriodoCmb1_Input', and 'PesquisaTxt1'.
Some google searches on this topic led me to the Selenium package, and I've put together this sample code to attempt to read the page:
import webbrowser
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
print("Begin...")
browser = webdriver.Chrome()
url = "http://bndigital.bn.gov.br/hemeroteca-digital/"
browser.get(url)
print("Waiting to load page... (Delay 3 seconds)")
time.sleep(3)
print("Searching for elements")
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
print(journal)
print("Set fields, delay 3 seconds between input")
search_journal = "Relatorios dos Presidentes dos Estados Brasileiros (BA)"
search_timeRange = "1890 - 1899"
search_text = "Milho"
journal.send_keys(search_journal)
time.sleep(3)
timeRange.send_keys(search_timeRange)
time.sleep(3)
searchTerm.send_keys(search_text)
print("Perform search")
submitButton = button.find_element_by_id("PesquisarBtn1_input")
submitButton.click()
The script runs to the print(journal) statement, where an error is thrown saying the element cannot be found.
Can anyone take a quick sweep of the page in question and make sure I've got the general premise of this script in line correctly, or point me towards some examples to get me running on this problem?
Thanks!
Your DOM elements you are trying to find are located in iframe. So before using find_element_by_id API you should switch to iframe context.
Here is a code how to switch to iframe context:
# add your code
frame_ref = browser.find_elements_by_tag_name("iframe")[0]
iframe = browser.switch_to.frame(frame_ref)
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
# add your code
Here is a link describing switching to iframe context.

Unable to select an element via XPath in selenium Webdriver python

I am trying to run a script in selenium webdriver python. Where I am trying to click on search field, but its always showing exception of "An element could not be located on the page using the given search parameters."
Here is script:
from selenium import webdriver
from selenium.webdriver.common.by import By
class Exercise:
def safari(self):
class Exercise:
def safari(self):
driver = webdriver.Safari()
driver.maximize_window()
url= "https://www.airbnb.com"
driver.implicitly_wait(15)
Title = driver.title
driver.get(url)
CurrentURL = driver.current_url
print("Current URL is "+CurrentURL)
SearchButton =driver.find_element(By.XPATH, "//*[#id='GeocompleteController-via-SearchBarV2-SearchBarV2']")
SearchButton.click()
note= Exercise()
note.safari()
Please Tell me, where I am wrong?
There appears to be two matching cases:
The one that matches the search bar is actually the second one. So you'd edit your XPath as follows:
SearchButton = driver.find_element(By.XPATH, "(//*[#id='GeocompleteController-via-SearchBarV2-SearchBarV2'])[2]")
Or simply:
SearchButton = driver.find_element_by_xpath("(//*[#id='GeocompleteController-via-SearchBarV2-SearchBarV2'])[2]")
You can paste your XPath in Chrome's Inspector tool (as seen above) by loading the same website in Google Chrome and hitting F12 (or just right click anywhere and click "Inspect"). This gives you the matching elements. If you scroll to 2 of 2 it highlights the search bar. Therefore, we want the second result. XPath indices start at 1 unlike most languages (which usually have indices start at 0), so to get the second index, encapsulate the entire original XPath in parentheses and then add [2] next to it.

Submitting value in a Google powered form while scraping

I am trying to scrape an online food-ordering website using Mechanize & BS4. The problem I'm facing is that the website has a form that takes location as input powered by Google. When I try filling it using this method:
from bs4 import BeautifulSoup as bs
import requests, lxml, mechanize
url = raw_input("Enter URL: ")
browser = mechanize.Browser()
browser.open(url)
# 'placeSelectionForm' is the name of the input-field
browser.select_form(name='placeSelectionForm')
control1 = browser.form.controls[0]
control1._value = 'Koramangala'
browser.submit()
soup = bs(browser.response().read(), "lxml")
print soup.prettify()
The script works fine for a normal django form that I have made. But the problem here is that the Google powered form is using auto-complete api like this:
So when I type initials of some location, there are auto-complete suggestions and as soon as I select one option the form auto-submits and I'm taken to a new URL.
Now, the problem with the URL of the new page is that no matter what option I chose in the form, the URL remains the same, and the values that come with the response vary accordingly with the option I chose on previous page.
How can I fill this form (powered by Google maps api) using tools like Mechanize or BS4 or any other such?
This is a quite Javascript "heavy" website which you may find difficult to automate with mechanize. Here is how you can do make a search, choose one of the suggestions and wait for results to load with selenium:
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.maximize_window()
driver.get('http://www.swiggy.com/bangalore')
# wait for input to appear and make a search
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "pac-input"))).send_keys("Koramangala")
# wait for suggestions to appear
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.pac-container div.pac-item")))
# choose the first suggestion
suggestions = driver.find_elements_by_css_selector("div.pac-container div.pac-item")
suggestions[0].click()
# wait for results to load
wait.until(EC.visibility_of_element_located((By.ID, "restaurants")))
# TODO: extract results
I've added comments to make things clear. Let me know if you want me to expand on any of the parts of the code.

Categories