Python selenium webscraping with google translate extension - python

I am trying to scrape multiple web pages across the world. So, I want to translate the website using Google translate extension and then scrape the page using selenium.
I did some research and figured out how to add extension while running selenium.
download google translate extension
Create .crx file
add extension to selenium
but I have no idea how to automatically execute the extension (By default, it does nothing)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
option = webdriver.ChromeOptions()
option.add_extension('./translate.crx')
driver = webdriver.Chrome(executable_path = "./chromedriver", chrome_options = option)
driver.get("naver.com")
WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
''' #### Here I want something like####
driver.execute_extension("translate this page")
'''
print driver.find_element_by_tag_name("body").text
driver.quit()
Also, I found that the extension doesn't translate the original HTML, so I might have to use a different method for crawling. (Maybe passing ctrl-a, ctrl-c, ctrl-v instead by_tag_name("body"))
Could you give me any pointer for this?

Related

Specific web page doesn't load (empty page) HTML and CSS with Selenium?

I started working with Selenium, it works for any website I tried except one (myvisit.com) that doesn't load the page.
It opens Chrome but the page is empty. I tried to set number of delays but it still doesn't load.
When I go to the website on a regular Chrome (without Selenium) it loads everything.
Here is my simple code, not sure how to continue from that:
import os
import random
import time
# selenium libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import ChromiumOptions
def delay():
time.sleep(random.randint(2,3))
driver = webdriver.Chrome(os.getcwd()+"\\webdriver\\chromedriver.exe")
driver.get("https://myvisit.com")
delay()
delay()
delay()
delay()
I also tried to use ChromiumOptions with flags like --no-sandbox but it didn't help:
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(os.getcwd()+"\\webdriver\\chromedriver.exe",options=options)
Simply add the arguement to remove it from determining it's an automation.

No output while scraping Google search page

I am trying to scrape from Google search results the blue highlighted portion as shown below:
When I use inspect element, it shows: span class="YhemCb". I have tried using various soup.find and soup.find_all commands, but everything I have tried has no
output so far. What command should I use to scrape this part?
Google uses javascript to display most of its web elements, so using something like requests and BeautifulSoup is unfortunately not enough.
Instead, use selenium! It essentially allows you to control a browser using code.
First, you will need to navigate to the google page you wish to scrape
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
Then, you have to wait until the review page loads in the browser.
This is done using WebDriverWait: you have to specify an element that needs to appear on the page. The [data-attrid="kc:/local:one line summary"] span css selector allows me to select the review info about the hotel.
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
And finally, print the rating
print(review_element.get_attribute('innerHTML'))
Here's the full code in case you want to play around with it
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
# navigate to google
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
# wait until the page loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
# print the rating
print(review_element.get_attribute('innerHTML'))
Note Google is notoriously defensive against anyone who is trying to scrape them. On first few attempts you might be successful, but eventually you will have to deal with Google Captcha.
To work around that, I would suggest using the search engine scraper, something like the quickstart guide to get you started!
Disclaimer: I work at Oxylabs.io

Web-scraping with selenium using google translate

I am trying to scrape multiple web pages across the world. So, I want to translate the website using Google translate extension and then scrape the page using selenium.
I did some research and figured out how to add extension while running selenium.
1) download google translate extension
2) Create .crx file
3) add extension to selenium
but I have no idea how to automatically execute the extension (By default, it does nothing)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
option = webdriver.ChromeOptions()
option.add_extension('./translate.crx')
driver = webdriver.Chrome(executable_path = "./chromedriver", chrome_options = option)
driver.get("naver.com")
WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
''' #### Here I want something like####
driver.execute_extension("translate this page")
'''
print driver.find_element_by_tag_name("body").text
driver.quit()
Also, I found that the extension doesn't translate the original HTML, so I might have to use a different method for crawling. (Maybe passing ctrl-a, ctrl-c, ctrl-v instead by_tag_name("body"))
Could you give me any pointer for this?
Thanks in advance
driver.execute_extension
Seems to me if you can open the extension by Selenium (see an example in C#). Then you by Selenium may click on the TRANSLATE THIS PAGE link:
Shortcut
Use Google Translate API.

Facing challenge with selenium webdriver on Win 10 system

I am trying to automate login on a website using selenium. (Windows 10 64 bit os)
Using the below code I am able to open a link,
But once webpage loads the get cmd does not release my python interpreter
next cmd:
browser.find_element_by_class_name('gb_g').click()
does not run.
I tried to open google.com but the issue is same, I tried different browser it works but my project URL only works with internet explorer.
I have tried with both 64 & 32-bit driver version of internet explorer
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
link = 'https://www.google.com/'
browser = webdriver.Ie(executable_path='S:\work\Automation\Elog\IEDriverServer.exe')
browser.get(link)
browser.find_element_by_class_name('gb_g').click()
You need to consider a few things:
If your use case is to use internet-explorer you need to use IEDriverServer.exe through selenium.webdriver.ie.webdriver.WebDriver() Class as follows:
webdriver.Ie(executable_path=r'S:\path\to\IEDriverServer.exe')
To click() on the element with text as Gmail you have to induce WebDriverWait for the element_to_be_clickable() and you can use either of the following Locator Strategies:
Using LINK_TEXT:
from selenium import webdriver
browser = webdriver.Ie(executable_path=r'S:\work\Automation\Elog\IEDriverServer.exe')
browser.get("https://www.google.com/")
WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.LINK_TEXT, "Gmail"))).click()
Using CSS_SELECTOR:
from selenium import webdriver
browser = webdriver.Ie(executable_path=r'S:\work\Automation\Elog\IEDriverServer.exe')
browser.get("https://www.google.com/")
WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"a.gb_g[href*='com/mail/?tab']"))).click()
Using XPATH:
from selenium import webdriver
browser = webdriver.Ie(executable_path=r'S:\work\Automation\Elog\IEDriverServer.exe')
browser.get("https://www.google.com/")
WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.XPATH,"//a[text()='Gmail']"))).click()
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Update (from #rahuldesai's comment)
With the above mentioned configurations and using the 32 bit IEDriverServer binary some how the issue gets fixed.

Use Selenium to download multiple URLs as PDFs

I am trying to have Selenium download the URLs of a webpage as PDFs on Safari. So far, I have been able to open the URL, but I can't get Safari to download it. All the solutions I found so far were either for another browser, or they didn't work. Ideally I would like it to download all links of one page and then move on the next page.
At first I thought that clicking on each hyperlink and then downloading it was the way to go. But that would require switching windows each time, so then I tried to find a way to download it without having to click on it, but nothing worked.
I am quite new at programming so I am sure that I am missing something.
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pdfkit
browser = webdriver.Safari()
browser.get(a_base_url)
username = browser.find_element_by_name("tb_LoginName")
password = browser.find_element_by_name("tb_Password")
submit = browser.find_element_by_id("btn_Login")
username.send_keys(username)
password.send_keys(password)
submit.click()
element=WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="maincolumn"]/div/table/tbody/tr[2]/td[9]/a[2]'))).click()
browser.switch_to_window(browser.window_handles[0])
url=browser.current_url
I would go for the following approach:
Get href attribute of the link you want to download via WebElement.get_attribute() function
Use urllib or requests library to retrieve the URL from step 1 without using the browser
Most probably you will also need to get the browser Cookies via WebDriver.get_cookies function and add them to Cookie header for your download request

Categories