I'm trying to parse webpage with python selenium webdriver.
I've found something starnge in html content. It is different when I use robot browser instead of when I'm getting same page with human browser.
For example just part of webpage that I get:
<p>
<label>
<span>
Some text 1
<br>
<i>header 1</i>
Some text 2
<br>
<i>header 2</i>
Some text 3
<br>
<i>header 3</i>
Some text 4
</span>
</label>
</>
In human browser I get it as is, but in robot browser I get it without one section, I missed header 2 and Some text 3.
I was trying to analize request headers in human browser and robot browser to find difference and I've found one. In human request headers there is not cookie. But in robot browser in request headers I can see this
cookie: _ga=GA1.2.153230535.1622710383; _gid=GA1.2.1454651548.1622710383; __gads=ID=fb2caae82787b530-2265cda036c80043:T=1622710436:RT=1622710436:S=ALNI_MZ0bzRzYOmpiZrGnBzbdMQl7UHCRw
I don't understand why it is so. Can anyone explain? How can server distinguish my robot browser and send different content instead of human browser?
I've solved my problem by imitating the mouse movement. So now before click on element, I use selenium webdriver.ActionChains for imitate mouse movement.
search_input = browser.driver.find_element_by_xpath('//input[#class="search_input"]')
sleep(0.5)
browser.action_chain.move_to_element(search_input)
sleep(0.5)
browser.action_chain.click(search_input)
sleep(0.5)
search_input.clear()
sleep(0.5)
Now all content that I got like in human browser
I am creating a website scraper and have so far successfully automated logging on and navigating the website with the use of requests alone. But the next tasks the scraper needs to complete is to click a button to open a form within the website. this is what the button looks like as the web element (I've turned some text into gobbledygook):
<button class="btn btn-link">search criteria: 'hd eijfwor ok' (jeiofij ji wdpojq), from: 2020-10-01, to: 2020-12-31</button>
what action does my code have to perform to click this button? Preferably this would only require requests , but im open to all other options.
cheers.
I use Tor Browser with Selenium to automate a click on a button.
File script.py
from tbselenium.tbdriver import TorBrowserDriver
with TorBrowserDriver("/home/user/Selenium/tor-browser_en-US/") as driver:
driver.get('https://www.example.com/form.html')
How do I manage to perform a click on this button (excerpt from the HTML file)?
<form method="post" id="IdA" action="https://example.com/action.php"><input id='valid' name='valid' value='012.23945765955' type="hidden"><button class="g-recaptcha" data-sitekey="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" data-callback="onSubmit" id="IdA" style="background:url(https://www.example.com/button.gif);width:190px;height:58px;border:none;cursor:pointer;display:none;" type="submit"></button></form>
I tried this, but it did not work:
driver.findElement(By.Id("IdA")).click()
I'm assuming you are trying to bypass a CAPTCHA.
You can do this one of two ways. You can click the button by using a selector. For example, an XPath selector for a button with class "g-recpatcha". You can also just execute JavaScript code on the page to call the onSubmit() function.
So two options are:
driver.find_element_by_xpath("//button[#class='g-recaptcha']").click()
driver.execute_script("onSubmit("" + captchaToken + "")")
See the reCAPTCHA callback on 2captcha API, Solving Captchas.
I am trying to upload a file using python automation.
While I try to execute the code below python selenium throws an error.
Even I tried waiting for 10 seconds to avoid synchronisation issues.
driver.execute_script('window.open("https://ocr.space/" , "new window")')
Imagepath = r"C:\User\Desktop\banner.png"
field=driver.find_element_by_xpath('//input[#type="file"]')
field.send_keys(Imagepath)
NoSuchElementException: Message: no such element: Unable to locate
element: {"method":"xpath","selector":"//input[#type="file"]"}
Website url:
https://ocr.space/
HTML snippet:
<div class="span8">
<input type="file" id="imageFile" class="form-control choose valid">
</div>
Changing the code to launch the url with get seems to solve the issue.
from selenium import webdriver
driver = webdriver.Chrome("./chromedriver")
driver.get("https://ocr.space/")
image = r"C:\Users\Thanthu Nair\Desktop\soc360.png"
field=driver.find_element_by_xpath('//input[#type="file"]')
field.send_keys(image)
Also make sure the path provided C:\User\Desktop\banner.png is correct, otherwise you'll get another exception. It is just my assumption that this path might be wrong because usually Desktop folder is inside folder with user's name which is inside the User folder. In this case you've Desktop folder is inside User folder according to the path you've give.
To solve your problem, simply replace new window with _self in the below line of your code :
driver.execute_script('window.open("https://ocr.space/" , "_self")')
Your code is working fine but the reason for an error is, after running your code it launches browser with two tabs nothing but windows and the page will be launched in the second window so you need to switch to that window before uploading an image.
You can use window handles for switching to that window. Below is the code in Java, you can try doing same using Python :
// Using JavaScriptExecutor to launch the browser
JavascriptExecutor jse = (JavascriptExecutor) driver;
jse.executeScript("window.open(\"https://ocr.space/\" , \"new window\")");
// Fetching window handles and switching to the last window
Set<String> handles = driver.getWindowHandles();
for(String handle : handles) {
driver.switchTo().window(handle);
}
// Printing window title
System.out.println(driver.getTitle());
// Uploading an image
WebElement field = driver.findElement(By.xpath("//input[#type='file']"));
String imagePath = "some image";
field.sendKeys(imagePath);
If you use window.open() to launch an URL then it will do two things, first it will launch browser with default window then it will open URL in new tab even if you don't provide new window argument in your JavaScript function. You need to switch to a particular window to perform any operations on it if you choose this way.
To avoid an above problem, simply you can use driver.get(URL) or driver.navigate().to(URL) which launches the browser and navigates to a particular URL in the same launched browser window.
If you want to use JavaScriptExecutor only without doing switching, you can pass _self as a second argument to the JavaScript function like below instead of new window which avoids switching and launches an URL in the same window :
JavascriptExecutor jse = (JavascriptExecutor) driver;
jse.executeScript("window.open(\"https://ocr.space/\" , \"_self\")");
System.out.println(driver.getTitle());
WebElement field = driver.findElement(By.xpath("//input[#type='file']"));
String imagePath = "some image";
field.sendKeys(imagePath);
I hope it helps...
Generally, when the file upload related <input> tag contains the attribute type as file you can invoke send_keys() to populate the relevant text field with a character sequence. However, in your usecase the <input> tag though having type="file" but the class attributes are form-control choose which is as follows:
<input type="file" id="imageFile" class="form-control choose">
So, you may not able to able to send a character sequence invoking send_keys().
In these cases you need to use Auto IT based solutions. You can find a couple of relevant discussion in:
How to upload a file in Selenium with no text box
I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:
from selenium import webdriver
URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine
# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)
driver.get(URL)
time.sleep(5)
# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)
blast_button = driver.find_element_by_id("b1")
blast_button.click()
time.sleep(60)
At that point I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one. The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.
content = driver.page_source
with open('webpage.html', 'w') as f:
f.write(content)
I've also found this question/answer on SO, but the accepted answer just brings up the 'save as' box, and does not provide a way to click it (as two commenters point out)
Is there a simple way to 'save [full page] as' using python? Ideally I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?
UPDATE - Follow up question for James' answer
So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right).
This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:
<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">
These options control formatting of alignments in results pages. The
default is HTML, but other formats (including plain text) are available.
PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST.
The Advanced view option allows the database descriptions to be sorted by various indices in a table.
</p>
Any idea why this is happening?
As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')
This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.
Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.
Full code, in which I also added conditional waits to improve speed:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui
URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)
# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)
blast_button = driver.find_element_by_id("b1")
blast_button.click()
# wait until results are loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')
This is not a perfect solution, but it will get you most of what you need. You can replicate the behavior of "save as full web page (complete)" by parsing the html and downloading any loaded files (images, css, js, etc.) to their same relative path.
Most of the javascript won't work due to cross origin request blocking. But the content will look (mostly) the same.
This uses requests to save the loaded files, lxml to parse the html, and os for the path legwork.
from selenium import webdriver
import chromedriver_binary
from lxml import html
import requests
import os
driver = webdriver.Chrome()
URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA'
base = 'https://blast.ncbi.nlm.nih.gov/'
driver.get(URL)
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)
blast_button = driver.find_element_by_id("b1")
blast_button.click()
content = driver.page_source
# write the page content
os.mkdir('page')
with open('page/page.html', 'w') as fp:
fp.write(content)
# download the referenced files to the same path as in the html
sess = requests.Session()
sess.get(base) # sets cookies
# parse html
h = html.fromstring(content)
# get css/js files loaded in the head
for hr in h.xpath('head//#href'):
if not hr.startswith('http'):
local_path = 'page/' + hr
hr = base + hr
res = sess.get(hr)
if not os.path.exists(os.path.dirname(local_path)):
os.makedirs(os.path.dirname(local_path))
with open(local_path, 'wb') as fp:
fp.write(res.content)
# get image/js files from the body. skip anything loaded from outside sources
for src in h.xpath('//#src'):
if not src or src.startswith('http'):
continue
local_path = 'page/' + src
print(local_path)
src = base + src
res = sess.get(hr)
if not os.path.exists(os.path.dirname(local_path)):
os.makedirs(os.path.dirname(local_path))
with open(local_path, 'wb') as fp:
fp.write(res.content)
You should have a folder called page with a file called page.html in it with the content you are after.
Inspired by FThompson's answer above, I came up with the following tool that can download full/complete html for a given page url (see: https://github.com/markfront/SinglePageFullHtml)
UPDATE - follow up with Max's suggestion, below are steps to use the tool:
Clone the project, then run maven to build:
$> git clone https://github.com/markfront/SinglePageFullHtml.git
$> cd ~/git/SinglePageFullHtml
$> mvn clean compile package
Find the generated jar file in target folder: SinglePageFullHtml-1.0-SNAPSHOT-jar-with-dependencies.jar
Run the jar in command line like:
$> java -jar .target/SinglePageFullHtml-1.0-SNAPSHOT-jar-with-dependencies.jar <page_url>
The result file name will have a prefix "FP, followed by the hashcode of the page url, with file extension ".html". It will be found in either folder "/tmp" (which you can get by System.getProperty("java.io.tmp"). If not, try find it in your home dir or System.getProperty("user.home") in Java).
The result file will be a big fat self-contained html file that includes everything (css, javascript, images, etc.) referred to by the original html source.
I'll advise u to have a try on sikulix which is an image based automation tool for operate any widgets within PC OS, it supports python grammar and run with command line and maybe the simplest way to solve ur problem.
All u need to do is just give it a screenshot, call sikulix script in ur python automation script(with OS.system("xxxx") or subprocess...).