Python + selenium download image without extension - python

I'm using python 3 with selenium, I have to download an image
HTML:
<img id="labelImage" name="labelImage" border="0" width="672" height="456" alt="labelImage" src="/shipping/labelAction.handle?method=doGetLabelFromCache&isDecompressRequired=false&utype=null&cacheKey=774242409034SHIPPING_L">
Python code:
found = browser.find_element_by_css_selector('img[alt="labelImage"]')
src = found.get_attribute('src')
urllib.request.urlretrieve(src, 'image.png')
that image file is empty, if I try to switch extension to html, shows me message below:
"We're sorry, we can't process your request right now. It appears you don't have permission to view this webpage"

The error you recieve when attempt to download comes from the fact the urllib call is a brand new session for their server - it does not have the cookies and authentication your browser does. E.g. it is the same as if you open incognito mode in the browser, and paste in the address bar the src attribute - for the server you are a new client, that hasn't fill in the form, logged in, etc.
You may want to try something else - in the selenium/the browser session, taking a screenshot of just the <img> element. That op is with variable success, Chrome for instance added support for it only recently, and in some situations it fails:
found = browser.find_element_by_css_selector('img[alt="labelImage"]')
try:
found.screenshot('element.png')
except Exception as ex: # FIXME: anti-pattern - I don't recall the exact exception - when you run the code, change it to the proper one
print('The correct exception is {}'.format(ex))
browser.get_screenshot_as_file('page.png')
If taking the element's screenshot fails, you'll get one of the whole page - which you can then trim to the element.

Related

How to properly send file through selenium in python

I work with selenium 3.141.0 and chromewebdriver 83.0.4103 .
All the selenium libraries are proprerly imported and my script is working fine until i got this error.
I'm currently trying to upload a json file to an input :
<input type="file" class="file" id="ext-gen1563">
upload = self.driver.find_element(By.XPATH, '//input[#type="file"]')
upload.send_keys("‪C:\\absolutepathtofile.json")
I'm getting the same error all the time :
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument:
I tried to click on the "Choose File" button on the form and it's working well untill I need to pass the desired file, I understood that it's not the good way to do it so I worked on the input way.
I cannot test with the geckodriver or edge drive cause my organisation do not allow me to us them.
Here is the complete code of the element :
<div class="uploader"><div class="import-file-form"><input type="file" class="file" id="ext-gen1563"></div><div class="filename">No file chosen.</div><div class="clickable btn" id="ext-gen1564">Choose File</div></div>
Can you give me some nudges to solve this problem ?
Regards.
I've dealt with the same issue myself recently and it's possible that the solution I found for the website I was dealing with might work for yours.
I first identified the box where the file name ends up after you choose your file. This box never shows the full file path nor can you type in it as a human in the browser. Once I did this, I simply sent the keys of the full path to the file to this box... and it worked.
I was then able to just identify the 'submit' button and click it.
Here is the code I used but you will obviously have to identify the elements of your website.
The CSV variable is simply a CSV file I'm passing in.
ChooseFile = browser.find_element_by_name('files[field_import_file_und_0]')
import os
ChooseFile.send_keys(os.path.abspath(CSV))
Upload = browser.find_element_by_xpath('/html/body/div[3]/div[1]/div[2]/div/div/div/div/div/div[2]/div/div/div/form/div/div[2]/div/div/div[1]/input[2]')
Upload.click()
Submit = browser.find_element_by_name('op')
Submit.click()
def base(request):
if request.method=="GET":
return render(request, 'base.html')
else:
title = request. POST.get('title')
file = request.POST.get('file')
data = models.base(title=title,file=file)
data.save()
return render(request,'base.html')

Python - Can't go through "Open URL: xxx Always open this type of links in the associated application"

This is my first topic and question. I'm trying to write a script in Python(Pycharm) which help me log on to SAP via Citrix from the browser level. My problem is that I can't go through the window that pops out:
Open URL: xxx Always open this type of links in the associated application.
My script:
*imports that I need*
driver = webdriver.Chrome('C:\User\Chromium\chromedriver.exe')
driver.maximize_window()
driver.get('https://XYZ.dk')
Detect_Reciver = driver.find_element_by_xpath('//*[#id="protocolhandler-welcome"]/div/div/div/div/a')
if Detect_Reciver == Detect_Reciver:
Detect_Reciver.click()
time.sleep(2)
The problem appears here, after loading on a new page, the "Open URL: xxx Always open this type of links in the associated application." pops out, and I want to Press Enter or Escape by Python. Send_keys(keys.ENTER/ESCAPE) doesn't work.
enter image description here
Can you help me out :)?

python webdriver os window

I need to upload a file using Python and Selenium. When I click the upload HTML element a "File Upload" window is opened and the click() method does not return since it waits to fully load the page. Therefore I cannot continue using pywinauto code to control the window.
The first method clicks the HTML element (an img) to upload a new file:
def add_file(self):
return self.selenium.find_element(By.ID, "add_file").click()
and the second method is using pywinauto to type the path to the file and then click open
def upload(self):
from pywinauto import application
app = application.Application()
app.connect_(title_re = "File Upload")
app.file_upload.TypeKeys("C:\\Path\\To\\FIle")
app.file_upload.Open.Click()
How can I force add_file method to return and to be able to run the upload method?
Solve it. There was an iframe dealing with the upload but was hidden and didn't see it in the first place. The iframe contains an input of type file also hidden. To solve it make the iframe visible using javascript:
selenium.execute_script("document.getElementById('iframe_id').style.display = 'block';")
then switch to the iframe and make the input visible also:
selenium.switch_to_frame(0)
selenium.execute_script("document.getElementById('input_field_id').type = 'visible';")
and simply send the path to the input:
selenium.find_element(By.ID, 'input_field_id').send_keys("path\\\\to\\\\file")
For windows use 4 '\\\\' as path separator.

Captchas in Scrapy

I'm working on a Scrapy app, where I'm trying to login to a site with a form that uses a captcha (It's not spam). I am using ImagesPipeline to download the captcha, and I am printing it to the screen for the user to solve. So far so good.
My question is how can I restart the spider, to submit the captcha/form information? Right now my spider requests the captcha page, then returns an Item containing the image_url of the captcha. This is then processed/downloaded by the ImagesPipeline, and displayed to the user. I'm unclear how I can resume the spider's progress, and pass the solved captcha and same session to the spider, as I believe the spider has to return the item (e.g. quit) before the ImagesPipeline goes to work.
I've looked through the docs and examples but I haven't found any ones that make it clear how to make this happen.
This is how you might get it to work inside the spider.
self.crawler.engine.pause()
process_my_captcha()
self.crawler.engine.unpause()
Once you get the request, pause the engine, display the image, read the info from the user& resume the crawl by submitting a POST request for login.
I'd be interested to know if the approach works for your case.
I would not create an Item and use the ImagePipeline.
import urllib
import os
import subprocess
...
def start_requests(self):
request = Request("http://webpagewithcaptchalogin.com/", callback=self.fill_login_form)
return [request]
def fill_login_form(self,response):
x = HtmlXPathSelector(response)
img_src = x.select("//img/#src").extract()
#delete the captcha file and use urllib to write it to disk
os.remove("c:\captcha.jpg")
urllib.urlretrieve(img_src[0], "c:\captcha.jpg")
# I use an program here to show the jpg (actually send it somewhere)
captcha = subprocess.check_output(r".\external_utility_solving_captcha.exe")
# OR just get the input from the user from stdin
captcha = raw_input("put captcha in manually>")
# this function performs the request and calls the process_home_page with
# the response (this way you can chain pages from start_requests() to parse()
return [FormRequest.from_response(response,formnumber=0,formdata={'user':'xxx','pass':'xxx','captcha':captcha},callback=self.process_home_page)]
def process_home_page(self, response):
# check if you logged in etc. etc.
...
What I do here is that I import urllib.urlretrieve(url) (to store the image), os.remove(file) (to delete the previous image), and subprocess.checoutput (to call an external command line utility to solve the captcha). The whole Scrapy infrastructure is not used in this "hack", because solving a captcha like this is always a hack.
That whole calling external subprocess thing could have been one nicer, but this works.
On some sites it's not possible to save the captcha image and you have to call the page in a browser and call a screen_capture utility and crop on an exact location to "cut out" the captcha. Now that is screenscraping.

Using Python/Selenium/Best Tool For The Job to get URI of image requests generated through JavaScript?

I have some JavaScript from a 3rd party vendor that is initiating an image request. I would like to figure out the URI of this image request.
I can load the page in my browser, and then monitor "Live HTTP Headers" or "Tamper Data" in order to figure out the image request URI, but I would prefer to create a command line process to do this.
My intuition is that it might be possible using python + qtwebkit, but perhaps there is a better way.
To clarify: I might have this (overly simplified code).
<script>
suffix = magicNumberFunctionIDontHaveAccessTo();
url = "http://foobar.com/function?parameter=" + suffix
img = document.createElement('img'); img.src=url; document.all.body.appendChild(img);
</script>
Then once the page is loaded, I can go figure out the url by sniffing the packets. But I can't just figure it out from the source, because I can't predict the outcome of magicNumberFunction...().
Any help would be muchly appreciated!
Thank you.
The simplest thing to do might be to use something like HtmlUnit and skip a real browser entirely. By using Rhino, it can evaluate JavaScript and likely be used to extract that URL out.
That said, if you can't get that working, try out Selenium RC and use the captureNetworkTraffic command (which requires the Selenium instant be started with an option of captureNetworkTraffic=true). This will launch Firefox with a proxy configured and then let you pull the request info back out as JSON/XML/plain text. Then you can parse that content and get what you want.
Try out the instant test tool that my company offers. If the data you're looking for is in our results (after you click View Details), you'll be able to get it from Selenium. I know, since I wrote the captureNetworkTraffic API for Selenium for my company, BrowserMob.
I would pick any one of the many http proxy servers written in Python -- probably one of the simplest ones at the very top of the list -- and tweak it to record all URLs requested (as well as proxy-serve them) e.g. appending them to a text file -- without loss of generality, call that text file 'XXX.txt'.
Now all you need is a script that: starts the proxy server in question; starts Firefox (or whatever) on your main desired URL with the proxy in question set as your proxy (see e.g. this SO question for how), though I'm sure other browsers would work just as well; waits a bit (e.g. until the proxy's XXX.txt file has not been altered for more than N seconds); reads XXX.txt to extract only the URLs you care about and record them wherever you wish; turns down the proxy and Firefox processes.
I think this will be much faster to put in place and make work correctly, for your specific requirements, than any more general solution based on qtwebkit, selenium, or other "automation kits".
Use Firebug Firefox plugin. It will show you all requests in real time and you can even debug the JS in your Browser or run it step-by-step.
Ultimately, I did it in python, using Selenium-RC. This solution requires the python files for selenium-rc, and you need to start the java server ("java -jar selenium-server.jar")
from selenium import selenium
import unittest
import lxml.html
class TestMyDomain(unittest.TestCase):
def setUp(self):
self.selenium = selenium("localhost", \
4444, "*firefox", "http://www.MyDomain.com")
self.selenium.start()
def test_mydomain(self):
htmldoc = open('site-list.html').read()
url_list = [link for (element, attribute,link,pos) in lxml.html.iterlinks(htmldoc)]
for url in url_list:
try:
sel = self.selenium
sel.open(url)
sel.select_window("null")
js_code = '''
myDomainWindow = this.browserbot.getUserWindow();
for(obj in myDomainWindow) {
/* This code grabs the OMNITURE tracking pixel img */
if ((obj.substring(0,4) == 's_i_') && (myDomainWindow[obj].src)) {
var ret = myDomainWindow[obj].src;
}
}
ret;
'''
omniture_url = sel.get_eval(js_code) #parse&process this however you want
except Exception, e:
print 'We ran into an error: %s' % (e,)
self.assertEqual("expectedValue", observedValue)
def tearDown(self):
self.selenium.stop()
if __name__ == "__main__":
unittest.main()
Why can't you just read suffix, or url for that matter? Is the image loaded in an iframe or in your page?
If it is loaded in your page, then this may be a dirty hack (substitute document.body for whatever element is considered):
var ac = document.body.appendChild;
var sources = [];
document.body.appendChild = function(child) {
if (/^img$/i.test(child.tagName)) {
sources.push(child.getAttribute('src'));
}
ac(child);
}

Categories