How can I download images from a CAPTCHA with Python?

How can I download images from a CAPTCHA with Python? - python

I need to download the images that are inside the custom made CAPTCHA in this login site. How can I do it :(?
This is the login site, there are five images
and this is the link: https://portalempresas.sb.cl/login.php
I've been trying with this code that another user (#EnriqueBet) helped me with:
from io import BytesIO
from PIL import Image
# Download image function
def downloadImage(element,imgName):
img = element.screenshot_as_png
stream = BytesIO(img)
image = Image.open(stream).convert("RGB")
image.save(imgName)
# Find all the web elements of the captcha images
image_elements = driver.find_elements_by_xpath("/html/body/div[1]/div/div/section/div[1]/div[3]/div/div/div[2]/div[*]")
# Output name for the images
image_base_name = "Imagen_[idx].png"
# Download each image
for i in range(len(image_elements)):
downloadImage(image_elements[i],image_base_name.replace("[idx]","%s"%i)
But when it tries to get all of the image elements
image_elements = driver.find_elements_by_xpath("/html/body/div[1]/div/div/section/div[1]/div[3]/div/div/div[2]/div[*]")
It fails and doesn't get any of them. Please, help! :(

Instead of defining an explicit path to the images, why not simply download all images that are present on the page. This will work since the page itself only has 5 images and you want to download all of them. See the method below.
The following should extract all images from a given page and write it to the directory where the script is being run.
import re
import requests
from bs4 import BeautifulSoup
site = ''#set image url here
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
The code is taken from here and credit goes to the respective owner.

This is a follow on answer from my earlier post
I have had no success getting my selenium to run due to versioning issues on selenium and my browser.
I have though thought of another way to download and extract all the images that are appearing on the captcha. As you can tell the images change on each visit, so to collect all the images the best option would be to automate them rather than manually saving the image from the site
To automate it, follow the steps below.
Firstly, navigate to the site using selenium and take a screenshot of the site. For example,
from selenium import webdriver
DRIVER = 'chromedriver'
driver = webdriver.Chrome(DRIVER)
driver.get('https://www.spotify.com')
screenshot = driver.save_screenshot('my_screenshot.png')
driver.quit()
This saves it locally. You can then open the image using library such as pil and crop the images of the captcha.
This would be done like so
im = Image.open('0.png').convert('L')
im = im.crop((1, 1, 98, 33))
im.save('my_screenshot.png)
Hopefully you get the idea here. You will need to do this one by one for all the images, ideally in a for loop with crop diemensions changed appropriately.

You can also try this It will save captcha image only
from PIL import Image
element = driver.find_element_by_id('captcha_image') #div id or the captcha container id
location = element.location
#print(location)
size = element.size
driver.save_screenshot('screenshot.png')
get_captcha_text(location, size)
def get_captcha_text(location, size):
im = Image.open('screenshot.png')
left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']
im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png')
return true

Related

It is possible to save image like by rightclick in browser?

It is possible to save image on site when selenium is minimized?
At the moment i using code:
img = driver.find_element_by_xpath('//*[#id="image_img"]')
img.screenshot('C:/foo.png')
This ofcourse works, but this option opens the browser just as it takes a screenshot.
Is it possible to save an image from a given xpath in a minimized browser?
Unfortunately, downloading the url of the photo is pointless, because this image is generated only once and when I download the photo via the url, I will get an empty file or the image is other what should be.
site: https://thispersondoesnotexist.com/

You don't need selenium to get pictures from the website, you can use this code
to download image directly to your local.
import requests
r1 = requests.get("https://thispersondoesnotexist.com/image")
r1.raise_for_status()
print(r1.status_code, r1.reason)
tts_url = 'https://thispersondoesnotexist.com/image'
r2 = requests.get(tts_url, timeout=100, cookies=r1.cookies)
print(r2.status_code, r2.reason)
try:
with open('test.jpeg', "w+b") as f:
f.write(r2.content)
except IOError:
print("IOError: could not write a file")

Using python and selenium to download an image using the image's "src" attribute

I'm new to Python and Selenium. My goal here is to download an image from the Google Image Search results page and save it as a file in a local directory, but I have been unable to initially download the image.
I'm aware there are other options (retrieving the image via the url using request, etc.), but I want to know if it's possible to do it using the image's "src" attribute, e.g., "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBxM..."
My code is below (I have removed all imports, etc., for brevity.):
# This creates the folder to store the image in
if not os.path.exists(save_folder):
os.mkdir(save_folder)
driver = webdriver.Chrome(PATH)
# Goes to the given web page
driver.get("https://www.google.com/imghp?hl=en&ogbl")
# "q" is the name of the google search field input
search_bar = driver.find_element_by_name("q")
# Input the search term(s)
search_bar.send_keys("Ben Folds Songs for Silverman Album Cover")
# Returns the results (basically clicks "search")
search_bar.send_keys(Keys.RETURN)
# Wait 10 seconds for the images to load on the page before moving on to the next part of the script
try:
# Returns a list
search_results = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "islrg"))
)
# print(search_results.text)
# Gets all of the images on the page (it should be a list)
images = search_results.find_elements_by_tag_name("img")
# I just want the first result
image = images[0].get_attribute('src')
### Need help here ###
except:
print("Error")
driver.quit()
# Closes the browser
driver.quit()
I have tried:
urllib.request.urlretrieve(image, "00001.jpg")
and
urllib3.request.urlretrieve(image, f"{save_folder}/captcha.png")
But I've always hit the "except" block using those methods. After reading a promising post, I also tried:
bufferedImage = imageio.read(image)
outputFile = f"{save_folder}/image.png"
imageio.write(bufferedImage, "png", outputFile)
with similar results, though I believe the latter example used Java in the post and I may have made an error in translating it to Python.
I'm sure it's something obvious, but what am I doing wrong? Thank you for any help.

The URL you are dealing with in this case is a Data URL which is the data of the image itself encoded in base64.
Since Python 3.4+ you can read this data and decode it to bytes with urllib.request.urlopen:
import urllib
data_url = "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBxM..."
with urllib.request.urlopen(data_url) as response:
data = response.read()
with open("some_image.jpg", mode="wb") as f:
f.write(data)
Alternatively you can decode the base64-encoded part of the data url yourself with base64:
import base64
data_url = "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBxM..."
base64_image_data = data_url.split(",")[1]
data = base64.b64decode(base64_image_data)
with open("some_image.jpg", mode="wb") as f:
f.write(data)

Locating specific img with Python Selenium

I'm trying to download the post on the Instagram page, but every time Selenium selects and downloads the profile photo.
def downloadPost(self,link):
os.system("cls")
self.link = link
self.browser = webdriver.Chrome(self.drvPath, chrome_options=self.browserProfile)
self.browser.get(link)
time.sleep(2)
img = self.browser.find_element_by_tag_name('img')
src = img.get_attribute('src')
urllib.request.urlretrieve(src, f"{self.imgPath}/igpost.png")
self.browser.close()
The photo tag I want to capture is under the second img tag and I can't identify it.
html code that I try to scrape

Lots of ways to do this. By grabbing 2nd img tag or the first div with img child class.
self.browser.find_elements_by_tag_name("img")[1]
self.browser.find_element_by_xpath("//div/img")
self.browser.find_element_by_xpath("//img[#alt='whateverattributevalueithas']")

How to check if (https://) is a image or web link

How to check if a hyperlink is a image link or web link.
image_list = []
url = 'http://www.image.jpg/'
if any(x in '.jpg .gif .png .jpeg' for x in url):
image_list.append(url)
else:
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib")
for link in soup.find_all('img'):
src = link.get('src')
if src.startswith("https"):
image_list.append(src)
The code above works in finding out the hyperlink contains image formats, however whenever i use a link that does not contain ".jpg ect..." it still appends the link to the image_list and skips the else statement.

Let's look at this code:
any(x in '.jpg .gif .png .jpeg' for x in url):
This checks if any letter in the URL is in the string. The 'p' from http is in the string, so you will always get a true result.
Here's how you could check the extension of a URL:
import posixpath
import urllib.parse
IMAGE_EXTS = { '.png', '.jpg', '.jpeg', '.gif' }
url = 'http://example.com/'
if posixpath.splitext(urllib.parse.urlparse(url).path)[1] in IMAGE_EXTS:
# Has image extension...
But that's a moot point, because the extension of a URL doesn't tell you whether it's an image. Unlike regular files, for URLs, the extension is completely irrelevant! You can have an .html URL which gives you a PNG image, or a .gif URL which is really an HTML web page. You need to check the Content-Type of the HTTP reply.

python Script for getting a captcha

So i was doing this site scraping for my app. I need to download the captcha image for displaying it to the user. But every time I visit the captcha url it generates a new captcha. How can I download the the dynamically generated captcha for automated Login
eg:https://academics.vit.ac.in/student/stud_login.asp
Here I download the captcha using below script>>>
from bs4 import BeautifulSoup
import urllib2
import urllib
url = "https://academics.vit.ac.in/student/stud_login.asp"
content = urllib2.urlopen(url)
soup = BeautifulSoup(content)
img = soup.find('img',id ='imgCaptcha')
print img
urllib.urlretrieve(img['src'],'captcha.bmp')
But some how this script doesn't seem to work.
1) One solution is to take screenshot and crop out the captcha.
But I need a different solution as I am going to work on devices with various screen sizes so taking screen shot would not solve the purpose.

img['src'] returns a relative url - captcha.asp. You have to make it into an absolute url before you can use it (https://academics.vit.ac.in/student/captcha.asp).
import urlparse
urllib.urlretrieve(urlparse.urljoin(url, img['src']), 'captcha.bmp')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I download images from a CAPTCHA with Python? - python

Related

It is possible to save image like by rightclick in browser?

Using python and selenium to download an image using the image's "src" attribute

Locating specific img with Python Selenium

How to check if (https://) is a image or web link

python Script for getting a captcha

Categories

Resources