I'm trying to download the post on the Instagram page, but every time Selenium selects and downloads the profile photo.
def downloadPost(self,link):
os.system("cls")
self.link = link
self.browser = webdriver.Chrome(self.drvPath, chrome_options=self.browserProfile)
self.browser.get(link)
time.sleep(2)
img = self.browser.find_element_by_tag_name('img')
src = img.get_attribute('src')
urllib.request.urlretrieve(src, f"{self.imgPath}/igpost.png")
self.browser.close()
The photo tag I want to capture is under the second img tag and I can't identify it.
html code that I try to scrape
Lots of ways to do this. By grabbing 2nd img tag or the first div with img child class.
self.browser.find_elements_by_tag_name("img")[1]
self.browser.find_element_by_xpath("//div/img")
self.browser.find_element_by_xpath("//img[#alt='whateverattributevalueithas']")
Related
img = "<img src='C:/Users/semnome/Desktop/21291010002.jpg'>" //Path it's ok, it open if I try it manually on URL
elm = driver.find_element_by_xpath(".//*[#id='content']")
driver.execute_script("arguments[0].innerHTML = arguments[1];", elm, img);
it's a local execution (chrome in debugging mode)
works to create the tag, but Image is not showing in browser (just img icon). What I'm doing wrong?
I need to download the images that are inside the custom made CAPTCHA in this login site. How can I do it :(?
This is the login site, there are five images
and this is the link: https://portalempresas.sb.cl/login.php
I've been trying with this code that another user (#EnriqueBet) helped me with:
from io import BytesIO
from PIL import Image
# Download image function
def downloadImage(element,imgName):
img = element.screenshot_as_png
stream = BytesIO(img)
image = Image.open(stream).convert("RGB")
image.save(imgName)
# Find all the web elements of the captcha images
image_elements = driver.find_elements_by_xpath("/html/body/div[1]/div/div/section/div[1]/div[3]/div/div/div[2]/div[*]")
# Output name for the images
image_base_name = "Imagen_[idx].png"
# Download each image
for i in range(len(image_elements)):
downloadImage(image_elements[i],image_base_name.replace("[idx]","%s"%i)
But when it tries to get all of the image elements
image_elements = driver.find_elements_by_xpath("/html/body/div[1]/div/div/section/div[1]/div[3]/div/div/div[2]/div[*]")
It fails and doesn't get any of them. Please, help! :(
Instead of defining an explicit path to the images, why not simply download all images that are present on the page. This will work since the page itself only has 5 images and you want to download all of them. See the method below.
The following should extract all images from a given page and write it to the directory where the script is being run.
import re
import requests
from bs4 import BeautifulSoup
site = ''#set image url here
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
The code is taken from here and credit goes to the respective owner.
This is a follow on answer from my earlier post
I have had no success getting my selenium to run due to versioning issues on selenium and my browser.
I have though thought of another way to download and extract all the images that are appearing on the captcha. As you can tell the images change on each visit, so to collect all the images the best option would be to automate them rather than manually saving the image from the site
To automate it, follow the steps below.
Firstly, navigate to the site using selenium and take a screenshot of the site. For example,
from selenium import webdriver
DRIVER = 'chromedriver'
driver = webdriver.Chrome(DRIVER)
driver.get('https://www.spotify.com')
screenshot = driver.save_screenshot('my_screenshot.png')
driver.quit()
This saves it locally. You can then open the image using library such as pil and crop the images of the captcha.
This would be done like so
im = Image.open('0.png').convert('L')
im = im.crop((1, 1, 98, 33))
im.save('my_screenshot.png)
Hopefully you get the idea here. You will need to do this one by one for all the images, ideally in a for loop with crop diemensions changed appropriately.
You can also try this It will save captcha image only
from PIL import Image
element = driver.find_element_by_id('captcha_image') #div id or the captcha container id
location = element.location
#print(location)
size = element.size
driver.save_screenshot('screenshot.png')
get_captcha_text(location, size)
def get_captcha_text(location, size):
im = Image.open('screenshot.png')
left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']
im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png')
return true
I'm following the tutorial for Automate the Boring Stuff's web-scraping section, and want to scrape the images from https://swordscomic.com/ .
The script should 1) download and parse the html 2) download the comic image 3) click on the "previous comic" button 4) repeat 1 - 3
The script is able to download the first comic, but gets stuck either on hitting the "previous comic" button, or downloading the next comic image.
Possible issues for this may be:
Al's tutorial instructs to find the "rel" selector, but I am unable to find it. I believe this site uses a slightly different format than that site Al's tutorial instructs to scrape. I believe I'm using the correct selector, but the script still crashes.
It may also be the way this site's home landing page contains a comic image, and then each "previous" comic has an additional file-path (in the form of /CCCLXVIII/ or something thereof).
I have tried:
adding the edition # for the comic for the initial page, but that only causes the script to crash earlier.
pointing the "previous button" part of the script to a different selector in the element, but still gives me the "Index out of range" error.
Here is the script as I have it:
#! python3
#swordscraper.py - Downloads all the swords comics.
import requests, os, bs4
os.chdir(r'C:\Users\bromp\OneDrive\Desktop\Python')
os.makedirs('swords', exist_ok=True) #store comics in /swords
url = 'https://swordscomic.com/' #starting url
while not url.endswith('#'):
#Download the page.
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, 'html.parser')
#Find the URL of the comic image.
comicElem = soup.select('#comic-image')
if comicElem == []:
print('Could not find comic image.')
else:
comicUrl = comicElem[0].get('src')
comicUrl = "http://" + comicUrl
if 'swords' not in comicUrl:
comicUrl=comicUrl[:7]+'swordscomic.com/'+comicUrl[7:]
#Download the image.
print('Downloading image %s...' % (comicUrl))
res = requests.get(comicUrl)
res.raise_for_status()
#Save the image to ./swords
imageFile = open(os.path.join('swords', os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
#Get the Prev button's url.
prevLink = soup.select('a[id=navigation-previous]')[0]
url = 'https://swordscomic.com/' + prevLink.get('href')
print('Done')
Here is the output the script does and the particular error message it gives:
Downloading page https://swordscomic.com/...
Downloading image http://swordscomic.com//media/Swords363bt.png...
Downloading page https://swordscomic.com//comic/CCCLXII/...
Could not find comic image.
Traceback (most recent call last):
File "C:\...\", line 39, in <module>
prevLink = soup.select('a[id=navigation-previous]')[0]
IndexError: list index out of range
The page is rendered with JavaScript. In particular the link you extract:
has an onclick() event which presumably links to the next page. In addition the page uses XHR. So your only option is to use a technology that renders JavaScript so try using Selenium or requests-html https://github.com/psf/requests-html.
I need to scroll over a web page (example twitter) an make a web scraping of the new elements that appear as one advances on the website. I try to make this using python 3.x, selenium and PhantomJS. This is my code
import time
from selenium import webdriver
from bs4 import BeautifulSoup
user = 'ciroylospersas'
# Start web browser
#browser = webdriver.Firefox()
browser = webdriver.PhantomJS()
browser.set_window_size(1024, 768)
browser.get("https://twitter.com/")
# Fill username in login
element = browser.find_element_by_id("signin-email")
element.clear()
element.send_keys('your twitter user')
# Fill password in login
element = browser.find_element_by_id("signin-password")
element.clear()
element.send_keys('your twitter pass')
browser.save_screenshot('screen.png') # save a screenshot to disk
# Summit the login
element.submit()
time.sleep(5
browser.save_screenshot('screen1.png') # save a screenshot to disk
# Move to the following url
browser.get("https://twitter.com/" + user + "/following")
browser.save_screenshot('screen2.png') # save a screenshot to disk
scroll_script = "var h = document.body.scrollHeight; window.scrollTo(0, h); return h;"
newHeight = browser.execute_script(scroll_script)
print(newHeight)
browser.save_screenshot('screen3.png') # save a screenshot to disk
The problem is I can't scroll to the bottom. The screen2.png and screen3.png are the same. But if I change the webdriver from PhantomJS to Firefox the same code work fine. Why?
I was able to get this to work in phantomJS when trying to solve a similar problem:
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
height = driver.execute_script("return document.body.scrollHeight;")
if height == check_height:
break
check_height = height
It will scroll to the current "bottom", wait, see if the page loaded more, and bail if it did not (assuming everything got loaded if the heights match.)
In my original code I had a "max" value I checked alongside the matching heights because I was only interested in the first 10 or so "pages". If there were more I wanted it to stop loading and skip them.
Also, this is the answer I used as an example
Using python, selenium, and firefox. I am clicking a link on a homepage and it leads directly to a JPG file that loads. I just want to verify that the image loads. The HTML of the image is this:
<img src="https://www.treasury.gov/about/organizational-structure/ig/Agency%20Documents/Organizational%20Chart.jpg" alt="https://www.treasury.gov/about/organizational-structure/ig/Agency%20Documents/Organizational%20Chart.jpg">
I am trying to use xpath for locating the element:
def wait_for_element_visibility(self, waitTime, locatorMode, Locator):
element = None
if locatorMode == LocatorMode.XPATH:
element = WebDriverWait(self.driver, waitTime).\
until(EC.visibility_of_element_located((By.XPATH, Locator)))
else:
raise Exception("Unsupported locator strategy.")
return element
Using this dictionary:
OrganizationalChartPageMap = dict(OrganizationalChartPictureXpath = "//img[contains(#src, 'Chart.jpg')]",
)
This is the code I am running:
def _verify_page(self):
try:
self.wait_for_element_visibility(20,
"xpath",
OrganizationalChartPageMap['OrganizationalChartPictureXpath']
)
except:
raise IncorrectPageException
I get the incorrectpageexception thrown every time. Am I doing this all wrong? Is there a better way to verify images using selenium?
Edit : Here is the DOM of the elements :
Appending the alt value should work in xpath, would suggest you to change the dictionary to :
= dict(OrganizationalChartPictureXpath = "//img[#alt='https://www.treasury.gov/about/organizational-structure/ig/Agency%20Documents/Organizational%20Chart.jpg' and contains(#src, 'Chart.jpg')]"
OR
alternatively use the full path to the image in the src as :
= dict(OrganizationalChartPictureXpath = "//img[#src='https://www.treasury.gov/about/organizational-structure/ig/Agency%20Documents/Organizational%20Chart.jpg']"
Edit :
According to the DOM shared in the image, you can also use the class of the img which would be a corresponding code into your project to this :
element = driver.find_element_by_class_name('shrinkToFit')