Web Scraping Images: Cannot find 'rel' selector - python

I'm following the tutorial for Automate the Boring Stuff's web-scraping section, and want to scrape the images from https://swordscomic.com/ .
The script should 1) download and parse the html 2) download the comic image 3) click on the "previous comic" button 4) repeat 1 - 3
The script is able to download the first comic, but gets stuck either on hitting the "previous comic" button, or downloading the next comic image.
Possible issues for this may be:
Al's tutorial instructs to find the "rel" selector, but I am unable to find it. I believe this site uses a slightly different format than that site Al's tutorial instructs to scrape. I believe I'm using the correct selector, but the script still crashes.
It may also be the way this site's home landing page contains a comic image, and then each "previous" comic has an additional file-path (in the form of /CCCLXVIII/ or something thereof).
I have tried:
adding the edition # for the comic for the initial page, but that only causes the script to crash earlier.
pointing the "previous button" part of the script to a different selector in the element, but still gives me the "Index out of range" error.
Here is the script as I have it:
#! python3
#swordscraper.py - Downloads all the swords comics.
import requests, os, bs4
os.chdir(r'C:\Users\bromp\OneDrive\Desktop\Python')
os.makedirs('swords', exist_ok=True) #store comics in /swords
url = 'https://swordscomic.com/' #starting url
while not url.endswith('#'):
#Download the page.
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, 'html.parser')
#Find the URL of the comic image.
comicElem = soup.select('#comic-image')
if comicElem == []:
print('Could not find comic image.')
else:
comicUrl = comicElem[0].get('src')
comicUrl = "http://" + comicUrl
if 'swords' not in comicUrl:
comicUrl=comicUrl[:7]+'swordscomic.com/'+comicUrl[7:]
#Download the image.
print('Downloading image %s...' % (comicUrl))
res = requests.get(comicUrl)
res.raise_for_status()
#Save the image to ./swords
imageFile = open(os.path.join('swords', os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
#Get the Prev button's url.
prevLink = soup.select('a[id=navigation-previous]')[0]
url = 'https://swordscomic.com/' + prevLink.get('href')
print('Done')
Here is the output the script does and the particular error message it gives:
Downloading page https://swordscomic.com/...
Downloading image http://swordscomic.com//media/Swords363bt.png...
Downloading page https://swordscomic.com//comic/CCCLXII/...
Could not find comic image.
Traceback (most recent call last):
File "C:\...\", line 39, in <module>
prevLink = soup.select('a[id=navigation-previous]')[0]
IndexError: list index out of range

The page is rendered with JavaScript. In particular the link you extract:
has an onclick() event which presumably links to the next page. In addition the page uses XHR. So your only option is to use a technology that renders JavaScript so try using Selenium or requests-html https://github.com/psf/requests-html.

Related

How can I download images from a CAPTCHA with Python?

I need to download the images that are inside the custom made CAPTCHA in this login site. How can I do it :(?
This is the login site, there are five images
and this is the link: https://portalempresas.sb.cl/login.php
I've been trying with this code that another user (#EnriqueBet) helped me with:
from io import BytesIO
from PIL import Image
# Download image function
def downloadImage(element,imgName):
img = element.screenshot_as_png
stream = BytesIO(img)
image = Image.open(stream).convert("RGB")
image.save(imgName)
# Find all the web elements of the captcha images
image_elements = driver.find_elements_by_xpath("/html/body/div[1]/div/div/section/div[1]/div[3]/div/div/div[2]/div[*]")
# Output name for the images
image_base_name = "Imagen_[idx].png"
# Download each image
for i in range(len(image_elements)):
downloadImage(image_elements[i],image_base_name.replace("[idx]","%s"%i)
But when it tries to get all of the image elements
image_elements = driver.find_elements_by_xpath("/html/body/div[1]/div/div/section/div[1]/div[3]/div/div/div[2]/div[*]")
It fails and doesn't get any of them. Please, help! :(
Instead of defining an explicit path to the images, why not simply download all images that are present on the page. This will work since the page itself only has 5 images and you want to download all of them. See the method below.
The following should extract all images from a given page and write it to the directory where the script is being run.
import re
import requests
from bs4 import BeautifulSoup
site = ''#set image url here
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
The code is taken from here and credit goes to the respective owner.
This is a follow on answer from my earlier post
I have had no success getting my selenium to run due to versioning issues on selenium and my browser.
I have though thought of another way to download and extract all the images that are appearing on the captcha. As you can tell the images change on each visit, so to collect all the images the best option would be to automate them rather than manually saving the image from the site
To automate it, follow the steps below.
Firstly, navigate to the site using selenium and take a screenshot of the site. For example,
from selenium import webdriver
DRIVER = 'chromedriver'
driver = webdriver.Chrome(DRIVER)
driver.get('https://www.spotify.com')
screenshot = driver.save_screenshot('my_screenshot.png')
driver.quit()
This saves it locally. You can then open the image using library such as pil and crop the images of the captcha.
This would be done like so
im = Image.open('0.png').convert('L')
im = im.crop((1, 1, 98, 33))
im.save('my_screenshot.png)
Hopefully you get the idea here. You will need to do this one by one for all the images, ideally in a for loop with crop diemensions changed appropriately.
You can also try this It will save captcha image only
from PIL import Image
element = driver.find_element_by_id('captcha_image') #div id or the captcha container id
location = element.location
#print(location)
size = element.size
driver.save_screenshot('screenshot.png')
get_captcha_text(location, size)
def get_captcha_text(location, size):
im = Image.open('screenshot.png')
left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']
im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png')
return true

Error while scraping image with beautifulsoup

The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py
So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.
I tried to get images from "https://500px.com/editors"
The first error was
The code that caused this warning is on line 12 of the file/Bureau/scrapper.py. To get rid of this warning, pass the additional argument
'features="lxml"' to the BeautifulSoup constructor.
So I did :
soup = BeautifulSoup(plain_text, features="lxml")
I also adapted the class to reflect the tag in 500px.
But now the script stopped running and nothing happened.
In the end it looks like this :
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
url = "https://500px.com/editors"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for link in soup.find_all("a",{"class":"photo_link "}):
href = link.get('href')
print(href)
img_name = random.randrange(1,500)
full_name = str(img_name) + ".jpg"
urllib.request.urlretrieve(href, full_name)
print("loop break")
What did I do wrong?
Actually the website is loaded via JavaScript using XHR request to the following API
So you can reach it directly via API.
Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['url'])
also you can access the image url itself in order to write it directly!
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['image_url'][-1])
Note that image_url key hold different img size. so you can choose your preferred one and save it. here I've taken the big one.
Saving directly:
import requests
with requests.Session() as req:
r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
result = []
for item in r['photos']:
print(f"Downloading {item['name']}")
save = req.get(item['image_url'][-1])
name = save.headers.get("Content-Disposition")[9:]
with open(name, 'wb') as f:
f.write(save.content)
Looking at the page you're trying to scrape I noticed something. The data doesn't appear to load until a few moments after the page finishes loading. This tells me that they're using a JS framework to load the images after page load.
Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling. Running your script and printing out what plain_text contains proves this:
<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>
If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.
Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages. One tool I've seen recommended for this is selenium, though I've never used it so I'm not fully aware of its capabilities; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.

Python Selenium to script Medium rss feed

I'm trying to script some blogs with python and selenium.
However, the source page is limited to a few articles, thus I need to scroll down to load the AJAX..
Is there a way to get the full source in one call with selenium?
The code would be something like:
# url and page source generating
url = url_constructor_medium_news(blog_name)
content = social_data_tools.selenium_page_source_generator(driver, url)
try:
# construct soup
soup = BeautifulSoup(content, "html.parser").rss.channel
# break condition
divs = soup.find_all('item')
except AttributeError as e:
print(e.__cause__)
# friendly
time.sleep(3 + random.randint(1, 5))
I don't believe there is a way to populate the driver with unloaded data that would otherwise be obtained by scrolling.
An alternative solution to getting the data would be driver.execute_script("windows.scrollTo(0, document.body.scrollHeight);")
I've previously used this as a reference.
I hope this helps!

Downloading Webpage as is (python 3.x.x)

I'm trying to take some links from a text file and just download them onto my computer. However I would like those downloaded pages to be completely the same as it is in browser. These wiki pages I downloaded are not the same, they don't display some of the pictures and it's just text mostly when I open them.
How can I achieve what I want, saw some things with scrapy and beautiful soup however I'm not exp
My code:
import urllib.request
links=[]
fr=open('wiki_linkovi','r')
fw1=open('imena_elemenata.txt', 'w')
link=fr.readlines()
j=0
for i in link:
base='https://en.wikipedia.org/wiki/'
start=i.find(base)+len(base)
end=i.find('\n',start)
ime=i[start:end]
fw1.write(ime+'\n')
response = urllib.request.urlopen(i) #save starts here-----
webContent = response.read()
f = open(ime+'.html', 'wb')
f.write(webContent)
f.close
j=j+1
print(str(j)+'. link\n')
So yeah In short, I'd like to download webpage completely

How to Download PDFs from Scraped Links [Python]?

I'm working on making a PDF Web Scraper in Python. Essentially, I'm trying to scrape all of the lecture notes from one of my courses, which are in the form of PDFs. I want to enter a url, and then get the PDFs and save them in a directory in my laptop. I've looked at several tutorials, but I'm not entirely sure how to go about doing this. None of the questions on StackOverflow seem to be helping me either.
Here is what I have so far:
import requests
from bs4 import BeautifulSoup
import shutil
bs = BeautifulSoup
url = input("Enter the URL you want to scrape from: ")
print("")
suffix = ".pdf"
link_list = []
def getPDFs():
# Gets URL from user to scrape
response = requests.get(url, stream=True)
soup = bs(response.text)
#for link in soup.find_all('a'): # Finds all links
# if suffix in str(link): # If the link ends in .pdf
# link_list.append(link.get('href'))
#print(link_list)
with open('CS112.Lecture.09.pdf', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
print("PDF Saved")
getPDFs()
Originally, I had gotten all of the links to the PDFs, but did not know how to download them; the code for that is now commented out.
Now I've gotten to the point where I'm trying to download just one PDF; and a PDF does get downloaded, but it's a 0KB file.
If it's of any use, I'm using Python 3.4.2
If this is something that does not require being logged in, you can use urlretrieve():
from urllib.request import urlretrieve
for link in link_list:
urlretrieve(link)

Categories