So I am working on a project that requires me to auto-download images from a basic websearch. I've created this script which should be downloading all the images it finds, except the script isn't working as intended. I have consulted various forms and tutorials but non seem to have the fix.
from bs4 import beutifulsoup, requests
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
image = soup.find_all("img")
for img in image:
name = img['alt']
link = img['src']
with open(name.replace(" ", "-").replace("/", "") + ".jpg", "wb") as f:`
im = requests.get(link)
f.write(im.content)
If I print the img links, it shows all the images which can be downloaded, but for some reason it downloads 1-2 images then stops. Not to mention the downloaded images are further down the list of links.
To solve this issue, you can append the links to a list, then download the images via the iterating list of urls
list_of_urls = []
for img in image:
link = img["src"]
list_of_urls.append(link)
for link in list_of_urls:
with open(str(list_of_urls.index(link)) + ".jpg", "wb") as f:
f.write(requests.get(link).content)
Related
I went through similar topics here but did not find anything helpful for my case.
I managed to get all PDFs (for personal learning purposes) in local folder but cannot open them. They also have the same (310 kB) size. Perhaps, you find some mistake in my code. Thanks.
import os
import requests
from bs4 import BeautifulSoup
# define the URL to scrape
url = 'https://www.apotheken-umschau.de/medikamente/arzneimittellisten/medikamente_i.html'
# define the folder to save the PDFs to
save_path = r'C:\PDFs'
# create the folder if it doesn't exist
if not os.path.exists(save_path):
os.makedirs(save_path)
# make a request to the URL
response = requests.get(url)
# parse the HTML content of the page
soup = BeautifulSoup(response.content, 'html.parser')
# find all links on the page that contain 'href="/medikamente/beipackzettel/"'
links = soup.find_all('a', href=lambda href: href and '/medikamente/beipackzettel/' in href)
# loop through each link and download the PDF
for link in links:
href = link['href']
file_name = href.split('?')[0].split('/')[-1] + '.pdf'
pdf_url = 'https://www.apotheken-umschau.de' + href + '&file=pdf'
response = requests.get(pdf_url)
with open(os.path.join(save_path, file_name), 'wb') as f:
f.write(response.content)
f.close()
print(f'Downloaded {file_name} to {save_path}')
There are some issues here:
Select your elements from the list more specific, used css selectors:
soup.select('article li a[href*="/medikamente/beipackzettel/"]')
Check the responses you get from your requests if expected elements are available and what the behavior looks like.
You will notice that you will have to iterate more levels as you have done.
for link in soup.select('article li a[href*="/medikamente/beipackzettel/"]'):
soup_detail_page = BeautifulSoup(requests.get('https://www.apotheken-umschau.de' + link.get('href')).content)
for file in soup_detail_page.select('a:-soup-contains("Original Beipackzettel")'):
soup_file_page = BeautifulSoup(requests.get('https://www.apotheken-umschau.de' + file.get('href')).content)
You will notice that the PDF is displayed in an IFRAME and you have to scrape it via external url
pdf_url = soup_file_page.iframe.get('src').split('?file=')[-1]
You will notice that there are not only Beipackzettel for download
Example
import os
import requests
from bs4 import BeautifulSoup
# define the URL to scrape
url = 'https://www.apotheken-umschau.de/medikamente/arzneimittellisten/medikamente_i.html'
# define the folder to save the PDFs to
save_path = r'C:\PDFs'
# create the folder if it doesn't exist
if not os.path.exists(save_path):
os.makedirs(save_path)
# parse the HTML content of the page
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# loop through each link and download the PDF
for link in soup.select('article li a[href*="/medikamente/beipackzettel/"]'):
soup_detail_page = BeautifulSoup(requests.get('https://www.apotheken-umschau.de' + link.get('href')).content, 'html.parser')
for file in soup_detail_page.select('a:-soup-contains("Original Beipackzettel")'):
soup_file_page = BeautifulSoup(requests.get('https://www.apotheken-umschau.de' + file.get('href')).content, 'html.parser')
pdf_url = soup_file_page.iframe.get('src').split('?file=')[-1]
file_name = file.get('href').split('.html')[0].split('/')[-1] + '.pdf'
with open(os.path.join(save_path, file_name), 'wb') as f:
f.write(requests.get(pdf_url).content)
f.close()
print(f'Downloaded {file_name} to {save_path}')
I am trying to download the show images from this page with beautifulsoup.
When I run the below code the only image that downloads is the spinning loading icon.
When I check the requests tab on the page I can see requests for all the other images on the page so assume they should be downloaded as well. I am not sure why they would not download as they are contained within img tags in the html on the page?
import re
import requests
from bs4 import BeautifulSoup
site = 'https://www.tvnz.co.nz/categories/sci-fi-and-fantasy'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
image_tags = soup.find_all('img')
urls = [img['src'] for img in image_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
if not filename:
print("Regular expression didn't match with the url: {}".format(url))
continue
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
print("Download complete, downloaded images can be found in current directory!")
You can try via the api they seem to be using to populate the page
api_url = 'https://apis-edge-prod.tech.tvnz.co.nz/api/v1/web/play/page/categories/sci-fi-and-fantasy'
r = requests.get(api_url)
try:
embVals = r.json()['_embedded'].values()
except Exception as e:
embVals = []
print('failed to get embedded items\n', str(e))
urls = [img for images in [ [
v['src'] for k, v in ev.items() if
k is not None and 'image' in k.lower()
and v is not None and 'src' in v
] for ev in embVals] for img in images]
# for url in urls: # should work the same
(Images seem to be in nested dictionaries with keys like 'portraitTileImage', 'image', 'tileImage', 'coverImage'. You can also use for-loop/s to go through embVals and extract other data if you want to include more in the filename/metadata/etc.)
I don't know if it will get you ALL the images on the page, but when I tried it, urls had 297 links.
I'm trying to create a program that scrapes a site for images using bs4. The site contains two types of images, low quality ones and high quality ones. The high quality files are named the same thing as their low quality versions, but contain the word "website" and the end before the .png. I'd like to only download the "website" files. Here's what I tried.
from bs4 import BeautifulSoup
import requests
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
resolvedURLs = []
for image in images:
src = image.get('src')
resolvedURLs.append(requests.compat.urljoin(URL, src))
for image in resolvedURLs:
if not image.endswith("Website.png"):
continue
if image.endswith("Website.png"):
webs = requests.get(image)
open('scraped_images/' + image.split('/')[-1], 'wb').write(webs.content)
I don't get any error messages, but no files download. Any tips?
You are only checking if it ends with "Website.png" after you have already established that it doesn't. Better not to even check if it doesn't:
from bs4 import BeautifulSoup
import requests
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
resolvedURLs = []
for image in images:
src = image.get('src')
resolvedURLs.append(requests.compat.urljoin(URL, src))
for image in resolvedURLs:
if image.endswith("Website.png"):
webs = requests.get(image)
open('scraped_images/' + image.split('/')[-1], 'wb').write(webs.content)
Actually using list comprehensions you can make your code less procedural and prevent mistakes of the sort you made in the future:
from bs4 import BeautifulSoup
import requests
from requests.compat import urljoin
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
image_urls = [urljoin(URL,image.get('src')) for image in soup.find_all('img')]
# let's make this one a generator so we don't keep too many downloaded
# images in memory
images = (requests.get(url) for url in image_urls if url.endswith("Website.png"))
for image in images:
# use the context manager so the files are closed after write
with open('scraped_images/' + image.split('/')[-1], 'wb') as f:
f.write(image.content)
I am trying to download some images from NHTSA Crash Viewer (CIREN cases). An example of the case https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?xsl=main.xsl&CaseID=99817
If I try to download a Front crash image then there is no file downloaded. I am using beautifulsoup4 and requests libraries. This code works for other websites.
The link of images are in the following format: https://crashviewer.nhtsa.dot.gov/nass-CIREN/GetBinary.aspx?Image&ImageID=555004572&CaseID=555003071&Version=0
I have also tried the previous answers from SO but none solution works, Error obtained:
No response form server
Code used for web scraping
from bs4 import *
import requests as rq
import os
r2 = rq.get("https://crashviewer.nhtsa.dot.gov/nass-CIREN/GetBinary.aspx?Image&ImageID=555004572&CaseID=555003071&Version=0")
soup2 = BeautifulSoup(r2.text, "html.parser")
links = []
x = soup2.select('img[src^="https://crashviewer.nhtsa.dot.gov"]')
for img in x:
links.append(img['src'])
os.mkdir('ciren_photos')
i=1
for index, img_link in enumerate(links):
if i<=200:
img_data = rq.get(img_link).content
with open("ciren_photos\\"+str(index+1)+'.jpg', 'wb+') as f:
f.write(img_data)
i += 1
else:
f.close()
break
This is a task that would require Selenium, but luckily there is a shortcut. On the top of the page there is a "Text and Images Only" link that goes to a page like this one: https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?ViewText&CaseID=99817&xsl=textonly.xsl&websrc=true that contains all the images and text content in one page. You can select that link with soup.find('a', text='Text and Images Only').
That link and the image links are relative (links to the same site are usually relative links), so you'll have to use urljoin() to get the full urls.
from bs4 import BeautifulSoup
import requests as rq
from urllib.parse import urljoin
url = 'https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?xsl=main.xsl&CaseID=99817'
with rq.session() as s:
r = s.get(url)
soup = BeautifulSoup(r.text, "html.parser")
url = urljoin(url, soup.find('a', text='Text and Images Only')['href'])
r = s.get(url)
soup = BeautifulSoup(r.text, "html.parser")
links = [urljoin(url, i['src']) for i in soup.select('img[src^="GetBinary.aspx"]')]
for link in links:
content = s.get(link).content
# write `content` to file
So, the site doesn't return valid pictures unless the request has valid cookies. There are two ways to get the cookies: either use cookies from a previous request or use a Sessiion object. It's best to use a Session because it also handles the TCP connection and other parameters.
I'm trying to download some images from tripadvisor using urllib but all that I get for the url in the src field from the html is this
I've done some research and I found out that those are lazy load images... Is there any way to download them??
You can extract a list of images from Javascript using the Beautiful Soup and json modules, then iterate over the list and retrieve the images you are interested in.
EDIT:
The problem was that the images have the same name, so they got overwritten. Fetching the first three images is trivial, but references to the other images in the carousel are not loaded until the carousel is opened, so that's trickier. For some images you can find a higher resolution version by substituting "photo-s" in the path with "photo-w", but figuring out which requires diving deeper into Javascript logic.
import urllib, re, json
from bs4 import BeautifulSoup as bs
def img_data_filter(tag):
if tag.name == "script" and tag.text.strip().startswith("var lazyImgs"):
return True
return False
response = urllib.urlopen("https://www.tripadvisor.it/Restaurant_Review-g3174493-d3164947-Reviews-Le_Ciaspole-Tret_Fondo_Province_of_Trento_Trentino_Alto_Adige.html")
soup = bs(response.read(), 'html.parser')
img_data = soup.find(img_data_filter)
js = img_data.text
js = js.replace("var lazyImgs = ", '')
js = re.sub(r";\s+var lazyHtml.+", '', js, flags=re.DOTALL)
imgs = json.loads(js)
suffix = 1
for img in imgs:
img_url = img["data"]
if not "media/photo-s" in img_url:
continue
img_name = img_url[img_url.rfind('/')+1:-4]
img_name = "%s-%03d.jpg" % (img_name, suffix)
suffix += 1
urllib.urlretrieve(img_url, img_name)