I am trying to download the show images from this page with beautifulsoup.
When I run the below code the only image that downloads is the spinning loading icon.
When I check the requests tab on the page I can see requests for all the other images on the page so assume they should be downloaded as well. I am not sure why they would not download as they are contained within img tags in the html on the page?
import re
import requests
from bs4 import BeautifulSoup
site = 'https://www.tvnz.co.nz/categories/sci-fi-and-fantasy'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
image_tags = soup.find_all('img')
urls = [img['src'] for img in image_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
if not filename:
print("Regular expression didn't match with the url: {}".format(url))
continue
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
print("Download complete, downloaded images can be found in current directory!")
You can try via the api they seem to be using to populate the page
api_url = 'https://apis-edge-prod.tech.tvnz.co.nz/api/v1/web/play/page/categories/sci-fi-and-fantasy'
r = requests.get(api_url)
try:
embVals = r.json()['_embedded'].values()
except Exception as e:
embVals = []
print('failed to get embedded items\n', str(e))
urls = [img for images in [ [
v['src'] for k, v in ev.items() if
k is not None and 'image' in k.lower()
and v is not None and 'src' in v
] for ev in embVals] for img in images]
# for url in urls: # should work the same
(Images seem to be in nested dictionaries with keys like 'portraitTileImage', 'image', 'tileImage', 'coverImage'. You can also use for-loop/s to go through embVals and extract other data if you want to include more in the filename/metadata/etc.)
I don't know if it will get you ALL the images on the page, but when I tried it, urls had 297 links.
Related
So I am working on a project that requires me to auto-download images from a basic websearch. I've created this script which should be downloading all the images it finds, except the script isn't working as intended. I have consulted various forms and tutorials but non seem to have the fix.
from bs4 import beutifulsoup, requests
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
image = soup.find_all("img")
for img in image:
name = img['alt']
link = img['src']
with open(name.replace(" ", "-").replace("/", "") + ".jpg", "wb") as f:`
im = requests.get(link)
f.write(im.content)
If I print the img links, it shows all the images which can be downloaded, but for some reason it downloads 1-2 images then stops. Not to mention the downloaded images are further down the list of links.
To solve this issue, you can append the links to a list, then download the images via the iterating list of urls
list_of_urls = []
for img in image:
link = img["src"]
list_of_urls.append(link)
for link in list_of_urls:
with open(str(list_of_urls.index(link)) + ".jpg", "wb") as f:
f.write(requests.get(link).content)
I'm scraping images from websites. I'm locating them by src, but what if they don't have src tag? How should I get them? Right now utilizing such code
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import os
def url_to_page_name(url):
parsed = urlparse(str(url))
return parsed.netloc
def get_images_job(page_url):
"""Request given page and extract images"""
directory_name = url_to_page_name(page_url)
response = requests.get(page_url)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
if not os.path.exists(directory_name):
os.makedirs(directory_name)
for url in urls:
file_name = re.search(r'/([\w_-]+[.](jpg|jpeg|gif|png|bmp|webp|svg))$', url)
if file_name:
file_name = file_name.group(1)
with open(os.path.join(f'{directory_name}/' + file_name), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(page_url, url)
response = requests.get(url)
f.write(response.content)
get_images_job("https://pixabay.com/")
And what if?
they are used as background
background="img/tile.jpg"
They can be located inside CSS
They can be masked as base64
I am trying to download some images from NHTSA Crash Viewer (CIREN cases). An example of the case https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?xsl=main.xsl&CaseID=99817
If I try to download a Front crash image then there is no file downloaded. I am using beautifulsoup4 and requests libraries. This code works for other websites.
The link of images are in the following format: https://crashviewer.nhtsa.dot.gov/nass-CIREN/GetBinary.aspx?Image&ImageID=555004572&CaseID=555003071&Version=0
I have also tried the previous answers from SO but none solution works, Error obtained:
No response form server
Code used for web scraping
from bs4 import *
import requests as rq
import os
r2 = rq.get("https://crashviewer.nhtsa.dot.gov/nass-CIREN/GetBinary.aspx?Image&ImageID=555004572&CaseID=555003071&Version=0")
soup2 = BeautifulSoup(r2.text, "html.parser")
links = []
x = soup2.select('img[src^="https://crashviewer.nhtsa.dot.gov"]')
for img in x:
links.append(img['src'])
os.mkdir('ciren_photos')
i=1
for index, img_link in enumerate(links):
if i<=200:
img_data = rq.get(img_link).content
with open("ciren_photos\\"+str(index+1)+'.jpg', 'wb+') as f:
f.write(img_data)
i += 1
else:
f.close()
break
This is a task that would require Selenium, but luckily there is a shortcut. On the top of the page there is a "Text and Images Only" link that goes to a page like this one: https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?ViewText&CaseID=99817&xsl=textonly.xsl&websrc=true that contains all the images and text content in one page. You can select that link with soup.find('a', text='Text and Images Only').
That link and the image links are relative (links to the same site are usually relative links), so you'll have to use urljoin() to get the full urls.
from bs4 import BeautifulSoup
import requests as rq
from urllib.parse import urljoin
url = 'https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?xsl=main.xsl&CaseID=99817'
with rq.session() as s:
r = s.get(url)
soup = BeautifulSoup(r.text, "html.parser")
url = urljoin(url, soup.find('a', text='Text and Images Only')['href'])
r = s.get(url)
soup = BeautifulSoup(r.text, "html.parser")
links = [urljoin(url, i['src']) for i in soup.select('img[src^="GetBinary.aspx"]')]
for link in links:
content = s.get(link).content
# write `content` to file
So, the site doesn't return valid pictures unless the request has valid cookies. There are two ways to get the cookies: either use cookies from a previous request or use a Sessiion object. It's best to use a Session because it also handles the TCP connection and other parameters.
I'm trying to write a script in Python that downloads the image on this site that is updated every day:
https://apod.nasa.gov/apod/astropix.html
I was trying to follow the top comment from this post:
How to extract and download all images from a website using beautifulSoup?
So, this is what my code currently looks like:
import re
import requests
from bs4 import BeautifulSoup
site = 'https://apod.nasa.gov/apod/astropix.html'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
However, when I run my program I get this error:
Traceback on line 17
with open(filename.group(1), 'wb' as f:
AttributeError: 'NoneType' object has no attribute 'group'
So it looks like there is some problem with my Regex perhaps?
The regex group() you are looking after is 0, not 1. It contains the image path. Also when the image source path is relative, the url formatting is done incorrectly. I used urllib builtin module to parse the site url:
import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
site = 'https://apod.nasa.gov/apod/astropix.html'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'([\w_-]+[.](jpg|gif|png))$', url)
filename = re.sub(r'\d{4,}\.', '.', filename.group(0))
with open(filename, 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
hostname = urlparse(site).hostname
scheme = urlparse(site).scheme
url = '{}://{}/{}'.format(scheme, hostname, url)
# for full resolution image the last four digits needs to be striped
url = re.sub(r'\d{4,}\.', '.', url)
print('Fetching image from {} to {}'.format(url, filename))
response = requests.get(url)
f.write(response.content)
Outputs:
Fetching image from https://apod.nasa.gov/image/1807/FermiFinals.jpg to FermiFinals.jpg
And the image is saved as FermiFinals.jpg
I think the issue is the site variable. When it's all said and done, it's trying to append the image path of site and https://apod.nasa.gov/apod/astropix.html. If you simply just remove the astropix.html it works fine. What I have below is just a small modification of what you have, copy/paste and ship it!
import re
import requests
from bs4 import BeautifulSoup
site = "https://apod.nasa.gov/apod/astropix.html"
site_path_only = site.replace("astropix.html","")
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(site_path_only, url)
response = requests.get(url)
f.write(response.content)
Note if it's downloading the the image but says it's corrupt and is like 1k in size, you are probably getting a 404 for some reason. Just open the 'image' in notepad and read the the HTML it's giving back.
i've tried to download images from a webpage, what am i missing here please ?
import urllib
from urllib.request import urlopen, Request
import requests
from bs4 import BeautifulSoup
import os
urlpage ='https://www.google.com/search?site=imghp&tbm=isch&source=hp&biw=1414&bih=709&q=little+cofee'
header = {'User-Agent': 'Mozilla/5.0'}
page = urlopen(Request(urlpage,headers=header))
soup = BeautifulSoup(page)
images = soup.find_all("div", {"class":"thumb-pic"})
for image in images:
imgUrl = image.a['href'].split("imgurl=")[1]
urllib.request.urlretrieve(imgUrl, os.path.basename(imgUrl))
It's tricky. Sometimes they use short URLs like "images/img.jpg", "/images/img.jpg", "../images/img.jpg". But the google page you are trying has no html tags at all. It contains just javascript.
I made a quick and dirty example just to show you how it might work in Python 2.7 but you can just save the page opened in your browser and all images will be saved neatly in a folder.
#!/usr/bin/python
import urllib
url ='http://www.blogto.com/cafes/little-nickys-coffee-toronto'
ext=['.jpg', '.png', '.gif'] # image type to download
response= urllib.urlopen(url)
html = response.read()
IMGs=[]
L=html.split('src="')
for item in L:
item=item[:item.find('"')]
item=item.strip()
if item.find('http') == -1:
item=url[:url.find('/', 10)]+item
for e in ext:
if item.find(e) != -1:
if item not in IMGs:
IMGs.append(item)
n=len(IMGs)
print 'Found', n, 'images'
i=1
for img in IMGs:
ext=img[img.rfind('.'):]
filename='0'*(len(str(n))-len(str(i)))+str(i)
i += 1
try:
print img
f = open(filename+ext,'wb')
f.write(urllib.urlopen(img).read())
f.close()
except:
print "Unpredictable error:", img
print 'Done!'