I'm trying to write a script in Python that downloads the image on this site that is updated every day:
https://apod.nasa.gov/apod/astropix.html
I was trying to follow the top comment from this post:
How to extract and download all images from a website using beautifulSoup?
So, this is what my code currently looks like:
import re
import requests
from bs4 import BeautifulSoup
site = 'https://apod.nasa.gov/apod/astropix.html'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
However, when I run my program I get this error:
Traceback on line 17
with open(filename.group(1), 'wb' as f:
AttributeError: 'NoneType' object has no attribute 'group'
So it looks like there is some problem with my Regex perhaps?
The regex group() you are looking after is 0, not 1. It contains the image path. Also when the image source path is relative, the url formatting is done incorrectly. I used urllib builtin module to parse the site url:
import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
site = 'https://apod.nasa.gov/apod/astropix.html'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'([\w_-]+[.](jpg|gif|png))$', url)
filename = re.sub(r'\d{4,}\.', '.', filename.group(0))
with open(filename, 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
hostname = urlparse(site).hostname
scheme = urlparse(site).scheme
url = '{}://{}/{}'.format(scheme, hostname, url)
# for full resolution image the last four digits needs to be striped
url = re.sub(r'\d{4,}\.', '.', url)
print('Fetching image from {} to {}'.format(url, filename))
response = requests.get(url)
f.write(response.content)
Outputs:
Fetching image from https://apod.nasa.gov/image/1807/FermiFinals.jpg to FermiFinals.jpg
And the image is saved as FermiFinals.jpg
I think the issue is the site variable. When it's all said and done, it's trying to append the image path of site and https://apod.nasa.gov/apod/astropix.html. If you simply just remove the astropix.html it works fine. What I have below is just a small modification of what you have, copy/paste and ship it!
import re
import requests
from bs4 import BeautifulSoup
site = "https://apod.nasa.gov/apod/astropix.html"
site_path_only = site.replace("astropix.html","")
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(site_path_only, url)
response = requests.get(url)
f.write(response.content)
Note if it's downloading the the image but says it's corrupt and is like 1k in size, you are probably getting a 404 for some reason. Just open the 'image' in notepad and read the the HTML it's giving back.
Related
I am trying to download the show images from this page with beautifulsoup.
When I run the below code the only image that downloads is the spinning loading icon.
When I check the requests tab on the page I can see requests for all the other images on the page so assume they should be downloaded as well. I am not sure why they would not download as they are contained within img tags in the html on the page?
import re
import requests
from bs4 import BeautifulSoup
site = 'https://www.tvnz.co.nz/categories/sci-fi-and-fantasy'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
image_tags = soup.find_all('img')
urls = [img['src'] for img in image_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
if not filename:
print("Regular expression didn't match with the url: {}".format(url))
continue
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
print("Download complete, downloaded images can be found in current directory!")
You can try via the api they seem to be using to populate the page
api_url = 'https://apis-edge-prod.tech.tvnz.co.nz/api/v1/web/play/page/categories/sci-fi-and-fantasy'
r = requests.get(api_url)
try:
embVals = r.json()['_embedded'].values()
except Exception as e:
embVals = []
print('failed to get embedded items\n', str(e))
urls = [img for images in [ [
v['src'] for k, v in ev.items() if
k is not None and 'image' in k.lower()
and v is not None and 'src' in v
] for ev in embVals] for img in images]
# for url in urls: # should work the same
(Images seem to be in nested dictionaries with keys like 'portraitTileImage', 'image', 'tileImage', 'coverImage'. You can also use for-loop/s to go through embVals and extract other data if you want to include more in the filename/metadata/etc.)
I don't know if it will get you ALL the images on the page, but when I tried it, urls had 297 links.
I'm trying to create a program that scrapes a site for images using bs4. The site contains two types of images, low quality ones and high quality ones. The high quality files are named the same thing as their low quality versions, but contain the word "website" and the end before the .png. I'd like to only download the "website" files. Here's what I tried.
from bs4 import BeautifulSoup
import requests
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
resolvedURLs = []
for image in images:
src = image.get('src')
resolvedURLs.append(requests.compat.urljoin(URL, src))
for image in resolvedURLs:
if not image.endswith("Website.png"):
continue
if image.endswith("Website.png"):
webs = requests.get(image)
open('scraped_images/' + image.split('/')[-1], 'wb').write(webs.content)
I don't get any error messages, but no files download. Any tips?
You are only checking if it ends with "Website.png" after you have already established that it doesn't. Better not to even check if it doesn't:
from bs4 import BeautifulSoup
import requests
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
resolvedURLs = []
for image in images:
src = image.get('src')
resolvedURLs.append(requests.compat.urljoin(URL, src))
for image in resolvedURLs:
if image.endswith("Website.png"):
webs = requests.get(image)
open('scraped_images/' + image.split('/')[-1], 'wb').write(webs.content)
Actually using list comprehensions you can make your code less procedural and prevent mistakes of the sort you made in the future:
from bs4 import BeautifulSoup
import requests
from requests.compat import urljoin
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
image_urls = [urljoin(URL,image.get('src')) for image in soup.find_all('img')]
# let's make this one a generator so we don't keep too many downloaded
# images in memory
images = (requests.get(url) for url in image_urls if url.endswith("Website.png"))
for image in images:
# use the context manager so the files are closed after write
with open('scraped_images/' + image.split('/')[-1], 'wb') as f:
f.write(image.content)
I'm using BeautifulSoup in my python code to download an image from a website which changes regularly. It all works well.
However, on the page (https://apod.nasa.gov/apod/astropix.html) there is one lower resolution image (which my code currently downloads) but then if you click the image it takes you to a higher resolution version of that same image.
Can someone please suggest how I can change my code so that it downloads the higher resolution image?:
from bs4 import BeautifulSoup as BSHTML
import requests
import subprocess
import urllib2
page = urllib2.urlopen('https://apod.nasa.gov/apod/astropix.html')
soup = BSHTML(page,features="html.parser")
images = soup.findAll('img')
url = 'https://apod.nasa.gov/apod/'+images[0]['src']
r = requests.get(url, allow_redirects=True)
with open('/home/me/Downloads/apod.jpg',"w") as f:
f.write(r.content)
You can select the <a> tag that contains <img> and then "href" attribute contains your image URL:
import requests
from bs4 import BeautifulSoup as BSHTML
page = requests.get("https://apod.nasa.gov/apod/astropix.html")
soup = BSHTML(page.content, features="html.parser")
image_url = (
"https://apod.nasa.gov/apod/" + soup.select_one("a:has(>img)")["href"]
)
r = requests.get(image_url, allow_redirects=True)
with open("/home/paul/Downloads/apod.jpg", "wb") as f:
f.write(r.content)
You need to download and write to disk:
import requests
from os.path import basename
r = requests.get("xxx")
soup = BeautifulSoup(r.content)
for link in links:
if "http" in link.get('src'):
lnk = link.get('src')
with open(basename(lnk), "wb") as f:
f.write(requests.get(lnk).content)
You can also use a select to filter your tags to only get the ones with http links:
for link in soup.select("img[src^=http]"):
lnk = link["src"]
with open(basename(lnk)," wb") as f:
f.write(requests.get(lnk).content)
I'm trying to scrape images from a site using beautifulsoup HTML parser.
There are 2 kinds of image tags for each image on the site. One is for the thumbnail and the other is the bigger size image that only appears after I click on the thumbnail and expand. The bigger size tag contains a class="expanded-image" attribute.
I'm trying to parse through the HTML and get the "src" attribute of the expanded image which contains the source for the image.
When I try to execute my code, nothing happens. It just says the process finished without scraping any image. But when I don't try to filter the code and just give tag as an argument, it downloads all the thumbnails.
Here's my code:
import webbrowser, requests, os
from bs4 import BeautifulSoup
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata('https://boards.4chan.org/a/thread/30814')
soup = BeautifulSoup(htmldata, 'html.parser')
list = []
for i in soup.find_all("img",{"class":"expanded-thumb"}):
list.append(i['src'].replace("//","https://"))
def download(url, pathname):
if not os.path.isdir(pathname):
os.makedirs(pathname)
filename = os.path.join(pathname, url.split("/")[-1])
response = requests.get(url, stream=True)
with open(filename, "wb") as f:
f.write(response.content)
for a in list:
download(a,"file")
You might be running into a problem using "list" as a variable name. It's a type in python. Start with this (replacing TEST_4CHAN_URL with whatever thread you want), incorporating my suggestion from the comment above.
import requests
from bs4 import BeautifulSoup
TEST_4CHAN_URL = "https://boards.4chan.org/a/thread/<INSERT_THREAD_ID_HERE>"
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata(TEST_4CHAN_URL)
soup = BeautifulSoup(htmldata, "html.parser")
src_list = []
for i in soup.find_all("a", {"class":"fileThumb"}):
src_list.append(i['href'].replace("//", "https://"))
print(src_list)
I'm scraping images from websites. I'm locating them by src, but what if they don't have src tag? How should I get them? Right now utilizing such code
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import os
def url_to_page_name(url):
parsed = urlparse(str(url))
return parsed.netloc
def get_images_job(page_url):
"""Request given page and extract images"""
directory_name = url_to_page_name(page_url)
response = requests.get(page_url)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
if not os.path.exists(directory_name):
os.makedirs(directory_name)
for url in urls:
file_name = re.search(r'/([\w_-]+[.](jpg|jpeg|gif|png|bmp|webp|svg))$', url)
if file_name:
file_name = file_name.group(1)
with open(os.path.join(f'{directory_name}/' + file_name), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(page_url, url)
response = requests.get(url)
f.write(response.content)
get_images_job("https://pixabay.com/")
And what if?
they are used as background
background="img/tile.jpg"
They can be located inside CSS
They can be masked as base64