I'm trying to make web scraper that downloads images from searched keywords. The code works completely fine until it has to download that image from extracted URL
from bs4 import BeautifulSoup
import requests
import os
import urllib
search = raw_input("search for images: ")
params = {"q": search}
r = requests.get("http://wwww.bing.com/images/search", params=params)
dir_name = search.replace(" ", "_").lower()
if not os.path.isdir(dir_name):
os.makedirs(dir_name)
soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "thumb"})
for items in links:
img_obj = requests.get(items.attrs["href"])
print "Getting: ", items.attrs["href"]
title = items.attrs["href"].split("/")[-1]
urllib.urlretrieve(items.attrs["href"], "./scraped_images/")
OUTPUT:
search for images: cats
Getting: http://c1.staticflickr.com/3/2755/4353908962_2a0003aebf.jpg
Traceback (most recent call last):
File "C:/Users/qazii/PycharmProjects/WebScraping/exm.py", line 21, in <module>
urllib.urlretrieve(items.attrs["href"], "./scraped_images/")
File "E:\anaconda\envs\WebScraping\lib\urllib.py", line 98, in urlretrieve
return opener.retrieve(url, filename, reporthook, data)
File "E:\anaconda\envs\WebScraping\lib\urllib.py", line 249, in retrieve
tfp = open(filename, 'wb')
IOError: [Errno 13] Permission denied: './scraped_images/'
You're attempting to save the image to a "file" named ./scraped_images/. Since this is a directory and not a file, you get a permissions error (you can't open a directory with write permissions). Instead, try saving to a specific file name.
urllib.urlretrieve(items.attrs["href"], os.path.join("./scrapped_images", title))
Related
I'm trying to create a web scraper to download certain images from a webpage using Python and BeautifulSoup. I'm a beginner and have built this just through finding code online and trying to adapt it. My problem is that when I run the code, it produces this error:
line 24, in <module>
if len(nametemp) == 0:
TypeError: object of type 'NoneType' has no len()
This is what my code looks like:
i = 1
def makesoup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
soup = makesoup("https://www.stiga.pl/sklep/koszenie-trawnika/agregaty-tnace/agregaty-tnace-park-villa/agregat-park-100-combi-3-el-qf")
for img in soup.findAll('img'):
temp=img.get('src')
if temp[:1]=="/":
image = "https://www.stiga.pl/sklep/koszenie-trawnika/agregaty-tnace/agregaty-tnace-park-villa/agregat-park-100-combi-3-el-qf" + temp
else:
image = temp
nametemp = img.get('alt', [])
if len(nametemp) == 0:
filename = str(i)
i = i + 1
else:
filename = nametemp
This works now! Thanks for the replies!
Now when I run the code, only some of the images from the webpage appear in my folder. And it returns this:
Traceback (most recent call last):
File "scrape_stiga.py", line 31, in <module>
imagefile.write(urllib .request.urlopen(image).read())
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
222, in urlopen
return opener.open(url, data, timeout)
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
510, in open
req = Request(fullurl, data)
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
328, in __init__
self.full_url = url
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
354, in full_url
self._parse()
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'assets/img/logo-white.png'
Replace nametemp = img.get('alt') with nametemp = img.get('alt', '').
Some <img> elements could be missing the alt attribute. In such a case, img.get('alt') will return None and len function doesn't work on None.
By using img.get('alt', ''), you are returning an empty string when the image lacks alt attribute. len('') will return 0 and your code will not break.
Looks like nametemp is being assigned none ( that’s the default behaviour of get ).
In order to ensure nametemp is iterable, try changing your assignment line:
nametemp = img.get('alt',[])
This will ensure that if “alt” isn’t found that you will return a list and thus you can call “len”.
To control which directory your file is stored to, simply change your filename to contain the whole path i.e: “C:/Desktop/mySpecialFile.jpeg”
You are taking the length of nametemp when the error is raised. It says you can't take the length of a NoneType object. This tells you that nametemp at that point must be None.
Why is it None? Let's go back to:
nametemp = img.get('alt')
OK. img is the current <img> tag, since you're iterating over image tags. At some point you iterate over an image tag which does not have an alt attribute. Therefore, img.get('alt') returns None, and None is assigned to nametemp.
Check the HTML you are parsing and confirm that all image tags have an alt attribute. If you only want to iterate over image tags with an alt attribute, you can use a css-selector to find only image tags with an alt attribute, or you could add a try-catch to your loop, and simply continue if you come across an image tag you don't like.
EDIT - You said you want to scrape product images, but it isn't really clear what page you are trying to scrape these images from exactly. You did update your post with a URL - thank you - but what exactly are you trying to achieve? Do you want to scrape the page that contains all (or some) of the products within a certain category, and simply scrape the thumbnails? Or do you want to visit each product page individually and download the higher resolution image?
Here's something I put together: It just looks at the first page of all products within a certain category, and then scrapes and downloads the thumbnails (low resolution) images into a downloaded_images folder. If the folder doesn't exist, it will create it automatically. This does require the third party module requests, which you can install using pip install requests - though you should be able to do something similar using urllib.request if you don't want to install requests:
def download_image(image_url):
import requests
from pathlib import Path
dir_path = Path("downloaded_images/")
dir_path.mkdir(parents=True, exist_ok=True)
image_name = image_url[image_url.rfind("/")+1:]
image_path = str(dir_path) + "/" + image_name
with requests.get(image_url, stream=True) as response:
response.raise_for_status()
with open(image_path, "wb") as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
file.flush()
print(f"Finished downloading \"{image_url}\" to \"{image_path}\".\n")
def main():
import requests
from bs4 import BeautifulSoup
root_url = "https://www.stiga.pl/"
url = f"{root_url}sklep/koszenie-trawnika/agregaty-tnace/agregaty-tnace-park-villa"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for product in soup.findAll("div", {"class": "products__item"}):
image_url = root_url + product.find("img")["data-src"]
download_image(image_url)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
To recap, you are using BeautifulSoup to find the URLs to the images, and then you use a simple requests.get to download the image.
I have a python script that searches for images on a web page and it's supposed to download them to folder named 'downloaded'. Last 2-3 lines are problematic, I don't know how to write the correct 'with open' code.
The biggest part of the script is fine, lines 42-43 give an error
import os
import requests
from bs4 import BeautifulSoup
downloadDirectory = "downloaded"
baseUrl = "http://pythonscraping.com"
def getAbsoluteURL(baseUrl, source):
if source.startswith("http://www."):
url = "http://"+source[11:]
elif source.startswith("http://"):
url = source
elif source.startswith("www."):
url = source[4:]
url = "http://"+source
else:
url = baseUrl+"/"+source
if baseUrl not in url:
return None
return url
def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
path = absoluteUrl.replace("www.", "")
path = path.replace(baseUrl, "")
path = downloadDirectory+path
directory = os.path.dirname(path)
if not os.path.exists(directory):
os.makedirs(directory)
return path
html = requests.get("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html.content, 'html.parser')
downloadList = bsObj.find_all(src=True)
for download in downloadList:
fileUrl = getAbsoluteURL(baseUrl,download["src"])
if fileUrl is not None:
print(fileUrl)
with open(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory), 'wb') as out_file:
out_file.write(fileUrl.content)
It opens downloaded folder on my computer and misc folder within it. And it gives a traceback error.
Traceback:
http://pythonscraping.com/misc/jquery.js?v=1.4.4
Traceback (most recent call last):
File "C:\Python36\kodovi\downloaded.py", line 43, in <module>
with open(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory), 'wb
') as out_file:
TypeError: an integer is required (got type str)
It seems your downloadList includes some URLs that aren't images. You could instead look for any <img> tags in the HTML:
downloadList = bsObj.find_all('img')
Then use this to download those images:
for download in downloadList:
fileUrl = getAbsoluteURL(baseUrl,download["src"])
r = requests.get(fileUrl, allow_redirects=True)
filename = os.path.join(downloadDirectory, fileUrl.split('/')[-1])
open(filename, 'wb').write(r.content)
EDIT: I've updated the filename = ... line so that it writes the file of the same name to the directory in the string downloadDirectory. By the way, the normal convention for Python variables is not to use camel case.
i am trying to download all the pdfs from the webiste provided and i am using the following code:
import mechanize
from time import sleep
br = mechanize.Browser()
br.open('http://www.nerc.com/comm/CCC/Pages/AgendasHighlightsandMinutes-.aspx')
f=open("source.html","w")
f.write(br.response().read())
filetypes=[".pdf"]
myfiles=[]
for l in br.links():
for t in filetypes:
if t in str(l):
myfiles.append(l)
def downloadlink(l):
f=open(l.text,"w")
br.click_link(l)
f.write(br.response().read())
print l.text," has been downloaded"
for l in myfiles:
sleep(1)
downloadlink(l)
keep on getting the following error and can't figure out the problem why.
legal and privacy has been downloaded
Traceback (most recent call last):
File "downloads-pdfs.py", line 29, in <module>
downloadlink(l)
File "downloads-pdfs.py", line 21, in downloadlink
f=open(l.text,"w")
IOError: [Errno 13] Permission denied: u'/trademark policy'
The problem you encounter arises because you use the link URL as a filename. The character '/' is not valid in a filename. Try to modify your downloadlink function to something like this:
def downloadlink(l):
filename = l.text.split('/')[-1]
with open(filename, "w") as f:
br.click_link(l)
f.write(br.response().read())
print l.text," has been downloaded"
I am a beginner in python3
I want to copy snippet java file in the middle of other temp file which i gain the address of this file from downloading URL.
my problem is when i execute my program i have this error:
RESTART: C:/Users/user/AppData/Local/Programs/Python/Python36/refactordwon.py
the Url is:
('C:\\Users\\user\\AppData\\Local\\Temp\\tmpq5m7m_og', <http.client.HTTPMessage object at 0x0000003A854879E8>)
Traceback (most recent call last):
File "C:/Users/user/AppData/Local/Programs/Python/Python36/refactordwon.py", line 14, in <module>
file_out = open("path_file" , "r")
FileNotFoundError: [Errno 2] No such file or directory: 'path_file'
>>>
i do not know why?
because when i download the url, url shows me this address:
the Url is:
('C:\\Users\\user\\AppData\\Local\\Temp\\tmpey3yovte', <http.client.HTTPMessage object at 0x0000002233347978>)
i tried to use this address in different way but anyway i have error. i found the temp file and copied in python address.
I am sure that i have this file and the address is correct but again i have error that can not find file.
could you help me, please?!
I hope my question is clear
my code is:
import urllib.request
import os
import tempfile
#download URL
#[-------------------------
url = 'http://pages.di.unipi.it/corradini/Didattica/AP-17/PROG-ASS/03/assignment3.html'
gt_url = urllib.request.urlretrieve(url)
print("the Url is: ")
print(gt_url)
#--------------------------]
#copy sniper java file inside remote file
#[--------------------------
path_file =r'C:/Users/user/AppData/Local/Programs/Python/Python36/tmpokv2s_dw'
file_out = open("path_file" , "r")
file_in = open("snip1.java", "r")
file_out.readlines()
open("file_back", "w")
file_back.write(file_out)
pos_fileout = file_back.tell()
file_back.seek(pos_fileout)
file_back.write(file_in)
print("the content of file is: ")
file_back.close()
file_out.close()
file_in.close()
open("file_back", "r")
file_back.readlines()
print(file_back)
file_back.close()
We have to extract a specified number of blogs(n) by reading them from a a text file containing a list of blogs.
Then I extract the blog data and append it to a file.
This is just a part of the main assignment of applying nlp to the data.
So far I've done this:
import urllib2
from bs4 import BeautifulSoup
def create_data(n):
blogs=open("blog.txt","r") #opening the file containing list of blogs
f=file("data.txt","wt") #Create a file data.txt
with open("blog.txt")as blogs:
head = [blogs.next() for x in xrange(n)]
page = urllib2.urlopen(head['href'])
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
description_tag = souprss.find('description')
f = open("data.txt","a") #data file created for applying nlp
f.write(description_tag)
This code doesn't work. It worked on giving the link directly.like:
page = urllib2.urlopen("http://www.frugalrules.com")
I call this function from a different script where user gives the input n.
What am I doing wrong?
Traceback:
Traceback (most recent call last):
File "C:/beautifulsoup4-4.3.2/main.py", line 4, in <module>
create_data(2)#calls create_data(n) function from create_data
File "C:/beautifulsoup4-4.3.2\create_data.py", line 14, in create_data
page=urllib2.urlopen(head)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 395, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
head is a list:
head = [blogs.next() for x in xrange(n)]
A list is indexed by integer indices (or slices). You can not use head['href'] when head is a list:
page = urllib2.urlopen(head['href'])
It's hard to say how to fix this without knowing what the contents of blog.txt looks like. If each line of blog.txt contains a URL, then
you could use:
with open("blog.txt") as blogs:
for url in list(blogs)[:n]:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
...
with open('data.txt', 'a') as f:
f.write(...)
Note that file is a deprecated form of open (which was removed in Python3). Instead of using f=file("data.txt","wt"), use the more modern with-statement syntax (as shown above).
For example,
import urllib2
import bs4 as bs
def create_data(n):
with open("data.txt", "wt") as f:
pass
with open("blog.txt") as blogs:
for url in list(blogs)[:n]:
page = urllib2.urlopen(url)
soup = bs.BeautifulSoup(page.read())
link = soup.find('link', type='application/rss+xml')
print(link['href'])
rss = urllib2.urlopen(link['href']).read()
souprss = bs.BeautifulSoup(rss)
description_tag = souprss.find('description')
with open('data.txt', 'a') as f:
f.write('{}\n'.format(description_tag))
create_data(2)
I'm assuming that you are opening, writing to and closing data.txt with each pass through the loop because you want to save partial results -- maybe in case the program is forced to terminate prematurely.
Otherwise, it would be easier to just open the file once at the very beginning:
with open("blog.txt") as blogs, open("data.txt", "wt") as f: