Download ans save many PDFs files with python - python

I am trying to download many PDFS fle from a website and save them.
import requests
url = "https://jawdah.qcc.abudhabi.ae/en/Registration/QCCServices/Services/Registration/Trade%20Licenses/"+id+".pdf"
r = requests.get(url, stream= TRUE)
for id in range(1,125):
with open(id+'.pdf',"wb") as pdf:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
pdf.write(chunk)
THE first url of the pdf is https://jawdah.qcc.abudhabi.ae/en/Registration/QCCServices/Services/Registration/Trade%20Licenses/1.pdf
and the last url is https://jawdah.qcc.abudhabi.ae/en/Registration/QCCServices/Services/Registration/Trade%20Licenses/125.pdf
I want to download all this files.
When i execute this code i have this error
Traceback (most recent call last):
File "c:\Users\king-\OneDrive\Bureau\pdfs\pdfs.py", line 6, in <module>
url = "https://jawdah.qcc.abudhabi.ae/en/Registration/QCCServices/Services/Registration/Trade%20Licenses/"+id+".pdf"
TypeError: can only concatenate str (not "builtin_function_or_method") to str

In the second line
url = "https://jawdah.qcc.abudhabi.ae/en/Registration/QCCServices/Services/Registration/Trade%20Licenses/"+id+".pdf"
you add a str object to something named id. id is a built-in function (type id() in a python console). In line 4
for id in range(1,125):
you overwrite id with something else (a number), which is possible, but not recommendable.
Apart from that you just make a single request, not one for very file. Try this:
import requests
url = "https://jawdah.qcc.abudhabi.ae/en/Registration/QCCServices/Services/Registration/Trade%20Licenses/{}.pdf"
for num in range(1,126):
r = requests.get(url.format(num), stream= TRUE)
with open('{}.pdf'.format(num),"wb") as pdf:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
pdf.write(chunk)

Related

How to fix error " object of type 'NoneType' has no len() " when creating a web scraper?

I'm trying to create a web scraper to download certain images from a webpage using Python and BeautifulSoup. I'm a beginner and have built this just through finding code online and trying to adapt it. My problem is that when I run the code, it produces this error:
line 24, in <module>
if len(nametemp) == 0:
TypeError: object of type 'NoneType' has no len()
This is what my code looks like:
i = 1
def makesoup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
soup = makesoup("https://www.stiga.pl/sklep/koszenie-trawnika/agregaty-tnace/agregaty-tnace-park-villa/agregat-park-100-combi-3-el-qf")
for img in soup.findAll('img'):
temp=img.get('src')
if temp[:1]=="/":
image = "https://www.stiga.pl/sklep/koszenie-trawnika/agregaty-tnace/agregaty-tnace-park-villa/agregat-park-100-combi-3-el-qf" + temp
else:
image = temp
nametemp = img.get('alt', [])
if len(nametemp) == 0:
filename = str(i)
i = i + 1
else:
filename = nametemp
This works now! Thanks for the replies!
Now when I run the code, only some of the images from the webpage appear in my folder. And it returns this:
Traceback (most recent call last):
File "scrape_stiga.py", line 31, in <module>
imagefile.write(urllib .request.urlopen(image).read())
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
222, in urlopen
return opener.open(url, data, timeout)
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
510, in open
req = Request(fullurl, data)
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
328, in __init__
self.full_url = url
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
354, in full_url
self._parse()
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'assets/img/logo-white.png'
Replace nametemp = img.get('alt') with nametemp = img.get('alt', '').
Some <img> elements could be missing the alt attribute. In such a case, img.get('alt') will return None and len function doesn't work on None.
By using img.get('alt', ''), you are returning an empty string when the image lacks alt attribute. len('') will return 0 and your code will not break.
Looks like nametemp is being assigned none ( that’s the default behaviour of get ).
In order to ensure nametemp is iterable, try changing your assignment line:
nametemp = img.get('alt',[])
This will ensure that if “alt” isn’t found that you will return a list and thus you can call “len”.
To control which directory your file is stored to, simply change your filename to contain the whole path i.e: “C:/Desktop/mySpecialFile.jpeg”
You are taking the length of nametemp when the error is raised. It says you can't take the length of a NoneType object. This tells you that nametemp at that point must be None.
Why is it None? Let's go back to:
nametemp = img.get('alt')
OK. img is the current <img> tag, since you're iterating over image tags. At some point you iterate over an image tag which does not have an alt attribute. Therefore, img.get('alt') returns None, and None is assigned to nametemp.
Check the HTML you are parsing and confirm that all image tags have an alt attribute. If you only want to iterate over image tags with an alt attribute, you can use a css-selector to find only image tags with an alt attribute, or you could add a try-catch to your loop, and simply continue if you come across an image tag you don't like.
EDIT - You said you want to scrape product images, but it isn't really clear what page you are trying to scrape these images from exactly. You did update your post with a URL - thank you - but what exactly are you trying to achieve? Do you want to scrape the page that contains all (or some) of the products within a certain category, and simply scrape the thumbnails? Or do you want to visit each product page individually and download the higher resolution image?
Here's something I put together: It just looks at the first page of all products within a certain category, and then scrapes and downloads the thumbnails (low resolution) images into a downloaded_images folder. If the folder doesn't exist, it will create it automatically. This does require the third party module requests, which you can install using pip install requests - though you should be able to do something similar using urllib.request if you don't want to install requests:
def download_image(image_url):
import requests
from pathlib import Path
dir_path = Path("downloaded_images/")
dir_path.mkdir(parents=True, exist_ok=True)
image_name = image_url[image_url.rfind("/")+1:]
image_path = str(dir_path) + "/" + image_name
with requests.get(image_url, stream=True) as response:
response.raise_for_status()
with open(image_path, "wb") as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
file.flush()
print(f"Finished downloading \"{image_url}\" to \"{image_path}\".\n")
def main():
import requests
from bs4 import BeautifulSoup
root_url = "https://www.stiga.pl/"
url = f"{root_url}sklep/koszenie-trawnika/agregaty-tnace/agregaty-tnace-park-villa"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for product in soup.findAll("div", {"class": "products__item"}):
image_url = root_url + product.find("img")["data-src"]
download_image(image_url)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
To recap, you are using BeautifulSoup to find the URLs to the images, and then you use a simple requests.get to download the image.

Downloading XML files from a web services URL in python

Please correct me if I am wrong as I am a beginner in python.
I have a web services URL which contains an XML file:
http://abc.tch.xyz.edu:000/patientlabtests/id/1345
I have a list of values and I want to append each value in that list to the URL and download file for each value and the name of the downloaded file should be the same to the value appended from the list.
It is possible to download one file at a time but I have 1000's of values in the list and I was trying to write a function with a for loop and I am stuck.
x = [ 1345, 7890, 4729]
for i in x :
url = http://abc.tch.xyz.edu:000/patientlabresults/id/{}.format(i)
response = requests.get(url2)
****** Missing part of the code ********
with open('.xml', 'wb') as file:
file.write(response.content)
file.close()
The files downloaded from URL should be like
"1345patientlabresults.xml"
"7890patientlabresults.xml"
"4729patientlabresults.xml"
I know there is a part of the code which is missing and I am unable to fill in that missing part. I would really appreciate if anyone can help me with this.
Accessing your web service url seem not to be working. Check this.
import requests
x = [ 1345, 7890, 4729]
for i in x :
url2 = "http://abc.tch.xyz.edu:000/patientlabresults/id/"
response = requests.get(url2+str(i)) # i must be converted to a string
Note: When you use 'with' to open a file, you do not have close the file since it will closed automatically.
with open(filename, mode) as file:
file.write(data)
Since the Url you provide is not working, I am going to use a different url. And I hope you get the idea and how to write to a file using the custom name
import requests
categories = ['fruit', 'car', 'dog']
for category in categories :
url = "https://icanhazdadjoke.com/search?term="
response = requests.get(url + category)
file_name = category + "_JOKES_2018" #Files will be saved as fruit_JOKES_2018
r = requests.get(url + category)
data = r.status_code #Storing the status code in 'data' variable
with open(file_name+".txt", 'w+') as f:
f.write(str(data)) # Writing the status code of each url in the file
After running this code, the status codes will be written in each of the files. And the file will also be named as follows:
car_JOKES_2018.txt
dog_JOKES_2018.txt
fruit_JOKES_2018.txt
I hope this gives you an understanding of how to name the files and write into the files.
I think you just want to create a path using str.format as you (almost) are for the URL. maybe something like the following
import os.path
x = [ 1345, 7890, 4729]
for i in x:
path = '1345patientlabresults.xml'.format(i)
# ignore this file if we've already got it
if os.path.exists(path):
continue
# try and get the file, throwing an exception on failure
url = 'http://abc.tch.xyz.edu:000/patientlabresults/id/{}'.format(i)
res = requests.get(url)
res.raise_for_status()
# write the successful file out
with open(path, 'w') as fd:
fd.write(res.content)
I've added some error handling and better behaviour on retry

Opening a csv file from an API with python

So I am trying to download a file from and API which will be in csv format
I generate a link with user inputs and store it in a variable exportLink
import requests
#getProjectName
projectName = raw_input('ProjectName')
#getApiToken
apiToken = "mytokenishere"
#getStartDate
startDate = raw_input('Start Date')
#getStopDate
stopDate = raw_input('Stop Date')
url = "https://api.awrcloud.com/get.php?action=export_ranking&project=%s&token=%s&startDate=%s&stopDate=%s" % (projectName,apiToken,startDate,stopDate)
exportLink = requests.get(url).content
exportLink will store the generated link
which I must then call to download the csv file using another
requests.get() command on exportLink
When I click the link it opens the download in a browser,
is there any way to automate this so it opens the zip and I can begin
to edit the csv using python i.e removing some stuff?
If you have bytes object zipdata that you got with requests.get(url).content, you can extract file by file to another bytes object
import zipfile
import io
import csv
with zipfile.ZipFile(io.BytesIO(zipdata)) as z:
for f in z.filelist:
csvdata = z.read(f)
and then do something with csvdata
reader = csv.reader(io.StringIO(csvdata.decode()))
...

Parsing data from JSON with python

I'm just starting out with Python and here is what I'm trying to do. I want to access Bing's API to get the picture of the day's url. I can import the json file fine but then I can't parse the data to extract the picture's url.
Here is my python script:
import urllib, json
url = "http://www.bing.com/HPImageArchive.aspx? format=js&idx=0&n=1&mkt=en-US"
response = urllib.urlopen(url)
data = json.loads(response.read())
print data
print data["images"][3]["url"]
I get this error:
Traceback (most recent call last):
File "/Users/Robin/PycharmProjects/predictit/api.py", line 9, in <module>
print data["images"][3]["url"]
IndexError: list index out of range
FYI, here is what the JSON file looks like:
http://jsonviewer.stack.hu/#http://www.bing.com/HPImageArchive.aspx?format=js&idx=0&n=1&mkt=en-US
print data["images"][0]["url"]
there is only one object in "images" array
Since there is only one element in the images list, you should have data['images'][0]['url'].
You can also see that under the "Viewer" tab in the "json viewer" that you linked to.

Extracting blog data in python

We have to extract a specified number of blogs(n) by reading them from a a text file containing a list of blogs.
Then I extract the blog data and append it to a file.
This is just a part of the main assignment of applying nlp to the data.
So far I've done this:
import urllib2
from bs4 import BeautifulSoup
def create_data(n):
blogs=open("blog.txt","r") #opening the file containing list of blogs
f=file("data.txt","wt") #Create a file data.txt
with open("blog.txt")as blogs:
head = [blogs.next() for x in xrange(n)]
page = urllib2.urlopen(head['href'])
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
description_tag = souprss.find('description')
f = open("data.txt","a") #data file created for applying nlp
f.write(description_tag)
This code doesn't work. It worked on giving the link directly.like:
page = urllib2.urlopen("http://www.frugalrules.com")
I call this function from a different script where user gives the input n.
What am I doing wrong?
Traceback:
Traceback (most recent call last):
File "C:/beautifulsoup4-4.3.2/main.py", line 4, in <module>
create_data(2)#calls create_data(n) function from create_data
File "C:/beautifulsoup4-4.3.2\create_data.py", line 14, in create_data
page=urllib2.urlopen(head)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 395, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
head is a list:
head = [blogs.next() for x in xrange(n)]
A list is indexed by integer indices (or slices). You can not use head['href'] when head is a list:
page = urllib2.urlopen(head['href'])
It's hard to say how to fix this without knowing what the contents of blog.txt looks like. If each line of blog.txt contains a URL, then
you could use:
with open("blog.txt") as blogs:
for url in list(blogs)[:n]:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
...
with open('data.txt', 'a') as f:
f.write(...)
Note that file is a deprecated form of open (which was removed in Python3). Instead of using f=file("data.txt","wt"), use the more modern with-statement syntax (as shown above).
For example,
import urllib2
import bs4 as bs
def create_data(n):
with open("data.txt", "wt") as f:
pass
with open("blog.txt") as blogs:
for url in list(blogs)[:n]:
page = urllib2.urlopen(url)
soup = bs.BeautifulSoup(page.read())
link = soup.find('link', type='application/rss+xml')
print(link['href'])
rss = urllib2.urlopen(link['href']).read()
souprss = bs.BeautifulSoup(rss)
description_tag = souprss.find('description')
with open('data.txt', 'a') as f:
f.write('{}\n'.format(description_tag))
create_data(2)
I'm assuming that you are opening, writing to and closing data.txt with each pass through the loop because you want to save partial results -- maybe in case the program is forced to terminate prematurely.
Otherwise, it would be easier to just open the file once at the very beginning:
with open("blog.txt") as blogs, open("data.txt", "wt") as f:

Categories