the downloaded file is corrupt - python

I write a script to download certain files from multiple pages from the web. The downloads seems to work but all the files corrupted. I tried different way to download the files but always give me corrupted files and all the files size only 4 kb.
Where do I need to change or revise my code to fix download's problem?
while pageCounter < 3:
soup_level1 = BeautifulSoup(driver.page_source, 'lxml')
for div in soup_level1.findAll('div', attrs ={'class':'financial-report-download ng-scope'}):
links = div.findAll('a', attrs = {'class':'ng-binding'}, href=re.compile("FinancialStatement"))
for a in links:
driver.find_element_by_xpath("//div[#ng-repeat = 'attachments in res.Attachments']").click()
files = [url + a['href']]
for file in files:
file_name = file.split('/')[-1]
print ("Downloading file:%s"%file_name)
# create response object
r = requests.get(file, stream = True)
# download started
with open(file_name, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024*1024):
if chunk:
f.write(chunk)
print ("%s downloaded!\n"%file_name)

Related

How to save several files downloaded in a row in one folder with different extensions?

What is the best way to save files to a folder with their native extension? The idea is that files are being downloaded from several urls in turn and stored in three folders into three folders, depending on the status code. And all these files with different extensions .
import requests
def save_file(link):
filename = link.split('/')[-1]
print(filename)
# proxies ={
# 'https': 'http://5.135.240.70:8080'
# }
data = requests.get('https://ipinfo.io/json')
print(data.text)
r =requests.get(link,allow_redirects=True)
print(r.status_code)
while True:
if():
if(r.status_code == 200):
with open('\\Users\\user\\Desktop\\good\\gp.txt', 'wb') as f:
f.write(r.content)
if(r.status_code != 200):
open(r'\Users\user\Desktop\bad\gp.zip', 'wb' ).write(r.content)
break
open(r'\Users\user\Desktop\general\gp.zip', 'wb').write(r.content)
link1 ='://...........................txt'
link2 ='://..............................jpeg'
link3 ='://..............................php'
link4 ='://........................rules'
In this form , it is more suitable for downloading one specific file . Maybe through the "glob" or "os.". I am grateful for any suggestions and help.
I am interested in this particular part of the code:
while True:
if():
if(r.status_code == 200):
with open('\\Users\\user\\Desktop\\good\\gp.txt', 'wb') as f:
f.write(r.content)
if(r.status_code != 200):
open(r'\Users\user\Desktop\bad\gp.zip', 'wb' ).write(r.content)
break
open(r'\Users\user\Desktop\general\gp.zip', 'wb').write(r.content)
As you suspected, the os module can come in handy in this case.
You can use the following syntax to get the file name and extension
file_name, file_ext = os.path.splitext(os.path.basename(link))
What is happening is that you are first getting the full file name using basename e.g. if your url is https://www.binarydrtyefense.com/banlist.txt basename would return banlist.txt. Then splitext takes the banlist.txt and produces a tuple ('banlist', '.txt').
Later in your code you can use this to make the file e.g.
download_name = f'gp{file_ext}'
download_dir = 'good' if r.status_code == 200 else 'bad'
download_path = os.path.join(r'\Users\user\Desktop', download_dir, download_name)

Having trouble using requests to download images off of wiki

I am working on a project where I need to scrape images off of the web. To do this, I write the image links to a file, and then I download each of them to a folder with requests. At first, I used Google as the scrape site, but do to several reasons, I have decided that wikipedia is a much better alternative. However, after I tried the first time, many of the images couldn't be opened, so I tried again with the change that when I downloaded the images, I downloaded them to names with endings that matched the endings of the links. More images were able to be accessed like this, but many were still not able to be opened. When I tested downloading the images myself (individually outside of the function), they downloaded perfectly, and when I used my function to download them afterwards, they kept downloading correctly (i.e. I could access them). I am not sure i it is important, but the image endings that I generally come across are svg.png and png. I want to know why this is occurring and what I may be able to do to prevent it. I have left some of my code below. Thank you.
Function:
def download_images(file):
object = file[0:file.index("IMAGELINKS") - 1]
folder_name = object + "_images"
dir = os.path.join("math_obj_images/original_images/", folder_name)
if not os.path.exists(dir):
os.mkdir(dir)
with open("math_obj_image_links/" + file, "r") as f:
count = 1
for line in f:
try:
if line[len(line) - 1] == "\n":
line = line[:len(line) - 1]
if line[0] != "/":
last_chunk = line.split("/")[len(line.split("/")) - 1]
endings = last_chunk.split(".")[1:]
image_ending = ""
for ending in endings:
image_ending += "." + ending
if image_ending == "":
continue
with open("math_obj_images/original_images/" + folder_name + "/" + object + str(count) + image_ending, "wb") as f:
f.write(requests.get(line).content)
file = object + "_IMAGEENDINGS.txt"
path = "math_obj_image_endings/" + file
with open(path, "a") as f:
f.write(image_ending + "\n")
count += 1
except:
continue
f.close()
Doing this outside of it worked:
with open("test" + image_ending, "wb") as f:
f.write(requests.get(line).content)
Example of image link file:
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
https:/static/images/footer/wikimedia-button.png
https:/static/images/footer/poweredby_mediawiki_88x31.png
If all the files are indeed in PNG format and the suffix is always .png, you could try something like this:
import requests
from pathlib import Path
u1 = "https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png"
r = requests.get(u1)
Path('u1.png').write_bytes(r.content)
My previous answer works for PNG's only
For SVG files you need to check if the file contents start eith the string "<svg" and create a file with the .svg suffix.
The code below saves the downloaded files in the "downloads" subdirectory.
import requests
from pathlib import Path
# urls are stored in a file 'urls.txt'.
with open('urls.txt') as f:
for i, url in enumerate(f.readlines()):
url = url.strip() # MUST strip the line-ending char(s)!
try:
content = requests.get(url).content
except:
print('Cannot download url:', url)
continue
# Check if this is an SVG file
# Note that content is bytes hence the b in b'<svg'
if content.startswith(b'<svg'):
ext = 'svg'
elif url.endswith('.png'):
ext = 'png'
else:
print('Cannot process contents of url:', url)
Path('downloads', f'url{i}.{ext}').write_bytes(requests.get(url).content)
Contents of the urls.txt file:
(the last url is an svg)
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e

How to scrape pdf to local folder with filename = url and delay within iteration?

I scraped a website (url = "http://bla.com/bla/bla/bla/bla.txt") for all the links containing .pdf that were important to me.
These are now stored in relative_paths:
['http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-0065.pdf',
'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-1679.pdf',
'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/4444/jjjjj-99-9526.pdf',]
Now i want to store the pdf "behind" the links in a local folder with their filename being their url.
None of the - although somewhat similar questions on the internet - seems to help me towards my goal. The closest i got was when it generated some weird file that didnt even have an extension. Here are some of the more promising code samples i already tried out.
for link in relative_paths:
content = requests.get(link, verify = False)
with open(link, 'wb') as pdf:
pdf.write(content.content)
for link in relative_paths:
response = requests.get(url, verify = False)
with open(join(r'C:/Users/', basename(url)), 'wb') as f:
f.write(response.content)
for link in relative_paths:
filename = link
with open(filename, 'wb') as f:
f.write(requests.get(link, verify = False).content)
for link in relative_paths:
pdf_response = requests.get(link, verify = False)
filename = link
with open(filename, 'wb') as f:
f.write(pdf_response.content)
Now i am confused and dont know how to move forward. Can you transform one of the for loop and provide a small explanation, please? If the urls are too long for filename, a split at the 3rd last / is also ok. thanks :)
Also, i was asked by the website host to not scrape all of the pdfs at once so that the server does not get overloaded since there are thousands of pdfs behind the many links stored in relative_paths. That is why i am searching for a way to incorporate some sort of delay within my requests.
give this a shot:
import time
count_downloads = 25 #<--- wait x seconds after every 25 downloads
time_delay = 60 #<--- wait 60 seconds after every y downloads
for idx, link in enumerate(relative_paths):
if idx % count_downloads == 0:
print ('Waiting %s seconds...' %time_delay)
time.sleep(time_delay)
filename = link.split('jjjjj-')[-1] #<--whatever that is is where you want to split then
try:
with open(filename, 'wb') as f:
f.write(requests.get(link).content)
print ('Saved: %s' %link)
except Exception as ex:
print('%s not saved. %s' %(link,ex))

How to batch read and then write a list of weblink .JSON files to specified locations on C drive in Python v2.7

I have a long list of .json files that I need to download to my computer. I want to download them as .json files (so no parsing or anything like that at this point).
I have some code that works for small files, but it is pretty buggy. Also it doesn't handle multiple links well.
Appreciate any advice to fix up this code:
import os
filename = 'test.json'
path = "C:/Users//Master"
fullpath = os.path.join(path, filename)
import urllib2
url = 'https://www.premierlife.com/secure/index.json'
response = urllib2.urlopen(url)
webContent = response.read()
f = open(fullpath, 'w')
f.write(webContent)
f.close
It's creating a blank file because the f.close at the end should be f.close().
I took your code and made a little function and then called it on a little loop to go through a .txt file with the list of urls called "list_of_urls.txt" having 1 url per line (you can change the delimiter in the split function if you want to format it differently).
def save_json(url):
import os
filename = url.replace('/','').replace(':','')
# this replaces / and : in urls
path = "C:/Users/Master"
fullpath = os.path.join(path, filename)
import urllib2
response = urllib2.urlopen(url)
webContent = response.read()
f = open(fullpath, 'w')
f.write(webContent)
f.close()
And then the loop:
f = open('list_of_urls.txt')
p = f.read()
url_list = p.split('\n') #here's where \n is the line break delimiter that can be changed
for url in url_list:
save_json(url)

How to get python to successfully download large images from the internet

So I've been using
urllib.request.urlretrieve(URL, FILENAME)
to download images of the internet. It works great, but fails on some images. The ones it fails on seem to be the larger images- eg. http://i.imgur.com/DEKdmba.jpg. It downloads them fine, but when I try to open these files photo viewer gives me the error "windows photo viewer cant open this picture because the file appears to be damaged corrupted or too large".
What might be the reason it can't download these, and how can I fix this?
EDIT: after looking further, I dont think the problem is large images- it manages to download larger ones. It just seems to be some random ones that it can never download whenever I run the script again. Now I'm even more confused
In the past, I have used this code for copying from the internet. I have had no trouble with large files.
def download(url):
file_name = raw_input("Name: ")
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_size = 8192
while True:
buffer = u.read(block_size)
if not buffer:
break
Here's the sample code for Python 3 (tested in Windows 7):
import urllib.request
def download_very_big_image():
url = 'http://i.imgur.com/DEKdmba.jpg'
filename = 'C://big_image.jpg'
conn = urllib.request.urlopen(url)
output = open(filename, 'wb') #binary flag needed for Windows
output.write(conn.read())
output.close()
For completeness sake, here's the equivalent code in Python 2:
import urllib2
def download_very_big_image():
url = 'http://i.imgur.com/DEKdmba.jpg'
filename = 'C://big_image.jpg'
conn = urllib2.urlopen(url)
output = open(filename, 'wb') #binary flag needed for Windows
output.write(conn.read())
output.close()
This should work: use requests module:
import requests
img_url = 'http://i.imgur.com/DEKdmba.jpg'
img_name = img_url.split('/')[-1]
img_data = requests.get(img_url).content
with open(img_name, 'wb') as handler:
handler.write(img_data)

Categories