corrupt zip download urllib2 - python

I am trying to download zip files from measuredhs.com using the following code:
url ='https://dhsprogram.com/customcf/legacy/data/download_dataset.cfm?Filename=BFBR62DT.ZIP&Tp=1&Ctry_Code=BF'
request = urllib2.urlopen(url)
output = open("install.zip", "w")
output.write(request.read())
output.close()
However the downloaded file does not open. I get a message saying the compressed zip folder is invalid.
To access the download link, one needs to long in, which I have done so. If i click on the link, it automatically downloads the file, or even if i paste it in a browser.
Thanks

Try writing to local file in binary mode.
with open('install.zip', 'wb') as output:
output.write(request.read())
Also, comparing the md5/sha1 hash of the downloaded file will let you know if the downloaded file has been corrupted.

Related

RAR files not downloading correctly using requests module

Recently, I was creating an installer for my project. It involves downloading a RAR file from my server, unRAR-ing it and putting those folders into the correct locations.
Once it's downloaded, the program is supposed to unRAR it, but instead it gives me an error in the console: Corrupt header of file.
I should note that this error is coming from the unRAR program bundled with WinRAR.
I also tried opening the file using the GUI of WinRAR, and it gave the same error.
I'm assuming while it's being downloaded, it's being corrupted somehow?
Also, when I download it manually using a web browser, it downloads fine.
I've tried this code:
KALI = "URL CENSORED"
kali_res = requests.get(KALI, stream=True)
for chunk in kali_res.iter_content(chunk_size=128):
open(file_path, "wb").write(chunk)
..but it still gives the same error.
Could someone please help?
You keep re-opening the file for every chunk.
Not only does this leak file descriptors, it also means you keep overwriting the file.
Try this instead:
KALI = "URL CENSORED"
kali_res = requests.get(KALI, stream=True)
with open(file_path, "wb") as outfile:
for chunk in kali_res.iter_content(chunk_size=128):
outfile.write(chunk)

PDF manipulation with Python

I need to remove the last page of a pdf file. I have multiple pdf files in the same directory. So far I have the next code:
from PyPDF2 import PdfFileWriter, PdfFileReader
import os
def changefile (file):
infile = PdfFileReader(file, "rb")
output = PdfFileWriter()
numpages = infile.getNumPages()
for i in range (numpages -1):
p = infile.getPage(i)
output.addPage(p)
with open(file, 'wb') as f:
output.write(f)
for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
if file.endswith(".pdf") or file.endswith(".PDF"):
changefile(file)
My script worked while testing it. I need this script for my work. Each day I need to download several e-invoices from our main external supplier. The last page always mentions the conditions of sales and is useless. Unfortunately our supplier left a signature, causing my script not to work properly.
When I am trying to run it on the invoices I receive the following error:
line 1901, in read
raise utils.PdfReadError("Could not find xref table at specified location")
PyPDF2.utils.PdfReadError: Could not find xref table at specified location
I was able to fix it on my Linux laptop by running qpdf invoice.pdf invoice-fix. I can't install QPDF on my work, where Windows is being used.
I suppose this error is triggered by the signature left by our supplier on each PDF file.
Does anyone know how to fix this error? I am looking for an efficient method to fix the issue with a broken PDF file and its signature. There must be something better than opening each PDF file with Adobe and removing the signature manually ...
Automation would be nice, because I daily put multiple invoices in the same directory.
Thanks.
The problem is probably a corrupt PDF file. QPDF is indeed able to work around that. That's why I would recommend using the pikepdf library instead of PyPDF2 in this case. Pikepdf is based on QPDF. First install pikepdf (for example using pip or your package manager) and then try this code:
import pikepdf
import os
def changefile (file):
print("Processing {0}".format(file))
pdf = pikepdf.Pdf.open(file)
lastPageNum = len(pdf.pages)
pdf.pages.remove(p = lastPageNum)
pdf.save(file + '.tmp')
pdf.close()
os.unlink(file)
os.rename(file + '.tmp', file)
for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
if file.lower().endswith(".pdf"):
changefile(file)
Link to pikepdf docs: https://pikepdf.readthedocs.io/en/latest/
Let me know if that works for you.

python: extracting a .bz2 compressed file from a torrent file

I have a .torrent file that contains a .bz2 file. I am sure that such a file is actually in the .torrent because I extracted the .bz2 with utorrent.
How can I do the same thing in python instead of using utorrent?
I have seen a lot of libraries for dealing with .torrent files in python but apparently none does what I need. Among my unsuccessful attempts I can mention:
import torrent_parser as tp
file_cont = tp.parse_torrent_file('RC_2015-01.bz2.torrent')
file_cont is now a dictionary and file_cont['info']['name']='RC_2015-01.bz2' but if I try to open the file, i.e.
from bz2 import BZ2File
with BZ2File(file_cont['info']['name']) as f:
what_I_want = f.read()
then the content of the dictionary is (obviously, I'd say) interpreted as a path, and I get
No such file or directory: 'RC_2015-01.bz2'
Other attempts have been even more ruinous.
A .torrent file is just a metadata file, indicating where to get the data and the filename of the file. You can't get the file contents from that file.
Only once you have successfully downloaded this torrent file to disk (using torrent software) you can then use BZ2File to open it (if it is .bz2 format).
If you want to perform the actual download with Python, the only option I found was torrent-dl which hasn't been updated for 2 years.

Unable to open saved excel file using urllib.request.urlretrieve (Sample link mentioned )

Currently, I'm using Flask with Python 3.
For sample purposes, here is a dropbox link
In order to fetch the file and save it, I'm doing the following.
urllib.request.urlretrieve("https://www.dropbox.com/s/w1h6vw2st3wvtfb/Sample_List.xlsx?dl=0", "Sample_List.xlsx")
The file is saved successfully to my project's root directory, however there is a problem. When I try to open the file, I get this error.
What am I doing wrong over here?
Also, is there a way to get the file name and extension from the URL itself? Example, filename = Sample_List and extension = xlsx...something like this.

Saving a .tar.gz file located on server to a FILE object

I'm currently working on a Python Flask API.
For demo purposes, I have a folder in the server containing .tar.gz files.
Basically I'm wondering how do I save these files knowing their relative path name, say like file.tar.gz, into a FILE object. I need the tar file in the format to be able to run the following code on it, where f would be the tar file:
tar = tarfile.open(mode="r:gz", fileobj=f)
for member in tar.getnames():
tf = tar.extractfile(member)
Thanks in advance!
Not ver familiar with this but , just saving it normally with .tar.gz extension should work? if yes and if you have the file already compressed then a very simple code could do that,
compresseddata= 'your file'
with open('file.tar.gz') as fo:
fo.write(compressed data)
fo.flush().close()
Will this do the job , or am i getting something wrong here?

Categories