RAR files not downloading correctly using requests module - python

Recently, I was creating an installer for my project. It involves downloading a RAR file from my server, unRAR-ing it and putting those folders into the correct locations.
Once it's downloaded, the program is supposed to unRAR it, but instead it gives me an error in the console: Corrupt header of file.
I should note that this error is coming from the unRAR program bundled with WinRAR.
I also tried opening the file using the GUI of WinRAR, and it gave the same error.
I'm assuming while it's being downloaded, it's being corrupted somehow?
Also, when I download it manually using a web browser, it downloads fine.
I've tried this code:
KALI = "URL CENSORED"
kali_res = requests.get(KALI, stream=True)
for chunk in kali_res.iter_content(chunk_size=128):
open(file_path, "wb").write(chunk)
..but it still gives the same error.
Could someone please help?

You keep re-opening the file for every chunk.
Not only does this leak file descriptors, it also means you keep overwriting the file.
Try this instead:
KALI = "URL CENSORED"
kali_res = requests.get(KALI, stream=True)
with open(file_path, "wb") as outfile:
for chunk in kali_res.iter_content(chunk_size=128):
outfile.write(chunk)

Related

Eror BadZipFile inconsistently raised

I am using selenium to successively download a number of ZIP files which I subsequently rename and unzip.
os.rename(F"C:/Users/Info/Desktop/PR RAW 3000/Test.ZIP", F"C:/Users/Info/Desktop/PR RAW 3000/12345.ZIP") ##Rename downloaded file
time.sleep(2)
#Unzip File
target_zip = F"C:/Users/Info/Desktop/PR RAW 3000/12345.ZIP"
os.mkdir(F'C:/Users/Info/Desktop/PR RAW 3000/12345_Sample') ##Create path
handle = zipfile.ZipFile(target_zip)
handle.extractall(F'C:/Users/Info/Desktop/PR RAW 3000/12345_Sample') ##Unzip file into created path
handle.close()
The code works just fine most of the time. Sometimes though, the error BadZipFile: File is not a zip file is raised and I have no idea why. The error pops up inconsistently, meaning that when the code breaks at one specific file, after restarting it runs smoothly through the point where it previously has failed.

Intermittent "No such file or directory" and permission errors when opening files (in a loop) on mounted FTP drive (linux)? Sync issue?

Getting errors like
FileNotFoundError: [Errno 2] No such file or directory: '/path/to/files/file.pdf'
when trying to loop through and open files in a mounted FTP drive (mounted via curlftpfs 'myuser:mypassword'#MY.SERVER.IP /path/to/files). I suspect sync issues, as the mounted drive is from another server on our network.
I can see that the file is there, can open it manually, can ls '/path/to/files/file.pdf' to see the file, but when executing...
FILES = os.listdir('/path/to/files')
FILES.sort()
.
.
.
for file in FILES:
with open(os.path.join('/path/to/files', 'file.pdf'), 'rb') as fd:
do stuff
... I sometimes get the FileNotFOundError.
More confusing, I can actually open this file (using the same path string that the error message tells me is not a file or directory) separately by just starting a python interactive shell and run something like...
fd = open('/path/to/files/file.pdf', 'rb')
fd.read()
...so IDK what the issue could be when reading it in a list of files.
Any debugging ideas or ideas of what could be causing this? Could there be some kind of timing/sync issues between reading the files on the mounted FTP drive vs the script that is running locally (and how to fix)?
* UPDATE:
Oddly, printing the target path before trying to open the file like...
print(os.path.join('/path/to/files', 'file.pdf'))
time.sleep(2) # giving even more time after initial access
with open(os.path.join('/path/to/files', 'file.pdf'), 'rb') as fd:
do stuff
...seems to help (kinda). Now also randomly throws PermissionErrors for random files that I had no problem reading before (still occasionally throws FileNotFoundErrors) and that I can actually open when accessing individually in python interactive shell. Makes me moreso think it is some kind of sync issue. Will need to investigate more.
It seems that os.path is a module, it will return an error if you use it like os.path('/path/to/files/file.pdf')
But I think it's not the cause of FileNotFOundError.

PDF manipulation with Python

I need to remove the last page of a pdf file. I have multiple pdf files in the same directory. So far I have the next code:
from PyPDF2 import PdfFileWriter, PdfFileReader
import os
def changefile (file):
infile = PdfFileReader(file, "rb")
output = PdfFileWriter()
numpages = infile.getNumPages()
for i in range (numpages -1):
p = infile.getPage(i)
output.addPage(p)
with open(file, 'wb') as f:
output.write(f)
for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
if file.endswith(".pdf") or file.endswith(".PDF"):
changefile(file)
My script worked while testing it. I need this script for my work. Each day I need to download several e-invoices from our main external supplier. The last page always mentions the conditions of sales and is useless. Unfortunately our supplier left a signature, causing my script not to work properly.
When I am trying to run it on the invoices I receive the following error:
line 1901, in read
raise utils.PdfReadError("Could not find xref table at specified location")
PyPDF2.utils.PdfReadError: Could not find xref table at specified location
I was able to fix it on my Linux laptop by running qpdf invoice.pdf invoice-fix. I can't install QPDF on my work, where Windows is being used.
I suppose this error is triggered by the signature left by our supplier on each PDF file.
Does anyone know how to fix this error? I am looking for an efficient method to fix the issue with a broken PDF file and its signature. There must be something better than opening each PDF file with Adobe and removing the signature manually ...
Automation would be nice, because I daily put multiple invoices in the same directory.
Thanks.
The problem is probably a corrupt PDF file. QPDF is indeed able to work around that. That's why I would recommend using the pikepdf library instead of PyPDF2 in this case. Pikepdf is based on QPDF. First install pikepdf (for example using pip or your package manager) and then try this code:
import pikepdf
import os
def changefile (file):
print("Processing {0}".format(file))
pdf = pikepdf.Pdf.open(file)
lastPageNum = len(pdf.pages)
pdf.pages.remove(p = lastPageNum)
pdf.save(file + '.tmp')
pdf.close()
os.unlink(file)
os.rename(file + '.tmp', file)
for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
if file.lower().endswith(".pdf"):
changefile(file)
Link to pikepdf docs: https://pikepdf.readthedocs.io/en/latest/
Let me know if that works for you.

corrupt zip download urllib2

I am trying to download zip files from measuredhs.com using the following code:
url ='https://dhsprogram.com/customcf/legacy/data/download_dataset.cfm?Filename=BFBR62DT.ZIP&Tp=1&Ctry_Code=BF'
request = urllib2.urlopen(url)
output = open("install.zip", "w")
output.write(request.read())
output.close()
However the downloaded file does not open. I get a message saying the compressed zip folder is invalid.
To access the download link, one needs to long in, which I have done so. If i click on the link, it automatically downloads the file, or even if i paste it in a browser.
Thanks
Try writing to local file in binary mode.
with open('install.zip', 'wb') as output:
output.write(request.read())
Also, comparing the md5/sha1 hash of the downloaded file will let you know if the downloaded file has been corrupted.

Extracting .app from zip file in Python

(Python 2.7)
I have a program that will download a .zip file from a server, containing a .app file which I'd like to run. The .zip downloads fine from the server, and trying to extract it outside of Python works fine. However, when I try to extract the zip from Python, the .app doesn't run - it does not say the file is corrupted or damaged, it simply won't launch. I've tried this with other .app files, and I get the same problem, and was wondering if anyone else has had this problem before and a way to fix it?
The code I'm using:
for a in gArchives:
if (a['fname'].endswith(".build.zip") or a['fname'].endswith(".patch.zip")):
#try to extract: if not, delete corrupted zip
try :
zip_file = zipfile.ZipFile(a['fname'], 'r')
except:
os.remove(a['fname'])
for files in zip_file.namelist() :
#deletes local files in the zip that already exist
if os.path.exists(files) :
try :
os.remove(files)
except:
print("Cannot remove file")
try :
shutil.rmtree(files)
except:
print("Cannot remove directory")
try :
zip_file.extract(files)
except:
print("Extract failed")
zip_file.close()
I've also tried using zip_file.extractall(), and I get the same problem.
Testing on my macbook pro, the problem appears to be with the way Python extracts the files.
If you run
diff -r python_extracted_zip normal_extracted_zip
You will come into messages like this:
File Seashore.app/Contents/Frameworks/TIFF.framework/Resources is a directory while file here/Seashore.app/Contents/Frameworks/TIFF.framework/Resources is a regular file
So obviously the issue is with the filenames it's coming across as it's extracting them. You will need to implement some checking of the filenames as you extract them.
EDIT: It appears to be a bug within python 2.7.* as found here - Sourced from another question posted here.
Managed to resolve this myself - the problem was not to do with directories not being extracted correctly, but in fact with permissions as eri mentioned above.
When the files were being extracted with Python, the permissions were not being kept as they were inside the .zip, so all executable files were set to be not executable. This problem was resolved with a call to the following on all files I extracted, where 'path' is the path of the file:
os.chmod(path, 0755)

Categories