I have the following problem: I have a local .zip file and a .zip file located on a server. I need to check if the .zip file on the server is different from the local one; if they are not I need to pull the new one from the server. My question is how do I compare them without downloading the file from the server and comparing them locally?
I could create an MD5 hash for the zip file on the server when creating the .zip file and then compare it with the MD5 of my local .zip file, but is there a simpler way?
Short answer: You can't.
Long answer: To compare with the zip file on the server, someone has to read that file. Either you can do that locally, which would involve pulling it, or you can ask the server to do it for you. Can you run code on the server?
Edit
If you can run Python on the server, why not hash the file and compare hashes?
import hashlib
with open( <path-to-file>, "rb" ) as theFile:
m = hashlib.md5( )
for line in theFile:
m.update( line )
with open( <path-to-hashfile>, "wb" ) as theFile:
theFile.write( m.digest( ) )
and then compare the contents of hashfile with a locally-generated hash?
Another edit
You asked for a simpler way. Think about this in an abstract way for a moment:
You don't want to download the entire zip file.
Hence, you can't process the entire file locally (because that would involve reading all of it from the server, which is equivalent to downloading it!).
Hence, you need to do some processing on the server. Specifically, you want to come up with some small amount of data that 'encodes' the file, so that you can fetch this small amount of data without fetching the whole file.
But this is a hash!
Therefore, you need to do some sort of hashing. Given that, I think the above is pretty simple.
I would like to know how you intend to compare them locally, if it were the case. You can apply the same logic to compare them remotely.
You can log in using ssh and make a md5 hash for the file remotely and a md5 hash for the current local file. If the md5s are matching the files are identicaly, else they are different.
Related
I'm currently using ftplib in Python to get some files and write them to S3.
The approach I'm using is to use with open as shown below:
with open('file-name', 'wb') as fp:
ftp.retrbinary('filename', fp.write)
to download files from FTP server and save them in a temporary folder, then upload them to S3.
I wonder if this is the best practice, because the shortcoming about this approach is:
if files are too many&big, I can download them and upload to S3, then delete them from the temp folder,
but the question is if I run this script once a day, I have to download everything again, so how can I check if a file is already been downloaded & existed in S3 so that the script will only process the new-added files in FTP?
Hope this makes sense, would be great if anyone has an example or something, many thanks.
You cache the fact that you processed a given file path to persistent storage (say, a SQLite database). If the file may change after you processed it, you may be able to detect this by also caching the timestamp from FTP.dir() and/or size FTP.size(filename). If that doesn't work, you also cache a checksum (say, SHA256) of the file, then you download the file again to recalculate the checksum to see if the file changed. s3 might support a conditional upload (etag) in which case you would calculate the etag of the file, then upload it with that header set ideally with an 'Expect: 100-continue' header to see if it already got the file before you try upload data.
We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS. Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries; however, in our case, we can assume there's exactly one large text file that was zipped, and we could begin extracting and parsing data immediately without needing to wait for the ZIP file to buffer.
If we were using C#, we could use https://github.com/icsharpcode/SharpZipLib/wiki/Unpack-a-zip-using-ZipInputStream (implementation here) which handles this pattern elegantly.
However, it seems that the Python standard library's zipfile module doesn't support this type of streaming; it assumes that the input file-like object is seekable, and all tutorials point to iterating first over namelist() which seeks to the central directory data, then open(name) which seeks back to the file entry.
Many other examples on StackOverflow recommend using BytesIO(response.content) which might appear to pipe the content in a streaming way; however, .content in the Requests library consumes the entire stream and buffers the entire thing to memory.
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Yes: https://github.com/uktrade/stream-unzip can do it [full disclosure: essentially written by me].
We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS.
The example from the README shows how to to this, using stream-unzip and httpx
from stream_unzip import stream_unzip
import httpx
def zipped_chunks():
# Any iterable that yields a zip file
with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
yield from r.iter_bytes()
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
If you do just want the first file, you can use break after the first file:
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
break
Also
Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries
This isn't completely true.
Each file has a "local" header that contains its name, and it can be worked out when the compressed data for any member file ends (via information in the local header if it's there or from the compressed data itself). While there is more information in the central file directory at the end, if you just need the name + bytes of the files, then it is possible to start unzipping a ZIP file, that contains multiple files, as it's downloading.
I can't claim its absolutely possible in all cases: technically ZIP allows for many different compression algorithms and I haven't investigated them all. However, for DEFLATE, which is the one most commonly used, it is possible.
It's even possible to download one specific file from .zip without downloading whole file. All you need is server that allows to read bytes in ranges, fetch end recored (to know size of CD), fetch central directory (to know where file starts and ends) and then fetch proper bytes and handle them.
Using Onlinezip you can handle file like local file. Event API is identical as FileZip in python
[full disclosure: I'm author of library]
I have a simple web-server written using Python Twisted. Users can log in and use it to generate certain reports (pdf-format), specific to that user. The report is made by having a .tex template file where I replace certain content depending on user, including embedding user-specific graphs (.png or similar), then use the command line program pdflatex to generate the pdf.
Currently the graphs are saved in a tmp folder, and that path is then put into the .tex template before calling pdflatex. But this probably opens up a whole pile of problems when the number of users increases, so I want to use temporary files (tempfile module) instead of a real tmp folder. Is there any way I can make pdflatex see these temporary files? Or am I doing this the wrong way?
without any code it's hard to tell you how, but
Is there any way I can make pdflatex see these temporary files?
yes you can print the path to the temporary file by using a named temporary file:
>>> with tempfile.NamedTemporaryFile() as temp:
... print temp.name
...
/tmp/tmp7gjBHU
As commented you can use tempfile.NamedTemporaryFile. The problem is that this will be deleted once it is closed. That means you have to run pdflatex while the file is still referenced within python.
As an alternative way you could just save the picture with a randomly generated name. The tempfile is designed to allow you to create temporary files on various platforms in a consistent way. This is not what you need, since you'll always run the script on the same webserver I guess.
You could generate random file names using the uuid module:
import uuid
for i in xrange(3):
print(str(uuid.uuid4()))
The you save the pictures explictly using the random name and pass insert it into the tex-file.
After running pdflatex you explicitly have to delete the file, which is the drawback of that approach.
I have no idea if this is possible...
Let's say I want to put test.html into a .zip archive and then use ftplib to upload the file and then once uploaded for it to be extracted overwriting any files?
If that's not possible whats the best way to upload a file, then rename and overwrite the original file name (would I have to delete the original test.html from the ftp folder?)
Any ideas?
ftp_session = ftplib.FTP('ftp.website.com','admin#website.com','password123')
ftp_file = open('output.html','r')
ftp_session.cwd("/folder")
ftp_session.storlines('STOR output.html', ftp_file)
ftp_file.close()
ftp_session.quit()
The FTP server won't unzip your file, you'll have to have something running on the other side doing that.
If you want to replace a single file, upload it as test.html.tmp and then rename it to test.html. The rename (ftp operation) should be atomic (filesystem wise) and will overwrite the old file (actually just delete it and point the name to the new file). This way, anything reading the file will get either the old version or the new one, but correctly, no danger of reading just half the new file.
I think that using CPanel you'll run unzip, which most likely will open the file for writing, truncate it and then fill the content. This is not atomic, someone may read invalid file. On the other hand, you can write a script that will run remotely and do the thing the way you want.
I'm writing a script in python for deploying static sites to aws (s3, cloudfront, route53). Because I don't want to upload every file on every deploy, I check which files were modified by comparing their md5 hash with their e-tag (which s3 sets to be the object's md5 hash). This works well for all files except for those that my build script gzips before uploading. Taking a look inside the files, it seems like gzip isn't really a pure function; there are very slight differences in the output file every time gzip is run, even if the source file hasn't changed.
My question is this: is there any way to get gzip to reliably and repeatably output the exact same file given the exact same input? Or am I better off just checking if the file is gzipped, unzipping it and computing the md5 hash/manually setting the e-tag value for it instead?
The compressed data is the same each time. The only thing that differs is likely the modification time in the header. The fifth argument of GzipFile (if that's what you're using) allows you to specify the modification time in the header. The first argument is the file name, which also goes in the header, so you want to keep that the same. If you provide a fourth argument for the source data, then the first argument is used only to populate the file name portion of the header.
gzip is not stable as you figured out correctly:
[root#dev1 ~]# touch a b
[root#dev1 ~]# gzip a
[root#dev1 ~]# gzip b
[root#dev1 ~]# md5sum a.gz b.gz
8674e28eab49306b519ec7cd30128a5c a.gz
4974585cf2e85113f1464dc9ea45c793 b.gz