Identify new files in FTP and write them to AWS S3 - python

I'm currently using ftplib in Python to get some files and write them to S3.
The approach I'm using is to use with open as shown below:
with open('file-name', 'wb') as fp:
ftp.retrbinary('filename', fp.write)
to download files from FTP server and save them in a temporary folder, then upload them to S3.
I wonder if this is the best practice, because the shortcoming about this approach is:
if files are too many&big, I can download them and upload to S3, then delete them from the temp folder,
but the question is if I run this script once a day, I have to download everything again, so how can I check if a file is already been downloaded & existed in S3 so that the script will only process the new-added files in FTP?
Hope this makes sense, would be great if anyone has an example or something, many thanks.

You cache the fact that you processed a given file path to persistent storage (say, a SQLite database). If the file may change after you processed it, you may be able to detect this by also caching the timestamp from FTP.dir() and/or size FTP.size(filename). If that doesn't work, you also cache a checksum (say, SHA256) of the file, then you download the file again to recalculate the checksum to see if the file changed. s3 might support a conditional upload (etag) in which case you would calculate the etag of the file, then upload it with that header set ideally with an 'Expect: 100-continue' header to see if it already got the file before you try upload data.

Related

Google Storage // Cloud Function // Python Modify CSV file in the Bucket

thanks for reading.
I've some problem with touching csv file in Bucket, i know how i can copy/rename/move file, but i have no idea how to modify file with out downloading to local machine.
Actually i have major idea , its download blob (csv file) as bytes then modify and upload to the Bucket as bytes. But i don't understand how to modify bytes.
How i should touch csv : add new header - date , and add value (today.date) in each row of csv
---INPUT---
CSV file in the Bucket:
a
b
1
2
--OUTPUT---
updated CSV file in the Bucket:
a
b
date
1
2
today
my code :
def addDataToCsv(bucket,fileName):
today = str(date.today())
bucket = storage_client.get_bucket(bucket)
blob = bucket.blob(fileName)
fileNameText = blob.download_as_string()
/// This should be a magic bytes modification //
blobNew = bucket.blob(path+'/'+'mod.csv')
blobNew.upload_from_string(fileNameText,content_type='text/csv')
Please help, thank you for time and effort
If I understand, you want to modify the CSV file in the bucket without downloading it to the local machine file-system.
You cannot directly edit a file from a Cloud Storage Bucket, aside from its metadata, therefore you will need to download it to your local machine somehow and push changes to the bucket.
Objects are immutable, which means that an uploaded object cannot change throughout its storage lifetime.
However, an approach would be to use Cloud Storage FUSE, which mounts a Cloud Storage bucket as a file system so you can edit any file from there and changes are applied to your bucket.
Still if this is not a suitable solution for you, the bytes can be downloaded and modified as you propose by decoding the bytes object (commonly using UTF-8, although depends on your characters) and reencoding it before uploading it.
# Create an array of every CSV file line
csv_array = fileNameText.decode("utf-8").split("\n")
# Add header
csv_array[0] = csv_array[0] + ",date\n"
# Add the date to each field
for i in range(1,len(csv_array)):
csv_array[i] = csv_array[i] + "," + today + "\n"
# Reencode from list to bytes to upload
fileNameText = ''.join(csv_array).encode("utf-8")
Take into account that if your local machine has some serious storage or performance limitations, if your CSV is large enough that it might cause problems handling it like above, or just for reference, you could use the compose command. For this you would need to modify the code above so only some sections of the CSV file are edited every time, uploaded, and then joined by gsutil compose in Cloud Storage.
Sorry I know I'm not at your shoes, but if I were you I will try to keep things simple. In deed most systems work best if they are kept simple and they are easier to maintain and share (KISS principle). So given you are using your local machine, I assume you have a generous network bandwidth and enough disk space and memory. So I will not hesitate to download the file, modify it, and upload it again. Even when dealing with big files.
Then, if your are willing to use another format of the file:
download blob (csv file) as bytes
In this case a better solution for size and simple code, is to use / convert your file to Parquet or Avro format. These formats will reduce drastically you file size, especially if you add compression. Then they allow you to keep a structure for your data, which makes their modifications way simpler. Finally you have many resources on the net on how to use these formats with python, and comparisons between CSV, Avro and Parquet.

Reading image EXIF data from SFTP server without downloading the file

I'm writing a script that downloads files from a SFTP server. However, there are 10k files (~5MB per file) in each folder, and I only want to download files that are, say, 12 hours apart. (eg. 12:00 time and 00:00).
But I seem to only be able to read the date for last modification, not creation. This date seems hidden until I have the file locally. I have an alternative strategy, but it is not as clean as getting the right files on the first download.
JPEG EXIF metadata is part of the file contents, not part of file metadata, as far as filesystem/FTP is concerned. So it's not a part of a directory listing, at least not with any SFTP server I know of.
You cannot retrieve it without downloading the JPEG file. Or at least not without downloading the part of the file that contains the EXIF.
Related question: Check aspect ratio of image stored on an FTP server without downloading the file in Python

Manipulating and creating S3 files within Django/python when local system files are needed

I'm using django-storages to store media files in an S3 bucket. However, I am occasionally converting or otherwise fiddling with the file to create new files, and this fiddling has to use files actually on my server (most of the conversion happens using process calls). When done I'd like to save the files back to S3.
In an ideal world, I would not have to do any changes to my functions when moving from local to S3. However, I'm unsure how I would do this considering that I have to create these intermediate local files to fiddle with, and then at the end know that the resultant file (which would also be stored on the local machine) needs to then be copied over to S3.
Best that I can come up with using a pair of context guards, one for the source file and one for the destination file. The source file one would create a temporary file that would get the contents of the source file copied over, and then it would be used, manipulated, etc. The destination file context guard would just get the final desired destination path on S3 and create a temporary local file, then when exiting would create a key in the S3 bucket, copy over the contents of the temporary file, and delete it.
But this seems pretty complicated to me. It also requires me to wrap every single function that manipulates these files in two "with" clauses.
The only other solution I can think of is switching over to utilities that only deal with file-like objects rather than filenames, but this means I can't do subprocess calls.
Take a look at the built-in file storage API - this is exactly the use-case for it.
If you're using django-storages and uploading to S3, then you should have a line in your settings module that looks like
DEFAULT_FILE_STORAGE = 'storages.backends.s3boto.S3BotoStorage'
When you're developing locally and don't want to upload your media files to S3, in your local settings module, just leave it out so it defaults to django.core.files.storage.FileSystemStorage.
In your application code, for the media files that will get saved to S3 when you move from local development to staging, instantiate a Storage object using the class returned from the get_storage_class function, and use this object to manipulate the file. For the temporary local files you're "fiddling" with, don't use this Storage object (i.e. use Python's built-in file-handling functions) unless it's a file you're going to want to save to S3.
When you're ready to start saving stuff on S3, all you have to do is set DEFAULT_FILE_STORAGE = 'storages.backends.s3boto.S3BotoStorage' again, and your code will work without any other tweaks. When that setting is not set, those media files will get saved to the local filesystem under MEDIA_ROOT, again without any need to change your application logic.

Possible to zip a file (files?) and then unzip once uploaded or is upload and rename the only option?

I have no idea if this is possible...
Let's say I want to put test.html into a .zip archive and then use ftplib to upload the file and then once uploaded for it to be extracted overwriting any files?
If that's not possible whats the best way to upload a file, then rename and overwrite the original file name (would I have to delete the original test.html from the ftp folder?)
Any ideas?
ftp_session = ftplib.FTP('ftp.website.com','admin#website.com','password123')
ftp_file = open('output.html','r')
ftp_session.cwd("/folder")
ftp_session.storlines('STOR output.html', ftp_file)
ftp_file.close()
ftp_session.quit()
The FTP server won't unzip your file, you'll have to have something running on the other side doing that.
If you want to replace a single file, upload it as test.html.tmp and then rename it to test.html. The rename (ftp operation) should be atomic (filesystem wise) and will overwrite the old file (actually just delete it and point the name to the new file). This way, anything reading the file will get either the old version or the new one, but correctly, no danger of reading just half the new file.
I think that using CPanel you'll run unzip, which most likely will open the file for writing, truncate it and then fill the content. This is not atomic, someone may read invalid file. On the other hand, you can write a script that will run remotely and do the thing the way you want.

Comparing local file with remote file

I have the following problem: I have a local .zip file and a .zip file located on a server. I need to check if the .zip file on the server is different from the local one; if they are not I need to pull the new one from the server. My question is how do I compare them without downloading the file from the server and comparing them locally?
I could create an MD5 hash for the zip file on the server when creating the .zip file and then compare it with the MD5 of my local .zip file, but is there a simpler way?
Short answer: You can't.
Long answer: To compare with the zip file on the server, someone has to read that file. Either you can do that locally, which would involve pulling it, or you can ask the server to do it for you. Can you run code on the server?
Edit
If you can run Python on the server, why not hash the file and compare hashes?
import hashlib
with open( <path-to-file>, "rb" ) as theFile:
m = hashlib.md5( )
for line in theFile:
m.update( line )
with open( <path-to-hashfile>, "wb" ) as theFile:
theFile.write( m.digest( ) )
and then compare the contents of hashfile with a locally-generated hash?
Another edit
You asked for a simpler way. Think about this in an abstract way for a moment:
You don't want to download the entire zip file.
Hence, you can't process the entire file locally (because that would involve reading all of it from the server, which is equivalent to downloading it!).
Hence, you need to do some processing on the server. Specifically, you want to come up with some small amount of data that 'encodes' the file, so that you can fetch this small amount of data without fetching the whole file.
But this is a hash!
Therefore, you need to do some sort of hashing. Given that, I think the above is pretty simple.
I would like to know how you intend to compare them locally, if it were the case. You can apply the same logic to compare them remotely.
You can log in using ssh and make a md5 hash for the file remotely and a md5 hash for the current local file. If the md5s are matching the files are identicaly, else they are different.

Categories