Force overwrite when saving to GCP bucket (pd.to_csv) - python

I have a little problem that I cannot solve and would really appreciate any input on this. You see, whenever I've used pd.to_csv() to write a file to a folder, it has done what I expected to do, namely to overwrite any file with the same name.
I am now working solely working on GCP and I need to refresh tokens to access some API:s using existing ones. To do this is not any problem, but what I've done is save copies of all the API tokens in a dataframe that I store in a bucket. All my files have a datestamp:
Saving
base = datetime.date.today()
bucket_name_tokens = 'project-tokens'
bucket_path_tokens = 'gs://' + bucket_name_tokens
tokens.to_csv(bucket_path_tokens +'/tokens_{}.csv'.format(base))
Reading
bucket_name_tokens = 'project-tokens'
bucket_path_tokens = 'gs://' + bucket_name_tokens
tokens_new = pd.read_csv(bucket_path_tokens +'/tokens_{}.csv'.format(base))
I discovered that if I (for some reason) refresh my token and save the new one using the above code (on the same day, obviously), it does not! overwrite the older one with the same name.
I've looked everywhere for an explanation or for an option to pandas.to_csv() that I wasn't aware of without any luck.
Any idea on how to ensure that files are overwritten? Have I missed something?
NOTE This is also true of all the files I save in any of the buckets I have created.
Thanks in advance

don't create a storage bucket with a retention.
retention is nice but it wont allow overwriting or deleting of items from the bucket, if turned off.
personally, I find it better to not have any retention.

Related

Can I restore my wallet access using just its pubkeys, ckeys, mkey and salt?

I recently managed to restore an old COLX wallet by importing a mobile backup file into colx-qt core %appdata folder.
After resyncing the blockchain I changed the rpcpassword and the wallet's encryption password, tested both of these and everything was in good order.
For testing purposes, I had decided to dump another encrypted backup file (wallet.dat) into the %appdata folder without removing the restored one first, resynced the blockchain, tried the passwords and noted they were now both incorrect.
I've removed both backups,resynced the blockchain anew, set rpcpassword and wallet unlock password to those associated with the first backup, imported said backup again but noticed that both passwords were now incorrect therefore losing my access to that mobile backup while at it.
Since I happen to have an older wallet.json file containing multiple addresses, several pubkeys and ckeys as well as the mkey (in the format {nID=1; encrypted_key:"the encrypted key"; nDerivationIterations:"62116 or some similar number"; version:""; etc etc}), could I use this information to restore my access to the wallet?
If YES, how exactly must I go about doing it?
Thank you in advance for your assistance!
I haven't tried anything else so far because I am trying to understand how this happened, what caused the change and how will I need to fix it before I'd just go ahead adding even more muddy details to the issue.

How to close a pdf opened with fitz if I've replaced its variable name?

This is a simple issue. I use jupyter notebook for python and usually deal with pdfs using pymupdf.
I usually define pdf = fitz.open('dir/to/file.pdf') but somethimes I forget to close the file before i redefine pdf = fitz.open('dir/to/other_file.pdf')
Sometimes I need to (for example) move or delete file.pdf (the original file) but I can't because python is using it.
Not being an expert, I don't know how to close this file when I have redefined the variable pdf, as obviously pdf.close() would close 'other_file.pdf' and I end up reeinitializing my .ipynb file, which feels dumb.
How can I access an object which variable name has been redefined?
If you do a Document.save("newname.pdf") then that new file, newname.pdf will be immediately available for other processes - it is not blocked by the process you are currently executing.
The original file however, oldname.pdf, from which you have created your Document object remains owned by your current process.
It will be freed if you do a Document.close().
But there is a way to work with the oldname.pdf under PyMuPDF without blocking it. It actually entails making a memory-resident copy:
import pathlib
import fitz
pdfbytes = pathlib.Path("oldname.pdf").read_bytes()
# from here on, file oldname.pdf is fully available
doc = fitz.open("pdf", pdfbytes)
# doc can be saved under any name, even the original one.
Writting this issue made me think about globals()
Browsing throughout its keys I found that the objects which variables have been reused are stored with dummy names (don't know the term used for them). I found the object I was looking for and I was able to 'close' it.
If there's a better - more elegant solution, I'd be glad to hear about it.

Google Storage // Cloud Function // Python Modify CSV file in the Bucket

thanks for reading.
I've some problem with touching csv file in Bucket, i know how i can copy/rename/move file, but i have no idea how to modify file with out downloading to local machine.
Actually i have major idea , its download blob (csv file) as bytes then modify and upload to the Bucket as bytes. But i don't understand how to modify bytes.
How i should touch csv : add new header - date , and add value (today.date) in each row of csv
---INPUT---
CSV file in the Bucket:
a
b
1
2
--OUTPUT---
updated CSV file in the Bucket:
a
b
date
1
2
today
my code :
def addDataToCsv(bucket,fileName):
today = str(date.today())
bucket = storage_client.get_bucket(bucket)
blob = bucket.blob(fileName)
fileNameText = blob.download_as_string()
/// This should be a magic bytes modification //
blobNew = bucket.blob(path+'/'+'mod.csv')
blobNew.upload_from_string(fileNameText,content_type='text/csv')
Please help, thank you for time and effort
If I understand, you want to modify the CSV file in the bucket without downloading it to the local machine file-system.
You cannot directly edit a file from a Cloud Storage Bucket, aside from its metadata, therefore you will need to download it to your local machine somehow and push changes to the bucket.
Objects are immutable, which means that an uploaded object cannot change throughout its storage lifetime.
However, an approach would be to use Cloud Storage FUSE, which mounts a Cloud Storage bucket as a file system so you can edit any file from there and changes are applied to your bucket.
Still if this is not a suitable solution for you, the bytes can be downloaded and modified as you propose by decoding the bytes object (commonly using UTF-8, although depends on your characters) and reencoding it before uploading it.
# Create an array of every CSV file line
csv_array = fileNameText.decode("utf-8").split("\n")
# Add header
csv_array[0] = csv_array[0] + ",date\n"
# Add the date to each field
for i in range(1,len(csv_array)):
csv_array[i] = csv_array[i] + "," + today + "\n"
# Reencode from list to bytes to upload
fileNameText = ''.join(csv_array).encode("utf-8")
Take into account that if your local machine has some serious storage or performance limitations, if your CSV is large enough that it might cause problems handling it like above, or just for reference, you could use the compose command. For this you would need to modify the code above so only some sections of the CSV file are edited every time, uploaded, and then joined by gsutil compose in Cloud Storage.
Sorry I know I'm not at your shoes, but if I were you I will try to keep things simple. In deed most systems work best if they are kept simple and they are easier to maintain and share (KISS principle). So given you are using your local machine, I assume you have a generous network bandwidth and enough disk space and memory. So I will not hesitate to download the file, modify it, and upload it again. Even when dealing with big files.
Then, if your are willing to use another format of the file:
download blob (csv file) as bytes
In this case a better solution for size and simple code, is to use / convert your file to Parquet or Avro format. These formats will reduce drastically you file size, especially if you add compression. Then they allow you to keep a structure for your data, which makes their modifications way simpler. Finally you have many resources on the net on how to use these formats with python, and comparisons between CSV, Avro and Parquet.

Django unique filename method

I am looking for the method Django uses to generate unique filename when we upload a file.
For example, if I upload a file called test.csv twice in the same directory, the first one will be saved as test.csv and the second file will be saved as test_2.csv. I already tried to find how Django manages that, but I only found django.utils.text.get_valid_filename which could be useful, but that not what I am looking for...
I already saw other topics with random naming solution, that not what I am looking for here :) I really trying to understand how Django manage that problem.
I actually take a closer look with your help and a found something :)
So basically I have to do something like:
from django.core.files.storage import FileSystemStorage
fss = FileSystemStorage()
filepath = fss.get_available_name(filepath)
Thank you all :)
PS: If you are interesting, the comment from django.core.file.storage.FileSystemStorage._save says:
There's a potential race condition between get_available_name and
saving the file; it's possible that two threads might return the
same name, at which point all sorts of fun happens. So we need to
try to create the file, but if it already exists we have to go back
to get_available_name() and try again.
If you see the implementation of class django.core.files.storage.Storage you will know how Django 1.6 manages the file names.
Look into the save method of this class. In this the line
name = self.get_available_name(name)
is doing the trick.
This is the default implementation of getting the new file name before saving the file. If you want to write your own version (like the file should be overridden) then consider writing your own custom storage system
Actually, you were on the right track.
From the docs,
Internally, Django uses a django.core.files.File instance any time it
needs to represent a file.
And also,
Behind the scenes, Django delegates decisions about how and where to
store files to a file storage system
Which means that, when the file is uploaded, using the default storage (FileSystemStorage), Django delegates the naming (or the available name), behind the scene, for the file to the storage, which then uses: get_available_name(name).
So, If you want to change the way files are named when uploaded, you need to add a custom file storage, which would basically only override get_available_name. The documentation on the matter is here.

Python ftplib: Best way to check if a file has changed?

I'm looking to schedule FTP file transfers, but to conserve bandwidth, I would like to only upload files that have changed. What's a good reliable way to do this that will work on a variety of different hosting providers?
First, checking to see whether a local file has changed really doesn't have anything to do with FTP. You're stating that you're only going to open an FTP connection to upload a file if/when it has changed.
At a high level, the basic strategy you're going to need to employ is by keeping track of when your application last checked for changes (previous execution timestamp), and compare that to the timestamps of the files you are interested in uploading. If the timestamp on the files is more recent, they will most likely have changed. I say most likely because it is possible to update only the timestamp (e.g. touch on unix/linux).
Here's a quick example showing you how to check the modification time for all of the items in a specific directory:
import os, time
checkdir="./"
for item in os.listdir(checkdir):
item_path = "%s%s"%(checkdir,item)
mtime = os.path.getmtime(item_path)
print "%s: %s" %(item_path,mtime)
Note that this does not differentiate between file types (e.g. regular file, directory, symlink). Read the docs on os.path to discover how to determine file type so you can skip certain types, if you so choose.
You'll still need to come up with the logic to store the time of the previous 'scan' so that you refer to it in subsequent scans. A really simple way to do this would be to store a value in a file.
Make sure you use a locking strategy in case two 'scans' overlap. FTP uploads will take some time to complete.

Categories