Dropbox Python API Upload multiple files - python

I'm trying to upload a set of pd.DataFrames as CSV to a folder in Dropbox using the Dropbox Python SDK (v2). The set of files is not particularly big, but it's numerous. Using batches will help to reduce the API calls and comply with the developer recommendations outlined in the documentation:
"The idea is to group concurrent file uploads into batches, where files
in each batch are uploaded in parallel via multiple API requests to
maximize throughput, but the whole batch is committed in a single,
asynchronous API call to allow Dropbox to coordinate the acquisition
and release of namespace locks for all files in the batch as
efficiently as possible."
Following several answers in SO (see the most relevant to my problem here), and this answer from the SDK maintainers in the Dropbox Forum I tried the following code:
commit_info = []
for df in list_pandas_df:
df_raw_str = df.to_csv(index=False)
upload_session = dbx.upload_session_start(df_raw_str.encode())
commit_info.append(
dbx.files.CommitInfo(path=/path/to/db/folder.csv
)
dbx.files_upload_finish_batch(commit_info)
Nonetheless, when reading the files_upload_finish_batch docstring I noticed that the function only takes a list of CommitInfo as an argument (documentation), which is confusing since the non-batch version (files_upload_session_finish) does take a CommitInfo object with a path, and a cursor object with data about the session.
I'm fairly lost in the documentation, and even the source code is not so helpful to understand how the batch works to upload several files (and not as a case for uploading heavy files). What I am missing here?

Related

Python-based PDF parser integrated with Zapier

I am working for a company which is currently storing PDF files into a remote drive and subsequently manually inserting values found within these files into an Excel document. I would like to automate the process using Zapier, and make the process scalable (we receive a large amount of PDF files). Would anyone know any applications useful and possibly free for converting PDFs into Excel docs and which integrate with Zapier? Alternatively, would it be possible to create a Python script in Zapier to access the information and store it into an Excel file?
This option came to mind. I'm using google drive as an example, you didn't say what you where using as storage, but Zapier should have an option for it.
Use cloud convert, doc parser (depends on what you want to pay, cloud convert at least gives you some free time per month, so that may be the closest you can get).
Create a zap with this step:
Trigger on new file in drive (Name: Convert new Google Drive files with CloudConvert)
Convert file with CloudConvert
Those are two options by Zapier that I can find. But you could also do it in python from your desktop by following something like this idea. Then set an event controller in windows event manager to trigger an upload/download.
Unfortunately it doesn't seem that you can import JS/Python libraries into zapier, however I may be wrong on that. If you could, or find a way to do so, then just use PDFminer and "Code by Zapier". A technician might have to confirm this though, I've never gotten libraries to work in zaps.
Hope that helps!

How can I set up an automated import to Google Data Prep?

When using Google Data Prep, I am able to create automated schedules to run jobs that update my BigQuery tables.
However, this seems pointless when considering that the data used in Prep is updated by manually dragging and dropping CSVs (or JSON, xlsx, whatever) into the data storage bucket.
I have attempted to search for a definitive way of updating this bucket automatically with files that are regularly updated on my PC, but there seems to be no best-practice solution that I can find.
How should one go about doing this efficiently and effectively?
So, in order to upload files from your computer to Google Cloud Storage, there are a few possibilities. If you just run an daemon process which handles any change in that shared directory, you can code an automatic upload in this different languages: C#, Go, Java, Node.JS, PHP, Python or Ruby.
You have here some code examples for uploading objects but, be aware that there is also a detailed Cloud Storage Client Libraries references and you can also find the GitHub links in "Additional Resources".

How to get s3 metadata for all keys in a bucket via boto3

I want to fetch all metadata for a bucket with a prefix via Boto. There are a few SO questions that imply this isn't possible via the AWS API. So, two questions:
Is there a good reason this shouldn't be possible via the AWS API?
Although I can't find one in docs, Is there a convenience method for this in Boto?
I'm currently doing this using multithreading, but that seems like overkill, and I'd really rather avoid it if at all possible.
While there isn't a way to do this directly through boto, you could add an inventory configuration on the bucket(s) which generates a daily CSV / ORC file with all file metadata.
Once this has been generated you can then process the output rather than multithreading or any other method that requires a huge number of requests.
See: put_bucket_inventory_configuration
Its worth noting that it can take upto 48 hours for the first one to be generated.

Writing big CSV files to GCS on App Engine

I'm extracting huge amounts of data from the Google App Engine Datastore (using Python) and I need to write it to a csv file on GCS (Google Cloud Storage).
I do this task by fetching ~10k entities with iter query and deferring the task.
Unfortunately GCS doesn't support appending to files, and because of that in each run of the task I'm forced to open and read the whole file, close it, then write the content to a new file and add the newly fetched batch of data to it.
I'm using the UnicodeWriter/UnicodeReader for handling csv files similar to:
https://docs.python.org/2/library/csv.html#examples
My problem is that when the file gets bigger it tends to eat a lot of instances memory and sometimes exceeds the limit. Is there any way to minimize the extensive memory usage in this case?
Any examples of handling big csv files > 32MB on GCS are quite welcome.
Google Cloud Storage can happily accept objects of basically unlimited size, but your problem is a little different, which is constructing the object in the first place.
You can use Google Cloud Storage's composition support to help. However, compose has limits. You can compose up to 1024 objects in total (32 objects per call, but the result of that object can be composed, as can the result of that object, and so on and so forth until there are 1024 original source objects that have been composed together). Thus, composition will only work if breaking the total size up into 1024 pieces makes them sufficiently small for your use case.
However, maybe that's good enough. If so, here are some resources:
Documentation of the compose feature: https://cloud.google.com/storage/docs/composite-objects#_Compose
I'm not sure if you're using the App Engine Cloud Storage library, but if you are, it unfortunately doesn't support compose. You'll have to grab the more generic Google API Python client and invoke the objects#compose method, documented here: https://cloud.google.com/storage/docs/json_api/v1/objects/compose
Here's the relevant example of using it:
composite_object_resource = {
'contentType': 'text/csv', # required
'contentLanguage': 'en',
}
compose_req_body = {
'sourceObjects': [
{'name': source_object_name_1},
{'name': source_object_name_2}],
'destination': composite_object_resource
}
req = client.objects().compose(
destinationBucket=bucket_name,
destinationObject=composite_object_name,
body=compose_req_body)
resp = req.execute()
When you write something like:
with gcs.open(gcs_filename, 'w', content_type=b'multipart/x-zip') as gf:
....
Here gf is a cloudstorage.storage_api.StreamingBuffer, which can be pickled to append data in a chained task. But I did not try this yet.

Python Azure blob storage upload file bigger then 64 MB

From the sample code, I can upload 64MB, without any problem:
myblob = open(r'task1.txt', 'r').read()
blob_service.put_blob('mycontainer', 'myblob', myblob, x_ms_blob_type='BlockBlob')
What if I want to upload bigger size?
Thank you
I ran into the same problem a few days ago, and was lucky enough to find this. It breaks up the file into chunks and uploads it for you.
I hope this helps. Cheers!
I'm not a Python programmer. But a few extra tips I can offer (my stuff is all in C):
Use HTTP PUT operations(comp=block option) for as many Blocks (4MB each) as required for your file, and then use a final PUT Block List (comp=blocklist option) that coalesces the Blocks. If your Block uploads fail or you need to abort, the cleanup for deleting the partial set of Blocks previously uploaded is a DELETE command for the file you are looking to create, but this appears supported by the 2013-08-15 version only (Someone from the Azure support should confirm this).
If you need to add Meta information, an additional PUT operation (with the comp=metadata prefix) is what I do when using the Block List method. There might be a more efficient way to tag the meta information without requiring an additional PUT, but I'm not aware of it.
This is good question. Unfortunately I don't see a real implementation for uploading arbitrary large files. So, from what I see there is much more work to do on the Python SDK, unless I am missing something really crucial.
The sample code provided in the documentation indeed uses just a single text file and uploads at once. There is no real code that is yet implemented (from what I see in the SDK Source code) to support upload of larger files.
So, for you, to work with Blobs from Python you need to understand how Azure Blob Storage works. Start here.
Then take a quick look at the REST API documentation for PutBlob operation. It is mentioned in the remarks:
The maximum upload size for a block blob is 64 MB. If your blob is
larger than 64 MB, you must upload it as a set of blocks. For more
information, see the Put Block (REST API) and Put Block List (REST
API) operations. It's not necessary to call Put Blob if you upload the
blob as a set of blocks.
The good news is that PutBlock and PutBlockList is implemented in the Python SDK, but with no sample provided for how to use it. What you have to do is to manually split your file into chunks (blocks) of up to 4 MB each. and then use put_block(self, container_name, blob_name, block, blockid, content_md5=None, x_ms_lease_id=None): function from the python SDK to upload the blocks. Ultimately you will upload the blocks in parallel. Do not forget however that you have to execute also put_block_list(self, container_name, blob_name, block_list, content_md5=None, x_ms_blob_cache_control=None... at the end to commit all blocks uploaded.
Unfortunately I'm not Python expert to help you further, but at least I give you a good picture of the situation.

Categories