how to efficiently rename a lot of blobs in GCS

how to efficiently rename a lot of blobs in GCS - python

Lets say that on Google Cloud Storage I have bucket: bucket1 and inside this bucket I have thousands of blobs I want to rename in this way:
Original blob:
bucket1/subfolder1/subfolder2/data_filename.csv
to: bucket1/subfolder1/subfolder2/data_filename/data_filename_backup.csv
subfolder1, subfolder2 and data_filename.csv - they can have different names, however the way to change names of all blobs is as above.
What is the most efficient way to do this? Can I use Python for that?

You can use whatever programming language you want where Google offers an SDK for working with Cloud Storage. There is not going to be much of an advantage to any particular language you choose.
There is not really an "efficient" way of doing this. What you will end up doing in your code is pretty standard:
List the objects that you want to rename.
Iterate that list.
For each object, change the name.
You will get better performance overall if you run the code in a Google Cloud Shell or other Google Cloud compute environment in the same region as your bucket.

If you have a lot of rename to perform, I recommend to perform the operation concurrently (use several thread and not perform the rename sequentially).
Indeed, you have to know how works CLoud Storage. rename doesn't exist. You can go into the Python library and see what is done: copy then delete.
The copy can take time if your files are large. Delete is pretty fast. But in both case, it's API call and it take time (about 50ms if you are in the same region).
If you can perform 200 or 500 operations concurrently, you will significantly reduce the processing time. It's easier with Go or Node, but you can do the same in Python with await key word.

Related

Python and ETL: Job and workflow/task mamangent trigger and other concepts

i am writing a very simple ETL(T) pipline currently:
look at ftp if new csv files exist
if yes than donwload them
Some initial Transformations
bulk insert the individual CSVs into a MS sql DB
Some additional Transformations
There can be alot of csv files. The srcript runs ok for the moment, but i have no concept of how to actually create a "managent" layer around this. Currently my pipeline runs linear. I have a list of the filenames that need to be loaded, and ( in a loop) i load them into the DB.
If something fails the whole pipeline has to rerun. I do not manage the state of the pipleine ( i.e. has an specific file already been downloaded and transformed/changed?).
There is no way to start from an intermediate point. How cold i break this down into individual taks that need to be performedß
I rougly now of tools like Airflow, but i feel that this is only a part of the necessary tools, and frankly i am to uneducated in this area to even ask the right questions.
It would be really nice if somebody could point me in the right direction of what i am missing and what tools are available.
Thanks in advance

I´m actually using Airflow to run etl-pipelines with similar steps described by you.
The whole workflow can be partitioned into single tasks. For almost every task Airflow provides an operator.
For
look at ftp if new csv files exist
u could use a file sensor with underlying ftp-connection
FileSensor
For
if yes than donwload them
you could use the BranchPythonOperator.
BranchPythonOperator
All succeding tasks could be wrapped into a .py function and then be executed via the PythonOperator.
Would definitely recommend using Airflow, but if you are looking for alternatives, there are plenty:
airflow-alternatives

Python-based PDF parser integrated with Zapier

I am working for a company which is currently storing PDF files into a remote drive and subsequently manually inserting values found within these files into an Excel document. I would like to automate the process using Zapier, and make the process scalable (we receive a large amount of PDF files). Would anyone know any applications useful and possibly free for converting PDFs into Excel docs and which integrate with Zapier? Alternatively, would it be possible to create a Python script in Zapier to access the information and store it into an Excel file?

This option came to mind. I'm using google drive as an example, you didn't say what you where using as storage, but Zapier should have an option for it.
Use cloud convert, doc parser (depends on what you want to pay, cloud convert at least gives you some free time per month, so that may be the closest you can get).
Create a zap with this step:
Trigger on new file in drive (Name: Convert new Google Drive files with CloudConvert)
Convert file with CloudConvert
Those are two options by Zapier that I can find. But you could also do it in python from your desktop by following something like this idea. Then set an event controller in windows event manager to trigger an upload/download.
Unfortunately it doesn't seem that you can import JS/Python libraries into zapier, however I may be wrong on that. If you could, or find a way to do so, then just use PDFminer and "Code by Zapier". A technician might have to confirm this though, I've never gotten libraries to work in zaps.
Hope that helps!

How to get s3 metadata for all keys in a bucket via boto3

I want to fetch all metadata for a bucket with a prefix via Boto. There are a few SO questions that imply this isn't possible via the AWS API. So, two questions:
Is there a good reason this shouldn't be possible via the AWS API?
Although I can't find one in docs, Is there a convenience method for this in Boto?
I'm currently doing this using multithreading, but that seems like overkill, and I'd really rather avoid it if at all possible.

While there isn't a way to do this directly through boto, you could add an inventory configuration on the bucket(s) which generates a daily CSV / ORC file with all file metadata.
Once this has been generated you can then process the output rather than multithreading or any other method that requires a huge number of requests.
See: put_bucket_inventory_configuration
Its worth noting that it can take upto 48 hours for the first one to be generated.

Google Dataflow / Apache Beam Python - Side-Input from PCollection kills performance

We are running logfile parsing jobs in google dataflow using the Python SDK. Data is spread over several 100s of daily logs, which we read via file-pattern from Cloud Storage. Data volume for all files is about 5-8 GB (gz files) with 50-80 million lines in total.
loglines = p | ReadFromText('gs://logfile-location/logs*-20180101')
In addition, we have a simple (small) mapping csv, that maps logfile-entries to human readable text. Has about 400 lines, 5 kb size.
For Example a logfile entry with [param=testing2] should be mapped to "Customer requested 14day free product trial" in the final output.
We do this in a simple beam.Map with sideinput, like so:
customerActions = loglines | beam.Map(map_logentries,mappingTable)
where map_logentries is the mapping function and mappingTable is said mapping table.
However, this only works if we read the mapping table in native python via open() / read(). If we do the same utilising the beam pipeline via ReadFromText() and pass the resulting PCollection as side-input to the Map, like so:
mappingTable = p | ReadFromText('gs://side-inputs/category-mapping.csv')
customerActions = loglines | beam.Map(map_logentries,beam.pvalue.AsIter(mappingTable))
performance breaks down completely to about 2-3 items per Second.
Now, my questions:
Why would performance break so badly, what is wrong with passing a
PCollection as side-input?
If it is maybe not recommended to use
PCollections as side-input, how is one supposed to build such as
pipeline that needs mappings that can/should not be hard coded into
the mapping function?
For us, the mapping does change frequently and I need to find a way to have "normal" users provide it. The idea was to have the mapping csv available in Cloud Storage, and simply incorporate it into the Pipeline via ReadFromText(). Reading it locally involves providing the mapping to the workers, so only the tech-team can do this.
I am aware that there are caching issues with side-input, but surely this should not apply to a 5kb input.
All code above is pseudo code to explain the problem. Any ideas and thoughts on this would be highly appreciated!

For more efficient side inputs (with small to medium size) you can utilize
beam.pvalue.AsList(mappingTable)
since AsList causes Beam to materialize the data, so you're sure that you will get in-memory list for that pcollection.
Intended for use in side-argument specification---the same places
where AsSingleton and AsIter are used, but forces materialization of
this PCollection as a list.
Source: https://beam.apache.org/documentation/sdks/pydoc/2.2.0/apache_beam.pvalue.html?highlight=aslist#apache_beam.pvalue.AsList

The code looks fine. However, since mappingTable is a mapping, wouldn't beam.pvalue.AsDict be more appropriate for your use case?
Your mappingTable is small enough so side input is a good use case here.
Given that mappingTable is also static, you can load it from GCS in start_bundle function of your DoFn. See the answer to this post for more details. If mappingTable becomes very large in future, you can also consider converting your map_logentries and mappingTable into PCollection of key-value pairs and join them using CoGroupByKey.

S3 and Filepicker.io - Multi-file, zipped download on the fly

I am using Ink (Filepicker.io) to perform multi-file uploads and it is working brilliantly.
However, a quick look around the internet shows multi-file downloads are more complicated. While I know it is possible to spin up an EC2 instance to zip on the fly, this would also entail some wait-time on the user's part, and the newly created file would also not be available on my Cloudfront immediately.
Has anybody done this before, and what are the practical UX implications - is the wait time significant enough to negatively affect the user experience?
The obvious solution would be to create the zipped files ahead of time, but this would result in some (unnecessary?) redundancy.
What is the best way to avoid redundant storage while reducing wait times for on-the-fly folder compression?

You can create the ZIP archive on the client side using JavaScript. Check out:
http://stuk.github.io/jszip/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to efficiently rename a lot of blobs in GCS - python

Related

Python and ETL: Job and workflow/task mamangent trigger and other concepts

Python-based PDF parser integrated with Zapier

How to get s3 metadata for all keys in a bucket via boto3

Google Dataflow / Apache Beam Python - Side-Input from PCollection kills performance

S3 and Filepicker.io - Multi-file, zipped download on the fly

Categories

Resources