Heroku - CSV Files and TXT Logfiles - python

I want to deploy a python bot to heroku.
The bot writes all logging data to txt files and also exports a CSV file to the filesystem where data is stored that is important for the next run of the bot and as well makes it possible to track past performance of the bot.
As I know that it is not possible to store any files at a heroku dyno persistent the question is - how/where to store the data?
A database for the data in the csv file is not suitable for me because I have to edit the file sometimes between two runs and to do this via a database would be to much effort for me.
Any suggestions?

You need to save the file(s) on an external storage like S3, DropBox, GitHub, etc..
Check Files on Heroku to see options and examples.
You can decide to read/write the files directly from the storage (i.e. no local copy), or also keep the files locally making sure they are saved at some point (every 10 min, before every restart).

Related

Transfer file from S3 to Windows server

I have just been introduced to Python (PySpark). I have a requirement to achieve the following steps:-
Extract data from a Hive table (on EMR) into a csv file on AWS S3
Transfer the csv file created on S3 (EMR cluster running Spark on YARN) to a
remote Windows server (at a certain folder path)
Once the file has been transferred, trigger a batch file that exists
on the Windows server at a certain folder path
The Windows batch script when triggered, updates/enriches the transferred file with additional information. So transfer/copy the updated csv file back to S3
The final step is to load the updated file to a Hive table once it is transferred back on S3. However, I have figured out how to extract the data from the table into a csv file on S3 and also to load the file to the table. However, I am struggling to get a bearing on how to perform the file transfer/copy between the servers, and most importantly, how to trigger the Windows Batch script on the remote machine.
Could someone please help me and point me in the right direction, and hint as to where I should be starting from ? I searched the internet but couldn't get a concrete answer. I understand that I have to use Boto3 library to interact with S3, however, if there is any other established solution please share those with me (code snippets, articles etc). Also any specific configurations that I might have to incorporate to achieve the result.
Thanks

How to stream a very large file to Dropbox using python v2 api

Background
I finally convinced someone willing to share his full archival node 5868GiB database for free (which now requires to be built in ram and thus requires 100000$ worth of ram in order to be built but can be run on an ssd once done).
However he want to send it only through sending a single tar file over raw tcp using a rather slow (400Mps) connection for this task.
I m needing to get it on dropbox and as a result, he don’t want to use the https://www.dropbox.com/request/[my upload key here] allowing to upload files through a web browser without a dropbox account (it really annoyed him that I talked about using an other method or compressing the database to the point he is on the verge of changing his mind about sharing it).
Because on my side, dropbox allows using 10Tib of storage for free during 30 days and I didn’t receive the required ssd yet (so once received I will be able to download it using a faster speed).
The problem
I m fully aware of upload file to my dropbox from python script but in my case the file doesn t fit into a memory buffer not even on disk.
And previously in api v1 it wasn t possible to append data to an exisiting file (but I didn t find the answer for v2).
To upload a large file to the Dropbox API using the Dropbox Python SDK, you would use upload sessions to upload it in pieces. There's a basic example here.
Note that the Dropbox API only supports files up to 350 GB though.

I want to trigger a python script using a cloud function whenever a specified file is created on the google cloud storage

One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/
The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.

Architecture for syncing s3/cloudfront with database

I'm building a Django app. The app allows the user to upload files, and have them served publicly to other users.
I'm thinking of using S3 or CloudFront to manage and serve these files. (Let's call it S3 for the sake of discussion.) What bugs me is the fact that S3 is going to have a lot of state on it. My Python code will create, rename and delete files on S3 based on user actions. But we already have all the state in our database. Having state in two separate datastores could lead to synchronization problems and confusing. In other words it's "not supposed to" go out of sync. For example if someone were to delete a record in the database from the django admin, the file on s3 will stay orphaned. (I could write code to deal with that scenario, but I can't catch all the scenarios.)
So what I'm thinking is: Is there a solution for having your S3 sync automatically with data in your Postgres database? (I have no problem storing the files as blobs in the database, they're not big, as long as they're not served directly from there.) I'm talking about having an active program that always maintains sync between them, so if say someone deletes a record in the database, the corresponding file in s3 gets deleted, and if someone deletes a file from the S3 interface, it gets recreated from the database. That way my mind could rest at ease regarding synchronization issues.
Is there something like that? Most preferably in Python.
Found the same problem in the past, maybe not the best advice but here's what I did.
I wrote the upload/modify/remove to S3 logic in the models and used Model signals to keep it updated, for example you can use the post_delete signal to delete the image from S3 and avoid orphans.
Also I had a management command to check if everything is synchronized and solve the problems if any. Unfortunately I wrote this for a client and I cannot share it.
Edit: I found django-cb-storage-s3 and django-s3sync they could be helpful

GAE better output information of appcfg.py bulkupload on daily routine

I have a web service on Google App Engine (programmed in Python) and everyday I have to update with data from a ftp source.
My daily job, that´s outside of GAE, downloads the data from the ftp server, parse and enrich this data with other information sources and this process take nearly 2 hours.
After all this, I upload all this data to my server using the bulk upload function of the appcfg.py (command line).
Since I want to have better reports of this process, I need to know how many records were really uploaded by each call to appcfg (there more then 10)
My question is: Can I get this number of lines uploaded from the appcfg.py without having to parse its output?
Bonus question: Does anyone else do this kind of daily routine? or is it a bad practice?

Categories