GAE better output information of appcfg.py bulkupload on daily routine

GAE better output information of appcfg.py bulkupload on daily routine - python

I have a web service on Google App Engine (programmed in Python) and everyday I have to update with data from a ftp source.
My daily job, that´s outside of GAE, downloads the data from the ftp server, parse and enrich this data with other information sources and this process take nearly 2 hours.
After all this, I upload all this data to my server using the bulk upload function of the appcfg.py (command line).
Since I want to have better reports of this process, I need to know how many records were really uploaded by each call to appcfg (there more then 10)
My question is: Can I get this number of lines uploaded from the appcfg.py without having to parse its output?
Bonus question: Does anyone else do this kind of daily routine? or is it a bad practice?

Related

How to run a python script on Azure for CSV file analysis

I have a python script on my local machine that reads a CSV file and outputs some metrics. The end goal is to create a web interface where the user uploads the CSV file and the metrics are displayed, while all being hosted on Azure.
I want to use a VM on Azure to run this python script.
The script takes the CSV file and outputs metrics which are stored in CosmosDB.
A web interface reads from this DB and displays graphs from the data generated by the script.
Can someone elaborate on the steps I need to follow to achieve this? Detailed steps are not essentially required, but a brief overview with links to relevant learning sources would be helpful.

There's an article that lists the primary options for hosting sites in Azure: https://learn.microsoft.com/en-us/azure/developer/python/quickstarts-app-hosting
As Sadiq mentioned, Functions is probably your best choice as it will probably be less expensive, less maintenance, and can handle both the script and the web interface. Here is a python tutorial for that method: https://learn.microsoft.com/en-us/azure/developer/python/tutorial-vs-code-serverless-python-01
Option 2 would be to run a traditional website on an App Service plan, with background tasks handled either by Functions or a Webjob- they both use the webjobs SDK, so the code is very similar: https://learn.microsoft.com/en-us/learn/paths/deploy-a-website-with-azure-app-service/
VMs are an option if either of those two don't work, but it comes with significantly more administration. This learning path has info on how to do this. The website is built on the MEAN stack, but is applicable to Python as well: https://learn.microsoft.com/en-us/learn/paths/deploy-a-website-with-azure-virtual-machines/

I want to trigger a python script using a cloud function whenever a specified file is created on the google cloud storage

One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/

The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.

How to export data into html or txt file using python script runnig on server?

The main purpose of my python script is to parse website, and then save results as html ot txt file on server. And also I want this script to repeat this operation every 15 minutes without my action.
Google App Engine doesn't allow to save files on server, instead I should use DataBase. Is it real to save txt or HTML in DB? And how to make script running without stopping?
Thanks for helping in advance

You are correct in saying that it is impossible to save files directly to the server. Your only option is the datastore as you say. The data type best suited to you is probably the "Text string (long)", however you are limited to 1mb. See https://developers.google.com/appengine/docs/python/datastore/entities for more information.
Regarding the scheduling, you are looking for Cron jobs. You can setup a Cron job to run at any configurable period. See https://developers.google.com/appengine/docs/python/config/cron for details describing how Cron jobs work.

Google AppEngine - How To Perform a Partial Datastore Download

I have a running GAE app that has been collecting data for a while. I am now at the point where I need to run some basic reports on this data and would like to download a subset of the live data to my dev server. Downloading all entities of a kind will simply be too big a data set for the dev server.
Does anyone know of a way to download a subset of entities from a particular kind? Ideally it would be based on entity attributes like date, or client ID etc... but any method would work. I've even tried a regular, full, download then arbitrarily killing the process when I thought I had enough data, but it seems the data is locked up in the .sql3 files generated by the bulkloader.

It looks like that default download/upload from/to GAE datastore utilities don't support filtering (appcfg.py and bulkloader.py).
It seems reasonable to do one of two things:
write a utility (select+export+save-to-local-file) and execute it locally accessing remotely GAE datastore in remote api shell
write a admin web function for select+export+zip - new url in handler + upload to GAE + call-it-using-http

Getting a piece of information from development GAE server to local filesystem

I have an application I am developing on top of GAE, using Python APIs. I am using the local development server right now. The application involves parsing large block of XML data received from outside service.
So the question is - is there an easy way to get this XML data exported out of the GAE application - e.g., in regular app I would just write it to a temp file, but in GAE app I can not do that. So what could I do instead? I can not easily run all the code that produces the service call outside of GAE since it uses some GAE functions to create the call, but it would be much easier if I could take the XML result out and develop/test the parser part outside and then put it back to GAE app.
I tried to log it using logging and then extract it from the console, but when XML is getting big it doesn't work well. I know there's bulk data import/export APIs but seems to be an overkill for extracting just this one piece of information to write it to data store and then export the whole store. So how to do it in the best way?

How about writing the XML data to the blobstore and then write a handler that uses send_blob to download to your local file system?
You can use the files API to write to the blobstore from you application.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.