Google AppEngine and Threaded Workers - python

I am currently trying to develop something using Google AppEngine, I am using Python as my runtime and require some advise on setting up the following.
I am running a webserver that provides JSON data to clients, The data comes from an external service in which I have to pull the data from.
What I need to be able to do is run a background system that will check the memcache to see if there are any required ID's, if there is an ID I need to fetch some data for that ID from the external source and place the data in the memecache.
If there are multiple id's, > 30 I need to be able to pull all 30 request as quickly and efficiently as possible.
I am new to Python Development and AppEngine so any advise you guys could give would be great.
Thanks.

You can use "backends" or "task queues" to run processes in the background. Tasks have a 10-minute run time limit, and backends have no run time limit. There's also a cronjob mechanism which can trigger requests at regular intervals.
You can fetch the data from external servers with the "URLFetch" service.

Note that using memcache as the communication mechanism between front-end and back-end is unreliable -- the contents of memcache may be partially or fully erased at any time (and it does happen from time to time).
Also note that you can't query memcache of you don't know the exact keys ahead of time. It's probably better to use the task queue to queue up requests instead of using memcache, or using the datastore as a storage mechanism.

Related

web service for recommendation system

I'm trying to build a recommendation system with python using lightfm library and an api created with Flask framework.
My question is more design related than coding.
The webservice which will be called when a user logs in the website, recieves a json with userid and return a json with userid and 5 product sku to be recommended.
My desire is to save those recommendations in a DB. I want to do that because in this way I can see and comparing this table with other tables in DB and find out if a user has purchased the product that I recommended.
My concern (maybe it's stupid) is that everything will slow down if I open a connection to DB and write data in it.
Potentially the service can be called between 5k to 7k times per day.
Thanks
What I've understood from your explanation is that you will be comparing the actual selected data by the user and the ones you recommended. So, considering you are comparing every week once, it won't affect much of your processing.
Your concern is, would everything slow down if a DB connection is opened?
It won't slow down the service. Considering the usage of service of 5k times per day, other major factors are there which will slow the service down or will cause it to stop. Like when the number of users is too high, one python process will fail.
What you need to do here is, use a web application server like Gunicorn or uwsgi Using Gunicorn with Flask
This way, what gunicorn does is it starts multiple python processes running flask so it will support a high number of concurrent users.

automating 3rd party api pull an push into AWS RDS SQL using Python

I wrote a Python script that will pull data from a 3rd party API and push it into a SQL table I set up in AWS RDS. I want to automate this script so that it runs every night (e.g., the script will only take about a minute to run). I need to find a good place and way to set up this script so that it runs each night.
I could set up an EC2 instance, and a cron job on that instance, and run it from there, but it seems expensive to keep an EC2 instance alive all day for only 1 minute of run-time per night. Would AWS data pipeline work for this purpose? Are there other better alternatives?
(I've seen similar topics discussed when googling around but haven't seen recent answers.)
Thanks
Based on your case, I think you can try to use shellCommandActivity in data pipeline. It will launch a ec2 instance and execute the command you give to data pipeline on your schedule. After finishing the task, pipeline will terminate ec2 instance.
Here is doc:
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-ec2resource.html
Alternatively, you could use a 3rd-party service like Crono. Crono is a simple REST API to manage time-based jobs programmatically.

Google AppEngine - How To Perform a Partial Datastore Download

I have a running GAE app that has been collecting data for a while. I am now at the point where I need to run some basic reports on this data and would like to download a subset of the live data to my dev server. Downloading all entities of a kind will simply be too big a data set for the dev server.
Does anyone know of a way to download a subset of entities from a particular kind? Ideally it would be based on entity attributes like date, or client ID etc... but any method would work. I've even tried a regular, full, download then arbitrarily killing the process when I thought I had enough data, but it seems the data is locked up in the .sql3 files generated by the bulkloader.
It looks like that default download/upload from/to GAE datastore utilities don't support filtering (appcfg.py and bulkloader.py).
It seems reasonable to do one of two things:
write a utility (select+export+save-to-local-file) and execute it locally accessing remotely GAE datastore in remote api shell
write a admin web function for select+export+zip - new url in handler + upload to GAE + call-it-using-http

Web application: Hold large object between requests

I'm working on a web application related to genome searching. This application makes use of this suffix tree library through Cython bindings. Objects of this type are large (hundreds of MB up to ~10GB) and take as long to load from disk as it takes to process them in response to a page request. I'm looking for a way to load several of these objects once on server boot and then use them for all page requests.
I have tried using a remote manager / client setup using the multiprocessing module, modeled after this demo, but it fails when the client connects with an error message that says the object is not picklable.
I would suggest writing a small Flask (or even raw WSGI… But it's probably simpler to use Flask, as it will be easier to get up and running quickly) application which loads the genome database then exposes a simple API. Something like this:
app = Flask(__name__)
database = load_database()
#app.route('/get_genomes')
def get_genomes():
return database.all_genomes()
app.run(debug=True)
Or, you know, something a bit more sensible.
Also, if you need to be handling more than one request at a time (I believe that app.run will only handle one at a time), start by threading… And if that's too slow, you can os.fork() after the database is loaded and run multiple request handlers from there (that way they will all share the same database in memory).

Python Web Backend

I am an experienced Python developer starting to work on web service
backend system. The system feeds data (constantly) from the web to a
MySQL database. This data is later displayed by a frontend side (there
is no connection between the frontend and the backend). The backend
system constantly downloads flight information from the web (some of
the data is fetched via APIs, and some by downloading and parsing
text / xls files). I already have a script that downloads the data,
parses it, and inserts it to the MySQL db - all in a big loop. The
frontend side is just a bunch of php pages that properly display the
data by querying the MySQL server.
It is crucial that this web service be robust, strong and reliable.
Therefore, I have been looking into the proper ways to design it, and came across the following parts to comprise my system:
1) django as a framework (for HTTP connections and for using Piston)
2) Piston as an API provider (this is great because then my front-end can use the API instead of actually running queries)
3) SQLAlchemy as the DB layer (I don't like the little control you get when using django ORM, I want to be able to run a more complex DB framework)
4) Apache with mod_wsgi to run everything
5) And finally, Celery (or django-cron) to actually run my infinite loop that pulls the data off the web - hopefully in some sort of organized tasks format). This is the part I am least sure of, and any pointers are appreciated.
This all sounds great. I used django before to write websites (aka
request handlers that return data). However, other than using Celery or django-cron I can't really see how it fits a role of a constant data feeding backend.
I just wanted to run this by you guys to hear your ideas / comments. Any input you have / pointers to documentation and/or other libraries would be greatly greatly appreciated!
If You are about to use SQLAlchemy, I would refrain from using Django: Django is fine if You are using the whole stack, but as You are about to rip Models off, I do not see much value in using it and I would take a look at another option (perhaps Pylons or pure old CherryPy would do).
Even more so if FEs will not run queries, but only ask API providers.
As for robustness, I am more satisfied with starting separate fcgi processess with supervise and using more lightweight web server (ligty / nginx), but that's a matter of taste.
For the "infinite loop" part, it depends on what behavior you want: if there is a problem with the source, would you just like to skip the step or repeat it multiple times when source is back up?
Periodic Tasks might be good for former, while cron that would just spawn scraping tasks is better for latter.

Categories