I'm trying to build a recommendation system with python using lightfm library and an api created with Flask framework.
My question is more design related than coding.
The webservice which will be called when a user logs in the website, recieves a json with userid and return a json with userid and 5 product sku to be recommended.
My desire is to save those recommendations in a DB. I want to do that because in this way I can see and comparing this table with other tables in DB and find out if a user has purchased the product that I recommended.
My concern (maybe it's stupid) is that everything will slow down if I open a connection to DB and write data in it.
Potentially the service can be called between 5k to 7k times per day.
Thanks
What I've understood from your explanation is that you will be comparing the actual selected data by the user and the ones you recommended. So, considering you are comparing every week once, it won't affect much of your processing.
Your concern is, would everything slow down if a DB connection is opened?
It won't slow down the service. Considering the usage of service of 5k times per day, other major factors are there which will slow the service down or will cause it to stop. Like when the number of users is too high, one python process will fail.
What you need to do here is, use a web application server like Gunicorn or uwsgi Using Gunicorn with Flask
This way, what gunicorn does is it starts multiple python processes running flask so it will support a high number of concurrent users.
Related
I have a processing engine built in python and a driver program with several components that uses this engine to process some files.
See the pictorial representation here.
The Engine is used for math calculations.
The Driver program has several components.
Scanner keeps scanning a folder to check for new files, if found makes entry into DB by calling a API.
Scheduler picks new entries made by scanner and schedules them for processing (makes entry into 'jobs' table in DB)
Executer picks entries from job table and executes them using the engine and outputs new files.
All the components run as separate python process continuously. This is very in efficient, how can I improve this? The use of Django is to provide a DB (so the multiple processes could communicate) and keep a record of how many files are processed.
Next came a new requirement to manually check the processed files for errors so a UI was developed for this. Also the assess to the engine was to be made API based. See the new block diagram here
Now the entire thing is a huge mess in my opinion. For start, the Django now has to serve 2 different sets of API - one for the UI and other for the driver program. If the server stops, the UI stops working and also the Driver program stops working.
Since the engine is API based there is a huge amount of data passed to it in the request. The Engine takes several minutes (3 to 4) to process the files and most of the time the request to engine get timeout. The Driver program is started separately from terminal and it fails if Django server is not running as the DB APIs are required to schedule jobs and execute the jobs.
I want to ask what is the beast way to structure such projects.
should I keep the Engine and driver program logic inside Django? In this case how do I start the driver program?
Should I keep both of them outside Django, in which case how do I communicate with Django such that even if the Django server is down I can still keep processing the files.
I would really appreciate any sort of improvement ideas in any of the areas.
this is more of an architecture question which I can't solve this properly as I don't have enough experience with such architecture... I'm currently running the solution with Python and SqlAlchemy, but the question is generic and the answer doesn't have to address those technologies.
I will try to explain it on an example of public library. So imagine having a public library, with server holding tables with all the books, scans (large binary images), users. I've already made a client and server parts which work great, but locally for a single library.
Now I would like to have this of server and clients for another public library (and later more public libraries to come). Having a local server for each library is desired as there is much data to be transferred to and from local server.
The complication comes from the requirement to be able to share users (with their member cards) between libraries - if user comes and registers at library A, he should be able to go to library B without the need for new registration. There's no need for being able to see other user data in the library he wasn't registered in the first place, just hist member account (id, login and password).
The simple solution would be:
having large data on local server
having users on cloud (some public server on internet)
The problem is that there are queries (for statistics, views, and so on), which run on local server and need accessing users, so I can't have users on a different server and database, because I couldn't then do select + join on such an architecture.
The solution which is left behind by previous developer and which other developers think is wrong, is to have the users table set up as replicated table (MariaDB + Galera), so it would end up having users table the same on cloud and each library site, so the previous code would work as if everything is just local, while sharing the users on the background with other libraries.
One of the problems with this is that the current version of our database (MariaDB) doesn't support (or has broken) partial replication (only some tables or some databases), so it would need patching of the MariaDB and distributing this patched version of database server to cloud and other sites, which stinks of various problems now and in the future, when new version of MariaDB will come out.
What would be the proper way of sharing these users between sites, while retaining the ability to do local selects and joins with the user table?
(Maybe there's a known design / architecture pattern for this, but I just don't know what to search for as I'm new to this.)
Thanks,
Miro
schema - sharing table between sites
Start with a single-source-of truth for the user registrations. That is one server (or Galera cluster, for HA) somewhere (in HQ, in Cloud, wherever). Login queries remotely access that server.
Think about any place you log in -- you are going to some central cite. My point is, that is the way everyone does it because it is fast, reliable, efficient, etc, with today's networks.
Next, what about images, etc? If they are shared across your sites, you may as well do them the same way. Look at any search engine for the last two decades -- images (etc) are fetched from a single site. (Actually a small number of sites, for redundancy, etc). Even the biggest web providers have no more than perhaps a dozen datacenters to service the entire world.
After that, you need to decide on Cloud vs dedicated (or even run your own datacenter).
For HA, Cloud providers do a lot. For do-it-yourself, there are various replication scenarios, Galera being one of the best (today). For true HA, you need two copies of your data geographically separated -- to protect from hurricanes, fires, floods, earthquakes, etc. Consider a WAN deployment of Galera, or some asynchronouse replication (possibly even between two Galera clusters.
Another choice is whether the Users and Images tables need to be on separate servers. Only if the traffic and size are high do you need to consider separating them. For a huge Image library, you may need a large number of servers, at which point, they should probably living on servers with the sole purpose of delivering images -- no Users, no HTML pages, etc. Even the "meta" info about images could be elsewhere in MySQL; the Images are in files and just a web server tuned to deliver images runs. (I can think of multiple 'big guys' that do it this way.)
I have some Flask application. It works with some database, I'm using SQLAlchemy for this. So I have one question:
Flask handle requests one-by-one. So, for example, I have two users, which are modifying the same record in the table of database, for example A and B (they are concurrent).
How can I say to user B that user A has changed this record? It must be some message to user B.
In the development server version, when you do app.run(), you get a single synchronous process, which means at most 1 requests being processed at a time. So you cannot accept multiple users at the same time.
However, gunicorn is a solid, easy-to-use WSGI server that will let you spawn multiple workers (separate processes), and even comes with asynchronous workers when you need to deploy your application.
However, to answer your question, since, they run on separate threads, the data that exists in the database at the specific time when the query is run in that thread will be used/returned.
I hope this answers your query.
I am currently trying to develop something using Google AppEngine, I am using Python as my runtime and require some advise on setting up the following.
I am running a webserver that provides JSON data to clients, The data comes from an external service in which I have to pull the data from.
What I need to be able to do is run a background system that will check the memcache to see if there are any required ID's, if there is an ID I need to fetch some data for that ID from the external source and place the data in the memecache.
If there are multiple id's, > 30 I need to be able to pull all 30 request as quickly and efficiently as possible.
I am new to Python Development and AppEngine so any advise you guys could give would be great.
Thanks.
You can use "backends" or "task queues" to run processes in the background. Tasks have a 10-minute run time limit, and backends have no run time limit. There's also a cronjob mechanism which can trigger requests at regular intervals.
You can fetch the data from external servers with the "URLFetch" service.
Note that using memcache as the communication mechanism between front-end and back-end is unreliable -- the contents of memcache may be partially or fully erased at any time (and it does happen from time to time).
Also note that you can't query memcache of you don't know the exact keys ahead of time. It's probably better to use the task queue to queue up requests instead of using memcache, or using the datastore as a storage mechanism.
I am an experienced Python developer starting to work on web service
backend system. The system feeds data (constantly) from the web to a
MySQL database. This data is later displayed by a frontend side (there
is no connection between the frontend and the backend). The backend
system constantly downloads flight information from the web (some of
the data is fetched via APIs, and some by downloading and parsing
text / xls files). I already have a script that downloads the data,
parses it, and inserts it to the MySQL db - all in a big loop. The
frontend side is just a bunch of php pages that properly display the
data by querying the MySQL server.
It is crucial that this web service be robust, strong and reliable.
Therefore, I have been looking into the proper ways to design it, and came across the following parts to comprise my system:
1) django as a framework (for HTTP connections and for using Piston)
2) Piston as an API provider (this is great because then my front-end can use the API instead of actually running queries)
3) SQLAlchemy as the DB layer (I don't like the little control you get when using django ORM, I want to be able to run a more complex DB framework)
4) Apache with mod_wsgi to run everything
5) And finally, Celery (or django-cron) to actually run my infinite loop that pulls the data off the web - hopefully in some sort of organized tasks format). This is the part I am least sure of, and any pointers are appreciated.
This all sounds great. I used django before to write websites (aka
request handlers that return data). However, other than using Celery or django-cron I can't really see how it fits a role of a constant data feeding backend.
I just wanted to run this by you guys to hear your ideas / comments. Any input you have / pointers to documentation and/or other libraries would be greatly greatly appreciated!
If You are about to use SQLAlchemy, I would refrain from using Django: Django is fine if You are using the whole stack, but as You are about to rip Models off, I do not see much value in using it and I would take a look at another option (perhaps Pylons or pure old CherryPy would do).
Even more so if FEs will not run queries, but only ask API providers.
As for robustness, I am more satisfied with starting separate fcgi processess with supervise and using more lightweight web server (ligty / nginx), but that's a matter of taste.
For the "infinite loop" part, it depends on what behavior you want: if there is a problem with the source, would you just like to skip the step or repeat it multiple times when source is back up?
Periodic Tasks might be good for former, while cron that would just spawn scraping tasks is better for latter.