I'm working in a personal project to segment some data from the Sendinblue Api (CRM Service). Basically what I try to achieve is generate a new score attribute to each user base on his emailing behavior. For that proposed, the process I've plan is as follows:
Get data from the API
Store in database
Analysis and segment the data with Python
Create and update score attribute in Sendin every 24 hours
The Api has a Rate limiting 400 request per minute, we are talking about 100k registers right now which means I have to spend like 3 hours to get all the initial data (currently I'm using concurrent futures to multiprocessing). After that I'll plan to store and update only the registers who present changes. I'm wondering if this is the best way to do it and which combinations of tools is better for this job.
Right now I have all my script in Jupyter notebooks and I recently finished my first Django project, so I don't know if I need a django app for this one or just simple connect the notebook to a database (PostgreSQL?), and if this last one is possible which library I have to learn to run my script every 24 hours. (i'm a beginner). Thanks!
I don't think you need Django except you want a web to view your data. Even so you can write any web application to view your statistic data with any framework/language. So I think the approach is simpler:
Create your python project, entry point main function will execute logic to fetch data from API. Once it's done, you can start logic to analyze and statistic then save result in database.
If you can query to view your final result by SQL, you don't need to build web application. Otherwise you might want to build a small web application to pull data from database to view statistic in charts or export in any prefer format.
Setup a linux cron job to execute python code at #1 and let it run every 24 at paticular time you want. Link: https://phoenixnap.com/kb/set-up-cron-job-linux
Related
Not able to articulate this well but I will try. I have a Python script that is acting as a middle man between two APIs. It pulls info from API 1, uses that that info to get more information from API 2, then sends a request back to API 1 with the updated information. It only performs this update routine when a specific field is updated by the user in the integration. An ID is stored for the integration
Currently, I am testing this with one ID (my test integration). I have it running in an infinite loop so that it is constantly checking to update.
I plan on making this public so that anyone can integrate with this. The IDs will be stored in a PostGres database with Heroku. The problem is that architecture of my program. The more people that integrate with this, the more IDs the program needs to cycle through and check in a linear fashion.
What would be the ideal way to parse every ID and check for updates every few seconds without losing speed when the number of integrations and IDs increase?
I have a processing engine built in python and a driver program with several components that uses this engine to process some files.
See the pictorial representation here.
The Engine is used for math calculations.
The Driver program has several components.
Scanner keeps scanning a folder to check for new files, if found makes entry into DB by calling a API.
Scheduler picks new entries made by scanner and schedules them for processing (makes entry into 'jobs' table in DB)
Executer picks entries from job table and executes them using the engine and outputs new files.
All the components run as separate python process continuously. This is very in efficient, how can I improve this? The use of Django is to provide a DB (so the multiple processes could communicate) and keep a record of how many files are processed.
Next came a new requirement to manually check the processed files for errors so a UI was developed for this. Also the assess to the engine was to be made API based. See the new block diagram here
Now the entire thing is a huge mess in my opinion. For start, the Django now has to serve 2 different sets of API - one for the UI and other for the driver program. If the server stops, the UI stops working and also the Driver program stops working.
Since the engine is API based there is a huge amount of data passed to it in the request. The Engine takes several minutes (3 to 4) to process the files and most of the time the request to engine get timeout. The Driver program is started separately from terminal and it fails if Django server is not running as the DB APIs are required to schedule jobs and execute the jobs.
I want to ask what is the beast way to structure such projects.
should I keep the Engine and driver program logic inside Django? In this case how do I start the driver program?
Should I keep both of them outside Django, in which case how do I communicate with Django such that even if the Django server is down I can still keep processing the files.
I would really appreciate any sort of improvement ideas in any of the areas.
I'm trying to build a recommendation system with python using lightfm library and an api created with Flask framework.
My question is more design related than coding.
The webservice which will be called when a user logs in the website, recieves a json with userid and return a json with userid and 5 product sku to be recommended.
My desire is to save those recommendations in a DB. I want to do that because in this way I can see and comparing this table with other tables in DB and find out if a user has purchased the product that I recommended.
My concern (maybe it's stupid) is that everything will slow down if I open a connection to DB and write data in it.
Potentially the service can be called between 5k to 7k times per day.
Thanks
What I've understood from your explanation is that you will be comparing the actual selected data by the user and the ones you recommended. So, considering you are comparing every week once, it won't affect much of your processing.
Your concern is, would everything slow down if a DB connection is opened?
It won't slow down the service. Considering the usage of service of 5k times per day, other major factors are there which will slow the service down or will cause it to stop. Like when the number of users is too high, one python process will fail.
What you need to do here is, use a web application server like Gunicorn or uwsgi Using Gunicorn with Flask
This way, what gunicorn does is it starts multiple python processes running flask so it will support a high number of concurrent users.
I have a google app engine app that has to deal with a lot of data collecting. The data I gather is around millions of records per day. As I see it, there are two simple approaches to dealing with this in order to be able to analyze the data:
1. use logger API to generate app engine logs, and then try to load these up to a big query (or more simply export to CSV and do the analysis with excel).
2. saving the data in the app engine datastore (ndb), and then download that data later / try to load that up to big query.
Is there any preferable method of doing this?
Thanks!
BigQuery has a new Streaming API, which they claim was designed for high-volume real-time data collection.
Advice from practice: we are currently logging 20M+ multi-event records a day via a method 1. as described above. It works pretty well, except when the batch uploader is not called (normally every 5min), then we need to detect this and re-run the importer.
Also, we are currently in process of migrating to new Streaming API, but is not yet in production so I can't say how reliable it is.
I am an experienced Python developer starting to work on web service
backend system. The system feeds data (constantly) from the web to a
MySQL database. This data is later displayed by a frontend side (there
is no connection between the frontend and the backend). The backend
system constantly downloads flight information from the web (some of
the data is fetched via APIs, and some by downloading and parsing
text / xls files). I already have a script that downloads the data,
parses it, and inserts it to the MySQL db - all in a big loop. The
frontend side is just a bunch of php pages that properly display the
data by querying the MySQL server.
It is crucial that this web service be robust, strong and reliable.
Therefore, I have been looking into the proper ways to design it, and came across the following parts to comprise my system:
1) django as a framework (for HTTP connections and for using Piston)
2) Piston as an API provider (this is great because then my front-end can use the API instead of actually running queries)
3) SQLAlchemy as the DB layer (I don't like the little control you get when using django ORM, I want to be able to run a more complex DB framework)
4) Apache with mod_wsgi to run everything
5) And finally, Celery (or django-cron) to actually run my infinite loop that pulls the data off the web - hopefully in some sort of organized tasks format). This is the part I am least sure of, and any pointers are appreciated.
This all sounds great. I used django before to write websites (aka
request handlers that return data). However, other than using Celery or django-cron I can't really see how it fits a role of a constant data feeding backend.
I just wanted to run this by you guys to hear your ideas / comments. Any input you have / pointers to documentation and/or other libraries would be greatly greatly appreciated!
If You are about to use SQLAlchemy, I would refrain from using Django: Django is fine if You are using the whole stack, but as You are about to rip Models off, I do not see much value in using it and I would take a look at another option (perhaps Pylons or pure old CherryPy would do).
Even more so if FEs will not run queries, but only ask API providers.
As for robustness, I am more satisfied with starting separate fcgi processess with supervise and using more lightweight web server (ligty / nginx), but that's a matter of taste.
For the "infinite loop" part, it depends on what behavior you want: if there is a problem with the source, would you just like to skip the step or repeat it multiple times when source is back up?
Periodic Tasks might be good for former, while cron that would just spawn scraping tasks is better for latter.