Django - when best to calculate statistics on large amounts of data

Django - when best to calculate statistics on large amounts of data - python

I'm working on a Django application that consists of a scraper that scrapes thousands of store items (price, description, seller info) per day and a django-template frontend that allows the user to access the data and view various statistics.
For example: the user is able to click on 'Item A', and gets a detail view that lists various statistics about 'Item A' (Like linegraphs about price over time, a price distribution, etc)
The user is also able to click on reports of the individual 'scrapes' and get details about the number of items scraped, average price. Etc.
All of these statistics are currently calculated in the view itself.
This all works well when working locally, on a small development database with +/100 items. However, when in production this database will eventually consist of 1.000.000+ lines. Which leads me to wonder if calculating the statistics in the view wont lead to massive lag in the future. (Especially as I plan to extend the statistics with more complicated regression-analysis, and perhaps some nearest neighbour ML classification)
The advantage of the view based approach is that the graphs are always up to date. I could offcourse also schedule a CRONJOB to make the calculations every few hours (perhaps even on a different server). This would make accessing the information very fast, but would also mean that the information could be a few hours old.
I've never really worked with data of this scale before, and was wondering what the best practises are.

As with anything performance-related, do some testing and profile your application. Don't get lured into the premature optimization trap.
That said, given the fact that these statistics don't change, you could perform them asynchronously each time you do a scrape. Like the scrape process itself, this calculation process should be done asynchronously, completely separate from your Django application. When the scrape happens it would write to the database directly and set some kind of status field to processing. Then kick off the calculation process which, when completed, will fill in the stats fields and set the status to complete. This way you can show your users how far along the processing chain they are.
People love feedback over immediate results and they'll tolerate considerable delays if they know they'll eventually get a result. Strand a user and they'll get frustrate more quickly than any computer can finish processing; Lead them on a journey and they'll wait for ages to hear how the story ends.

Related

Notion API quickly delete and repopulate entire DB

Background
I'm creating a Notion DB that will contain data about different analyzers my team uses (analyzer name, location, last time the analyzer sent data, etc.). Since I'm using live data I need to have a way to quickly update the data of all analyzers in the notion db.
I'm currently using a python script to get the analyzers data and upload it to the Notion DB. Currently I read each row, get it's ID that I use to update the row's data - but this is too slow: it takes more than 30 seconds to update 100 rows.
The Question
I'd like to know if there's a way to quickly update the data of many rows (maybe in one big bulk operation). The goal is perhaps 100 row updates per second (instead of 30 seconds).

There are multiple things one could do here - sadly none of it will improve the updates drastically. Currently there is no way to update multiple rows, or to be more precise pages. I am not sure what "read each row" refers to, but you can retrieve multiple pages of a database at once - up to 100. If you are retrieving them one by one, this could be updated.
Secondly, I'd like to know how often the analyzers change and if, will they be altered by the Python script or updated in Notion? If this does not happen too often, you might be able to cache the page_ids and retrieve the ids not every time you update. Sadly the last_edited_time of the database does not reflect any addition or removal of it's children, so simply checking this is not an option.
The third and last way to improve performance is multi-threading. You can send multiple requests at the same time as the amount of requests is usually the bottleneck.
I know none of these will really help you, but sadly no efficient method to update multiple pages exists.

There is also the rate limit of 3 requests per second, which is enforced by Notion to ensure fair performance for all users. If you send more requests, you will start receiving responses with an HTTP 429 code. Your integration should be written in such a way that this response will be respected and should prevent any requests to be sent before the time indicated in the indicated number of seconds as per this page on the notion developer API guidelines.

Flask website backend structure guidance assistance?

I have a basic personal project website that I am looking to learn some web dev fundamentals with and database (SQL) fundamentals as well (If SQL is even the right technology to use??).
I have the basic skeleton up and running but as I am new to this, I want to make sure I am doing it in the most efficient and "correct" way possible.
Currently the site has a main index (landing) page and from there the user can select one of a few subpages. For the sake of understanding, each of these sub pages represents a different surf break and they each display relevant info about that particular break i.e. wave height, wind, tide.
As I have already been able to successfully scrape this data, my main questions revolve around how would I go about inserting this data into a database for future use (historical graphs, trends)? How would I ensure data is added to this database in a continuous manner (once/day)? How would I use data that was scraped from an earlier time, say at noon, to be displayed/used at 12:05 PM rather than scraping it again?
Any other tips, guidance, or resources you can point me to are much appreciated.

This kind of data is called time series. There are specialized database engines for time series, but with a not-extreme volume of observations - (timestamp, wave heigh, wind, tide, which break it is) tuples - a SQL database will be perfectly fine.
Try to model your data as a table in Postgres or MySQL. Start by making a table and manually inserting some fake data in a GUI client for your database. When it looks right, you have your schema. The corresponding CREATE TABLE statement is your DDL. You should be able to write SELECT queries against your table that yield the data you want to show on your webapp. If these queries are awkward, it's a sign that your schema needs revision. Save your DDL. It's (sort of) part of your source code. I imagine two tables: a listing of surf breaks, and a listing of observations. Each row in the listing of observations would reference the listing of surf breaks. If you're on a Mac, Sequel Pro is a decent tool for playing around with a MySQL database, and playing around is probably the best way to learn to use one.
Next, try to insert data to the table from a Python script. Starting with fake data is fine, but mold your Python script to read from your upstream source (the result of scraping) and insert into the table. What does your scraping code output? Is it a function you can call? A CSV you can read? That'll dictate how this script works.
It'll help if this import script is idempotent: you can run it multiple times and it won't make a mess by inserting duplicate rows. It'll also help if this is incremental: once your dataset grows large, it will be very expensive to recompute the whole thing. Try to deal with importing a specific interval at a time. A command-line tool is fine. You can specify the interval as a command-line argument, or figure out out from the current time.
The general problem here, loading data from one system into another on a regular schedule, is called ETL. You have a very simple case of it, and can use very simple tools, but if you want to read about it, that's what it's called. If instead you could get a continuous stream of observations - say, straight from the sensors - you would have a streaming ingestion problem.
You can use the Linux subsystem cron to make this script run on a schedule. You'll want to know whether it ran successfully - this opens a whole other can of worms about monitoring and alerting. There are various open-source systems that will let you emit metrics from your programs, basically a "hey, this happened" tick, see these metrics plotted on graphs, and ask to be emailed/texted/paged if something is happening too frequently or too infrequently. (These systems are, incidentally, one of the main applications of time-series databases). Don't get bogged down with this upfront, but keep it in mind. Statsd, Grafana, and Prometheus are some names to get you started Googling in this direction. You could also simply have your script send an email on success or failure, but people tend to start ignoring such emails.
You'll have written some functions to interact with your database engine. Extract these in a Python module. This forms the basis of your Data Access Layer. Reuse it in your Flask application. This will be easiest if you keep all this stuff in the same Git repository. You can use your chosen database engine's Python client directly, or you can use an abstraction layer like SQLAlchemy. This decision is controversial and people will have opinions, but just pick one. Whatever database API you pick, please learn what a SQL injection attack is and how to use user-supplied data in queries without opening yourself up to SQL injection. Your database API's documentation should cover the latter.
The / page of your Flask application will be based on a SQL query like SELECT * FROM surf_breaks. Render a link to the break-specific page for each one.
You'll have another page like /breaks/n where n identifies a surf break (an integer that increments as you insert surf break rows is customary). This page will be based on a query like SELECT * FROM observations WHERE surf_break_id = n. In each case, you'll call functions in your Data Access Layer for a list of rows, and then in a template, iterate through those rows and render some HTML. There are various Javascript and Python graphing libraries you can feed this list of rows into and get graphs out of (client side or server side). If you're interested in something like a week-over-week change, you should be able to express that in one SQL query and get that dataset directly from the database engine.
For performance, try not to get in a situation where more than one SQL query happens during a page load. By default, you'll be doing some unnecessary work by going back to the database and recomputing the page every time someone requests it. If this becomes a problem, you can add a reverse proxy cache in front of your Flask app. In your case this is easy, since nothing users do to the app cause its content to change. Simply invalidate the cache when you import new data.

Storing queryset after fetching it once

I am new to django and web development.
I am building a website with a considerable size of database.
Large amount of data should be shown in many pages, and a lot of this data is repeated. I mean I need to show the same data in many pages.
Is it a good idea to make a query to the database asking for the data in every GET request? it takes many seconds to get the data every time I refresh the page or request another page that has the same data shown.
Is there a way to fetch the data once and store it somewhere and just display it in every page, and only refetch it when some updates are being done.
I thought about the session but I found that it is limited to 5MB which is small for my data.
Any suggestions?
Thank you.

Django's cache - as mentionned by Leistungsabfall - can help, but like most cache systems it has some drawbacks too if you use it naively for this kind of problems (long queries/computations): when the cache expires, the next request will have to recompute the whole thing - which might take some times durring which every new request will trigger a recomputation... Also, proper cache invalidation can be really tricky.
Actually there's no one-size-fits-all answer to your question, the right solution is often a mix of different solutions (code optimisation, caching, denormalisation etc), based on your actual data, how often they change, how much visitors you have, how critical it is to have up-to-date data etc, but the very first steps would be to
check the code fetching the data and find out if there are possible optimisations at this level using QuerySet features (.select_related() / prefetch_related(), values() and/or values_list(), annotations etc) to avoid issues like the "n+1 queries" problem, fetching whole records and building whole model instances when you only need a single field's value, doing computations at the Python level when they could be done at the database level etc
check your db schema's indexes - well used indexes can vastly improve performances, badly used ones can vastly degrade performances...
and of course use the right tools (db query logging, Python's profiler etc) to make sure you identify the real issues.

Persistent object with Django?

So I have a site that on a per-user basis, and it is expected to query a very large database, and flip through the results. Due to the size of the number of entries returned, I run the query once (which takes some time...), store the result in a global, and let folks iterate through the results (or download them) as they want.
Of course, this isn't scalable, as the globals are shared across sessions. What is the correct way to do this in Django? I looked at session management, but I always ran into the "xyz is not serializeable on json" issue. Do I look into how I do this correctly using sessions, or is there another preferred way to do this?

If the user is flipping through the results, you probably don't want to pull back and render any more than you have to. Most SQL dialects have TOP and LIMIT clauses that will let you pull back a limited range of results, as long as your data is ordered consistently. Django's Pagination classes are a nice abstraction of this on top of Django Model classes: https://docs.djangoproject.com/en/dev/topics/pagination/
I would be careful of storing large amounts of data in user sessions, as it won't scale as your number of users grows, and user sessions can stay around for a while after the user has left the site. If you're set on this option, make sure you read about clearing the expired sessions. Django doesn't do it for you:
https://docs.djangoproject.com/en/1.7/topics/http/sessions/#clearing-the-session-store

Caching online prices fetched via API unless they change

I'm currently working on a site that makes several calls to big name online sellers like eBay and Amazon to fetch prices for certain items. The issue is, currently it takes a few seconds (as far as I can tell, this time is from making the calls) to load the results, which I'd like to be more instant (~10 seconds is too much in my opinion).
I've already cached other information that I need to fetch, but that information is static. Is there a way that I can cache the prices but update them only when needed? The code is in Python and I store info in a mySQL database.
I was thinking of somehow using chron or something along that lines to update it every so often, but it would be nice if there was a simpler and less intense approach to this problem.
Thanks!

You can use memcache to do the caching. The first request will be slow, but the remaining requests should be instant. You'll want a cron job to keep it up to date though. More caching info here: Good examples of python-memcache (memcached) being used in Python?

Have you thought about displaying the cached data, then updating the prices via an ajax callback? You could notify the user if the price changed with a SO type notification bar or similar.
This way the users get results immediately, and updated prices when they are available.
Edit
You can use jquery:
Assume you have a script names getPrices.php that returns a json array of the id of the item, and it's price.
No error handling etc here, just to give you an idea
My necklace: <div id='1'> $123.50 </div><br>
My bracelet: <div id='1'> $13.50 </div><br>
...
<script>
$(document).ready(function() {
$.ajax({ url: "getPrices.php", context: document.body, success: function(data){
for (var price in data)
{
$(price.id).html(price.price);
}
}}));
</script>

You need to handle the following in your application:
get the price
determine if the price has changed
cache the price information
For step 1, you need to consider how often the item prices will change. I would go with your instinct to set a Cron job for a process which will check for new prices on the items (or on sets of items) at set intervals. This is trivial at small scale, but when you have tens of thousands of items the architecture of this system will become very important.
To avoid delays in page load, try to get the information ahead of time as much as possible. I don't know your use case, but it'd be good to prefetch the information as much as possible and avoid having the user wait for an asynchronous JavaScript call to complete.
For step 2, if it's a new item or if the price has changed, update the information in your caching system.
For step 3, you need to determine the best caching system for your needs. As others have suggested, memcached is a possible solution. There are a variety of "NoSQL" databases you could check out, or even cache the results in MySQL.

How are you getting the price? If you are scrapping the data from the normal HTML page using a tool such as BeautifulSoup, that may be slowing down the round-trip time. In this case, it might help to compute a fast checksum (such as MD5) from the page to see if it has changed, before parsing it. If you are using a API which gives a short XML version of the price, this is probably not an issue.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django - when best to calculate statistics on large amounts of data - python

Related

Notion API quickly delete and repopulate entire DB

Flask website backend structure guidance assistance?

Storing queryset after fetching it once

Persistent object with Django?

Caching online prices fetched via API unless they change

Categories

Resources