Documents requests and memory issue using MongoDB - python

I have documents in Mongo that contain questionnaire-like data (company's name, address, phone numbers, etc.). I should have this data displayed in web interface.
How to make it better: get the whole document as dict into a variable and fill webform from its data or make lots of database requests to fill each field of the web form?

If you are going to display all of the data on the web interface anyways, start simple: Get a the whole document. It reduces your latency by requiring less round trips to the database.

Agree with GPS - if you need all, or most, of the data then it should be a better option to retrieve the document in one request as opposed to hitting the database x number of times.

Related

Flask website backend structure guidance assistance?

I have a basic personal project website that I am looking to learn some web dev fundamentals with and database (SQL) fundamentals as well (If SQL is even the right technology to use??).
I have the basic skeleton up and running but as I am new to this, I want to make sure I am doing it in the most efficient and "correct" way possible.
Currently the site has a main index (landing) page and from there the user can select one of a few subpages. For the sake of understanding, each of these sub pages represents a different surf break and they each display relevant info about that particular break i.e. wave height, wind, tide.
As I have already been able to successfully scrape this data, my main questions revolve around how would I go about inserting this data into a database for future use (historical graphs, trends)? How would I ensure data is added to this database in a continuous manner (once/day)? How would I use data that was scraped from an earlier time, say at noon, to be displayed/used at 12:05 PM rather than scraping it again?
Any other tips, guidance, or resources you can point me to are much appreciated.
This kind of data is called time series. There are specialized database engines for time series, but with a not-extreme volume of observations - (timestamp, wave heigh, wind, tide, which break it is) tuples - a SQL database will be perfectly fine.
Try to model your data as a table in Postgres or MySQL. Start by making a table and manually inserting some fake data in a GUI client for your database. When it looks right, you have your schema. The corresponding CREATE TABLE statement is your DDL. You should be able to write SELECT queries against your table that yield the data you want to show on your webapp. If these queries are awkward, it's a sign that your schema needs revision. Save your DDL. It's (sort of) part of your source code. I imagine two tables: a listing of surf breaks, and a listing of observations. Each row in the listing of observations would reference the listing of surf breaks. If you're on a Mac, Sequel Pro is a decent tool for playing around with a MySQL database, and playing around is probably the best way to learn to use one.
Next, try to insert data to the table from a Python script. Starting with fake data is fine, but mold your Python script to read from your upstream source (the result of scraping) and insert into the table. What does your scraping code output? Is it a function you can call? A CSV you can read? That'll dictate how this script works.
It'll help if this import script is idempotent: you can run it multiple times and it won't make a mess by inserting duplicate rows. It'll also help if this is incremental: once your dataset grows large, it will be very expensive to recompute the whole thing. Try to deal with importing a specific interval at a time. A command-line tool is fine. You can specify the interval as a command-line argument, or figure out out from the current time.
The general problem here, loading data from one system into another on a regular schedule, is called ETL. You have a very simple case of it, and can use very simple tools, but if you want to read about it, that's what it's called. If instead you could get a continuous stream of observations - say, straight from the sensors - you would have a streaming ingestion problem.
You can use the Linux subsystem cron to make this script run on a schedule. You'll want to know whether it ran successfully - this opens a whole other can of worms about monitoring and alerting. There are various open-source systems that will let you emit metrics from your programs, basically a "hey, this happened" tick, see these metrics plotted on graphs, and ask to be emailed/texted/paged if something is happening too frequently or too infrequently. (These systems are, incidentally, one of the main applications of time-series databases). Don't get bogged down with this upfront, but keep it in mind. Statsd, Grafana, and Prometheus are some names to get you started Googling in this direction. You could also simply have your script send an email on success or failure, but people tend to start ignoring such emails.
You'll have written some functions to interact with your database engine. Extract these in a Python module. This forms the basis of your Data Access Layer. Reuse it in your Flask application. This will be easiest if you keep all this stuff in the same Git repository. You can use your chosen database engine's Python client directly, or you can use an abstraction layer like SQLAlchemy. This decision is controversial and people will have opinions, but just pick one. Whatever database API you pick, please learn what a SQL injection attack is and how to use user-supplied data in queries without opening yourself up to SQL injection. Your database API's documentation should cover the latter.
The / page of your Flask application will be based on a SQL query like SELECT * FROM surf_breaks. Render a link to the break-specific page for each one.
You'll have another page like /breaks/n where n identifies a surf break (an integer that increments as you insert surf break rows is customary). This page will be based on a query like SELECT * FROM observations WHERE surf_break_id = n. In each case, you'll call functions in your Data Access Layer for a list of rows, and then in a template, iterate through those rows and render some HTML. There are various Javascript and Python graphing libraries you can feed this list of rows into and get graphs out of (client side or server side). If you're interested in something like a week-over-week change, you should be able to express that in one SQL query and get that dataset directly from the database engine.
For performance, try not to get in a situation where more than one SQL query happens during a page load. By default, you'll be doing some unnecessary work by going back to the database and recomputing the page every time someone requests it. If this becomes a problem, you can add a reverse proxy cache in front of your Flask app. In your case this is easy, since nothing users do to the app cause its content to change. Simply invalidate the cache when you import new data.

Storing queryset after fetching it once

I am new to django and web development.
I am building a website with a considerable size of database.
Large amount of data should be shown in many pages, and a lot of this data is repeated. I mean I need to show the same data in many pages.
Is it a good idea to make a query to the database asking for the data in every GET request? it takes many seconds to get the data every time I refresh the page or request another page that has the same data shown.
Is there a way to fetch the data once and store it somewhere and just display it in every page, and only refetch it when some updates are being done.
I thought about the session but I found that it is limited to 5MB which is small for my data.
Any suggestions?
Thank you.
Django's cache - as mentionned by Leistungsabfall - can help, but like most cache systems it has some drawbacks too if you use it naively for this kind of problems (long queries/computations): when the cache expires, the next request will have to recompute the whole thing - which might take some times durring which every new request will trigger a recomputation... Also, proper cache invalidation can be really tricky.
Actually there's no one-size-fits-all answer to your question, the right solution is often a mix of different solutions (code optimisation, caching, denormalisation etc), based on your actual data, how often they change, how much visitors you have, how critical it is to have up-to-date data etc, but the very first steps would be to
check the code fetching the data and find out if there are possible optimisations at this level using QuerySet features (.select_related() / prefetch_related(), values() and/or values_list(), annotations etc) to avoid issues like the "n+1 queries" problem, fetching whole records and building whole model instances when you only need a single field's value, doing computations at the Python level when they could be done at the database level etc
check your db schema's indexes - well used indexes can vastly improve performances, badly used ones can vastly degrade performances...
and of course use the right tools (db query logging, Python's profiler etc) to make sure you identify the real issues.

Should I load the whole database at initialization in Flask web application?

I'm develop a web application using Flask. I have 2 approaches to return pages for user's request.
Load requesting data from database then return.
Load the whole database into python dictionary variable at initialization and return the related page when requested. (the whole database is not too big)
I'm curious which approach will have better performance?
Of course it will be faster to get data from cache that is stored in memory. But you've got to be sure that the amount of data won't get too large, and that you're updating your cache every time you update the database. Depending on your exact goal you may choose python dict, cache (like memcached) or something else, such as tries.
There's also a "middle" way for this. You can store in memory not the whole records from database, but just the correspondence between the search params in request and the ids of the records in database. That way user makes a request, you quickly check the ids of the records needed, and query your database by id, which is pretty fast.

How do I pass around data between pages in a GAE app?

I have a fairly basic GAE app that takes some input, fetches some data from a webpage, parses, then presents it to the user. Right now, the fairly spare input HTML form POSTs the arguments to the output 'file' which is wholly generated by the handler for that URL.
I'd like to do a couple things with the data (e.g. graph it at a landing page perhaps, then write it to an output file), but I don't know how I should pass the parsed data between the different handlers. I could maybe encode it then successively POST it to other handlers, but my gut says that I shouldn't need to HTTP the data back and forth within my app—it seems terribly inefficient (my gut is also hungry...).
In fairly broad swaths (or maybe a link to an example), how should my handlers handle this?
Later thoughts (edit)
My very rough idea now is to have the form submitted to a page that 1) enters the subsequent query into a database (datastore?) keyed to some hash, then uses that to 2) grab and parse all the data. The parsed data would be stored in memory (memcache?) for near-immediate use to graph it and/or process it into a variety of tabular formats for download. The script that does said parsing redirects to a unique URL based on the hash which can be referred to in order to get the data.
The thought would be that you could save the URL, then if you visit it later when the data has been lost, it can re-query the source to get it back/update it.
Reasonable? Should I be looking at other things?

Hosting a small database aside from Google App Engine?

I asked another question about doing large queries in GAE, to which the answer was pretty much not possible.
What I want to do is this: from an iOS device, I get all the user's contacts phone numbers. So now I have a list of say 250 phone numbers. I want to send these phone numbers back to the server and check to see which of these phone numbers belong to a User account.
So I need to do a query: query = User.query(User.phone.IN(phones_list))
However, with GAE, this is quite an expensive query. It will cost 250 reads for just this one query, and I expect to do this type of query often.
So I came up with a crazy idea. Why don't I host the phone numbers on another host, on another database, where this type of query is cheaper. Then I can have GAE send a HTTP request to my other server to get the desired info.
So I have two questions:
Are there any databases more streamlined to handle these kinds of
queries, and which it would be more cheaper to do? Or will it all be
the same as GAE?
Is this overkill? Is it a good idea? Should I suck it up and pay the cost?
GAE's datastore should be good enough for your service. Since your application looks like could be parallelized very well.
1. use phone number as key_name of User.
As you set number as key_name of User, the following code will increase the query speed and reduce the read operation.
memcache.get_multi([phone_number1, phone_number2 ... ])
db.get([number1_not_found_in_memcache, number2_not_found_in_memcache])
memcache.set_multi("all_number_found_in_db")
2. store multi number in one datastore.
the operation cost of GAE not directly related to the entity's size. therefore a large entity store multi data would be another way to save the operation cost.
for example, store several phone number which have the same number_prefix together.
class Number(db.Model):
number_prefix = db.StringProperty()
numbers = db.StringListProperty(indexed = False)
# check number 01234567, 032123124
numbers = Number.get(["01", "03'])
# check 01234567 in number[0].numbers ?
# check 032123124 in number[1].numbers ?
this method could further imporve with memcache.
Generalizing slightly on other ideas offered... assuming that all your search keys are unique to a single User (e.g. email, phone, twitter handle, etc.)
At User write time, you can generate a set of SearchIndex(...) and persist that. Each SearchIndex has the key of the User.
Then at search time you can construct the keys for any SearchIndex and do two ndb.get_multi_async calls. The first to get matching SearchIndex entities, and the second to get the Users associated with those index entities.

Categories