Diffing a JSON document

Diffing a JSON document - python

Well, my question is a little complicated, but here goes:
I have a Python server that stores client (written in JavaScript) sessions, and has complete knowledge of what the client currently has stored in its state.
The server will constantly fetch data from the database and check for any changes against the client state. The data is JSON; consisting mostly of lists and dicts. I need a way to send a response to the client telling it to alter its data to match what the server has.
I have considered:
Sending a JSON-serialised recursively
diffed dict of changed elements and
not ever using lists - not bad, but I
can't use lists
Send the entire server version of the client state to the client -
costly and inefficient
Find some convoluted way to diff lists - painful and messy
Text-based diff of the two after dumping as JSON - plain silly
I'm pretty stumped on this, and I'd appreciate any help with this.
UPDATE
I'm considering sending nulls to the client to remove data it no longer requires and that the server has removed from its version of the client state.

Related question
How to push diffs of data (possibly JSON) to a server?
See
http://ajaxian.com/archives/json-diff-released
http://michael.hinnerup.net/blog/2008/01/15/diffing_json_objects/
There are a couple of possible approaches:
Do an actual tree-parsing recursive diff;
Encapsulate your JSON updates such that they generate the diff at the same time;
Generate change-only JSON directly from your data.
What are your expected mean and max sizes for client-state JSON?
What are your expected mean and max sizes for diff updates?
How often will updates be requested?
How quickly does the base data change?
Why can't you use lists?
You could store just a last-known client state timestamp and query the database for items which have changed since then - effectively, let the database do the diff for you. This would require a last-changed timestamp and item-deleted flag on each table item; instead of directly deleting items, set the item-deleted flag, and have a cleanup query that deletes all records with item-deleted flag set more than two full update cycles ago.
It might be helpful to see some sample data - two sets of JSON client-state data and the diff between them.

Related

Notion API quickly delete and repopulate entire DB

Background
I'm creating a Notion DB that will contain data about different analyzers my team uses (analyzer name, location, last time the analyzer sent data, etc.). Since I'm using live data I need to have a way to quickly update the data of all analyzers in the notion db.
I'm currently using a python script to get the analyzers data and upload it to the Notion DB. Currently I read each row, get it's ID that I use to update the row's data - but this is too slow: it takes more than 30 seconds to update 100 rows.
The Question
I'd like to know if there's a way to quickly update the data of many rows (maybe in one big bulk operation). The goal is perhaps 100 row updates per second (instead of 30 seconds).

There are multiple things one could do here - sadly none of it will improve the updates drastically. Currently there is no way to update multiple rows, or to be more precise pages. I am not sure what "read each row" refers to, but you can retrieve multiple pages of a database at once - up to 100. If you are retrieving them one by one, this could be updated.
Secondly, I'd like to know how often the analyzers change and if, will they be altered by the Python script or updated in Notion? If this does not happen too often, you might be able to cache the page_ids and retrieve the ids not every time you update. Sadly the last_edited_time of the database does not reflect any addition or removal of it's children, so simply checking this is not an option.
The third and last way to improve performance is multi-threading. You can send multiple requests at the same time as the amount of requests is usually the bottleneck.
I know none of these will really help you, but sadly no efficient method to update multiple pages exists.

There is also the rate limit of 3 requests per second, which is enforced by Notion to ensure fair performance for all users. If you send more requests, you will start receiving responses with an HTTP 429 code. Your integration should be written in such a way that this response will be respected and should prevent any requests to be sent before the time indicated in the indicated number of seconds as per this page on the notion developer API guidelines.

Storing queryset after fetching it once

I am new to django and web development.
I am building a website with a considerable size of database.
Large amount of data should be shown in many pages, and a lot of this data is repeated. I mean I need to show the same data in many pages.
Is it a good idea to make a query to the database asking for the data in every GET request? it takes many seconds to get the data every time I refresh the page or request another page that has the same data shown.
Is there a way to fetch the data once and store it somewhere and just display it in every page, and only refetch it when some updates are being done.
I thought about the session but I found that it is limited to 5MB which is small for my data.
Any suggestions?
Thank you.

Django's cache - as mentionned by Leistungsabfall - can help, but like most cache systems it has some drawbacks too if you use it naively for this kind of problems (long queries/computations): when the cache expires, the next request will have to recompute the whole thing - which might take some times durring which every new request will trigger a recomputation... Also, proper cache invalidation can be really tricky.
Actually there's no one-size-fits-all answer to your question, the right solution is often a mix of different solutions (code optimisation, caching, denormalisation etc), based on your actual data, how often they change, how much visitors you have, how critical it is to have up-to-date data etc, but the very first steps would be to
check the code fetching the data and find out if there are possible optimisations at this level using QuerySet features (.select_related() / prefetch_related(), values() and/or values_list(), annotations etc) to avoid issues like the "n+1 queries" problem, fetching whole records and building whole model instances when you only need a single field's value, doing computations at the Python level when they could be done at the database level etc
check your db schema's indexes - well used indexes can vastly improve performances, badly used ones can vastly degrade performances...
and of course use the right tools (db query logging, Python's profiler etc) to make sure you identify the real issues.

What's faster: temporary SQL tables or Python dicts for session data?

Have some programming background, but in the process of both learning Python and making a web app, and I'm a long-time lurker but first-time poster on Stack Overflow, so please bear with me.
I know that SQLite (or another database, seems like PostgreSQL is popular) is the way to store data between sessions. But what's the most efficient way to store large amounts of data during a session?
I'm building a script to identify the strongest groups of employees to work on various projects in a company. I have received one SQLite database per department containing employee data including skill sets, achievements, performance, and pay.
My script currently runs one SQL query on each database in response to an initial query by the user, pulling all the potentially-relevant employees and their data. It stores all of that data in a list of Python dicts so the end-user can mix-and-match relevant people.
I see two other options: I could still run the comprehensive initial queries but instead of storing it in Python dicts, dump it all into SQLite temporary tables; my guess is that this would save some space and computing because I wouldn't have to store all the joins with each record. Or I could just load employee name and column/row references, which would save a lot of joins on the first pass, then pull the data on the fly from the original databases as the user requests additional data, storing little if any data in Python data structures.
What's going to be the most efficient? Or, at least, what is the most common/proper way of handling large amounts of data during a session?
Thanks in advance!

Aren't you over-optimizing? You don't need the best solution, you need a solution which is good enough.
Implement the simplest one, using dicts; it has a fair chance to be adequate. If you test it and then find it inadequate, try SQLite or Mongo (both have downsides) and see if it suits you better. But I suspect that buying more RAM instead would be the most cost-effective solution in your case.
(Not-a-real-answer disclaimer applies.)

AWS DynamoDB retrieve entire table

Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!

There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda

You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.

How can I prove that some data came from my app?

I have a distributed application that sends and receives data from a specific service on the Internet. When a node receives data, sometimes it needs to validate that that data is correlated with data it or another node previously sent. The value also needs to be unique enough so that I can practically expect never to generate identical values within 24 hours.
In the current implementation I have been using a custom header containing a value of uuid.uuid1(). I can easily validate that that value comes from the one single node running by comparing the received uuid to uuid.getnode(), but this implementation was written before we had a requirement that this app should be multi-node.
I still think that some uuid version is the right answer, but I can't seem to figure out how to validate an incoming uuid value.
>>> received = uuid.uuid5(uuid.NAMESPACE_URL, 'http://example.org')
>>> received
UUID('c57c6902-3774-5f11-80e5-cf09f92b03ac')
Is there some way to validate that received was generated with 'http://example.org'?
Is uuid the right approach at all? If not, what is?
If so, should I even be using uuid5 here?

If the goal is purely to create a unique value across your nodes couldn't you just give each node a unique name and append that to the uuid you are generating?
Wasn't clear to me if you are trying to do this for security reasons or you simply just want a guaranteed unique value across the nodes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.