How can I prove that some data came from my app? - python

I have a distributed application that sends and receives data from a specific service on the Internet. When a node receives data, sometimes it needs to validate that that data is correlated with data it or another node previously sent. The value also needs to be unique enough so that I can practically expect never to generate identical values within 24 hours.
In the current implementation I have been using a custom header containing a value of uuid.uuid1(). I can easily validate that that value comes from the one single node running by comparing the received uuid to uuid.getnode(), but this implementation was written before we had a requirement that this app should be multi-node.
I still think that some uuid version is the right answer, but I can't seem to figure out how to validate an incoming uuid value.
>>> received = uuid.uuid5(uuid.NAMESPACE_URL, 'http://example.org')
>>> received
UUID('c57c6902-3774-5f11-80e5-cf09f92b03ac')
Is there some way to validate that received was generated with 'http://example.org'?
Is uuid the right approach at all? If not, what is?
If so, should I even be using uuid5 here?

If the goal is purely to create a unique value across your nodes couldn't you just give each node a unique name and append that to the uuid you are generating?
Wasn't clear to me if you are trying to do this for security reasons or you simply just want a guaranteed unique value across the nodes.

Related

Keep form result in memory

The image above gives an example of what I hope to achieve with flask.
For now I have a list of tuples such as [(B,Q), (A,B,C), (T,R,E,P), (M,N)].
The list can be any length as well as the tuples. When I submit or pass my form, I receive the data one the server side, all good.
However now I am asked to remember the state of previously submited ad passed forms in order to go back to it and eventually modify the information.
What would be the best way to remember the state of the forms?
Python dictionary with the key being the form number as displayed at the bottom (1 to 4)
Store the result in an SQL table and query it every time I need to access a form again
Other ideas?
Notes: The raw data should be kept for max one day, however, the data are to be processed to generate meaningful information to be stored permanently. Hence, if a modification is made to the form, the final database should reflect it.
This will very much depend on how the application is built.
One option is to simply always return all the answers posted, with each request, but that won't work well if you have a lot of data.
Although you say that you need the data to be accessible for a day. So it seems reasonable to store it to a database. Performing select queries using the indexed key is rather insignificant for most cases.

How can records put to Kinesis be efficiently batched when smaller than the 25KB minimum payload unit?

Update:
To give more detail on the problem, put_records are charged based on the number of records (partition keys) submitted and the size of the records. Any record that is smaller than 25KB is charged as one PU (Payload Unit). Our individual records average about 100 Bytes per second. If we put them individually we will spend a couple orders of magnitude more money on PUs than we need to.
Regardless of the solution we want a given UID to always end up in the same shard to simplify the work on the other end of Kinesis. This happens naturally if the UID is used as the partition key.
One way to deal with this would be to continue to do puts for each UID, but buffer them in time. But to efficiently use PUs we'd end up with a delay of 250 seconds introduced in the stream.
The combination of the answer given here and this question gives me a strategy for mapping multiple user IDs to static (predetermined) partition keys for each shard.
This would allow multiple UIDs to be batched into one Payload Unit (using the shared partition key for the target shard) so they can be written out as they come in each second while ensuring a given UID ends up in the correct shard.
Then I just need a buffer for each shard and as soon as enough records are present totaling just under 25KB or 500 records are reached (max per put_records call) the data can be pushed.
That just leaves figuring out ahead of time which shard a given UID would naturally map to if it was used as a partition key.
The AWS Kinesis documentation says this is the method:
Partition keys are Unicode strings with a maximum length limit of 256
bytes. An MD5 hash function is used to map partition keys to 128-bit
integer values and to map associated data records to shards.
Unless someone has done this before I'll try and see if the method in this question generates valid mappings. I'm wondering if I need to convert a regular Python string into a unicode string before doing the MD5.
There are probably other solutions, but this should work and I'll accept the existing answer here if no challenger appears.
Excerpt from a previous answer:
Try generating a few random partition_keys, and send distinct value with them to the stream.
Run a consumer application and see which shard delivered which value.
Then map the partition keys which you used to send each record with the corresponding shard.
So, now that you know which partition key to use while sending data to
a specific shard, you can use this map while sending those special "to
be multiplexed" records...
It's hacky and brute force, but it will work.
Also see previous answer regarding partition keys and shards:
https://stackoverflow.com/a/31377477/1622134
Hope this helps.
PS: If you use low level Kinesis APIs and create a custom PutRecord
request, in the response you can find which shard the data is placed
upon. PutRecordResponse contains shardId information;
http://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html
Source: https://stackoverflow.com/a/34901425/1622134

Use sorted set to notifications system

I am using redis sorted sets to save user notifications. But as i never did a notification system, I am asking about my logic.
I need to save 4 things for each notification.
post_id
post_type - A/B
visible - Y/N
checked - Y/N
My question is how can I store this type of structure in sorted sets?
ZADD users_notifications:1 10 1_A_Y_Y
ZADD users_notifications:1 20 2_A_Y_N
....
There is a better way to do this type of stuff in redis? In the case above i am saving the four thing in each element, and i need to split by the underscore in the server language.
It really depends on how you need to query the data.
The most common way to approach this problem is to use a sorted set for the order and a hash for each object.
So:
ZADD notifications:<user-id> <timestamp> <post-id>
HMSET notifications:<user-id>:<post-id> type <type> visible <visible> checked <checked>
You'd use ZRANGE to get the latest notifications in order and then a pipelined call to HMGET to get the attributes for each object.
As I mentioned, it depends on how you need to access the data. If, for example, you always show visible and unchecked notifications to a user, then you probably want to store those IDs in a different sorted set, so that you don't have to query for the status.
Assuming you have such a sorted set, when a user dismisses a notification you'd do:
HSET notifications:<user-id>:<post-id> visible 0
ZREM notifications:<user-id>:visible <post-id>

Is it safe to pass Google App Engine Entity Keys into web pages to maintain context?

I have a simple GAE system that contains models for Account, Project and Transaction.
I am using Django to generate a web page that has a list of Projects in a table that belong to a given Account and I want to create a link to each project's details page. I am generating a link that converts the Project's key to string and includes that in the link to make it easy to lookup the Project object. This gives a link that looks like this:
My Project Name
Is it secure to create links like this? Is there a better way? It feels like a bad way to keep context.
The key string shows up in the linked page and is ugly. Is there a way to avoid showing it?
Thanks.
There is few examples, in GAE docs, that uses same approach, and also Key are using characters safe for including in URLs. So, probably, there is no problem.
BTW, I prefer to use numeric ID (obj_key.id()), when my model uses number as identifier, just because it's looks not so ugly.
Whether or not this is 'secure' depends on what you mean by that, and how you implement your app. Let's back off a bit and see exactly what's stored in a Key object. Take your key, go to shell.appspot.com, and enter the following:
db.Key(your_key)
this returns something like the following:
datastore_types.Key.from_path(u'TestKind', 1234, _app=u'shell')
As you can see, the key contains the App ID, the kind name, and the ID or name (along with the kind/id pairs of any parent entities - in this case, none). Nothing here you should be particularly concerned about concealing, so there shouldn't be any significant risk of information leakage here.
You mention as a concern that users could guess other URLs - that's certainly possible, since they could decode the key, modify the ID or name, and re-encode the key. If your security model relies on them not guessing other URLs, though, you might want to do one of a couple of things:
Reconsider your app's security model. You shouldn't rely on 'secret URLs' for any degree of real security if you can avoid it.
Use a key name, and set it to a long, random string that users will not be able to guess.
A final concern is what else users could modify. If you handle keys by passing them to db.get, the user could change the kind name, and cause you to fetch a different entity kind to that which you intended. If that entity kind happens to have similarly named fields, you might do things to the entity (such as revealing data from it) that you did not intend. You can avoid this by passing the key to YourModel.get instead, which will check the key is of the correct kind before fetching it.
All this said, though, a better approach is to pass the key ID or name around. You can extract this by calling .id() on the key object (for an ID - .name() if you're using key names), and you can reconstruct the original key with db.Key.from_path('kind_name', id) - or just fetch the entity directly with YourModel.get_by_id.
After doing some more research, I think I can now answer my own question. I wanted to know if using GAE keys or ids was inherently unsafe.
It is, in fact, unsafe without some additional code, since a user could modify URLs in the returned webpage or visit URL that they build manually. This would potentially let an authenticated user edit another user's data just by changing a key Id in a URL.
So for every resource that you allow access to, you need to ensure that the currently authenticated user has the right to be accessing it in the way they are attempting.
This involves writing extra queries for each operation, since it seems there is no built-in way to just say "Users only have access to objects that are owned by them".
I know this is an old post, but i want to clarify one thing. Sometimes you NEED to work with KEYs.
When you have an entity with a #Parent relationship, you cant get it by its ID, you need to use the whole KEY to get it back form the Datastore. In these cases you need to work with the KEY all the time if you want to retrieve your entity.
They aren't simply increasing; I only have 10 entries in my Datastore and I've already reached 7001.
As long as there is some form of protection so users can't simply guess them, there is no reason not to do it.

Diffing a JSON document

Well, my question is a little complicated, but here goes:
I have a Python server that stores client (written in JavaScript) sessions, and has complete knowledge of what the client currently has stored in its state.
The server will constantly fetch data from the database and check for any changes against the client state. The data is JSON; consisting mostly of lists and dicts. I need a way to send a response to the client telling it to alter its data to match what the server has.
I have considered:
Sending a JSON-serialised recursively
diffed dict of changed elements and
not ever using lists - not bad, but I
can't use lists
Send the entire server version of the client state to the client -
costly and inefficient
Find some convoluted way to diff lists - painful and messy
Text-based diff of the two after dumping as JSON - plain silly
I'm pretty stumped on this, and I'd appreciate any help with this.
UPDATE
I'm considering sending nulls to the client to remove data it no longer requires and that the server has removed from its version of the client state.
Related question
How to push diffs of data (possibly JSON) to a server?
See
http://ajaxian.com/archives/json-diff-released
http://michael.hinnerup.net/blog/2008/01/15/diffing_json_objects/
There are a couple of possible approaches:
Do an actual tree-parsing recursive diff;
Encapsulate your JSON updates such that they generate the diff at the same time;
Generate change-only JSON directly from your data.
What are your expected mean and max sizes for client-state JSON?
What are your expected mean and max sizes for diff updates?
How often will updates be requested?
How quickly does the base data change?
Why can't you use lists?
You could store just a last-known client state timestamp and query the database for items which have changed since then - effectively, let the database do the diff for you. This would require a last-changed timestamp and item-deleted flag on each table item; instead of directly deleting items, set the item-deleted flag, and have a cleanup query that deletes all records with item-deleted flag set more than two full update cycles ago.
It might be helpful to see some sample data - two sets of JSON client-state data and the diff between them.

Categories