Which way of fetching is more Efficient in Redis? - python

Hi I'm fairly new to Redis and Currently face a problem. My problem is "I don't know which way is better performance"
Way#1 : Cache All data to Redis and then Query to it ( I don't know is it possible to Query to Redis ? if possible How ? )
for example in following table cache all data to single Key ( By this way in my table we have 1 key ) and then Query for users with same City.
Way#2 : Cache all users with same City in separate Key ( By this way in my table we have 4 key ) and then Fetch each Key separately.

Cache all users with same City in separate Key - the Redis way. Fast insert, fast get in cost of much memory consumption or some data redundancy.
In general you can't follow your way#1 example. Why not? Redis do not have any in box solutions for query data in sql terms. You can't do something like select something from somethere where criteria in most of Redis data structures. You can write LUA script for complex map/redus solutioin on your data - but not in out of box.
You should remeber, each time you want to say Join this and this data you should understand - you can do this only in client application space or in redis LUA script. Yes, you have some types in join capacity with ZSET and SET's but it not that you require.

Related

DynamoDB best solution to making queries without using the primary key

I have a table in DyamoDB similar to this:
StaffID, Name, Email, Office
1514923 Winston Smith, SmithW#company.com, 101
It only has around 100 rows.
I'm experimenting with Amazon's Alexa and the possibility of using it for voice-based queries such as
'Where is Winston Smith?'
The problem is that when using an Alexa function to pull results from the table, it would never be through the primary key StaffID - because you wouldn't have users asking:
'Where is 1514923?'
From what I've read, querying the non-primary key values is extremely slow... Is there a suitable solution to this when using Python with DynamoDB?
I know that with only 100 rows it is negligible - but I'd like to do things in the correct, industry standard way. Or is the best solution with cases like this, to simply scan the tables - splitting them up for different user groups when they get too large?
There are two approaches here, depending on your application:
If you only ever want to query this table via the Name field, then change the table so that it has a partition key of Name instead of StaffID. DynamoDB isn't SQL - there's no need to have everything keyed on an ID field unless you actually use it. (Note you can't actually "change" the partition key on an existing DynamoDB table - you'll have to rebuild the table).
If you want to query efficiently via both StaffID and Name, create a global secondary index for the table using the Name field. Be aware that global secondary indexes need both their own provisioned throughput and storage, both of which of course equal money.
Minor aside: this is nothing to do with the fact you're using the Python interface, it applies to all DynamoDB access.

AWS DynamoDB retrieve entire table

Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!
There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda
You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.

Can I use SQLAlchemy with Cassandra CQL?

I use Python with SQLAlchemy for some relational tables. For the storage of some larger data-structures I use Cassandra. I'd prefer to use just one technology (cassandra) instead of two (cassandra and PostgreSQL). Is it possible to store the relational data in cassandra as well?
No, Cassandra is a NoSQL storage system, and doesn't support fundamental SQL semantics like joins, let alone SQL queries. SQLAlchemy works exclusively with SQL statements. CQL is only SQL-like, not actual SQL itself.
To quote from the Cassandra CQL documentation:
Although CQL has many similarities to SQL, there are some fundamental differences. For example, CQL is adapted to the Cassandra data model and architecture so there is still no allowance for SQL-like operations such as JOINs or range queries over rows on clusters that use the random partitioner.
You are of course free to store all your data in Casandra, but that means you have to re-think how you store that data and find it again. You cannot use SQLAlchemy to map that data into Python Objects.
As mentioned, Cassandra does not support JOIN by design. Use Pycassa mapping instead: http://pycassa.github.com/pycassa/api/pycassa/columnfamilymap.html
playOrm supports JOIN on noSQL so that you CAN put relational data into noSQL but it is currently in java. We have been thinking of exposing a S-SQL language from a server for programs like yours. Would that be of interest to you?
The S-SQL would look like this(if you don't use partitions, you don't even need anything before the SELECT statement piece)...
PARTITIONS t(:partId) SELECT t FROM TABLE as t INNER JOIN t.security as s WHERE s.securityType = :type and t.numShares = :shares")
This allows relational data in a noSQL environment AND IF you partition your data, you can scale as well very nicely with fast queries and fast joins.
If you like, we can quickly code up a prototype server that exposes an interface where you send in S-SQL requests and we return some form of json back to you. We would like it to be different than SQL result sets which was a very bad idea when left joins and inner joins are in the picture.
ie. we would return results on a join like so (so that you can set a max results that actually works)...
tableA row A - tableB row45
- tableB row65
- tableB row 78
tableA row C - tableB row46
- tableB row93
NOTICE that we do not return multiple row A's so that if you have max results 2 you get row A and row C where as in ODBC/JDBC, you would get ONLY rowA two times with row45 and row 65 because that is what the table looks like when it is returned (which is kind of stupid when you are in an OO language of any kind).
just let playOrm team know if you need anything on the playOrm github website.
Dean

What are some ways to maintain data consistency at the application layer of NoSQL?

My python web application uses DynamoDB as its datastore, but this is probably applicable to other NoSQL tables where index consistency is done at the application layer. I'm de-normalizing data and creating indicies in several tables to facilitate lookups.
For example, for my users table:
* Table 1: (user_id) email, employee_id, first name, last name, etc ...
Table 2: (email) user_id
Table 3: (employee_id) user_id
Table 1 is my "primary table" where user info is stored. If the user_id is known, all info about a user can be retrieved in a single GET query.
Table 2 and 3 enable lookups by email or employee_id, requiring a query to those tables first to get the user_id, then a second query to Table 1 to retrieve the rest of the information.
My concern is with the de-normalized data -- what is the best way to handle deletions from Table 1 to ensure the matching data gets deleted from Tables 2 + 3? Also ensuring inserts?
Right now my chain of events is something like:
1. Insert row in table 1
2. Insert row in table 2
3. Insert row in table 3
Does it make sense to add "checks" at the end? Some thing like:
4. Check that all 3 rows have been inserted.
5. If a row is missing, remove rows from all tables and raise an error.
Any other techniques?
Short answer is: There is no way to ensure consistency. This is the price you agreed to pay when moving to NoSQL in trade of performances and scalability.
DynamoDB-mapper has a "transaction engine". Transaction objects are plain DynamoDB Items and may be persisted. This way, If a logical group of actions aka transaction has succeeded, we can be sure of it by looking at the persisted status. But we have no mean to be sure it has not...
To do a bit of advertisment :) , dynamodb-mapper transaction engine supports
single/multiple targets
sub transactions
transaction creating objects (not released yet)
If you are rolling your own mapper (which is an enjoyable task), feel free to have a look at our source code: https://bitbucket.org/Ludia/dynamodb-mapper/src/52c75c5df921/dynamodb_mapper/transactions.py
Disclaimer: I am one of the main dynamodb-mapper project. Feel free to contribute :)
Disclaimer: I haven't actually used DynamoDB, just looked through the data model and API, so take this for what it's worth.
The use case you're giving is one primary table for the data, with other tables for hand-rolled indices. This really sounds like work for an RDBMS (maybe with some sharding for growth). But, if that won't cut it, here a couple of ideas which may or may not work for you.
A. Leave it as it is. If you'll never serve data from your index tables, then maybe you can afford to have lazy deletion and insertion as long as you handle the primary table first. Say this happens:
1) Delete JDoe from Main table
xxxxxxxxxx Process running code crashes xxxxxxx
2) Delete from email index // Never gets here
3) Delete from employee_id index // Never gets here
Well, if an "email" query comes in, you'll resolve the corresponding user_id from the index (now stale), but it won't show up on the main table. You know that something is wrong, so you can return a failure/error and clean up the indexes. In other words, you just live with some stale data and save yourself the trouble, cleaning it up as necessary. You'll have to figure out how much stale data to expect, and maybe write a script that does some housekeeping daily.
B. If you really want to simulate locks and transactions, you could consider using something like Apache Zookeeper, which is a distributed system for managing shared resources like locks. It'd be more work and overhead, but you could probably set it up to do what you want.

Efficient way to combine results of two database queries

I have two tables on different servers, and I'd like some help finding an efficient way to combine and match the datasets. Here's an example:
From server 1, which holds our stories, I perform a query like:
query = """SELECT author_id, title, text
FROM stories
ORDER BY timestamp_created DESC
LIMIT 10
"""
results = DB.getAll(query)
for i in range(len(results)):
#Build a string of author_ids, e.g. '1314,4134,2624,2342'
But, I'd like to fetch some info about each author_id from server 2:
query = """SELECT id, avatar_url
FROM members
WHERE id IN (%s)
"""
values = (uid_list)
results = DB.getAll(query, values)
Now I need some way to combine these two queries so I have a dict that has the story as well as avatar_url and member_id.
If this data were on one server, it would be a simple join that would look like:
SELECT *
FROM members, stories
WHERE members.id = stories.author_id
But since we store the data on multiple servers, this is not possible.
What is the most efficient way to do this? I understand the merging probably has to happen in my application code ... any efficient sample code that minimizes the number of dict loops would be greatly appreciated!
Thanks.
If memory isn't a problem, you could use a dictionary.
results1_dict = dict((row[0], list(row[1:])) for row in results1)
results2_dict = dict((row[0], list(row[1:])) for row in results2)
for key, value in results2_dict:
if key in results1_dict:
results1_dict[key].extend(value)
else:
results1_dict[key] = value
This isn't particularly efficient (n2), but it is relatively simple and you can tweak it to do precisely what you need.
The only option looks to be Database Link, but is unfortunately unavailable in MySQL.
You'll have to do the merging in your application code. Better to keep the data in same database.
You will have to bring the data together somehow.
There are things like server links (though that is probably not the correct term in mysql context) that might allow querying accross different DBs. This opens up another set of problems (security!)
The easier solution is to bring the data together in one DB.
The last (least desirable) solution is to join in code as Padmarag suggests.
Is it possible to setup replication of the needed tables from one server to a database on the other?
That way you could have all your data on one server.
Also, see FEDERATED storage engine, available since mysql 5.0.3.

Categories