Can I use SQLAlchemy with Cassandra CQL? - python

I use Python with SQLAlchemy for some relational tables. For the storage of some larger data-structures I use Cassandra. I'd prefer to use just one technology (cassandra) instead of two (cassandra and PostgreSQL). Is it possible to store the relational data in cassandra as well?

No, Cassandra is a NoSQL storage system, and doesn't support fundamental SQL semantics like joins, let alone SQL queries. SQLAlchemy works exclusively with SQL statements. CQL is only SQL-like, not actual SQL itself.
To quote from the Cassandra CQL documentation:
Although CQL has many similarities to SQL, there are some fundamental differences. For example, CQL is adapted to the Cassandra data model and architecture so there is still no allowance for SQL-like operations such as JOINs or range queries over rows on clusters that use the random partitioner.
You are of course free to store all your data in Casandra, but that means you have to re-think how you store that data and find it again. You cannot use SQLAlchemy to map that data into Python Objects.

As mentioned, Cassandra does not support JOIN by design. Use Pycassa mapping instead: http://pycassa.github.com/pycassa/api/pycassa/columnfamilymap.html

playOrm supports JOIN on noSQL so that you CAN put relational data into noSQL but it is currently in java. We have been thinking of exposing a S-SQL language from a server for programs like yours. Would that be of interest to you?
The S-SQL would look like this(if you don't use partitions, you don't even need anything before the SELECT statement piece)...
PARTITIONS t(:partId) SELECT t FROM TABLE as t INNER JOIN t.security as s WHERE s.securityType = :type and t.numShares = :shares")
This allows relational data in a noSQL environment AND IF you partition your data, you can scale as well very nicely with fast queries and fast joins.
If you like, we can quickly code up a prototype server that exposes an interface where you send in S-SQL requests and we return some form of json back to you. We would like it to be different than SQL result sets which was a very bad idea when left joins and inner joins are in the picture.
ie. we would return results on a join like so (so that you can set a max results that actually works)...
tableA row A - tableB row45
- tableB row65
- tableB row 78
tableA row C - tableB row46
- tableB row93
NOTICE that we do not return multiple row A's so that if you have max results 2 you get row A and row C where as in ODBC/JDBC, you would get ONLY rowA two times with row45 and row 65 because that is what the table looks like when it is returned (which is kind of stupid when you are in an OO language of any kind).
just let playOrm team know if you need anything on the playOrm github website.
Dean

Related

Using Impala to select multiple tables with wildcard pattern and concatenate them

I'm starting with Impala SQL and Hadoop and have a (probably simple) question.
I have a Hadoop database with hundrets of tables with the same schema and naming convention (e.g. process_1, process_2, process_3 and so on). How would I query all the tables and concatenate them into one big table or dataframe? Is it possible to do so by using just Impala SQL which returns one dataframe in python?
Something like:
SELECT * FROM 'process_*';
Or do I need to run SHOW TABLES 'process_*', use a loop in python and query each table seperately?
If you are looking purely Impala solution, then one approach would be to create a view on top of all of the tables. Something as below:
create view process_view_all_tables as
select * from process1
union all
select * from process2
union all
...
select * from processN;
The disadvantage with this approach is as below:
You need to union multiple tables together. Union is an expensive operation in terms of memory utilisation. Works ok if you have less number of tables say in range of 2-5 tables.
You need to add all the tables manually. If you a new process table in future, you would need to ALTER the view and then add the new table. This is a maintenance headache.
The view assumes that all the PROCESS tables are of the same schema.
In the Second approach, as you said, you could query the list of tables from Impala using SHOW TABLES LIKE 'process*' and write a small program to iterate over the list of tables and create the files.
Once you have the file generated, you could port the file back to HDFS and create a table on top of it.
The only disadvantage with the second approach is that for every iteration there would impala database requests which is particularly disadvantageous in a multi-tenant database env.
In my opinion, you should try the second approach.
Hope this helps :)

How to compare hash of two table columns hashes across SQL Server and Postgres?

I have a table in SQL Server 2017 which has many rows and that table was migrated to Postgres 10.5 along with data (my colleagues did it using Talend tool).
I want to compare if the data is correct after migration. I want to compare the values in a column in SQL Server vs Postgres.
I could try reading the columns into a Numpy series items from SQL server and Postgres and compare both.
But both the DBs are not in my local machine. They're hosted on a server that I need to access from the network which means the data retrieval is going to take much time.
Instead, I want to do something like this.
Perform sha256 or md5 hash on the column values which are ordered_by primary_key and compare the hash values from both databases which means I don't need to retrieve the results from the database to my local for comparison.
That function or something should return the same value for the hash if the column has exact same values.
I'm not even sure if it's possible or is there any better way to do it.
Can someone please point me in some direction.
If an FDW isn't going to work out for you, maybe the hash comparison is a good idea. MD5 is probably a good idea, only because you ought to get consistent results from different software.
Obviously, you'll need the columns to be in the same order in the two databases for the hash comparison to work. If the layouts are different, you can create a view in Postgres to match the column order in SQL Server.
Once you've got tables/views to compare, there's a shortcut to the hashing on the Postgres side. Imagine a table named facility:
SELECT MD5(facility::text) FROM facility;
If that's not obvious, here's what's going in there. Postgres has the ability to case any compound type to text. Like:
select your_table_here::text from your_table_here
The result is like this example:
(2be4026d-be29-aa4a-a536-de1d7124d92d,2200d1da-73e7-419c-9e4c-efe020834e6f,"Powder Blue",Central,f)
Notice the (parens) around the result. You'll need to take that into account when generating the hash on the SQL Server side. This pithy piece of code strips the parens:
SELECT MD5(substring(facility::text, 2, length(facility::text))) FROM facility;
Alternatively, you can concatenate columns as strings manually, and hash that. Chances are, you'll need to do that, or use a view, if you've got ID or timestamp fields that automatically changed during the import.
The :: casting operator can also cast a row to another type, if you've got a conversion in place. And where I've listed a table above, you can use a view just as well.
On the SQL Server side, I have no clue. HASHBYTES?

Which way of fetching is more Efficient in Redis?

Hi I'm fairly new to Redis and Currently face a problem. My problem is "I don't know which way is better performance"
Way#1 : Cache All data to Redis and then Query to it ( I don't know is it possible to Query to Redis ? if possible How ? )
for example in following table cache all data to single Key ( By this way in my table we have 1 key ) and then Query for users with same City.
Way#2 : Cache all users with same City in separate Key ( By this way in my table we have 4 key ) and then Fetch each Key separately.
Cache all users with same City in separate Key - the Redis way. Fast insert, fast get in cost of much memory consumption or some data redundancy.
In general you can't follow your way#1 example. Why not? Redis do not have any in box solutions for query data in sql terms. You can't do something like select something from somethere where criteria in most of Redis data structures. You can write LUA script for complex map/redus solutioin on your data - but not in out of box.
You should remeber, each time you want to say Join this and this data you should understand - you can do this only in client application space or in redis LUA script. Yes, you have some types in join capacity with ZSET and SET's but it not that you require.

Efficient way to combine results of two database queries

I have two tables on different servers, and I'd like some help finding an efficient way to combine and match the datasets. Here's an example:
From server 1, which holds our stories, I perform a query like:
query = """SELECT author_id, title, text
FROM stories
ORDER BY timestamp_created DESC
LIMIT 10
"""
results = DB.getAll(query)
for i in range(len(results)):
#Build a string of author_ids, e.g. '1314,4134,2624,2342'
But, I'd like to fetch some info about each author_id from server 2:
query = """SELECT id, avatar_url
FROM members
WHERE id IN (%s)
"""
values = (uid_list)
results = DB.getAll(query, values)
Now I need some way to combine these two queries so I have a dict that has the story as well as avatar_url and member_id.
If this data were on one server, it would be a simple join that would look like:
SELECT *
FROM members, stories
WHERE members.id = stories.author_id
But since we store the data on multiple servers, this is not possible.
What is the most efficient way to do this? I understand the merging probably has to happen in my application code ... any efficient sample code that minimizes the number of dict loops would be greatly appreciated!
Thanks.
If memory isn't a problem, you could use a dictionary.
results1_dict = dict((row[0], list(row[1:])) for row in results1)
results2_dict = dict((row[0], list(row[1:])) for row in results2)
for key, value in results2_dict:
if key in results1_dict:
results1_dict[key].extend(value)
else:
results1_dict[key] = value
This isn't particularly efficient (n2), but it is relatively simple and you can tweak it to do precisely what you need.
The only option looks to be Database Link, but is unfortunately unavailable in MySQL.
You'll have to do the merging in your application code. Better to keep the data in same database.
You will have to bring the data together somehow.
There are things like server links (though that is probably not the correct term in mysql context) that might allow querying accross different DBs. This opens up another set of problems (security!)
The easier solution is to bring the data together in one DB.
The last (least desirable) solution is to join in code as Padmarag suggests.
Is it possible to setup replication of the needed tables from one server to a database on the other?
That way you could have all your data on one server.
Also, see FEDERATED storage engine, available since mysql 5.0.3.

how to make table partitions?

I am not very familiar with databases, and so I do not know how to partition a table using SQLAlchemy.
Your help would be greatly appreciated.
There are two kinds of partitioning: Vertical Partitioning and Horizontal Partitioning.
From the docs:
Vertical Partitioning
Vertical partitioning places different
kinds of objects, or different tables,
across multiple databases:
engine1 = create_engine('postgres://db1')
engine2 = create_engine('postgres://db2')
Session = sessionmaker(twophase=True)
# bind User operations to engine 1, Account operations to engine 2
Session.configure(binds={User:engine1, Account:engine2})
session = Session()
Horizontal Partitioning
Horizontal partitioning partitions the
rows of a single table (or a set of
tables) across multiple databases.
See the “sharding” example in
attribute_shard.py
Just ask if you need more information on those, preferably providing more information about what you want to do.
It's quite an advanced subject for somebody not familiar with databases, but try Essential SQLAlchemy (you can read the key parts on Google Book Search -- p 122 to 124; the example on p. 125-126 is not freely readable online, so you'd have to purchase the book or read it on commercial services such as O'Reilly's Safari -- maybe on a free trial -- if you want to read the example).
Perhaps you can get better answers if you mention whether you're talking about vertical or horizontal partitioning, why you need partitioning, and what underlying database engines you are considering for the purpose.
Automatic partitioning is a very database engine specific concept and SQLAlchemy doesn't provide any generic tools to manage partitioning. Mostly because it wouldn't provide anything really useful while being another API to learn. If you want to do database level partitioning then do the CREATE TABLE statements using custom Oracle DDL statements (see Oracle documentation how to create partitioned tables and migrate data to them). You can use a partitioned table in SQLAlchemy just like you would use a normal table, you just need the table declaration so that SQLAlchemy knows what to query. You can reflect the definition from the database, or just duplicate the table declaration in SQLAlchemy code.
Very large datasets are usually time-based, with older data becoming read-only or read-mostly and queries usually only look at data from a time interval. If that describes your data, you should probably partition your data using the date field.
There's also application level partitioning, or sharding, where you use your application to split data across different database instances. This isn't all that popular in the Oracle world due to the exorbitant pricing models. If you do want to use sharding, then look at SQLAlchemy documentation and examples for that, for how SQLAlchemy can support you in that, but be aware that application level sharding will affect how you need to build your application code.

Categories