Using Impala to select multiple tables with wildcard pattern and concatenate them - python

I'm starting with Impala SQL and Hadoop and have a (probably simple) question.
I have a Hadoop database with hundrets of tables with the same schema and naming convention (e.g. process_1, process_2, process_3 and so on). How would I query all the tables and concatenate them into one big table or dataframe? Is it possible to do so by using just Impala SQL which returns one dataframe in python?
Something like:
SELECT * FROM 'process_*';
Or do I need to run SHOW TABLES 'process_*', use a loop in python and query each table seperately?

If you are looking purely Impala solution, then one approach would be to create a view on top of all of the tables. Something as below:
create view process_view_all_tables as
select * from process1
union all
select * from process2
union all
...
select * from processN;
The disadvantage with this approach is as below:
You need to union multiple tables together. Union is an expensive operation in terms of memory utilisation. Works ok if you have less number of tables say in range of 2-5 tables.
You need to add all the tables manually. If you a new process table in future, you would need to ALTER the view and then add the new table. This is a maintenance headache.
The view assumes that all the PROCESS tables are of the same schema.
In the Second approach, as you said, you could query the list of tables from Impala using SHOW TABLES LIKE 'process*' and write a small program to iterate over the list of tables and create the files.
Once you have the file generated, you could port the file back to HDFS and create a table on top of it.
The only disadvantage with the second approach is that for every iteration there would impala database requests which is particularly disadvantageous in a multi-tenant database env.
In my opinion, you should try the second approach.
Hope this helps :)

Related

Relational DB - separate joined tables

Is there any way to join tables from a relational database and then separate them again ?
I'm working on a project that involves modifying the data after having joined them. Unfortunately, modifying the data bejore the join is not an option. I would then want to separate the data according to the original schema.
I'm stuck at the separating part. I have metadata (python dictionary) with the information on the tables (primary keys, foreign keys, fields, etc.).
I'm working with python. So, if you have a solution in python, it would be greatly appretiated. If an SQL solution works, that also helps.
Edit : Perhaps the question was unclear. To summarize I would like to create a new database with an identical schema to the old one. I do not want to make any modifications to the original database. The data that makes up the new database must originally be in a single table (result of a join of the old tables) as the operations that need to be run must be run on a single table and I cannot run these operations on invidual tables as the outcome will not be as desired.
I would like to know if this is possible and, if so, how can I achieve this?
Thanks!

Multiple identical databases

First let me begin by saying that I'm somewhat new to sql (but have been doing python for a long while). I've been having trouble finding a good solution to my problem on the net.
The problem:
I have an undefined number (although probably less than 100) of identically structured sqlite databases that I need to query and merge results from. The databases themselves are not particularly huge.
I've been looking at the ATTACH command and following things in this tutorial:
http://souptonuts.sourceforge.net/readme_sqlite_tutorial.html
import sqlite3 as lite
Database = "library.db"
con = lite.connect(Database)
con.row_factory = lite.Row
cur = con.cursor()
cur.execute("ATTACH DATABASE 'library.db' As 'db1'")
cur.execute("ATTACH DATABASE 'library2.db' As 'db2'")
cur.execute("""
SELECT 'db1',* FROM db1.table
UNION
SELECT 'db2',* FROM db2.table
""")
but it seems like there should be a better way than to explicitly spell out each database in the execute command. Also, it looks like there's a limit to the number of databases that I can attach to? https://sqlite.org/limits.html
I've also looked at something like merging them together into a large database:
How can I merge many SQLite databases?
but it seems inefficient merge the databases together each time that a query needs to be made or one of the many individual databases gets changed.
Before I continue down some path, I was wondering if there were any better options for this type of thing that I'm not aware of?
Other potentially useful information about the project:
There are two main tables.
The tables from db to db can have duplicates.
I need to be able to grab unique values in columns for the "merged" databases.
Knowing which database what data came from is not essential.
The individual databases are updated frequently.
Queries to the "merged" database are made frequently.
You could avoid spelling out all databases in every query by using views:
CREATE VIEW MyTable_all AS
SELECT 'db1', db1.* FROM db1.MyTable
UNION ALL
SELECT 'db2', db2.* FROM db1.MyTable
...
However, if there are too many databases, you cannot use ATTACH.
In that case, you have to merge all the databases together.
If doing this every time for all databases is too slow, you can synchronize a single database at a time by keeping the source for each record:
DELETE FROM MyTable WHERE SourceDB = 1;
INSERT INTO MyTable SELECT 1, * FROM db1.MyTable;

Large text database: Convert to SQL or use as is

My python project involves an externally provided database: A text file of approximately 100K lines.
This file will be updated daily.
Should I load it into an SQL database, and deal with the diff daily? Or is there an effective way to "query" this text file?
ADDITIONAL INFO:
Each "entry", or line, contains three fields - any one of which can be used as an index.
The update is is the form of the entire database - I would have to manually generate a diff
The queries are just looking up records and displaying the text.
Querying the database will be a fundamental task of the application.
How often will the data be queried? On the one extreme, if once per day, you might use a sequential search more efficiently than maintaining a database or index.
For more queries and a daily update, you could build and maintain your own index for more efficient queries. Most likely, it would be worth a negligible (if any) sacrifice in speed to use an SQL database (or other database, depending on your needs) in return for simpler and more maintainable code.
What I've done before is create SQLite databases from txt files which were created from database extracts, one SQLite db for each day.
One can query across SQLite db to check the values etc and create additional tables of data.
I added an additional column of data that was the SHA1 of the text line so that I could easily identify lines that were different.
It worked in my situation and hopefully may form the barest sniff of an acorn of an idea for you.

Can I use SQLAlchemy with Cassandra CQL?

I use Python with SQLAlchemy for some relational tables. For the storage of some larger data-structures I use Cassandra. I'd prefer to use just one technology (cassandra) instead of two (cassandra and PostgreSQL). Is it possible to store the relational data in cassandra as well?
No, Cassandra is a NoSQL storage system, and doesn't support fundamental SQL semantics like joins, let alone SQL queries. SQLAlchemy works exclusively with SQL statements. CQL is only SQL-like, not actual SQL itself.
To quote from the Cassandra CQL documentation:
Although CQL has many similarities to SQL, there are some fundamental differences. For example, CQL is adapted to the Cassandra data model and architecture so there is still no allowance for SQL-like operations such as JOINs or range queries over rows on clusters that use the random partitioner.
You are of course free to store all your data in Casandra, but that means you have to re-think how you store that data and find it again. You cannot use SQLAlchemy to map that data into Python Objects.
As mentioned, Cassandra does not support JOIN by design. Use Pycassa mapping instead: http://pycassa.github.com/pycassa/api/pycassa/columnfamilymap.html
playOrm supports JOIN on noSQL so that you CAN put relational data into noSQL but it is currently in java. We have been thinking of exposing a S-SQL language from a server for programs like yours. Would that be of interest to you?
The S-SQL would look like this(if you don't use partitions, you don't even need anything before the SELECT statement piece)...
PARTITIONS t(:partId) SELECT t FROM TABLE as t INNER JOIN t.security as s WHERE s.securityType = :type and t.numShares = :shares")
This allows relational data in a noSQL environment AND IF you partition your data, you can scale as well very nicely with fast queries and fast joins.
If you like, we can quickly code up a prototype server that exposes an interface where you send in S-SQL requests and we return some form of json back to you. We would like it to be different than SQL result sets which was a very bad idea when left joins and inner joins are in the picture.
ie. we would return results on a join like so (so that you can set a max results that actually works)...
tableA row A - tableB row45
- tableB row65
- tableB row 78
tableA row C - tableB row46
- tableB row93
NOTICE that we do not return multiple row A's so that if you have max results 2 you get row A and row C where as in ODBC/JDBC, you would get ONLY rowA two times with row45 and row 65 because that is what the table looks like when it is returned (which is kind of stupid when you are in an OO language of any kind).
just let playOrm team know if you need anything on the playOrm github website.
Dean

Select Data from Table and Insert into a different DB

I'm using python and psycopg2 to remotely query some psql databases, and I'm trying to figure out the best way to select the data I need from the remote table, and insert it into a table on a separate DB (local application server).
Most of the stuff I've read has directed me to avoid executemany and look toward COPY operations, but I'm unsure how to implement this on a specific select statement as opposed to the entire table. Should I be headed this way or am I completely off?
but I'm unsure how to implement this on a specific select statement as opposed to the entire table
COPY isn't limited to tables, you can use a query as the source as well, check out the examples in the manual, it shows how to use COPY to create a text file based on a query:
http://www.postgresql.org/docs/current/static/sql-copy.html#AEN59055
(3rd example)
Take a look at http://ryrobes.com/featured-articles/using-a-simple-python-script-for-end-to-end-data-transformation-and-etl-part-1/
Granted, this is pulling from Oracle and inserting into SQL Server, but the concepts should be the same.

Categories