Different WHERE clauses depending on table schema - python

I'm trying to make a mysql dump for a database using the "mysqldump" command line tool and I'm looking to do something kind of unique. Essentially, if the table contains a certain column, dump according to those constraints, otherwise, dump the whole table. But due to the nature of the mysqldump command line options, this all has to be in one WHERE clause.
In psuedo code this is what I have to do:
if table.columns contains "user_id":
where = "user_id=1" // dump rows where user_id=1
else:
where = "TRUE" // dump whole table
But in all one where clause that will apply to every table mysqldump comes across

The --where option for mysqldump can be used to pass an expression to the SELECT for each table, but it passes the same expression to all tables.
If you need to use different WHERE clauses for different subsets of tables, you'll need to do more than one mysqldump command, and execute each command against a different set of tables
Re your comment:
No, the references to columns in an SQL query must be fixed at the time the query is parsed, and that means any column references must refer to columns that exist in the table. There's no way to make an expression in SQL that uses columns if they do exist and ignores the column reference if the column doesn't exist.
To do what you're describing, you would have to query the INFORMATION_SCHEMA to see what columns exist in the table, and then format the SQL query for the dump conditionally, based on the columns you find do exist in the table.
But in mysqldump, you don't have the opportunity to do that. It would have to be implemented in the code for mysqldump.
Feel free to get the source code for mysqldump and add your own logic to it. Seems like a lot of work though.

Related

python/SQL - SQL a run list, then create a loop that will SQLanother table for each runID/Timestamp?

I have two (HUGE) tables for transactions and run data. One table is the contextual stuff, like timestamps, machine, recipe, lot#, runID, etc. The second table is a tall key-value table that has hundreds of keys for each tool/recipe/runID.
Currently, I'm searching the first table, joining the second table, and then doing a case-when line to look for a specific key and return the associated value as a new specific column. This works reasonably OK for a small date range and only a few runID's, but if I open it up to something like t-60days and hundreds of runID's, it basically hangs and craps out.
Both tables are timestamp partitioned.
My question:
How can I make this more efficient or work correctly? I'm thinking something like pulling the run data from the first table, and then use the resulting list to search the 2nd table looping one at a time through the returns from the first table. That way each SQL against the 2nd table is for the exact runID and exact timestamp, looped n times and stacked.
Any advice for doing this directly in SQL? Or perhaps better to do this in python and execute the SQL query from there and the looping manipulation and dynamic SQL creation?

Using Impala to select multiple tables with wildcard pattern and concatenate them

I'm starting with Impala SQL and Hadoop and have a (probably simple) question.
I have a Hadoop database with hundrets of tables with the same schema and naming convention (e.g. process_1, process_2, process_3 and so on). How would I query all the tables and concatenate them into one big table or dataframe? Is it possible to do so by using just Impala SQL which returns one dataframe in python?
Something like:
SELECT * FROM 'process_*';
Or do I need to run SHOW TABLES 'process_*', use a loop in python and query each table seperately?
If you are looking purely Impala solution, then one approach would be to create a view on top of all of the tables. Something as below:
create view process_view_all_tables as
select * from process1
union all
select * from process2
union all
...
select * from processN;
The disadvantage with this approach is as below:
You need to union multiple tables together. Union is an expensive operation in terms of memory utilisation. Works ok if you have less number of tables say in range of 2-5 tables.
You need to add all the tables manually. If you a new process table in future, you would need to ALTER the view and then add the new table. This is a maintenance headache.
The view assumes that all the PROCESS tables are of the same schema.
In the Second approach, as you said, you could query the list of tables from Impala using SHOW TABLES LIKE 'process*' and write a small program to iterate over the list of tables and create the files.
Once you have the file generated, you could port the file back to HDFS and create a table on top of it.
The only disadvantage with the second approach is that for every iteration there would impala database requests which is particularly disadvantageous in a multi-tenant database env.
In my opinion, you should try the second approach.
Hope this helps :)

SQLAlchemy distinct ignoring order_by

I'm trying to run a query from my flask application using SQLAlchemy where I order the results by the timestamp in descending order. From the returned order I want to pull the distinct sender_id's. Unfortunately distinct ignores my requested order_by and pulls the data from the standard table layout.
messagesReceived = Message.query.filter_by(recipient_id=user_id).order_by(Message.timestamp.desc()).group_by(Message.sender_id).distinct()
I'm still new to SQLAlchemy and through the tutorials and lessons I have done I haven't encountered this, I tried googling but I don't think I am phrasing it correctly to get the answer. I'm currently trying to wrap my head around sub-queries as I think that might be a way to make it work, but asking here in the meantime.
Your SQL query is illogical. You select the entire message, but group by person_id. Unless it is unique, it is indeterminate which row the values are selected from for the group row. ORDER BY is logically performed after GROUP BY, and since timestamp is now indeterminate, so is the resulting order. Some SQL DBMS do not even allow such a query to run, as it is not allowed by the SQL standard.
To fetch distinct sender_ids ordered by their latest timestamp per sender_id do
messagesReceived = db.session.query(Message.sender_id).\
filter_by(recipient_id=user_id).\
group_by(Message.sender_id).\
order_by(db.func.max(Message.timestamp))
If you're using PostgreSQL,
messagesReceived = Message.query.filter_by(recipient_id=user_id).order_by(Message.sender_id.asc(), Message.timestamp.desc()).distinct(Message.sender_id).all()
There may not be a need to group by sender_id. Distinct can also be applied at the Query level (affects the entire query, not just the column), as described here, so the order_by needs the sender_id, post which, the distinct can be applied on a specific column.
This may be specific to PostgreSQL, however, so if you're using another DB, I would recommend the distinct expression as outlined here.
from sqlalchemy import distinct, func
stmt = select([func.count(distinct(users_table.c.name))])

How to compare hash of two table columns hashes across SQL Server and Postgres?

I have a table in SQL Server 2017 which has many rows and that table was migrated to Postgres 10.5 along with data (my colleagues did it using Talend tool).
I want to compare if the data is correct after migration. I want to compare the values in a column in SQL Server vs Postgres.
I could try reading the columns into a Numpy series items from SQL server and Postgres and compare both.
But both the DBs are not in my local machine. They're hosted on a server that I need to access from the network which means the data retrieval is going to take much time.
Instead, I want to do something like this.
Perform sha256 or md5 hash on the column values which are ordered_by primary_key and compare the hash values from both databases which means I don't need to retrieve the results from the database to my local for comparison.
That function or something should return the same value for the hash if the column has exact same values.
I'm not even sure if it's possible or is there any better way to do it.
Can someone please point me in some direction.
If an FDW isn't going to work out for you, maybe the hash comparison is a good idea. MD5 is probably a good idea, only because you ought to get consistent results from different software.
Obviously, you'll need the columns to be in the same order in the two databases for the hash comparison to work. If the layouts are different, you can create a view in Postgres to match the column order in SQL Server.
Once you've got tables/views to compare, there's a shortcut to the hashing on the Postgres side. Imagine a table named facility:
SELECT MD5(facility::text) FROM facility;
If that's not obvious, here's what's going in there. Postgres has the ability to case any compound type to text. Like:
select your_table_here::text from your_table_here
The result is like this example:
(2be4026d-be29-aa4a-a536-de1d7124d92d,2200d1da-73e7-419c-9e4c-efe020834e6f,"Powder Blue",Central,f)
Notice the (parens) around the result. You'll need to take that into account when generating the hash on the SQL Server side. This pithy piece of code strips the parens:
SELECT MD5(substring(facility::text, 2, length(facility::text))) FROM facility;
Alternatively, you can concatenate columns as strings manually, and hash that. Chances are, you'll need to do that, or use a view, if you've got ID or timestamp fields that automatically changed during the import.
The :: casting operator can also cast a row to another type, if you've got a conversion in place. And where I've listed a table above, you can use a view just as well.
On the SQL Server side, I have no clue. HASHBYTES?

I need to filter a large database 2 billion entries using python and sql

The data is in bytea format, how do I query the postgresql database, its is indexed in the bytea column i need to query by the first 4 bytes. I have tried
SELECT * FROM table WHERE addr LIKE '%8ac5c320____'
but it takes too long to find. Any suggestions? if I query the whole string then it works fast, but there are about 2 billion entries and i cant use wild cards...
To get matches based on the first four bytes, I'd recommend the following query:
SELECT * FROM table WHERE substring(addr from 0 for 5) = '\x8ac5c320'::bytea;
The documentation for substring can be found on the bytea functions page, though that's admittedly minimal.
The query as written will likely perform a sequential scan across the entire table. To remedy that, create the following index:
CREATE INDEX ON table (substring(addr from 0 for 5));
That creates an index specifically designed for the query you need to run frequently. It's a functional index -- it's indexing a function result, rather than a column.
That should get you the performance that you want.
All that said, though, your example query does not actually query for the first four bytes. If the query is more correct than your description of the query, then this approach won't work.

Categories