I have created this table in python 2.7 . I use it to store unique pairs name and value. In some queries I search for names and in others I search for values. Lets say that SELECT queries are 50-50. Is there any way to create a table that will be double index (one index on names and another for values) so my program will seek faster the data ?
Here is the database and table creation:
import sqlite3
#-------------------------db creation ---------------------------------------#
db1 = sqlite3.connect('/my_db.db')
cursor = db1.cursor()
cursor.execute("DROP TABLE IF EXISTS my_table")
sql = '''CREATE TABLE my_table (
name TEXT DEFAULT NULL,
value INT
);'''
cursor.execute(sql)
sql = ("CREATE INDEX index_my_table ON my_table (name);")
cursor.execute(sql)
Or is there any other faster struct for faster value seek ?
You can create another index...
sql = ("CREATE INDEX index_my_table2 ON my_table (value);")
cursor.execute(sql)
I think the best way for faster research is to create a index on the 2 fields.
like: sql = ("CREATE INDEX index_my_table ON my_table (Field1, field2)")
Multi-Column Indices or Covering Indices.
see the (great) doc here: https://www.sqlite.org/queryplanner.html
Related
When I'm trying to remove all tables with:
base.metadata.drop_all(engine)
I'm getting following error:
ERROR:libdl.database_operations:Cannot drop table: (psycopg2.errors.DependentObjectsStillExist) cannot drop sequence <schema>.<sequence> because other objects depend on it
DETAIL: default for table <schema>.<table> column id depends on sequence <schema>.<sequence>
HINT: Use DROP ... CASCADE to drop the dependent objects too.
Is there an elegant one-line solution for that?
import psycopg2
from psycopg2 import sql
cnn = psycopg2.connect('...')
cur = cnn.cursor()
cur.execute("""
select s.nspname as s, t.relname as t
from pg_class t join pg_namespace s on s.oid = t.relnamespace
where t.relkind = 'r'
and s.nspname !~ '^pg_' and s.nspname != 'information_schema'
order by 1,2
""")
tables = cur.fetchall() # make sure they are the right ones
for t in tables:
cur.execute(
sql.SQL("drop table if exists {}.{} cascade")
.format(sql.Identifier(t[0]), sql.Identifier(t[1])))
cnn.commit() # goodbye
I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()
This is my query using code found perusing this site:
query="""SELECT Family
FROM Table2
INNER JOIN Table1 ON Table1.idSequence=Table2.idSequence
WHERE (Table1.Chromosome, Table1.hg19_coordinate) IN ({seq})
""".format(seq=','.join(['?']*len(matchIds_list)))
matchIds_list is a list of tuples in (?,?) format.
It works if I just ask for one condition (ie just Table1.Chromosome as oppose to both Chromosome and hg_coordinate) and matchIds_list is just a simple list of single values, but I don't know how to get it to work with a composite key or both columns.
Since you're running SQLite 3.7.17, I'd recommend to just use a temporary table.
Create and populate your temporary table.
cursor.executescript("""
CREATE TEMP TABLE control_list (
Chromosome TEXT NOT NULL,
hg19_coordinate TEXT NOT NULL
);
CREATE INDEX control_list_idx ON control_list (Chromosome, hg19_coordinate);
""")
cursor.executemany("""
INSERT INTO control_list (Chromosome, hg19_coordinate)
VALUES (?, ?)
""", matchIds_list)
Just constrain your query to the control list temporary table.
SELECT Family
FROM Table2
INNER JOIN Table1
ON Table1.idSequence = Table2.idSequence
-- Constrain to control_list.
WHERE EXISTS (
SELECT *
FROM control_list
WHERE control_list.Chromosome = Table1.Chromosome
AND control_list.hg19_coordinate = Table1.hg19_coordinate
)
And finally perform your query (there's no need to format this one).
cursor.execute(query)
# Remove the temporary table since we're done with it.
cursor.execute("""
DROP TABLE control_list;
""")
Short Query (requires SQLite 3.15): You actually almost had it. You need to make the IN ({seq}) a subquery
expression.
SELECT Family
FROM Table2
INNER JOIN Table1
ON Table1.idSequence = Table2.idSequence
WHERE (Table1.Chromosome, Table1.hg19_coordinate) IN (VALUES {seq});
Long Query (requires SQLite 3.8.3): It looks a little complicated, but it's pretty straight forward. Put your
control list into a sub-select, and then constrain that main select by the control
list.
SELECT Family
FROM Table2
INNER JOIN Table1
ON Table1.idSequence = Table2.idSequence
-- Constrain to control_list.
WHERE EXISTS (
SELECT *
FROM (
SELECT
-- Name the columns (must match order in tuples).
"" AS Chromosome,
":1" AS hg19_coordinate
FROM (
-- Get control list.
VALUES {seq}
) AS control_values
) AS control_list
-- Constrain Table1 to control_list.
WHERE control_list.Chromosome = Table1.Chromosome
AND control_list.hg19_coordinate = Table1.hg19_coordinate
)
Regardless of which query you use, when formatting the SQL replace {seq} with (?,?) for each compsite
key instead of just ?.
query = " ... ".format(seq=','.join(['(?,?)']*len(matchIds_list)))
And finally flatten matchIds_list when you execute the query because it is a list of tuples.
import itertools
cursor.execute(query, list(itertools.chain.from_iterable(matchIds_list)))
I am new with MySQL and I need some help please. I am using MySQL connector to write scripts.
I have database contain 7K tables and I am trying to select some values from some of these tables
cursor.execute( "SELECT SUM(VOLUME) FROM stat_20030103 WHERE company ='Apple'")
for (Volume,) in cursor:
print(Volume)
This works for one table e.g (stats_20030103). However I want to sum all volume of all tables .startwith (stats_2016) where the company name is Apple. How I can loop over my tables?
I'm not an expert in MySQL, but here is something quick and simple in python:
# Get all the tables starting with "stats_2016" and store them
cursor.execute("SHOW TABLES LIKE 'stats_2016%'")
tables = [v for (v, ) in cursor]
# Iterate over all tables, store the volumes sum
all_volumes = list()
for t in tables:
cursor.execute("SELECT SUM(VOLUME) FROM %s WHERE company = 'Apple'" % t)
# Get the first row as is the sum, or 0 if None rows found
all_volumes.append(cursor.fetchone()[0] or 0)
# Return the sum of all volumes
print(sum(all_volumes))
You can probably use select * from information_schema.tables to get all tables name into your query.
I'd try to left-join.
SELECT tables.*, stat.company, SUM(stat.volume) AS volume
FROM information_schema.tables AS tables LEFT JOIN mydb.stat_20030103 AS stat
WHERE tables.schema = "mydb" GROUP BY stat.company;
This will give you all results at once. Maybe MySQL doesn't support joining from metatables, in which case you might select it into a temporary table.
CREATE TEMPORARY TABLE mydb.tables SELECT name FROM information_schema.tables WHERE schema = "mydb"
See MySQL doc on information_schema.table.
I need to update a field with a value from another table in MySQL, using Python Connector (not that important though). I need to select a value from one table based on a matching criteria and insert the extracted column back into the previous table based on the same matching criteria.
I have the following, which doesn't work of cource.
for match_field in list:
cursor_importer.execute(UPDATE table1 SET table1_field =
(SELECT field_new FROM table2 WHERE match_field = %s)
WHERE match_field = %s LIMIT 1,
(match_field, match_field ))
You can use UPDATE with JOINS.
Below is an example in MySQL:
UPDATE table1 a JOIN table2 b ON a.match_field = b.match_field
SET a.table1_field = b.field_new
WHERE a.match_field = 'filter criteria'