1. This code - raw SQL - takes 2.6 sec:
*
all_feeds = Feed.objects.all()
for feed in all_feeds:
q_sku = MainData.objects.raw(f'SELECT id as id, COUNT(DISTINCT sku) as "count" FROM imports_maindata WHERE feed_id={feed.id}')
q_loc = MainData.objects.raw(
f'SELECT id as id, COUNT(DISTINCT locale) AS "count" FROM imports_maindata WHERE feed_id={feed.id}')
q_spec = MapSpecs.objects.raw(
f'SELECT id as id, COUNT(DISTINCT f_feat_id) AS "count" FROM imports_mapspecs WHERE feed_id={feed.id}')
q_mapped = MapSpecs.objects.raw(
f'SELECT id as id, COUNT(DISTINCT ic_feat_id) AS "count" FROM imports_mapspecs WHERE feed_id={feed.id} AND ic_feat_id IS NOT NULL')
q_date = MainData.objects.raw(
f'SELECT id as id, MAX(last_update) as "last_date" FROM imports_maindata WHERE feed_id={feed.id}')
print(q_sku[0].count, q_loc[0].count, q_spec[0].count, q_mapped[0].count, q_date[0].last_date)*
3. While this one - ORM only - takes 3.1 sec:
*f = Feed.objects.all()
for feed in f:
prods_count = f.filter(maindata__feed_id=feed.id).values('maindata__sku').distinct().count()
locales_count = f.filter(maindata__feed_id=feed.id).values_list('maindata__locale', flat=True).distinct()
total_specs = f.filter(mapspecs__feed_id=feed.id).count()
mapped_specs = f.filter(mapspecs__feed_id=feed.id, mapspecs__ic_feat_id__isnull=False).all().count()
try:
last_update = f.filter(maindata__feed_id=feed.id).values('maindata__last_update').distinct().order_by('-maindata__last_update').first()['maindata__last_update']
except TypeError:
pass*
3. And this one, using ORM but different approach, is returned in 3.1-3.2 sec:
*
f = Feed.objects.all()
prods = f.annotate(num_prods=Count('maindata__sku', distinct=True))
locs = f.annotate(num_locs=Count('maindata__locale', distinct=True))
total_sp_count = f.annotate(num_books=Count('mapspecs__f_feat_id', distinct=True))
total_sp_mapped = f.filter(mapspecs__ic_feat_id__isnull=False).annotate(
num_books=Count('mapspecs__ic_feat_id', distinct=True))
dates = f.annotate(num_books=Max('maindata__last_update'))
*
So how come that Django ORM is so inefficient and slow? The timings are for a low number of rows in DB (below 50K)... So it's not only slower than raw SQL but has a more confusing (and sometimes too vague) syntax. Guess some other Python frameworks should be considered...
Related
df_table contains meta data for the some list of columns with information like: table_schema, table_name and column_name.
for each column in column_name, I would like to calculate entropy (bits) , shannon_entropy and count of values.
the following code works good in python, but it is not parallel.
I wonder if more efficient way to run this :
job_config_True = bigquery.QueryJobConfig(use_legacy_sql=True)
job_config_False = bigquery.QueryJobConfig(use_legacy_sql=False)
for i,j in df_table[df_table['shannon_entropy'].isna()].iterrows():
try:
table_schema = (j['table_schema'])
table_name = (j['table_name'])
column_name = (j['column_name'])
q1 = f'''select -sum(p*log2(p)) as shannon_entropy from (
select RATIO_TO_REPORT(c) over() p from (
select {column_name}, count(*) c FROM {table_schema}.{table_name} group by 1))
'''
query_job = bqclient.query(q1, job_config=job_config_True) # Make an API request.
shannon_entropy = query_job.result().to_dataframe()['shannon_entropy'][0]
except:
shannon_entropy = np.nan
pass
q = f'''UPDATE `myproject.info_tabels_all` t1
set t1.entropy =t2.entropy ,t1.values = t2.total , t1.unique = t2.distinct_total , t1.shannon_entropy = {shannon_entropy}
from (
SELECT
LOG(2, COUNT(DISTINCT {column_name})) as entropy,
count({column_name}) as total,
COUNT(DISTINCT {column_name}) as distinct_total
FROM `datateam-248616.{table_schema}.{table_name}` ) t2
where table_schema = '{table_schema}' and table_name = '{table_name}' and column_name = '{column_name}'
'''
print( table_name , shannon_entropy)
query_job = bqclient.query(q, job_config_False) # Make an API request.
I used this code in the process :
BigQuery: compute entropy of a column
The above function has parameters endTime, startTime, list1 and column_filter to it and I am trying to read a query by making the WHERE clause conditions parameterized.
endT = endTime
startT = startTime
myList = ",".join("'" + str(i) + "'" for i in list1)
queryArgs = {'db': devDB,
'schema': dbo,
'table': table_xyz,
'columns': ','.join(column_filter)}
query = '''
WITH TIME_SERIES AS
(SELECT ROW_NUMBER() OVER (PARTITION BY LocId ORDER BY Created_Time DESC) RANK, {columns}
from {schema}.{table}
WHERE s_no in ? AND
StartTime >= ? AND
EndTime <= ? )
SELECT {columns} FROM TIME_SERIES WHERE RANK = 1
'''.format(**queryArgs)
args = (myList, startT, endT)
return self.read(query, args)
The below is my read which connects to the DB to fetch records and a condition is also added to check if its parameterized or not.
def read(self, query, parameterValues = None):
cursor = self.connect(cursor=True)
if parameterValues is not None:
rows = cursor.execute(query, parameterValues)
else:
rows = cursor.execute(query)
df = pd.DataFrame.from_records(rows.fetchall())
if len(df.columns) > 0:
df.columns = [x[0] for x in cursor.description]
cursor.close()
return df
The query args are getting picked up but not the parameterized values. In my case, it is going inside the read method with parameter values of (myList, startT ,endT) as a tuple. The query in WHERE clause remains unchanged (parameters not able to replace ? ), and as a result I am not able to fetch any records. Can you specify where I might be going wrong?
I have the following Python code :
params = {}
query = 'SELECT * FROM LOGS '
if(date_from and date_to):
query += ' WHERE LOG_DATE BETWEEN TO_DATE(:date_start, "MM-DD-YYYY") AND LOG_DATE <= TO_DATE(:date_end, "MM-DD-YYYY")'
params['date_start'] = date_from
params['date_end'] = date_to
if(structure):
query += ' AND STRUCTURE=:structure_val'
params['structure_val'] = structure
if(status):
query += ' AND STATUS =:status'
params['status'] = status
cursor.execute(query, params)
Here I am conditionally adding the WHERE clause to the query. But there is an issue when I don't have value for the dates as it will not take the WHERE and will add AND without WHERE. If I add the where clause with the query, if there is no filter, then it will give the wrong query. Is there any better way to do this ? I have been using Laravel for sometime and it's query builder has a method when, which will help to add conditional where clauses. Anything like this in Python for cx_Oracle ?
params = {}
query = 'SELECT * FROM LOGS '
query_conditions = []
if(date_from and date_to):
query_conditions.apend(' WHERE LOG_DATE BETWEEN TO_DATE(:date_start, "MM-DD-YYYY") AND LOG_DATE <= TO_DATE(:date_end, "MM-DD-YYYY")')
params['date_start'] = date_from
params['date_end'] = date_to
if(structure):
query_conditions.append('STRUCTURE=:structure_val')
params['structure_val'] = structure
if(status):
query_conditions.append('STATUS =:status')
params['status'] = status
if query_conditions:
query += " AND ".join(query_conditions)
cursor.execute(query, params)
add them in list and join values with AND
I am interested in finding the most efficient manner to query the following:
For a list of table names, return the table name if it contains at least one record that meet the conditions
Essentially, something similar to the following Python code in a single query:
dfs = [pd.read_sql('SELECT name FROM {} WHERE a=1 AND b=2'.format(table), engine) for table in tables]
tables = [table for table, df in zip(tables, dfs) if not df.empty]
Is this possible in MySQL?
Assuming you trust the table names in tables not to contain any surprises leading to SQL injection, you could device something like:
from sqlalchemy import text
selects = [f'SELECT :table_{i} FROM {table} WHERE a = 1 AND b = 2'
for i, table in enumerate(tables)]
stmt = ' UNION '.join(selects)
stmt = text(stmt)
results = engine.execute(
stmt, {f'table_{i}': table for i, table in enumerate(tables)})
or you could use SQLAlchemy constructs to build the same query safely:
from sqlalchemy import table, column, union, and_, select, Integer, literal
tbls = [table(name,
column('a', Integer),
column('b', Integer)) for name in tables]
stmt = union(*[select([literal(name).label('name')]).
select_from(tbl).
where(and_(tbl.c.a == 1, tbl.c.b == 2))
for tbl, name in zip(tbls, tables)])
results = engine.execute(stmt)
You can use a UNION of queries that search each table.
(SELECT 'table1' AS table_name
FROM table1
WHERE a = 1 AND b = 2
LIMIT 1)
UNION
(SELECT 'table2' AS table_name
FROM table2
WHERE a = 1 AND b = 2
LIMIT 1)
UNION
(SELECT 'table3' AS table_name
FROM table3
WHERE a = 1 AND b = 2
LIMIT 1)
...
When I run the following query, it gives me the first id, I want the last id and I want the data in that id, how can I do it?
cursor = mysql.connection.cursor()
sorgu = "Select * From site2"
result = cursor.execute(sorgu)
if result > 0:
articles = cursor.fetchone()
return render_template("site.html",articles = articles)
else:
return render_template("site.html")
First retrieve the max ID from the table (assuming your IDs are incrementing upward as rows are added. This means the largest is the last one), then use it in your final query
sorgu = "SELECT * FROM site2 WHERE ID = (SELECT MAX(ID) FROM site2)"
If you have a timestamp column in your data base you could do MAX(timestap) to get the latest ID as well