MySQL, join vs string parsing - python

I should first put a warning that this can be a little longer question. So please bear with me. One of my projects (That I started very recently) had a table which looked like this
name (varchar)
scores (text)
Example value is like this -
['AAA', '[{"score": 3.0, "subject": "Algebra"}, {"score": 5.0, "subject": "geography"}]']
As you can see the second field is the string representation of a JSON array.
In good faith, I redesigned this table into the following two tables
table-name:
id - Int, auto_inc, primary_key
name - varchar
table-scores:
id - int, auto_inc, primary_key
subject - varchar
score- float
name - int, FK to table-name
I have this following code in my python file to represent the tables (At this point, I assume that you are familiar with Python and SqlAlchemy, and so I will skip the specific imports and all to make it shorter)
Base = declarative_base()
class Name(Base):
__tabelname__ = "name_table"
id = Column(Integer, primary_key=True)
name = Column(String(255), index=True)
class Score(Base):
__tablename__ = "score_table"
id = Column(Integer, primary_key=True)
subject = Column(String(255), index=True)
score = Column(Float)
name = Column(ForeignKey('Name.id'), nullable=False, index=True)
Name = relationship(u'Name')
The first table has ~ 778284 rows whereas the second table has ~ 907214 rows.
After declaring them and populating them using the initial data I went to make an experiment. The goal - To find all the subjects whose score is > 5.0 for a given name. (Here, for a second, please consider that name is unique across the DB), and then run the same process 100 times and then take the average to find out how long this query is taking. Following is what I am doing (Please imagine and session is a valid db session I obtained before calling this function.)
def test_time():
for i in range(0, 100):
scores = session.query(Score, Name.name).join(Name).filter(Name.name=='AAA').filter(Score.score>5.0).all()
array = []
for score in scores:
array.append((score[0].subject, score[0].score))
I am not doing anything with the array I am creating. But I am calling this function which runs this query 100 times and I am using default_timer from timeit to measure the time elapsed. Following is the result for three runs -
Avarage - 0.10969632864
Avarage - 0.105748419762
Avarage - 0.105768380165
Now, as I was curious, so what I did is that I created another quick and dirty python file and declared this following class there -
class DirtyTable(Base):
__tablename__ = "dirty_table"
name = Column(String(255), primary_key=True)
scores = Column(Text)
And then I created the following function to achieve the same goal but this time reading the data from the second field, parse it back to python dict, run a for loop over all the elements of the list, add in the array only those elements whose score value is > 5.0. Here it goes -
def dirty_timer():
for i in range(0,100):
scores = session.query(DirtyTable).filter_by(name='AAA').all()
for score in scores:
x = json.loads(score.scores)
array = []
for member in x:
if x['score'] > 5.0:
array.append((x['subject'], x['score']))
This is the time of three runs -
Avarage - 0.0288228917122
Avarage - 0.0296836185455
Avarage - 0.0298663306236
Am I missing something? Normalizing the DB (I believe this is all what I tried to do by breaking the original table in two tables) gave me worse result. How is that possible. What is wrong with my approach?
Please let me know your thoughts. Sorry for the long post but had to explain everything properly.

Related

SQLAlchemy: partial unique constraint where a field has a certain value

In my flask project I need a table with a unique constraint on a column, if the values in an other column are identical. So I try to do something like that:
if premiumuser_id = "a value I don't know in advance" then track_id=unique
This is similar to Creating partial unique index with sqlalchemy on Postgres, but I use sqlite (where partial indexes should also be possible: https://docs.sqlalchemy.org/en/13/dialects/sqlite.html?highlight=partial%20indexes#partial-indexes) and the condition is different.
So far my code looks like that:
class Queue(db.Model):
id = db.Column(db.Integer, primary_key=True)
track_id = db.Column(db.Integer)
premiumuser_id = db.Column(
db.Integer, db.ForeignKey("premium_user.id"), nullable=False
)
__table_args__ = db.Index(
"idx_partially_unique_track",
"track_id",
unique=True,
sqlite_where="and here I'm lost",
)
All examples I've found operate with boolean or fixed values. How should the syntax for sqlite_where look like for the condition: premiumuser_id = "a value I don't know in advance"?

How to return specific dictionary keys from within a nested list from a jsonb column in sqlalchemy

I am attempting to return some named columns from a jsonb data set that is stored with PostgreSQL.
I am able to run a raw query that meets my needs directly, however I am trying to run the query utilising SQLAlchemy, in order to ensure that my code is 'pythonic' and easy to read.
The query that returns the correct result (two columns) is:
SELECT
tmp.item->>'id',
tmp.item->>'name'
FROM (SELECT jsonb_array_elements(t.data -> 'users') AS item FROM tpeople t) as tmp
Example json (each user has 20+ columns)
{ "results":247, "users": [
{"id":"202","regdate":"2015-12-01","name":"Bob Testing"},
{"id":"87","regdate":"2014-12-12","name":"Sally Testing"},
{"id":"811", etc etc}
...
]}
The table is simple enough, with a PK, datetime of json extraction, and the jsonb column for the extract
CREATE TABLE tpeople
(
record_id bigint NOT NULL DEFAULT nextval('"tpeople_record_id_seq"'::regclass) ( INCREMENT 1 START 1 MINVALUE 1 MAXVALUE 9223372036854775807 CACHE 1 ),
scrape_time timestamp without time zone NOT NULL,
data jsonb NOT NULL,
CONSTRAINT "tpeople_pkey" PRIMARY KEY (record_id)
);
Additionally I have a People Class that looks as follows:
class people(Base):
__tablename__ = 'tpeople'
record_id = Column(BigInteger, primary_key=True, server_default=text("nextval('\"tpeople_record_id_seq\"'::regclass)"))
scrape_time = Column(DateTime, nullable=False)
data = Column(JSONB(astext_type=Text()), nullable=False)
Presently my code to return the two columns looks like this:
from db.db_conn import get_session // Generic connector for my db
from model.models import people
from sqlalchemy import func,
sess = get_session()
sub = sess.query(func.jsonb_array_elements(people.data["users"]).label("item")).subquery()
test = sess.query(sub.c.item).select_entity_from(sub).all()
SQLAlchemy generates the following SQL:
SELECT anon_1.item AS anon_1_item
FROM (SELECT jsonb_array_elements(tpeople.data -> %(data_1)s) AS item
FROM tpeople) AS anon_1
{'data_1': 'users'}
But nothing I seem to do can allow me to only get certain columns within the item itself like the raw SQL I can write. Some of the approaches I have tried as follows (they all error out):
test = sess.query("sub.item.id").select_entity_from(sub).all()
test = sess.query(sub.item.["id"]).select_entity_from(sub).all()
aas = func.jsonb_to_recordset(people.data["users"])
res = sess.query("id").select_from(aas).all()
sub = select(func.jsonb_array_elements(people.data["users"]).label("item"))
Presently I can extract the columns I need in a simple for loop, but this seems like a hacky way to do it, and I'm sure there is something dead obvious I'm missing.
for row in test:
print(row.item['id'])
Searched for a few hours eventually found some who accidentally did this while trying to get another result.
sub = sess.query(func.jsonb_array_elements(people.data["users"]).label("item")).subquery()
tmp = sub.c.item.op('->>')('id')
tmp2 = sub.c.item.op('->>')('name')
test = sess.query(tmp, tmp2).all()

Identify what values in a list doesn't exist in a Table column using SQLAlchemy

I have a list cities = ['Rome', 'Barcelona', 'Budapest', 'Ljubljana']
Then,
I have a sqlalchemy model as follows -
class Fly(Base):
__tablename__ = 'fly'
pkid = Column('pkid', INTEGER(unsigned=True), primary_key=True, nullable=False)
city = Column('city', VARCHAR(45), unique=True, nullable=False)
country = Column('country', VARCHAR(45))
flight_no = Column('Flight', VARCHAR(45))
I need to check if ALL the values in given cities list exists in fly table or not using sqlalchemy. Return true only if ALL the cities exists in table. Even if a single city doesn't exist in table, I need to return false and list of cities that doesn't exist. How to do that? Any ideas/hints/suggestions? I'm using MYSQL
One way would be to create a (temporary) relation based on the given list and take the set difference between it and the cities from the fly table. In other words create a union of the values from the list1:
from sqlalchemy import union, select, literal
cities_union = union(*[select([literal(v)]) for v in cities])
Then take the difference:
sq = cities_union.select().except_(select([Fly.city]))
and check that no rows are left after the difference:
res = session.query(~exists(sq)).scalar()
For a list of cities lacking from fly table omit the (NOT) EXISTS:
res = session.execute(sq).fetchall()
1 Other database vendors may offer alternative methods for producing relations from arrays, such as Postgresql and its unnest().

SQLAlchemy ignoring specific fields in a query

I am using SQLAlchemy with Flask to talk to a postgres DB. I have a Customer model that has a date_of_birth field defined like this
class Customer(Base):
__tablename__ = 'customer'
id = Column(Integer, primary_key=True)
date_of_birth = Column(Date)
Now, I am trying to filter these by a minimum age like this:
q.filter(date.today().year - extract('year', Customer.date_of_birth) - cast((today.month, today.day) < (extract('month', Customer.date_of_birth), extract('day', Customer.date_of_birth)), Integer) >= 5)
But the generated SQL seems to ignore the day part and replace it with a constant. It looks like this
SELECT customer.date_of_birth AS customer_date_of_birth,
FROM customer
WHERE (2017 - EXTRACT(year FROM customer.date_of_birth)) - CAST(EXTRACT(month FROM customer.date_of_birth) > 2 AS INTEGER) >= 5
The generated SQL is exactly the same when I remove the day part from the query. Why is sqlalchemy ignoring it?
This is because you're comparing two tuples:
(today.month, today.day) < (extract('month', Customer.date_of_birth), extract('day', Customer.date_of_birth))
The way tuples compare is, compare the first element, and if they're not equal then return the result, otherwise return the comparison of the second element. So, in your case, it's the same as comparing the first elements together.
Instead of two tuples, what you should compare is the tuple_ construct, like this:
(today.month, today.day) < tuple_(extract('month', Customer.date_of_birth), extract('day', Customer.date_of_birth))

How can I prevent sqlalchemy from prefixing the column names of a CTE?

Consider the following query codified via SQLAlchemy.
# Create a CTE that performs a join and gets some values
x_cte = session.query(SomeTable.col1
,OtherTable.col5
) \
.select_from(SomeTable) \
.join(OtherTable, SomeTable.col2 == OtherTable.col3)
.filter(OtherTable.col6 == 34)
.cte(name='x')
# Create a subquery that splits the CTE based on the value of col1
# and computes the quartile for positive col1 and assigns a dummy
# "quartile" for negative and zero col1
subquery = session.query(x_cte
,literal('-1', sqlalchemy.INTEGER).label('quartile')
) \
.filter(x_cte.col1 <= 0)
.union_all(session.query(x_cte
,sqlalchemy.func.ntile(4).over(order_by=x_cte.col1).label('quartile')
)
.filter(x_cte.col1 > 0)
) \
.subquery()
# Compute some aggregate values for each quartile
result = session.query(sqlalchemy.func.avg(subquery.columns.x_col1)
,sqlalchemy.func.avg(subquery.columns.x_col5)
,subquery.columns.x_quartile
) \
.group_by(subquery.columns.x_quartile) \
.all()
Sorry for the length, but this is similar to my real query. In my real code, I've given a more descriptive name to my CTE, and my CTE has far more columns for which I must compute the average. (It's also actually a weighted average weighted by a column in the CTE.)
The real "problem" is purely one of trying to keep my code more clear and shorter. (Yes, I know. This query is already a monster and hard to read, but the client insists on this data being available.) Notice that in the final query, I must refer to my columns as subquery.columns.x_[column name]; this is because SQLAlchemy is prefixing my column name with the CTE name. I would just like for SQLAlchemy to leave off my CTE's name when generating column names, but since I have many columns, I would prefer not to list them individually in my subquery. Leaving off the CTE name would make my column names (which are long enough on their own) shorter and slightly more readable; I can guarantee that the columns are unique. How can I do this?
Using Python 2.7.3 with SQLAlchemy 0.7.10.
you're not being too specific what "x_" is here, but if that's the final result, use label() to give the result columns whatever name you want:
row = session.query(func.avg(foo).label('foo_avg'), func.avg(bar).label('bar_avg')).first()
foo_avg = row['foo_avg'] # indexed access
bar_avg = row.bar_avg # attribute access
Edit: I'm not able to reproduce the "x_" here. Here's a test:
from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class A(Base):
__tablename__ = "a"
id = Column(Integer, primary_key=True)
x = Column(Integer)
y = Column(Integer)
s = Session()
subq = s.query(A).cte(name='x')
subq2 = s.query(subq, (subq.c.x + subq.c.y)).filter(A.x == subq.c.x).subquery()
print s.query(A).join(subq2, A.id == subq2.c.id).\
filter(subq2.c.x == A.x, subq2.c.y == A.y)
above, you can see I can refer to subq2.c.<colname> without issue, there is no "x" prepended. If you can please specify SQLAlchemy version information and fill out your example fully, I can run it as is in order to reproduce your issue.

Categories