I have a table with 4 columns (1 PK) from which I need to select 30 rows.
Of these rows, two columns (col. A and B) must exists in another table (8 columns, 1 PK, 2 are A and B).
Second table is large, contains millions of records and it's enough for me to know if even a single row exists containing values of col. A and B of 1st table.
I am using the code below:
query = db.Session.query(db.Table_1).\
filter(
exists().where(db.Table_2.col_a == db.Table_1.col_a).\
where(db.Table_2.col_b == db.Table_2.col_b)
).limit(30).all()
This query gets me the results I desire however I'm afraid it might be a bit slow since it does not imply a limit condition to exists() function nor does it do select 1 but a select *.
exists() does not accept a .limit(1)
How can I put a limit to exists to get it not to look for whole table, hence making this query run faster?
I need n rows from Table_1, which 2 columns exist in a record in
Table_2
Thank you
You can do the "select 1" thing using a more explicit form as it mentioned here, that is,
exists([1]).where(...)
However, while I've been a longtime diehard "select 1" kind of guy, I've since learned that the usage of "1" vs. "*" for performance is now a myth (more / more).
exists() is also a wrapper around select(), so you can get a limit() by constructing the select() first:
s = select([1]).where(
table1.c.col_a == table2.c.colb
).where(
table1.c.colb == table2.c.colb
).limit(30)
s = exists(s)
query=select([db.Table_1])
query=query.where(
and_(
db.Table_2.col_a == db.Table_1.col_a,
db.Table_2.col_b == db.Table_2.col_b
)
).limit(30)
result=session.execute(query)
Related
I have a 11 columns x 13,470,621 rows pytable. The first column of the table contains a unique identifier to each row (this identifier is always only present once in the table).
This is how I select rows from the table at the moment:
my_annotations_table = h5r.root.annotations
# Loop through table and get rows that match gene identifiers (column labeled gene_id).
for record in my_annotations_table.where("(gene_id == b'gene_id_36624' ) | (gene_id == b'gene_id_14701' ) | (gene_id == b'gene_id_14702')"):
# Do something with the data.
Now this works fine with small datasets, but I will need to routinely perform queries in which I can have many thousand of unique identifiers to match for in the table's gene_id column. For these larger queries, the query string can quickly get very large and I get an exception:
File "/path/to/my/software/python/python-3.9.0/lib/python3.9/site-packages/tables/table.py", line 1189, in _required_expr_vars
cexpr = compile(expression, '<string>', 'eval')
RecursionError: maximum recursion depth exceeded during compilation
I've looked at this question (What is the PyTables counterpart of a SQL query "SELECT col2 FROM table WHERE col1 IN (val1, val2, val3...)"?), which is somehow similar to mine, but was not satisfactory.
I come from an R background where we often do these kinds of queries (i.e. my_data_frame[my_data_frame$gene_id %in% c("gene_id_1234", "gene_id_1235"),] and was wondering if there was comparable solution that I could use with pytables.
Thanks very much,
Another approach to consider is combining 2 functions: Table.get_where_list() with Table.read_coordinates()
Table.get_where_list(): gets the row coordinates fulfilling the given condition.
Table.read_coordinates(): Gets a set of rows given their coordinates (in a list), and returns as a (record) array.
The code would look something like this:
my_annotations_table = h5r.root.annotations
gene_name_list = ['gene_id_36624', 'gene_id_14701', 'gene_id_14702']
# Loop through gene names and get rows that match gene identifiers (column labeled gene_id)
gene_row_list = []
for gene_name in gene_name_list:
gene_rows = my_annotations_table.get_where_list("gene_id == gene_name"))
gene_row_list.extend(gene_rows)
# Retieve all of the data in one call
gene_data_arr = my_annotations_table.read_coordinates(gene_row_list)
Okay, I managed to do some satisfactory improvements on this.
1st: optimize the table (with the help of the documentation - https://www.pytables.org/usersguide/optimization.html)
Create table. Make sure to specify the expectedrows=<int> arg as it has the potential to increase the query speed.
table = h5w.create_table("/", 'annotations',
DataDescr, "Annotation table unindexed",
expectedrows=self._number_of_genes,
filters=tb.Filters(complevel=9, complib='blosc')
#tb comes from import tables as tb ...
I also modified the input data so that the gene_id_12345 fields are simple integers (gene_id_12345 becomes 12345).
Once the table is populated with its 13,470,621 entries (i.e. rows),
I created a complete sorted index based on the gene_id column (Column.create_csindex()) and sorted it.
table.cols.gene_id.create_csindex()
table.copy(overwrite=True, sortby='gene_id', newname="Annotation table", checkCSI=True)
# Just make sure that the index is usable. Will print an empty list if not.
print(table.will_query_use_indexing('(gene_id == 57403)'))
2nd - The table is optimized, but I still can't query thousands of gene_ids at a time. So I simply separated them in chunks of 31 gene_ids (yes 31 was the absolute maximum, 32 was too much apparently).
I did not perform benchmarks, but querying ~8000 gene_ids now takes approximately 10 seconds which is acceptable for my needs.
I want to get all the columns of a table with max(timestamp) and group by name.
What i have tried so far is:
normal_query ="Select max(timestamp) as time from table"
event_list = normal_query \
.distinct(Table.name)\
.filter_by(**filter_by_query) \
.filter(*queries) \
.group_by(*group_by_fields) \
.order_by('').all()
the query i get :
SELECT DISTINCT ON (schema.table.name) , max(timestamp)....
this query basically returns two columns with name and timestamp.
whereas, the query i want :
SELECT DISTINCT ON (schema.table.name) * from table order by ....
which returns all the columns in that table.Which is the expected behavior and i am able to get all the columns, how could i right it down in python to get to this statement?.Basically the asterisk is missing.
Can somebody help me?
What you seem to be after is the DISTINCT ON ... ORDER BY idiom in Postgresql for selecting greatest-n-per-group results (N = 1). So instead of grouping and aggregating just
event_list = Table.query.\
distinct(Table.name).\
filter_by(**filter_by_query).\
filter(*queries).\
order_by(Table.name, Table.timestamp.desc()).\
all()
This will end up selecting rows "grouped" by name, having the greatest timestamp value.
You do not want to use the asterisk most of the time, not in your application code anyway, unless you're doing manual ad-hoc queries. The asterisk is basically "all columns from the FROM table/relation", which might then break your assumptions later, if you add columns, reorder them, and such.
In case you'd like to order the resulting rows based on timestamp in the final result, you can use for example Query.from_self() to turn the query to a subquery, and order in the enclosing query:
event_list = Table.query.\
distinct(Table.name).\
filter_by(**filter_by_query).\
filter(*queries).\
order_by(Table.name, Table.timestamp.desc()).\
from_self().\
order_by(Table.timestamp.desc()).\
all()
I have an Sq Lite 3 database which has the columns ID,name and time.
So I have the last row and placed in a var LAST_PERSON using python.
your_rank = "SELECT usr_name,time FROM rank WHERE ID = (SELECT MAX(ID) FROM rank)"
I also have a var ROW which loops through each row sorted order by time.
sql = "SELECT usr_name,time FROM rank ORDER BY time "
for row in cur.execute(sql):
I want to compare:
your_rank with the sorted by time row and get that last person's rank
I tried
for row in cur.execute(sql):
sql_list.append(row)
if(row is your_rank):
this_is_your_rank = rank_number
rank_number += 1
But I cannot use the if statements for Sq Lite 3 and I have not being able to find any solution to compare these. Can you anyone give me a click?
If you cannot, thanks taking your time to reading.
You want to select count(ID) from rank where time < your_time or similar.
Looping over SQL results to find out what you want is clunky when you can just ask the database to give you the answer you want.
Edit:
Your first query, where you join the table to itself to get the user with the highest ID, can be:
SELECT MAX(ID),usr_name,time FROM rank
And you can combine them both together into "get the most recent user name, time, and their position" with:
SELECT
MAX(ID),usr_name,time, (SELECT COUNT(ID)+1 FROM rank WHERE time < r.time) [Pos]
FROM
rank r
Edit again, oh ok ok. Instead of "can anyone give me a solution?", it's not clear what you mean by "I can't use the if statement", but here's some speculation:
If you actually typed in row is your_rank, assuming you actually executed the your_rank SQL query and saved the result over the top in the same variable name, then it fails because is is a Python keyword for testing whether two things are the same thing (that is, one thing with two names). It does not test whether two separate things have the same value. == is the equality test.
It also might fail because the result of a SQL query is effectively a list of tuples. Each row is a tuple and, depending on what you did to put the result in your_rank, they won't ever match when compared.
This might work, if you want to keep the same approach:
last_user = cursor.execute('select max(id),usr_name,time from rank').fetchone()
last_user_rank = 1
for row in cursor.execute('select id,usr_name,time from rank order by time asc'):
if last_user[2] > row[2]:
last_user_rank += 1
else:
break
print last_user, last_user_rank
I am trying to query a table in an existing sqlite database. The data must first be subsetted as such, from a user input:
query(Data.num == input)
Then I want to find the max and min of another field: date in this subset.
I have tried using func.min/max, as well as union, but received an error saying the columns do not match. One of the issues here is that func.min/max need to be used as query arguments, not filter.
ids = session.query(Data).filter(Data.num == input)
q = session.query(func.max(Data.date),
func.min(Data.date))
ids.union(q).all()
ArgumentError: All selectables passed to CompoundSelect must have identical numbers of columns; select #1 has 12 columns, select #2 has 2
Similarly, if I use func.max and min separately, the error says #2 has 1 column.
I think seeing this query in SQL might help as well.
Thanks
The following solution works. You first need to set up the query, then filter the data down afterwards.
query = session.query(Data.num, func.min(Data.date),
func.max(Data.date), Data.date)
query = query.filter(Data.num == input)
query = query.all()
Hi I have a postgresql "TABLE1" with 2.7 million records. Each record has a field "FIELD1" that may be empty or may have data. I want a SELECT statement or method that a) returns the first 1000 results from TABLE1 with FIELD1 empty, and b) randomly pick one of the records to return to a python variable. Help???
For selecting first 1000 result you can use limit in your query
SELECT field1 FROM table1 ORDER BY id Limit 1000;
The result will be a list in python. So you can use python random module to operate on the result list.
If performance is not a concern:
SELECT *
FROM (
SELECT *
FROM tbl
WHERE field1 IS NULL
ORDER BY id --?? unclear from question
LIMIT 1000
) sub
ORDER BY random()
LIMIT 1;
This returns 1 perfectly random row from the "first" 1000 empty rows.
"Empty" meaning NULL, and "first" meaning smallest id.
If performance is a concern, you need to be a lot more specific.
If your circumstances match, this related answer might of help:
Best way to select random rows PostgreSQL