"Nested" queries in SQL / SQLAlchemy

"Nested" queries in SQL / SQLAlchemy - python

I'm using SQLAlchemy (being relatively new both to it and SQL) and I want to get a list of all comments posted to a set of things, but I'm only interested in comments that have been posted since a certain date, and the date is different for each thing:
To clarify, here's what I'm doing now: I begin with a dictionary that maps the ID code of each thing I'm interested in to the date I'm interested in for that thing. I do a quick list comprehension to get a list of just the codes (thingCodes) and then do this query:
things = meta.Session.query(Thing)\
.filter(Thing.objType.in_(['fooType', 'barType']))\
.filter(Thing.data.any(and_(Data.key == 'thingCode',Data.value.in_(thingCodes))))\
.all()
which returns a list of the thing objects (I do need those in addition to the comments). I then iterate through this list, and for each thing do a separate query:
comms = meta.Session.query( Thing )
.filter_by(objType = 'comment').filter(Thing.data.any(wc('thingCode', code))) \
.filter(Thing.date >= date) \
.order_by('-date').all()
This works, but it seems horribly inefficient to be to be doing all these queries separately. So, I have 2 questions:
a) Rather than running the second query n times for an n-length list of things, is there a way I could do it in a single query while still returning a separate set of results for each ID (presumably in the form of a dictionary of ID's to lists)? I suppose I could do a value_in(listOfIds) to get a single list of all the comments I want and then iterate through that and build the dictionary manually, but I have a feeling there's a way to use JOINs for this.
b) Am I over-optimizing here? Would I be better off with the second approach I just mentioned? And is it even that important that I roll them all into a single transactions? The bulk of my experience is with Neo4j, which is pretty good at transparently nesting many small transactions into larger ones - does SQL/SQLAlchemy have similar functionality, or is it definitely in my interest to minimize the number of queries?

Related

Django the fastest way to do Query get list of items in row from table

In my app I need to do fast Query but I don't know which is faster
materials = Material.objects.only('name')
Or do filter this in view
materials = Material.objects.all()
And then use for loop to show list of items from 'name' row
I think that first is better or there is better way to do this?
It cant be done with filter() because it need to show all of fields in this row.

If you only want the names, you can use a .values_list(..):
materials = list(Material.objects.values_list('name', flat=True))
This will avoid wrapping the records in Material objects. That being said, unless some of the columns contain (very) large amounts of data, using .only(..) will not significantly speed up the process. Furthermore software design-wise it is often better to fetch Material objects, since that means that you can define behavior in your Material model.

Order by the same list as the one used in IN using Peewee ORM

I'm trying to write a query in Peewee for MySQL and I'd like to do something similar to the solution offered here: Sort by order of values in a select statement "in" clause in mysql
That is, I'd like to select a table using a WHERE clause and an IN operator.
However, rather than have the results ordered based on value(s) found in these tables, I'd like them to arrange them in the same order they're found in the list or operator I provide.
The alternative I'm using now is to simply loop through and accumulate on another list, but this takes much longer (~50-70% more time than just a simple query with an order_by).
Is there a way to do this more elegantly in Peewee?

Assuming that information isn't stored in your database, you probably want to just re-order the rows into a new list. This can be done in O(n) for the size of your result-set, so it should not be any slower than just iterating over the rows in the first place.

SQLAlchemy: how can I order a table by a column permanently?

I'm not sure if this has been answered before, I didn't get anything on a quick search.
My table is built in a random order, but thereafter it is modified very rarely. I do frequent selects from the table and in each select I need to order the query by the same column. Now is there a way to sort a table permanently by a column so that it does not need to be done again for each select?

You can add an index sorted by the column you want. The data will be presorted according to that index.

You can have only one place where you define it, and re-use that for
every query:
def base_query(session, what_for):
return session.query(what_for).order_by(what_for.rank_or_whatever)
Expand that as needed, then for all but very complex queries you can use that like so:
some_query = base_query(session(), Employee).filter(Employee.feet > 3)
The resulting query will be ordered by Employee.rank_or_whatever. If you are always querying for the same, You won't habve to use it as argument, of course.
EDIT: If you could somehow define a "permanent" order on your table which is observed by the engine without being issued an ORDER BY I'd think this would be an implementation feature specific to which RDBMS you use, and just convenience. Internally it makes for a DBMS no sense to being coerced how to store the data, since retrieving this data in a specific order is easily and efficiently accomplished by using an INDEX - forcing a specific order would probably decrease overall performance.

Get the latest entries django

I currently have a function that writes one to four entries into a database every 12 hours. When certain conditions are met the function is called again to write another 1-4 entries based on the previous ones. Now since time isn't the only factor I have to check whether or not the conditions are met and because the entries are all in the same database I have to differentiate them based on their time posted into the database (DateTimeField is in the code)
How could I achieve this? Is there a function built in in django that I just couldn't find? Or would I have to take a look at a rather complicated solution.
as a sketch I would say i'd expect something like this:
latest = []
allData = myManyToManyField.objects.get(externalId=2)
for data in allData:
if data.Timestamp.checkIfLatest(): #checkIfLatest returns true/false
latest.append(data)
or even better something like this (although I don't think that's implemented)
latest = myManyToManyField.objects.get.latest.filter(externalId=2)

The django documentation is very very good, especially with regards to querysets and model layer functions. It's usually the first place you should look. It sounds like you want .latest(), but it's hard to tell with your requirements regarding conditions.
latest_entry = m2mfield.objects.latest('mydatefield')
if latest_entry.somefield:
# do something
Or perhaps you wanted:
latest_entry = m2mfield.objects.filter(somefield=True).latest('mydatefield')
You might also be interested in order_by(), which will order the rows according to a field you specify. You could then iterate on all the m2m fields until you find the one that matches a condition.
But without more information on what these conditions are, it's hard to be more specific.

IT's just a thought.. we can keep epoch time(current time of the entrie) field in database as a primary key and compare with the previous entiries and diffrentiate them

Efficient large dicts of dicts to represent M:M relationships in Python

I have a very large dataset - millions of records - that I want to store in Python. I might be running on 32-bit machines so I want to keep the dataset down in the hundreds-of-MB range and not ballooning much larger than that.
These records - represent a M:M relationship - two IDs (foo and bar) and some simple metadata like timestamps (baz).
Some foo have too nearly all bar in them, and some bar have nearly all foo. But there are many bar that have almost no foos and many foos that have almost no bar.
If this were a relational database, a M:M relationship would be modelled as a table with a compound key. You can of course search on either component key individually comfortably.
If you store the rows in a hashtable, however, you need to maintain three hashtables as the compound key is hashed and you can't search on the component keys with it.
If you have some kind of sorted index, you can abuse lexical sorting to iterate the first key in the compound key, and need a second index for the other key; but its less obvious to me what actual data-structure in the standard Python collections this equates to.
I am considering a dict of foo where each value is automatically moved from tuple (a single row) to list (of row tuples) to dict depending on some thresholds, and another dict of bar where each is a single foo, or a list of foo.
Are there more efficient - speedwise and spacewise - ways of doing this? Any kind of numpy for indices or something?
(I want to store them in Python because I am having performance problems with databases - both SQL and NoSQL varieties. You end up being IPC memcpy and serialisation-bound. That is another story; however the key point is that I want to move the data into the application rather than get recommendations to move it out of the application ;) )

Have you considered using a NoSQL database that runs in memory such at Redis? Redis supports a decent amount of familiar data structures.
I realize you don't want to move outside of the application, but not reinventing the wheel can save time and quite frankly it may be more efficient.

If you need to query the data in a flexible way, and maintain various relationships, I would suggest looking further into using a database, of which there are many options. How about using an in-memory databse, like sqlite (using ":memory:" as the file)? You're not really moving the data "outside" of your program, and you will have much more flexibility than with multi-layered dicts.
Redis is also an interesting alternative, as it has other data-structures to play with, rather than using a relational model with SQL.

What you describe sounds like a sparse matrix, where the foos are along one axis and the bars along the other one. Each non-empty cell represents a relationship between one foo and one bar, and contains the "simple metadata" you describe.
There are efficient sparse matrix packages for Python (scipy.sparse, PySparse) you should look at. I found these two just by Googling "python sparse matrix".
As to using a database, you claim that you've had performance problems. I'd like to suggest that you may not have chosen an optimal representation, but without more details on what your access patterns look like, and what database schema you used, it's awfully hard for anybody to contribute useful help. You might consider editing your post to provide more information.

NoSQL systems like redis don't provide MM tables.
In the end, a python dict keyed by pairs holding the values, and a dict of the set of pairings for each term was the best I could come up with.
class MM:
def __init__(self):
self._a = {} # Bs for each A
self._b = {} # As for each B
self._ab = {}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.