Multiple linked list with SQLAlchemy and MySQL - python

I want to have multiple linked list in a SQL table, using MySQL and SQLAlchemy (0.7). All lists with it's first node with parent being 0, and ends with child being 0. The id represents the list, and not the indevidiual element. The element is identified by PK
With some omitted syntax (not relevant to the problem) it should look something like this:
id(INT, PK)
content (TEXT)
parent(INT, FK(id), PK)
child(INT, FK(id), PK)
As the table has multiple linked lists how can return the entire list from the database I select a specific ID and parent is 0?
For example:
SELECT * FROM ... WHERE id = 3 AND parent = 0

Given that you have multiple linked lists stored in the same table, I assume that you store either the HEAD and/or the TAIL of those in some other tables. Few ideas:
1) Keep the linked list:
The first big improvement (also proposed in the comments) from the data-querying perspective would be to have some common identifier (lets call it ListID) of all the nodes in the same list. Here there are few options:
If each list is referenced only from one object (data row) [I would even phrase the question like "Does the list belong to a single object?], then this ListID could simply be the (primary) identifier of the holder object with the ForeignKey on top to ensure data integrity.
In this case, querying all list is very simple. In fact, you can define the relationship and navigate it like my_object.my_list_items.
If the list is used/referenced by multiple objects, then one could create another table which will consist only of one column ListID (PK), and each Node/Item will again have a ForeignKey to it, or something similar
Else, large lists can be loaded in two queries/SQL statements:
query the HEAD/TAIL by its ID
query the whole list based on received ListID of the HEAD/TAIL
In fact, this can be done with one query like the one below (Single-query example), which is more efficient from the IO perspective, but doing it in two steps has the advantage that you immediately have a reference to the HEAD (or TAIL) node.
Single-query example:
# single-query using join (not tested)
Head = alias(Node)
qry = session.query(Node).join(Head, Node.ListID == Head.ListID).filter(Head.ID == head_node_id)
Iin any case, in order to traverse the linked list, you would have to get the HEAD/TAIL by its ID, then traverse as usual.
Note: Here I am not certain if SA would recognize that the reference objects are already loaded into session, or will issue other SQL statements for each of these, which will defeat the purpose of bulk loading.
2) Replace linked list with Ordering List extension:
Please read the Ordering List documentation. It well might be that Ordering List implementation will be good enough for you to use instead of the linked list

Related

Finding document containing array of nested names in pymongo (CrossRef data)

I have a dataset of CrossRef works records stored in a collection called works in MongoDB and I am using a Python application to query this database.
I am trying to find documents based on one author's name. Removing extraneous details, a document might look like this:
{'DOI':'some-doi',
'author':[{'given': 'Albert','family':'Einstein',affiliation:[]},
{'given':'R.','family':'Feynman',affiliation:[]},
{'given':'Paul','family':'Dirac',affiliation:['University of Florida']}]
}
It isn't clear to me how to combine the queries to get just Albert Einstein's papers.
I have indexes on author.family and author.given, I've tried:
cur = works.find({'author.family':'Einstein','author.given':'Albert'})
This returns all of the documents by people called 'Albert' and all of those by people called 'Einstein'. I can filter this manually, but it's obviously less than ideal.
I also tried:
cur = works.find({'author':{'given':'Albert','family':'Einstein','affiliation':[]}})
But this returns nothing (after a very long delay). I've tried this with and without 'affiliation'. There are a few questions on SO about querying nested fields, but none seem to concern the case where we're looking for 2 specific things in 1 nested field.
Your issue is that author is a list.
You can use an aggregate query to unwind this list to objects, and then your query would work:
cur = works.aggregate([{'$unwind': '$author'},
{'$match': {'author.family':'Einstein', 'author.given':'Albert'}}])
Alternatively, use $elemMatch which matches on arrays that match all the elements specified.
cur = works.find({"author": {'$elemMatch': {'family': 'Einstein', 'given': 'Albert'}}})
Also consider using multikey indexes.

SQLAlchemy: how to refer to filtered fields

I am doing a relatively simple ETL project using SQLAlchemy.
There is a large existing PostgreSQL database with multiple 'schemas' (in the PostgreSQL sub-database sense), one of which is new and the project is to convert the data from schema 'old' to schema 'new'.
I have one set of two 'old' source tables that I have to join together to get the new table ... I can't see how to refer to the fields in the joined/filtered superset of the two tables. For example, if I just loop over one table:
allp = session.query(Permit).all()
for p in allp:
print p.permit_id
... works as expected.
But if I set up a filter to combine the two tables:
prmp = session.query(Permit,Permit_master).filter(Permit_master.id == Permit.mast_id).all()
for p in prmp:
print p.permit_id
gives
'result' object has no attribute 'permit_id'
This must be something simple, but I've tried inspecting the object with dir() to no avail.
Help please ...
The results of your query are keyed 2-tuples of Permit and Permit_master. You can access the result entities using either their position or key:
for p in prmp:
print p.Permit.permit_id
# or
print p[0].permit_id

Peewee dynamically generate where clause to check for duplicates

I am currently using peewee as an ORM in a python project.
I am trying to determine given a set of objects if any of these objects already exist in the database. For objects where the uniqueness is based on one key that is simple -- I can generate a list of the keys and do a select in the database:
Model.select().where(Model.column << ids)
However, in some cases the uniqueness is determined by two columns. (Note that I don't have the primary key in hand at this moment, which is why I can't just rely on id.)
I tried to genericize the logic, where a list of all the column names that determined uniqueness could be passed in. Here is my code:
clauses = []
for obj in db_objects:
# get_unique_key returns a tuple of the all the column values
# that determine uniqueness for this object.
uniq_key = self._get_unique_key(obj)
subclause = [getattr(self.model_class, uniq_column) == value
for uniq_column, value in zip(self.uniq_columns, uniq_key)]
clauses.append(reduce(operator.and_, subclause))
dups = self.model_class.select().where(reduce(operator.or_, clauses)).execute()
Note that self.dup_columns contains the names of all the columns that together determine uniqueness, and _get_unique_key returns a tuple of those column values.
When I run this I get an error that max recursion depth has been exceeded. I suppose this is due to how peewee resolves expressions. One way around it might be to break up my clauses into some max amount of objects (i.e. create a clause for every 100 objects and then issue the query, and do this until all the objects have been processed).
Wanted to see if there was a better way instead.

Find parent with certain combination of child rows - SQLite with Python

There are several parts to this question. I am working with sqlite3 in Python 2.7, but I am less concerned with the exact syntax, and more with the methods I need to use. I think the best way to ask this question is to describe my current database design, and what I am trying to accomplish. I am new to databases in general, so I apologize if I don't always use correct nomenclature.
I am modeling refrigeration systems (using Modelica--not really important to know), and I am using the database to manage input data, results data, and models used for that data.
My top parent table is Model, which contains the columns:
id, name, version, date_created
My child table under Model is called Design. It is used to create a unique id for each combination of design input parameters and the model used. the columns it contains are:
id, model_id, date_created
I then have two child tables under Design, one called Input, and the other called Result. We can just look at Input for now, since one example should be enough. The columns for input are:
id, value, design_id, parameter_id, component_id
parameter_id and component_id are foreign keys to their own tables.The Parameter table has the following columns:
id, name, units
Some example rows for Parameter under name are: length, width, speed, temperature, pressure (there are many dozens more). The Component table has the following columns:
id, name
Some example rows for Component under name are: compressor, heat_exchanger, valve.
Ultimately, in my program I want to search the database for a specific design. I want to be able to search a specific design to be able to grab specific results for that design, or to know whether or not a model simulation with that design has already been run previously, to avoid re-running the same data point.
I also want to be able to grab all the parameters for a given design, and insert it into a class I have created in Python, which is then used to provide inputs to my models. In case it helps for solving the problem, the classes I have created are based on the components. So, for example, I have a compressor class, with attributes like compressor.speed, compressor.stroke, compressor.piston_size. Each of these attributes should have their own row in the Parameter table.
So, how would I query this database efficiently to find if there is a design that matches a long list (let's assume 100+) of parameters with specific values? Just as a side note, my friend helped me design this database. He knows databases, but not my application super well. It is possible that I designed it poorly for what I want to accomplish.
Here is a simple picture trying to map a certain combination of parameters with certain values to a design_id, where I have taken out component_id for simplicity:
Picture of simplified tables
Simply join the necessary tables. Your schema properly reflects normalization (separating tables into logical groupings) and can scale for one-to-many relationships. Specifically, to answer your question --So, how would I query this database efficiently to find if there is a design that matches a long list (let's assume 100+) of parameters with specific values?-- consider below approaches:
Inner Join with Where Clause
For handful of parameters, use an inner join with a WHERE...IN() clause. Below returns design fields joined by input and parameters tables, filtered for specific parameter names where you can have Python pass as parameterized values even iteratively in a loop:
SELECT d.id, d.model_id, d.date_created
FROM design d
INNER JOIN input i ON d.id = i.design_id
INNER JOIN parameters p ON p.id = i.parameter_id
WHERE p.name IN ('param1', 'param2', 'param3', 'param4', 'param5', ...)
Inner Join with Temp Table
Should values be over 100+ in a long list, consider a temp table that filters parameters table to specific parameter values:
# CREATE EMPTY TABLE (SAME STRUCTURE AS parameters)
sql = "CREATE TABLE tempparams AS SELECT id, name, units FROM parameters WHERE 0;"
cur.execute(sql)
db.commit()
# ITERATIVELY APPEND TO TEMP
for i in paramslist: # LIST OF 100+ ITEMS
sql = "INSERT INTO tempparams (id, name, units) \
SELECT p.id, p.name, p.units \
FROM parameters p \
WHERE p.name = ?;"
cur.execute(sql, i) # CURSOR OBJECT COMMAND PASSING PARAM
db.commit() # DB OBJECT COMMIT ACTION
Then, join main design and input tables with new temp table holding specific parameters:
SELECT d.id, d.model_id, d.date_created
FROM design d
INNER JOIN input i ON d.id = i.design_id
INNER JOIN tempparams t ON t.id = i.parameter_id
Same process can work with components table as well.
*Moved picture to question section

Which one is more efficient?

I have a Python program for deleting duplicates from a list of names.
But I'm in a dilemma and searching out for a most efficient way out of both means.
I have uploaded a list of names to a SQLite DB, into a column in a table.
Whether comparing the names and deleting the duplicates out of them in a DB is good or loading them to Python means getting them into Python and deleting the duplicates and pushing them back to the DB is good?
I'm confused and here is a piece of code to do it on SQLite:
dup_killer (member_id, date) SELECT * FROM talks GROUP BY member_id,
If you use the names as a key in the database, the database will make sure they are not duplicated. So there would be no reason to ship the list to Python and de-dup there.
If you haven't inserted the names into the database yet, you might as well de-dup them in Python first. It is probably faster to do it in Python using the built-in features than to incur the overhead of repeated attempts to insert to the database.
(By the way: you can really speed up the insertion of many names if you wrap all the inserts in a single transaction. Start a transaction, insert all the names, and finish the transaction. The database does some work to make sure that the database is consistent, and it's much more efficient to do that work once for a whole list of names, rather than doing it once per name.)
If you have the list in Python, you can de-dup it very quickly using built-in features. The two common features that are useful for de-duping are the set and the dict.
I have given you three examples. The simplest case is where you have a list that just contains names, and you want to get a list with just unique names; you can just put the list into a set. The second case is that your list contains records and you need to extract the name part to build the set. The third case shows how to build a dict that maps a name onto a record, then inserts the record into a database; like a set, a dict will only allow unique values to be used as keys. When the dict is built, it will keep the last value from the list with the same name.
# list already contains names
unique_names = set(list_of_all_names)
unique_list = list(unique_names) # lst now contains only unique names
# extract record field from each record and make set
unique_names = set(x.name for x in list_of_all_records)
unique_list = list(unique_names) # lst now contains only unique names
# make dict mapping name to a complete record
d = dict((x.name, x) for x in list_of_records)
# insert complete record into database using name as key
for name in d:
insert_into_database(d[name])

Categories