I have a Python program for deleting duplicates from a list of names.
But I'm in a dilemma and searching out for a most efficient way out of both means.
I have uploaded a list of names to a SQLite DB, into a column in a table.
Whether comparing the names and deleting the duplicates out of them in a DB is good or loading them to Python means getting them into Python and deleting the duplicates and pushing them back to the DB is good?
I'm confused and here is a piece of code to do it on SQLite:
dup_killer (member_id, date) SELECT * FROM talks GROUP BY member_id,
If you use the names as a key in the database, the database will make sure they are not duplicated. So there would be no reason to ship the list to Python and de-dup there.
If you haven't inserted the names into the database yet, you might as well de-dup them in Python first. It is probably faster to do it in Python using the built-in features than to incur the overhead of repeated attempts to insert to the database.
(By the way: you can really speed up the insertion of many names if you wrap all the inserts in a single transaction. Start a transaction, insert all the names, and finish the transaction. The database does some work to make sure that the database is consistent, and it's much more efficient to do that work once for a whole list of names, rather than doing it once per name.)
If you have the list in Python, you can de-dup it very quickly using built-in features. The two common features that are useful for de-duping are the set and the dict.
I have given you three examples. The simplest case is where you have a list that just contains names, and you want to get a list with just unique names; you can just put the list into a set. The second case is that your list contains records and you need to extract the name part to build the set. The third case shows how to build a dict that maps a name onto a record, then inserts the record into a database; like a set, a dict will only allow unique values to be used as keys. When the dict is built, it will keep the last value from the list with the same name.
# list already contains names
unique_names = set(list_of_all_names)
unique_list = list(unique_names) # lst now contains only unique names
# extract record field from each record and make set
unique_names = set(x.name for x in list_of_all_records)
unique_list = list(unique_names) # lst now contains only unique names
# make dict mapping name to a complete record
d = dict((x.name, x) for x in list_of_records)
# insert complete record into database using name as key
for name in d:
insert_into_database(d[name])
Related
I am currently using peewee as an ORM in a python project.
I am trying to determine given a set of objects if any of these objects already exist in the database. For objects where the uniqueness is based on one key that is simple -- I can generate a list of the keys and do a select in the database:
Model.select().where(Model.column << ids)
However, in some cases the uniqueness is determined by two columns. (Note that I don't have the primary key in hand at this moment, which is why I can't just rely on id.)
I tried to genericize the logic, where a list of all the column names that determined uniqueness could be passed in. Here is my code:
clauses = []
for obj in db_objects:
# get_unique_key returns a tuple of the all the column values
# that determine uniqueness for this object.
uniq_key = self._get_unique_key(obj)
subclause = [getattr(self.model_class, uniq_column) == value
for uniq_column, value in zip(self.uniq_columns, uniq_key)]
clauses.append(reduce(operator.and_, subclause))
dups = self.model_class.select().where(reduce(operator.or_, clauses)).execute()
Note that self.dup_columns contains the names of all the columns that together determine uniqueness, and _get_unique_key returns a tuple of those column values.
When I run this I get an error that max recursion depth has been exceeded. I suppose this is due to how peewee resolves expressions. One way around it might be to break up my clauses into some max amount of objects (i.e. create a clause for every 100 objects and then issue the query, and do this until all the objects have been processed).
Wanted to see if there was a better way instead.
I create a Berkeley database, and operate with it using bsddb module. And I need to store there information in a style, like this:
username = '....'
notes = {'name_of_note1':{
'password':'...',
'comments':'...',
'title':'...'
}
'name_of_note2':{
#keys same as previous, but another values
}
}
This is how I open database
db = bsddb.btopen['data.db','c']
How do I do that ?
So, first, I guess you should open your database using parentheses:
db = bsddb.btopen('data.db','c')
Keep in mind that Berkeley's pattern is key -> value, where both key and value are string objects (not unicode). The best way in your case would be to use:
db[str(username)] = json.dumps(notes)
since your notes are compatible with the json syntax.
However, this is not a very good choice, say, if you want to query only usernames' comments. You should use a relational database, such as sqlite, which is also built-in in Python.
A simple solution was described by #Falvian.
For a start there is a column pattern in ordered key/value store. So the key/value pattern is not the only one.
I think that bsddb is viable solution when you don't want to rely on sqlite. The first approach is to create a documents = bsddb.btopen['documents.db','c'] and store inside json values. Regarding the keys you have several options:
Name the keys yourself, like you do "name_of_note_1", "name_of_note_2"
Generate random identifiers using uuid.uuid4 (don't forget to check it's not already used ;)
Or use a row inside this documents with key=0 to store a counter that you will use to create uids (unique identifiers).
If you use integers don't forget to pack them with lambda x: struct.pack('>q', uid) before storing them.
If you need to create index. I recommend you to have a look at my other answer introducting composite keys to build index in bsddb.
I'm using wx.ListCtrl for live report in my app, and there will be continuous status updating including inserting a new row when a task starts and deleting related rows when the tasks end. Since the list gets sorted every now and then, you cannot simply delete the rows by the rowid you started with. Although you can assign a unique id using SetItemData, and that way you know exactly which row to delete when a task is done, there does NOT seem to be any method related to deleting a row by that unique id, not even a method to get rowid by unique id, and the only method I found is GetItemData, which will return the unique id for a certain row.
So the only way came to my mind is to iterate all rows checking their unique ids and compares it against the given id, if it matches then delete that row. But this sounds way too clumsy, so is there a better way to delete a specific row after sorting?
If you can upgrade your wxPython to the 2.9.x series then there is a simple answer - use a DataViewListCtrl - in that the display reflects your data rather than actually containing the data. As a result if your data model changes due to a data item, (line), being deleted then your display will loose that line regardless of how the display is sorted. If you can't then I suspect that you will have to tag lines with a unique id and then find them for deletion.
If you need to do the latter I would suggest having a, possibly hidden, RowID column containing your unique ID and either a dictionary that you maintain with the subprocess PID as the key and the unique ID as the value or a function that maps the process id to the unique row ID.
Obviously adding the new row on the process create is not a problem, just remember to include your unique ID. When your process ends get the row ID and do something like:
def FindRow(ID):
""" Find the row that matches the ID """
match = None
for index in range(self.TheGrid.GetNumberRows()):
if self.TheGrid.GetCellValue(index, IdColNo):
match = index
break
return match
# In your process end handler
Line = FindRow(GetID(pid))
if (Line):
self.TheGrid.DeleteRows(Line, 1)
I ended up using ObjectListView to do the job. Basically you build an index for your objects in the list, and then you are able to operate on any row you want. It's way more convenient than wx.ListCtrl
I want to have multiple linked list in a SQL table, using MySQL and SQLAlchemy (0.7). All lists with it's first node with parent being 0, and ends with child being 0. The id represents the list, and not the indevidiual element. The element is identified by PK
With some omitted syntax (not relevant to the problem) it should look something like this:
id(INT, PK)
content (TEXT)
parent(INT, FK(id), PK)
child(INT, FK(id), PK)
As the table has multiple linked lists how can return the entire list from the database I select a specific ID and parent is 0?
For example:
SELECT * FROM ... WHERE id = 3 AND parent = 0
Given that you have multiple linked lists stored in the same table, I assume that you store either the HEAD and/or the TAIL of those in some other tables. Few ideas:
1) Keep the linked list:
The first big improvement (also proposed in the comments) from the data-querying perspective would be to have some common identifier (lets call it ListID) of all the nodes in the same list. Here there are few options:
If each list is referenced only from one object (data row) [I would even phrase the question like "Does the list belong to a single object?], then this ListID could simply be the (primary) identifier of the holder object with the ForeignKey on top to ensure data integrity.
In this case, querying all list is very simple. In fact, you can define the relationship and navigate it like my_object.my_list_items.
If the list is used/referenced by multiple objects, then one could create another table which will consist only of one column ListID (PK), and each Node/Item will again have a ForeignKey to it, or something similar
Else, large lists can be loaded in two queries/SQL statements:
query the HEAD/TAIL by its ID
query the whole list based on received ListID of the HEAD/TAIL
In fact, this can be done with one query like the one below (Single-query example), which is more efficient from the IO perspective, but doing it in two steps has the advantage that you immediately have a reference to the HEAD (or TAIL) node.
Single-query example:
# single-query using join (not tested)
Head = alias(Node)
qry = session.query(Node).join(Head, Node.ListID == Head.ListID).filter(Head.ID == head_node_id)
Iin any case, in order to traverse the linked list, you would have to get the HEAD/TAIL by its ID, then traverse as usual.
Note: Here I am not certain if SA would recognize that the reference objects are already loaded into session, or will issue other SQL statements for each of these, which will defeat the purpose of bulk loading.
2) Replace linked list with Ordering List extension:
Please read the Ordering List documentation. It well might be that Ordering List implementation will be good enough for you to use instead of the linked list
I want 3 columns to have 9 different values, like a list in Python.
Is it possible? If not in SQLite, then on another database engine?
You must serialize the list (or other Python object) into a string of bytes, aka "BLOB";-), through your favorite means (marshal is good for lists of elementary values such as numbers or strings &c, cPickle if you want a very general solution, etc), and deserialize it when you fetch it back. Of course, that basically carries the list (or other Python object) as a passive "payload" -- can't meaningfully use it in WHERE clauses, ORDER BY, etc.
Relational databases just don't deal all that well with non-atomic values and would prefer other, normalized alternatives (store the list's items in a different table which includes a "listID" column, put the "listID" in your main table, etc). NON-relational databases, while they typically have limitations wrt relational ones (e.g., no joins), may offer more direct support for your requirement.
Some relational DBs do have non-relational extensions. For example, PostGreSQL supports an array data type (not quite as general as Python's lists -- PgSQL's arrays are intrinsically homogeneous).
Generally, you do this by stringifying the list (with repr()), and then saving the string. On reading the string from the database, use eval() to re-create the list. Be careful, though that you are certain no user-generated data can get into the column, or the eval() is a security risk.
Your question is difficult to understand. Here it is again:
I want 3 columns to have 9 different values, like a list in Python. Is it possible? If not in SQLite, then on another database engine?
Here is what I believe you are asking: is it possible to take a Python list of 9 different values, and save the values under a particular column in a database?
The answer to this question is "yes". I suggest using a Python ORM library instead of trying to write the SQL code yourself. This example code uses Autumn:
import autumn
import autumn.util
from autumn.util import create_table
# get a database connection object
my_test_db = autumn.util.AutoConn("my_test.db")
# code to create the database table
_create_sql = """\
DROP TABLE IF EXISTS mytest;
CREATE TABLE mytest (
id INTEGER PRIMARY KEY AUTOINCREMENT,
value INTEGER NOT NULL,
UNIQUE(value)
);"""
# create the table, dropping any previous table of same name
create_table(my_test_db, _create_sql)
# create ORM class; Autumn introspects the database to find out columns
class MyTest(autumn.model.Model):
db = my_test_db
lst = [3, 6, 9, 2, 4, 8, 1, 5, 7] # list of 9 unique values
for n in lst:
row = MyTest(value=n) # create MyTest() row instance with value initialized
row.save() # write the data to the database
Run this code, then exit Python and run sqlite3 my_test.db. Then run this SQL command inside SQLite: select * from mytest; Here is the result:
1|3
2|6
3|9
4|2
5|4
6|8
7|1
8|5
9|7
This example pulls values from one list, and uses the values to populate one column from the database. It could be trivially extended to add additional columns and populate them as well.
If this is not the answer you are looking for, please rephrase your request to clarify.
P.S. This example uses autumn.util. The setup.py included with the current release of Autumn does not install util.py in the correct place; you will need to finish the setup of Autumn by hand.
You could use a more mature ORM such as SQLAlchemy or the ORM from Django. However, I really do like Autumn, especially for SQLite.