select all id's and insert missing data - python

I have a database where i store some values with a auto generated index key. I also have a n:m mapping table like this:
create table data(id int not null identity(1,1), col1 int not null, col2 varchar(256) not null);
create table otherdata(id int not null identity(1.1), value varchar(256) not null);
create table data_map(dataid int not null, otherdataid int not null);
every day the data table needs to be updated with a list of new values, where a lot of them are already present but needs to be inserted into the data_map (the key in otherdata is then generated, so in this table the data is always new).
one way of doing it would be to first try to insert all values, then select the generated id, then insert into data_map:
mydata = [] # list of tuples
cursor.executemany("if not exists (select * from data where col1 = %d and col2 = %d) insert into data (col1, col2) values (%d, %d)", mydata);
# now select the id's
# [...]
but that obviously is quite bad because i need to select all things without using the key and also i need to do the check without using the key, so i need indexed data first, otherwise everything is very slow.
my next approach was to use a hashfunction (like md5 or crc64) to generate my own hash over col1 and col2, to be able to insert all values without using a select and be able to use the indexed key when inserting missing values.
can this be optimized or is it the best thing i could do?
the amout of lines is >500k per change, where maybe ~20-50% will be already in the database.
timing wise it looks like that calculating the hashes is much faster than inserting data into the database.

As far as I concern, you use mysql.connector. If it is, when you run cursor.execute() you should not use %d types. Everything should be just %s and connector will do this job about type conversions

Related

INSERT SQL records from one database to a second (where 2nd has an additional column)

I'd like to insert select records from Table A --> Table B (in this example case, different "databases" == different tables to not worry about ATTACH), where Table A has less columns than Table B. The additional B_Table column (col3) should also be populated.
I've tried this sequence in raw-SQL (through SQLAlch.):
1.) INSERTing A_Table into Table B using an engine.connect().execute(text)
text("INSERT INTO B_Table (col1, col2) SELECT col1, col2 FROM A_Table")
2.) UPDATEing B_Table w/ col3 info with an engine.connect()ion (all newly inserted records are populated/updated w/ the same identifier, NewInfo)
text("UPDATE B_Table SET col3 = NewInfo WHERE B_Table.ID >= %s" % (starting_ID#_of_INSERT'd_records))
More efficient alternative?
But this is incredibly inefficient. It takes 4x longer to UPDATE a single column than to INSERT. This seems like it should be a fraction of the INSERT time. I'd like to reduce the total time to ~just the insertion time.
What's a better way to copy data from one table to another w/out INSERTing followed by an UPDATE? I was considering a:
1.) SQLAlchemy session.query(A_Table), but wasn't sure how to then edit that object (for col3) and then insert that updated object w/out loading all the A_Table queried info into RAM (which I understand raw-SQL's INSERT does not do).
You can use 'NewInfo' as a string literal in the SELECT statement:
INSERT INTO B_Table (col1, col2, col3)
SELECT col1, col2, 'NewInfo'
FROM A_Table;

How do I pass variables in SQL3 python? [duplicate]

I create a table with primary key and autoincrement.
with open('RAND.xml', "rb") as f, sqlite3.connect("race.db") as connection:
c = connection.cursor()
c.execute(
"""CREATE TABLE IF NOT EXISTS race(RaceID INTEGER PRIMARY KEY AUTOINCREMENT,R_Number INT, R_KEY INT,\
R_NAME TEXT, R_AGE INT, R_DIST TEXT, R_CLASS, M_ID INT)""")
I want to then insert a tuple which of course has 1 less number than the total columns because the first is autoincrement.
sql_data = tuple(b)
c.executemany('insert into race values(?,?,?,?,?,?,?)', b)
How do I stop this error.
sqlite3.OperationalError: table race has 8 columns but 7 values were supplied
It's extremely bad practice to assume a specific ordering on the columns. Some DBA might come along and modify the table, breaking your SQL statements. Secondly, an autoincrement value will only be used if you don't specify a value for the field in your INSERT statement - if you give a value, that value will be stored in the new row.
If you amend the code to read
c.executemany('''insert into
race(R_number, R_KEY, R_NAME, R_AGE, R_DIST, R_CLASS, M_ID)
values(?,?,?,?,?,?,?)''',
sql_data)
you should find that everything works as expected.
From the SQLite documentation:
If the column-name list after table-name is omitted then the number of values inserted into each row must be the same as the number of columns in the table.
RaceID is a column in the table, so it is expected to be present when you're doing an INSERT without explicitly naming the columns. You can get the desired behavior (assign RaceID the next autoincrement value) by passing an SQLite NULL value in that column, which in Python is None:
sql_data = tuple((None,) + a for a in b)
c.executemany('insert into race values(?,?,?,?,?,?,?,?)', sql_data)
The above assumes b is a sequence of sequences of parameters for your executemany statement and attempts to prepend None to each sub-sequence. Modify as necessary for your code.

How can I get a MySQL database to insert a default value if there's an attempt to insert a null value with Python?

I've read answers that do something similar but not exactly what I'm looking for, which is: attempting to insert a row with a NULL value in a column will result instead in that column's DEFAULT value being inserted.
I'm trying to process a large number of inserts in the mySQL Python connector with a large number of column values that I don't want to deal with individually, and none of the typical alternatives work here. Here is a sketch of my code:
qry = "INSERT INTO table (col1, col2, ...) VALUES (%s, %s, ...)"
row_data_dict = defaultdict(lambda : None, {...})
params = []
for col in [col1, col2, ...]:
params.append(row_data_dict[col])
cursor.execute(qry, tuple(params))
My main problem is that setting None as the default in the dictionary results in either NULL being inserted or an error if I specify the row as NOT NULL. I have a large number of columns that might change in the future so I'd want to avoid setting different 'default' values for different entries if at all possible.
I can't do the typical way of inserting DEFAULT by skipping over columns on the insert because while those columns might have the DEFAULT value, I can't guarantee it and considering I'm doing a large number of inserts I don't want to change the query string each time I insert depending on if it's default or not.
The other way of inserting DEFAULT seems to be to have DEFAULT as one of the parameters (e.g. INSERT INTO table (col1,...) VALUES (DEFAULT,...)) but in my case setting the default in the dictionary to 'DEFAULT' results in error (mySQL complains about it being an incorrect integer value on trying to insert into an integer column, making it seem like it's interpreting the default as a string and not a keyword).
This seems like it would be a relatively common use case, so it kind of shocks me that I can't figure out a way to do this. I'd appreciate any way to do this or get around it that I haven't already listed here.
EDIT: All the of the relevant columns are already labeled with a DEFAULT value, it doesn't seem to actually replace NULL (or python's None) when it's inserted.
EDIT 2: The reason why I want to avoid NULL so badly is because NULL != NULL and I want to have unique rows, so that if there's one row (1, 2, 3, 'Unknown'), INSERT IGNORE'ing a row (1, 2, 3, 'Unknown') won't insert it. With NULL you end up with a bunch of copies of the same record because one of the values is unknown.
You can use the DEFAULT() function in the VALUES list to specify that default value for the column should be used. And you can put this in an IFNULL() call so it will be used when the supplied value is NULL.
qry = """INSERT INTO table (col1, col2, ...)
VALUES (IFNULL(%s, DEFAULT(col1)), IFNULL(%s, DEFAULT(col2)), ...)"""
Welcome to Stackoverflow. What you need to do is in your database add a default value for the column you want to have the default value. When you create your table just use DEFAULT and then the value after you create the column in the table, like this:
CREATE TABLE `yourTable` (`id` INT DEFAULT 0, .....)
if you have already created the table and you need to alter the existing column, you would do something like this:
ALTER TABLE `yourTable` MODIFY `id` INT DEFAULT 0
so in your insert statement coming from python, as long as you pass in either NULL or Nothing for the value of that column then when the row is inserted into your database, the default value will be populated for that column
Another thing to keep in mind is that you have to pass in the proper number of values when you have a default set up for a column. Say you have a table with 3 columns, we'll call them colA, colB and colC.
if you want to insert a row with colA_value for colA, nothing for colB so it will use it's default value and colC_value for colC then you need to still pass in 3 values that will be used for your insert. If you just passed in colA_value and colC_value, then colA will get colA_value and colB will get colC_value and colC will be null. you need to pass in values that will be interpreted by MySQL like this:
INSERT INTO `yourTable` (`colA`, `colB`, `colC`)
VALUES
('colA_value', null, 'colC_value')
even though you are not passing in anything for colB you need to pass a null value from your python program by either passing null or None to MySQL for the value for colB in order to get colB to be populated with it's default value
if you only pass in 2 values to MySQL to insert a row in your table, the insert statement under the hood will look like this:
INSERT INTO `yourTable` (`colA`, `colB`, `colC`)
VALUES
('colA_value', 'colC_value')
which would result in colA getting set to colA_value, colB getting set to colC_value and colC being left as null
if you are passing in the right number of values to be inserted into MySQL (that would mean you need to include null or None for the value to be inserted into the column with the default value) than that is another story. Please let me know if you are passing in the right number of values so I can help you troubleshoot further if needed.

Bulk insert with returning IDs performance

I've got a table which I want to insert around 1000 items each query, and get their PK after creation, for later use as FK for other tables.
I've tried inserting them using returning syntax in postgresql.
but it takes around 10 sec to insert
INSERT INTO table_name (col1, col2, col3) VALUES (a1,a2,a3)....(a(n-2),a(n-1),a(n)) returning id;
By removing RETURNING I get much better performance ~50ms.
I think that if I can get an atomic operation to get the first id and insert the rows at the same time I could get better performance by removing the RETURNING.
but don't understand if that is possible.
Generate id using the nextval
http://www.postgresql.org/docs/9.1/static/sql-createsequence.html
CREATE TEMP TABLE temp_val AS(
VALUES(nextval('table_name_id_seq'),a1,a2,a3),
(nextval('table_name_id_seq'),a1,a2,a3)
);
INSERT INTO table_name (id, col1, col2, col3)(
SELECT column1,column2,column3,column4
FROM temp_val
);
SELECT column1 FROM temp_val;

Saving Tuples as blob data types in Sqlite3 in Python

I have a dictionary in python. They keys are tuples with varying size containing unicode characters and the values are just a single int number. I want to insert this dictionary into sqlite db with a 2 column table.
The first column is for the key values and the second column should have the corresponding int value. Why do I want to do this? well I have a very large dictionary and I used cPickle, even setting the protocol to 2. The size is still big and saving and loading this file takes a lot of time. So I decided to save it in db. This dictionary only loads once into memory at the beginning of program, so there is no extra operation.
Now the problem is that I want to save the tuples exactly as tuples (not strings), so whenever I load my table into memory, I can immediately build my dictionary with no problem.
Does anyone know how I can do this?
A couple of things. First, SQLite doesn't let you store Python data-structures directly. Second, I'm guessing you want to ability to query the value by the tuple key on demand, so you don't want to pickle and unpickle and then search the keys in the dict.
The problem is, you can't query with tuple and you can't break the tuple entries into their own columns because they are of varying sizes. If you must use SQLite, you pretty much have to concatenate the unicode characters in the tuple, possibly with a delimiter that is not 1 of the characters in the tuple values. Use that as a key, and store it into a column in SQLite as a primary key column.
def tuple2key(t, delimiter=u':'):
return delimiter.join(t)
import sqlite3
conn = sqlite3.connect('/path/to/your/db')
cur = conn.cursor()
cur.execute('''create table tab (k text primary key, value integer)''')
# store the dict into a table
for k, v in my_dict.iteritems():
cur.execute('''insert into tab values (?, ?)''', (tuple2key(k), v))
cur.commit()
# query the values
v = cur.execute(''' select value from tab where key = ? ''', tuple2key((u'a',u'b'))).fetchone()
It is possible to store tuples into sqlite db and to create indices on tuples. It needs some extra code to get it done.
Whether the storing of tuples into db an appropriate solution in this particular case is another issue (probably a two-key solution is better suited).
import sqlite3
import pickle
def adapt_tuple(tuple):
return pickle.dumps(tuple)
sqlite3.register_adapter(tuple, adapt_tuple) #cannot use pickle.dumps directly because of inadequate argument signature
sqlite3.register_converter("tuple", pickle.loads)
def collate_tuple(string1, string2):
return cmp(pickle.loads(string1), pickle.loads(string2))
con = sqlite3.connect(":memory:", detect_types=sqlite3.PARSE_DECLTYPES)
con.create_collation("cmptuple", collate_tuple)
cur = con.cursor()
cur.execute("create table test(p tuple unique collate cmptuple) ")
cur.execute("create index tuple_collated_index on test(p collate cmptuple)")
#insert
p = (1,2,3)
p1 = (1,2)
cur.execute("insert into test(p) values (?)", (p,))
cur.execute("insert into test(p) values (?)", (p1,))
#ordered select
cur.execute("select p from test order by p collate cmptuple")
I think it is better to create 3 columns in your table - key1, key2 and value.
If you prefer to save the key as a tuple, you can still use pickle but apply to the key only. Then you can save it as blob.
>>> pickle.dumps((u"\u20AC",u"\u20AC"))
'(V\\u20ac\np0\ng0\ntp1\n.'
>>> pickle.loads(_)
(u'\u20ac', u'\u20ac')
>>>

Categories