what the best tune for generate code in python to mariadb? - python

i have this code for generate unique code,
import mysql.connector as mariadb
import time
import random
mariadb_connection = mariadb.connect(user='XX', password='XX', database='XX', port='3306', host='192.168.XX.XX')
cursor = mariadb_connection.cursor()
FullChar = 'CDHKPMQRVXY123456789'
total = 20
count = 7
entries = []
uq_id = 0
flg_id = None
get_id = None
bcd = None
def inputDatabase(data):
try:
maria_insert_query = "INSERT INTO BB_UNIQUE_CODE(unique_code, flag_code, get_id, barcode) VALUES (%s, %s, %s, %s)"
cursor.executemany(maria_insert_query, data)
mariadb_connection.commit()
print("Commiting " + str(total) + " entries..")
except Exception:
maria_alter_query = "ALTER TABLE PromoBigrolls.BB_UNIQUE_CODE AUTO_INCREMENT=0"
mariadb_connection.rollback()
print("UniqueCode Rollbacked")
cursor.execute(maria_alter_query)
print("UniqueCode Increment Altered")
while (0 < 1) :
for i in range(total):
unique_code = ''.join(random.sample(FullChar, count))
entry = (unique_code, flg_id, get_id, bcd)
entries.append(entry)
inputDatabase(entries)
#print(entries)
entries.clear()
time.sleep(0.1)
my code output:
1 K4C1D9M null null null
2 K2R9XH3 null null null
3 5M3V9R2 null null null
This code is run correctly, but after generated unique code reach 30 M, there is too much rollback, because if there is a same unique code in database it'll rollback the newest data. Any suggestion?
Thanks

In my opinion UUID is definitely the way to go. By the way your usage of random.sample() is rather unusual - you may end up generating the same "unique" code twice (or more) in a row.
I don't know why you would like to customize an UUID, since it is intended to be a meaningless unique identifier; but if you really need your ID to be a string made of the chars in FullChar then you can generate the UUID, convert it to a list of indexes, and use it to build you final string:
def int2code(n, codestring):
base = len(codestring)
numbers = []
while n > 0:
x = n % base
numbers.append(x)
n //= base
chars = [codestring[c] for c in reversed(numbers)]
return ''.join(chars)
unique_code = int2code(int(str(uuid.uuid1().int)), FullChar)
EDIT
As it has been noted by #shoaib30 you were generating 7-chars long codes.
While this is not difficult to handle (the easiest although probably not the smartest way would be to just calculate uuid.uuid1().int % 20**7 and use it in the function above), it can easily bring to collisions: the UUID is a 128 bit integer, or about 3.4+e38 possible values, while the permutations of 7 out of 20 items are just 3.9+e08, or 390 millions. So you have 1.0e+30 different UUIDs which will be translated to the same code.

Related

What does mean for sub_ids in grouped_slice(ids) in Tryton?

I want to understand an instruction that I have seen a lot. What does for sub_ids in grouped_slice(ids) mean in Tryton?
Here is a fragment of a method where such instruction is used:
#classmethod
def get_duration(cls, works, name):
pool = Pool()
Line = pool.get('timesheet.line')
transaction = Transaction()
cursor = transaction.connection.cursor()
context = transaction.context
table_w = cls.__table__()
line = Line.__table__()
ids = [w.id for w in works]
durations = dict.fromkeys(ids, None)
where = Literal(True)
if context.get('from_date'):
where &= line.date >= context['from_date']
if context.get('to_date'):
where &= line.date <= context['to_date']
if context.get('employees'):
where &= line.employee.in_(context['employees'])
query_table = table_w.join(line, 'LEFT',
condition=line.work == table_w.id)
for sub_ids in grouped_slice(ids):
red_sql = reduce_ids(table_w.id, sub_ids)
cursor.execute(*query_table.select(table_w.id, Sum(line.duration),
where=red_sql & where,
group_by=table_w.id))
for work_id, duration in cursor:
# SQLite uses float for SUM
if duration and not isinstance(duration, datetime.timedelta):
duration = datetime.timedelta(seconds=duration)
durations[work_id] = duration
return durations
grouped_slice is a function we use in tryton to read a list of objects in several subs sets (the default number depends on the database backend. For example the value is 2000 in the PostgreSQL backend).
This is used to limit the number of parameters that we pass to the database because this may cause some issues when the number of input records is bigger than the max number of parameters accepted by the sql driver.
Here is a simple example which provides a visual representation of how everything works:
>>> from trytond.tools import grouped_slice
>>> ids = list(range(0, 30))
>>> for i, sub_ids in enumerate(grouped_slice(ids, count=10)):
... print("Loop", i, ','.join(map(str, sub_ids)))
...
Loop 0 0,1,2,3,4,5,6,7,8,9
Loop 1 10,11,12,13,14,15,16,17,18,19
Loop 2 20,21,22,23,24,25,26,27,28,29
As you see we have a list of 30 ids, which are processed in sets of 10 (as I used 10 as count argument).

How can I update specific value from my JSON type column

I have my web application and I am having statistics about my users in json type column. For example: {'current': {'friends': 5, 'wins': 2, 'loses': 10}}. I would like to update only specific field in case on race condition. For now I was just simply updating whole dictionary but when User will play two games at the same moment, race condition could occur.
For now i am doing this like that:
class User:
name = Column(Unicode(1024), nullable=False)
username = Column(Unicode(128), nullable=False, unique=True, default='')
password = Column(Unicode(256), nullable=True, default='')
counters = Column(
MutableDict.as_mutable(JSON), nullable=False,
server_default=text('{}'), default=lambda: copy.deepcopy(DEFAULT_COUNTERS))
def current_counter(self, feature, number):
current = self.counters.get('current', {})[feature]
if current + number < 0:
return
self.counters.get('current', {})[feature] = current + number
self.counters.changed()
but this will update whole counters column after changing value and if two games will occur I am expecting race condition.
I was thinking about some session.query, something like that, but I am not that good:
def update_counter(self, session, feature, number):
current = self.counters.get('current', {})[feature]
if current + number < 0:
return
session.query(User) \
.filter(User.id == self.id) \
.update({
"current": func.jsonb_set(
User.counters['current'][feature],
column(current) + column(number),
'true')
},
synchronize_session=False
)
This code produce: NotImplementedError: Operator 'getitem' is not supported on this expression for Event.counters['current'][feature] line but I don't know how to make this works.
Thanks for help.
The error is produced from chaining item access, instead of using a tuple of indexes as a single operation:
User.counters['current', feature]
This would produce a path index operation. But if you would do it that way, you would be setting the value in the nested JSON only, not in the whole value. In addition the value you're indexing from your JSON is an integer (instead of a collection), so jsonb_set() would not even know what to do. That is why jsonb_set() accepts a path as its second argument, which is an array of text and describes which value you want to set in your JSON:
func.jsonb_set(User.counters, ['current', feature], ...)
As for race conditions, there might be one still. You first get the count from the current model object in
current = self.counters.get('current', {})[feature]
and then proceed to use that value in an update, but what if another transaction has managed to perform a similar update in between? You would possibly overwrite that update's changes:
select, counter = 42 |
| select, counter = 42
update counter = 52 | # +10
| update counter = 32 # -10
commit |
| commit # 32 instead of 42
The solution then is to either make sure that you fetched the current model object using FOR UPDATE, or you're using SERIALIZABLE transaction isolation (be ready to retry on serialization failures), or ignore the fetched value and let the DB calculate the update:
# Note that create_missing is true by default
func.jsonb_set(
User.counters,
['current', feature],
func.to_jsonb(
func.coalesce(User.counters['current', feature].astext.cast(Integer), 0) +
number))
and if you want to be sure that you don't update the value if the result would turn out negative (remember that the value you've read before might've changed already), add a check using the DB calculated value as a predicate:
def update_counter(self, session, feature, number):
current_count = User.counters['current', feature].astext.cast(Integer)
# Coalesce in case the count has not been set yet and is NULL
new_count = func.coalesce(current_count, 0) + number
session.query(User) \
.filter(User.id == self.id, new_count >= 0) \
.update({
User.counters: func.jsonb_set(
func.to_jsonb(User.counters),
['current', feature],
func.to_jsonb(new_count)
)
}, synchronize_session=False)

Python: Efficient way to split list of strings into smaller chunks by concatenated size

I am communicating with Google API via batch requests through its google-api-python-client. In the batch requests there are limitations:
A batch request can not contain more than 1000 requests,
A batch request can not contain more than 1MB in the payload.
I have random number of random length strings in a list, from which I need to construct a batch request while keeping the aforementioned limitations in mind.
Does anyone know a good way to efficiently build chunks of that original list that can be submitted to Google API? By 'efficiently' I mean, not iterating through all elements from part one (counting the payload size).
So far, that's what I had in mind: take at maximum 1000 piece of the items, build the request, see the payload size. If it's bigger than 1M, take 500, see the size. If the payload is bigger, take the first 250 items. If the payload if smaller, take 750 items. And so on, you get the logic. This way, one could get the right amount of elements with less iterations than building the payload while checking it after each addition.
I really don't want to reinvent the wheel, so if anyone knows an efficient builtin/module for that, please let me know.
The body payload size can be calculated by calling _serialize_request, when you've added the right amount of requests to the instantiated BatchHttpRequest.
See also the Python API Client Library documentation on making batch requests.
Okay, it seems I created something that solves this issue, here's a draft of the idea in python:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import random
import string
import sys
MAX_LENGTH = 20
MAX_SIZE = 11111
def get_random():
return ''.join([
random.choice(string.ascii_letters) for i in range(
random.randrange(10, 1000))])
def get_random_list():
return [get_random() for i in range(random.randrange(50, 1000))]
def get_str_length(rnd_list, item_idx):
return len(''.join(rnd_list[:item_idx]))
rnd_list = get_random_list()
def calculate_ideal_amount(rnd_list):
list_bounds = {
'first': 1,
'last': len(rnd_list)
}
print ('ORIG_SIZE: %s, ORIG_LEN: %s' % (
get_str_length(rnd_list, len(rnd_list)), len(rnd_list)))
if get_str_length(rnd_list, list_bounds['first']) > MAX_SIZE:
return 0
if get_str_length(rnd_list, list_bounds['last']) <= MAX_SIZE and \
list_bounds['last'] <= MAX_LENGTH:
return list_bounds['last']
while True:
difference = round((list_bounds['last'] - list_bounds['first']) / 2)
middle_item_idx = list_bounds['first'] + difference
str_len = get_str_length(
rnd_list, middle_item_idx)
print(
'MAX_SIZE: %s, list_bounds: %s, '
'middle_item_idx: %s, diff: %s, str_len: %s,' % (
MAX_SIZE, list_bounds, middle_item_idx, difference, str_len))
# sys.stdin.readline()
if str_len > MAX_SIZE:
list_bounds['last'] = middle_item_idx
continue
if middle_item_idx > MAX_LENGTH:
return MAX_LENGTH
list_bounds['first'] = middle_item_idx
if difference == 0:
if get_str_length(rnd_list, list_bounds['last']) <= MAX_SIZE:
if list_bounds['last'] > MAX_LENGTH:
return MAX_LENGTH
return list_bounds['last']
return list_bounds['first']
ideal_idx = calculate_ideal_amount(rnd_list)
print (
len(rnd_list), get_str_length(rnd_list, len(rnd_list)),
get_str_length(rnd_list, ideal_idx), ideal_idx,
get_str_length(rnd_list, ideal_idx + 1))
This code does exactly the same I tried to describe, by finding and modifying the bounds of the list while measuring its returned (concatenated) size, and then giving back the index of the list where it should be sliced in order to achieve the most efficient string size. This method avoids the CPU overhead of compiling and measuring the list one by one. Running this code will show you the iterations it does on the list.
The get_str_length, lists and other functions can be replaced to use the corresponding functionality in the API client, but this is the rough idea behind.
However the code is not foolproof, the solution should be something along these lines.

Best way to insert ~20 million rows using Python/MySQL

I need to store a defaultdict object containing ~20M objects into a database. The dictionary maps a string to a string, so the table has two columns, no primary key because it's constructed later.
Things I've tried:
executemany, passing in the set of keys and values in the dictionary. Works well when number of values < ~1M.
Executing single statements. Works, but slow.
Using transactions
con = sqlutils.getconnection()
cur = con.cursor()
print len(self.table)
cur.execute("SET FOREIGN_KEY_CHECKS = 0;")
cur.execute("SET UNIQUE_CHECKS = 0;")
cur.execute("SET AUTOCOMMIT = 0;")
i = 0
for k in self.table:
cur.execute("INSERT INTO " + sqlutils.gettablename(self.sequence) + " (key, matches) values (%s, %s);", (k, str(self.hashtable[k])))
i += 1
if i % 10000 == 0:
print i
#cur.executemany("INSERT INTO " + sqlutils.gettablename(self.sequence) + " (key, matches) values (%s, %s)", [(k, str(self.table[k])) for k in self.table])
cur.execute("SET UNIQUE_CHECKS = 1;")
cur.execute("SET FOREIGN_KEY_CHECKS = 1;")
cur.execute("COMMIT")
con.commit()
cur.close()
con.close()
print "Finished", self.sequence, "in %.3f sec" % (time.time() - t)
This is a recent conversion from SQLite to MySQL. Oddly enough, I'm getting much better performance when I use SQLite (30s to insert 3M rows in SQLite, 480s in MySQL). Unfortunately, MySQL is a necessity because the project will be scaled up in the future.
-
Edit
Using LOAD DATA INFILE works like a charm. Thanks to all who helped! Inserting 3.2M rows takes me ~25s.
MySQL can inserts multiple values with one query: INSERT INTO table (key1, key2) VALUES ("value_key1", "value_key2"), ("another_value_key1", "another_value_key2"), ("and_again", "and_again...");
Also, you could try to write your datas inside a file and use LOAD DATA from Mysql that is designed to insert with "very hight speed" (dixit Mysql).
I dunno if "file writing" + "MySQL Load Data" will be faster than Insert multiple values in one query (or many queries if MySQL has a limit for it)
It depends on your hardware (write a file is "fast" with a SSD), on your file system configuration, on MySQL configuration etc. So, you have to test on your "prod" env to see what solution is the fastest for you.
Insert of directly inserting, generate a sql file (using extended inserts etc) then fetch this to MySQL, this will save you quite a lot of overhead.
NB : you'll still save some execution time if you avoid recomputing constant values in your loop, ie:
for k in self.table:
xxx = sqlutils.gettablename(self.sequence)
do_something_with(xxx, k)
=>
xxx = sqlutils.gettablename(self.sequence)
for k in self.table:
do_something_with(xxx, k)

Take a few elements from an iterable, do something, take a few elements more, and so on

Here is some python code moving data from one database in one server to another database in another server:
cursor1.execute("""
SELECT d1.Doc_Id , d2.Doc_Id
FROM Document d1
INNER JOIN Reference r ON d1.Doc_Id = r.Doc_Id
INNER JOIN Document d2 ON r.R9 = d2.T9
""")
cursor2.execute("START TRANSACTION")
cursor2.executemany( "INSERT IGNORE INTO citation_t(citing_doc_id, cited_doc_id) VALUES (?,?)",
cursor1 )
cursor2.execute("COMMIT")
Now, for the sake of exposition, let's say that the transaction runs out of space in the target hard-drive before the commit, and thus the commit is lost. But I'm using the transaction for performance reasons, not for atomicity. So, I would like to fill the hard-drive with commited data so that it remains full and I can show it to my boss. Again, this is for the sake of exposition, the real question is below. In that scenario, I would rather do:
cursor1.execute("""
SELECT d1.Doc_Id , d2.Doc_Id
FROM Document d1
INNER JOIN Reference r ON d1.Doc_Id = r.Doc_Id
INNER JOIN Document d2 ON r.R9 = d2.T9
""")
MAX_ELEMENTS_TO_MOVE_TOGETHER = 1000
dark_spawn = some_dark_magic_with_iterable( cursor1, MAX_ELEMENTS_TO_MOVE_TOGETHER )
for partial_iterable in dark_spawn:
cursor2.execute("START TRANSACTION")
cursor2.executemany( "INSERT IGNORE INTO citation_t(citing_doc_id, cited_doc_id) VALUES (?,?)",
partial_iterable )
cursor2.execute("COMMIT")
My question is, which is the right way of filling in some_dark_magic_with_iterable, that is, to create some sort of iterator with pauses in-between?
Just create a generator! :P
def some_dark_magic_with_iterable(curs, nelems):
res = curs.fetchmany(nelems)
while res:
yield res
res = curs.fetchmany(nelems)
Ok, ok... for generic iterators...
def some_dark_magic_with_iterable(iterable, nelems):
try:
while True:
res = []
while len(res) < nelems:
res.append(iterable.next())
yield res
except StopIteration:
if res:
yield res

Categories