Algorithm for A/B testing - python

I need to develop an A/B testing method for my users. Basically I need to split my users into a number of groups - for example 40% and 60%.
I have around 1,000,00 users and I need to know what would be my best approach. Random numbers are not an option because the users will get different results each time. My second option is to alter my database so each user will have a predefined number (randomly generated). The negative side is that if I get 50 for example, I will always have that number unless I create a new user. I don't mind but I'm not sure that altering the database is a good idea for that purpose.
Are there any other solutions so I can avoid that?

Run a simple algorithm against the primary key. For instance, if you have an integer for user id, separate by even and odd numbers.
Use a mod function if you need more than 2 groups.

Well you are using MySQL so whether it's a good idea or not, it's hard to tell. Altering databases could be costly. Also it could affect performance in the long run if it starts getting bigger. Also you would have to modify your system to include that number in the database for every new user. You have tagged this as a python question. So here is another way of doing it without making any changes to the database. Since you are talking about users you probably have a unique identifier for all of them, let's say e-mail. Instead of email I'll be using uuid's.
import hashlib
def calculateab(email):
maxhash = 16**40
emailhash = int(hashlib.sha1(email).hexdigest(), 16)
div = (maxhash/100)-1
return int(float(emailhash/div))
#A small demo
if __name__ == '__main__':
import uuid, time, json
emails = []
verify = {}
for i in range(1000000):
emails.append(str(uuid.uuid4()))
starttime = time.time()
for i in emails:
ab = calculateab(i)
if ab not in verify:
verify[ab] = 1
else:
verify[ab] += 1
#json for your eye's pleasure
print json.dumps(verify, indent = 4)
#if you look at the numbers, you'll see that they are well distributed so
#unless you are going to do that every second for all users, it should work fine
print "total calculation time {0} seconds".format((time.time() - starttime))
Not that much to do with python, more of a math solution. You could use md5, sha1 or anything along those lines, as long as it has a fixed length and it's a hex number. The -1 on the 6-th line is optional - it sets the range from 0 to 99 instead of 1 to 100. You could also modify that to use floats which will give you a greater flexibility.

I would add an auxiliary table with just userId and A/B. You do not change existent table and it is easy to change the percentage per class if you ever need to. It is very little invasive.

Here is the JS one liner:
const AB = (str) => parseInt(sha1(str).slice(0, 1), 16) % 2 === 0 ? 'A': 'B';
and the result for 10 million random emails:
{ A: 5003530, B: 4996470 }

Related

Django/Python generate and unique Gift-card-code from UUID

I'm using Django and into a my model I'm using UUID v4 as primary key.
I'm using this UUID to generate a Qrcode used for a sort of giftcard.
Now the customer requests to have also a giftcard code using 10 characters to have a possibility to acquire the giftcard using the Qrcode (using the current version based on the UUID) as also the possibility to inter manually the giftcard code (to digit 10 just characters).
Now I need to found a way to generate this gift code. Obviously this code most be unique.
I found this article where the author suggest to use the auto-generaed id (integer id) into the generate code (for example at the end of a random string). I'm not sure for this because I have only 10 characters: for long id basically I will fire some of available characters just to concatenate this unique section.
For example, if my id is 609234 I will have {random-string with length 4} + 609234.
And also, I don't like this solution because I think it's not very sure, It's better to have a completely random code. There is a sort regular-format from malicious user point of view.
Do you know a way to generate an unique random string using, for example from an input unique key (in my case the UUIDv4)?
Otherwise, do you know some algorithm/approach to generate voucher codes?
import string
import secrets
unique_digits = string.digits
password = ''.join(secrets.choice(unique_digits) for i in range(6))
print(password)
The above code pallet generates a unique code of integers for the number of digits you want. In the above case, it will print a 6-digit unique Integer code.
If it doesn't let me know, what exactly you want.

Python random character string repeated 7/2000 records

I am using the below to generate a random set of characters and numbers:
tag = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(36)])
I thought that this was a decent method. 36 character length, with each character being one of 36 unique options. Should be a good amount of randomness, right?
Then, I was running a query off an instance with what I thought was a unique tag. Turns out, there were SEVEN (7) records with the same "random" tag. So, I opened the DB, and ran a query to see the repeatability of my tags.
Turns out that not only does mine show up 7 times, but there are a number of tags that repeatedly appear over and over again. With approximately 2000 rows, it clearly should not be happening.
Two questions:
(1) What is wrong with my approach, and why would it be repeating the same tag so often?
(2) What would be a better approach to get unique tags for each record?
Here is the code I am using to save this to the DB. While it is written in Django, clearly this is not a django related question.
class Note(models.Model):
...
def save(self, *args, **kwargs):
import random
import string
self.tag = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(36)])
super(Note, self).save(*args, **kwargs)
The problem with your approach:
true randomness/crypto is hard, you should try to use tested existing solutions instead of implementing your own.
Randomness isn't guaranteed - while 'unlikely', there's nothing preventing the same string to be generated more than once.
A better solution would be to not reinvent the wheel, and use the uuid module, a common solution to generating unique identifiers:
import uuid
tag = uuid.uuid1()
Use a cryptographically secure PRNG with random.SystemRandom(). It will use the PRNG of whatever system you are on.
tag = ''.join(random.SystemRandom().choice(string.ascii_letters + string.digits) for n in xrange(36))
Note that there is no need to pass this as a list comprehension to join().
There are 6236 possible combinations, a number with 65 digits, so duplicates should be extremely rare, even if you take the birthday paradox into consideration.

Understanding a custom encryption method in python

As part of an assignment I've been given some code written in python that was used to encrypt a message, and I have to try and understand the code and decrypt the ciphertext. I've never used python before and am somewhat out of my depth.
I understand most of it and the overall gist of what the code is trying to accomplish, however there are a few lines near the end tripping me up. Here's the entire thing (the &&& denotes sections of code which are supposed to be "damaged", while testing the code I've set secret to "test" and count to 3):
import string
import random
from base64 import b64encode, b64decode
secret = '&&&&&&&&&&&&&&' # We don't know the original message or length
secret_encoding = ['step1', 'step2', 'step3']
def step1(s):
_step1 = string.maketrans("zyxwvutsrqponZYXWVUTSRQPONmlkjihgfedcbaMLKJIHGFEDCBA","mlkjihgfedcbaMLKJIHGFEDCBAzyxwvutsrqponZYXWVUTSRQPON")
return string.translate(s, _step1)
def step2(s): return b64encode(s)
def step3(plaintext, shift=4):
loweralpha = string.ascii_lowercase
shifted_string = loweralpha[shift:] + loweralpha[:shift]
converted = string.maketrans(loweralpha, shifted_string)
return plaintext.translate(converted)
def make_secret(plain, count):
a = '2{}'.format(b64encode(plain))
for count in xrange(count):
r = random.choice(secret_encoding)
si = secret_encoding.index(r) + 1
_a = globals()[r](a)
a = '{}{}'.format(si, _a)
return a
if __name__ == '__main__':
print make_secret(secret, count=&&&)
Essentially, I assume the code is meant to choose randomly from the three encryption methods step1, step2 and step3, then apply them to the cleartext a number or times as governed by whatever the value of "count" is.
The "make_secret" method is the part that's bothering me, as I'm having difficulty working out how it ties everything together and what the overall purpose of it is. I'll go through it line by line and give my reasons on each part, so someone can correct me if I'm mistaken.
a = '2{}'.format(b64encode(plain))
This takes the base64 encoding of whatever the "plain" variable corresponds to and appends a 2 to the start of it, resulting in something like "2VGhpcyBpcyBhIHNlY3JldA==" using "this is a secret" for plain as a test. I'm not sure what the 2 is for.
r = random.choice(secret_encoding)
si = secret_encoding.index(r) + 1
r is a random selection from the secret_encoding array, while si corresponds to the next array element after r.
_a = globals()[r](a)
This is one of the parts that has me stumped. From researching global() it seems that the intention here is to turn "r" into a global dictionary consisting of the characters found in "a", ie somewhere later in the code a's characters will be used as a limited character set to choose from. Is this correct or am I way off base?
I've tried printing _a, which gives me what appears to be the letters and numbers found in the final output of the code.
a = '{}{}'.format(si, _a)
It seems as if this is creating a string which is a concatenation of the si and _a variables, however I'll admit I don't understand the purpose of doing this.
I realize this is a long question, but I thought it would be best to put the parts that are bothering me into context.
I will refrain from commenting on the readability of the code. I daresay
it was all intentional, anyway, for purposes of obfuscation. Your
professor is an evil bastard and I want to take his or her course :)
r = random.choice(secret_encoding)
...
_a = globals()[r](a)
You're way off base. This is essentially an ugly and hard-to-read way to
randomly choose one of the three functions and run it on a. The
function globals() returns a dict that maps names to identifiers; it
includes the three functions and other things. globals()[r] looks up
one of the three functions based on the name r. Putting (a) after
that runs the function with a as the argument.
a = '{}{}'.format(si, _a)
The idea here is to prepend each interim result with the number of the
function that encrypted it, so you know which function you need to
reverse to decrypt that step. They all accumulate at the beginning, and
get encrypted and re-encrypted with each step, except for the last one.
a = '2{}'.format(b64encode(plain))
Essentially, this is applying step2 first. Each encryption with
step2 prepends a 2.
So, the program applies count encryptions to the plaintext, with each
step using a randomly-chosen transformation, and the choice appears in
plaintext before the ciphertext. Your task is to read each prepended
number and apply the inverse transformation to the rest of the message.
You stop when the first character is not in "123".
One problem I see is that if the plaintext begins with a digit in
"123", it will look like we should perform another decryption step. In
practice, however, I feel sure that the professor's choice of plaintext
does not begin with such a digit (unless they're really evil).

MongoDB, pymongo: algorithm to speed up identification of consecutive documents with matching values

I'm new to MongoDB and pymongo and looking for some guidance in terms of algorithms and performance for a specific task described below. I have posted a link to an image of the data sample and also my sample python code below.
I have a single collection that grows about 5 to 10 Million documents every month. It receives all this info from other systems, which I have no access to modify in any way (they are in different companies). Each document represent sort of a financial transaction. I need to group documents that are part of a same "transaction group".
Each document has hundreds of keys. Almost all keys vary between documents (which is why they moved from MySQL to MongoDB - no easy way to align schema). However, I found out that three keys are guaranteed to always be in all of them. I'll call these keys key1, key2 and key3 in this example. These keys are my only option to identify the transactions that are part of the same transaction group.
The basic rule is:
- If consecutive documents have the same key1, and the same key2, and the same key3, they are all in the same "transaction group". Then I must give it some integer id in a new key named 'transaction_group_id'
- Else, consecutive documents that do not matck key1, key2 and key3 are all in their own individual "transaction_groups".
It's really easy to understand it by looking at the screenshot of a data sample (better than my explanation anyway). See here:
As you can see in the sample:
- Documents 1 and 2 are in the same group, because they match key1, key2 and key3;
- Documents 3 and 4 also match and are in their own group;
- Following the same logic, documents 18 and 19 are a group obviously. However, even though they match the values of documents 1 and 3, they are not in the same group (because the documents are not consecutive).
I created a very simplified version of the current python function, to give you guys an idea of the current implementation:
def groupTransactions(mongo_host,
mongo_port,
mongo_db,
mongo_collection):
"""
Group transactions if Keys 1, 2 and 3 all match in consecutive docs.
"""
mc = MongoClient(mongo_host, mongo_port)
db = mc['testdb']
coll = db['test_collection']
# The first document transaction group must always be equal to 1.
first_doc_id = coll.find_one()['_id']
coll.update({'_id': first_doc_id},
{"$set": {"transaction_group_id": 1}},
upsert=False, multi=False)
# Cursor order is undetermined unless we use sort(), no matter what the _id is. We learned it the hard way.
cur = coll.find().sort('subtransaction_id', ASCENDING)
doc_count = cur.count()
unique_data = []
unique_data.append(cur[0]['key1'], cur[0]['key2'], cur[0]['key3'])
transaction_group_id = 1
i = 1
while i < doc_count:
doc_id = cur[i]['_id']
unique_data.append(cur[i]['key1'], cur[i]['key2'], cur[i]['key3'])
if unique_data[i] != unique_data[i-1]:
# New group find, increase group id by 1
transaction_group_id = transaction_group_id + 1
# Update the group id in the database
coll.update({'_id': doc_id},
{"$set": {"transaction_group_id": transaction_group_id}},
upsert=False, multi=False)
i = i + 1
print "%d subtransactions were grouped into %d transaction groups." % (doc_count, i)
return 1
This is the code, more or less, and it works. But it takes between 2 to 3 days to finish, which is starting to become unacceptable. The hardware is good: VMs in last generation Xeon, local MongoDB in SSD, 128GB RAM). It will probably run fast if we decide to run it on AWS, use threading/subprocesses, etc - which are all obviously good options to try at some point.
However, I'm not convinced this is the best algorithm. It's just the best I could come up with.There must be obvious ways to improve it that I'm not seeing.
Moving to c/c++ or out of NoSQL is out of the question at this point. I have to make it work the way it is.
So basically the question is: Is this the best possible algorithm (using MongoDB/pymongo) in terms of speed? If not, I'd appreciate it if you could point me in the right direction.
EDIT: Just so you can have an idea of how slow this code performance is: Last time I measured it, it took 22 hours to run on 1.000.000 results. As a quick workaround, I wrote something else to load the data to a Pandas DataFrame first and then apply the same logic of this code more or less. It took 3 to 4 minutes to group everything, using the same hardware. I mean, I know Pandas is efficient, etc. But there's something wrong, there can't be such a huge gap between between the two solutions performances (4min vs 1,320min).
It is the case that most of the time is spent writing to the database, which includes the round trip of sending work to the DB, plus the DB doing the work. I will point out a couple of places where you can speed up each of those.
Speeding up the back-and-forth of sending write requests to the DB:
One of the best ways to improve the latency of the requests to the DB is to minimize the number of round trips. In your case, it's actually possible because multiple documents will get updated with the same transaction_group_id. If you accumulate their values and only send a "multi" update for all of them, then it will cut down on the back-and-forth. The larger the transaction groups the more this will help.
In your code, you would replace the current update statement:
coll.update( {'_id': doc_id},
{"$set": {"transaction_group_id": transaction_group_id}},
upsert=False, multi=False)
With an accumulator of doc_id values (appending them to a list should be just fine). When you detect the "pattern" change and transaction group go to the next one, you would then run one update for the whole group as:
coll.update( {'_id': {$in: list-of-docids },
{"$set": {"transaction_group_id": transaction_group_id}},
upsert=False, multi=True)
A second way of increasing parallelism of this process and speeding up end-to-end work would be to split the job between more than one client - the downside of this is that you need a single unit of work to pre-calculate how many transaction_group_id values there will be and where the split points are. Then you can have multiple clients like this one which only handle range of specific subtransaction_id values and their transaction_group_id starting value is not 1 but whatever is given to them by the "pre-work" process.
Speeding up the actual write on the DB:
The reason I asked about existence of the transaction_group_id field is because if a field that's being $set does not exist, it will be created and that increases the document size. If there is not enough space for the increased document, it has to be relocated and that's less efficient than the in-place update.
MongoDB stores documents in BSON format. Different BSON values have different sizes. As a quick demonstration, here's a shell session that shows total document size based on the type and size of value stored:
> db.sizedemo.find()
{ "_id" : ObjectId("535abe7a5168d6c4735121c9"), "transaction_id" : "" }
{ "_id" : ObjectId("535abe7d5168d6c4735121ca"), "transaction_id" : -1 }
{ "_id" : ObjectId("535abe815168d6c4735121cb"), "transaction_id" : 9999 }
{ "_id" : ObjectId("535abe935168d6c4735121cc"), "transaction_id" : NumberLong(123456789) }
{ "_id" : ObjectId("535abed35168d6c4735121cd"), "transaction_id" : " " }
{ "_id" : ObjectId("535abedb5168d6c4735121ce"), "transaction_id" : " " }
> db.sizedemo.find().forEach(function(doc) { print(Object.bsonsize(doc)); })
43
46
46
46
46
53
Note how the empty string takes up three bytes fewer than double or NumberLong do. The string " " takes the same amount as a number and longer strings take proportionally longer. To guarantee that your updates that $set the transaction group never cause the document to grow, you want to set transaction_group_id to the same size value on initial load as it will be updated to (or larger). This is why I suggested -1 or some other agreed upon "invalid" or "unset" value.
You can check if the updates have been causing document moves by looking at the value in db.serverStatus().metrics.record.moves - this is the number of document moves caused by growth since the last time server was restarted. You can compare this number before and after your process runs (or during) and see how much it goes up relative to the number of documents you are updating.

What data is cached during a "select" in sqlite3/Python, and can this be done manually from the start?

Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?
Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.

Categories