Python random character string repeated 7/2000 records - python

I am using the below to generate a random set of characters and numbers:
tag = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(36)])
I thought that this was a decent method. 36 character length, with each character being one of 36 unique options. Should be a good amount of randomness, right?
Then, I was running a query off an instance with what I thought was a unique tag. Turns out, there were SEVEN (7) records with the same "random" tag. So, I opened the DB, and ran a query to see the repeatability of my tags.
Turns out that not only does mine show up 7 times, but there are a number of tags that repeatedly appear over and over again. With approximately 2000 rows, it clearly should not be happening.
Two questions:
(1) What is wrong with my approach, and why would it be repeating the same tag so often?
(2) What would be a better approach to get unique tags for each record?
Here is the code I am using to save this to the DB. While it is written in Django, clearly this is not a django related question.
class Note(models.Model):
...
def save(self, *args, **kwargs):
import random
import string
self.tag = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(36)])
super(Note, self).save(*args, **kwargs)

The problem with your approach:
true randomness/crypto is hard, you should try to use tested existing solutions instead of implementing your own.
Randomness isn't guaranteed - while 'unlikely', there's nothing preventing the same string to be generated more than once.
A better solution would be to not reinvent the wheel, and use the uuid module, a common solution to generating unique identifiers:
import uuid
tag = uuid.uuid1()

Use a cryptographically secure PRNG with random.SystemRandom(). It will use the PRNG of whatever system you are on.
tag = ''.join(random.SystemRandom().choice(string.ascii_letters + string.digits) for n in xrange(36))
Note that there is no need to pass this as a list comprehension to join().
There are 6236 possible combinations, a number with 65 digits, so duplicates should be extremely rare, even if you take the birthday paradox into consideration.

Related

Django/Python generate and unique Gift-card-code from UUID

I'm using Django and into a my model I'm using UUID v4 as primary key.
I'm using this UUID to generate a Qrcode used for a sort of giftcard.
Now the customer requests to have also a giftcard code using 10 characters to have a possibility to acquire the giftcard using the Qrcode (using the current version based on the UUID) as also the possibility to inter manually the giftcard code (to digit 10 just characters).
Now I need to found a way to generate this gift code. Obviously this code most be unique.
I found this article where the author suggest to use the auto-generaed id (integer id) into the generate code (for example at the end of a random string). I'm not sure for this because I have only 10 characters: for long id basically I will fire some of available characters just to concatenate this unique section.
For example, if my id is 609234 I will have {random-string with length 4} + 609234.
And also, I don't like this solution because I think it's not very sure, It's better to have a completely random code. There is a sort regular-format from malicious user point of view.
Do you know a way to generate an unique random string using, for example from an input unique key (in my case the UUIDv4)?
Otherwise, do you know some algorithm/approach to generate voucher codes?
import string
import secrets
unique_digits = string.digits
password = ''.join(secrets.choice(unique_digits) for i in range(6))
print(password)
The above code pallet generates a unique code of integers for the number of digits you want. In the above case, it will print a 6-digit unique Integer code.
If it doesn't let me know, what exactly you want.

Algorithm for A/B testing

I need to develop an A/B testing method for my users. Basically I need to split my users into a number of groups - for example 40% and 60%.
I have around 1,000,00 users and I need to know what would be my best approach. Random numbers are not an option because the users will get different results each time. My second option is to alter my database so each user will have a predefined number (randomly generated). The negative side is that if I get 50 for example, I will always have that number unless I create a new user. I don't mind but I'm not sure that altering the database is a good idea for that purpose.
Are there any other solutions so I can avoid that?
Run a simple algorithm against the primary key. For instance, if you have an integer for user id, separate by even and odd numbers.
Use a mod function if you need more than 2 groups.
Well you are using MySQL so whether it's a good idea or not, it's hard to tell. Altering databases could be costly. Also it could affect performance in the long run if it starts getting bigger. Also you would have to modify your system to include that number in the database for every new user. You have tagged this as a python question. So here is another way of doing it without making any changes to the database. Since you are talking about users you probably have a unique identifier for all of them, let's say e-mail. Instead of email I'll be using uuid's.
import hashlib
def calculateab(email):
maxhash = 16**40
emailhash = int(hashlib.sha1(email).hexdigest(), 16)
div = (maxhash/100)-1
return int(float(emailhash/div))
#A small demo
if __name__ == '__main__':
import uuid, time, json
emails = []
verify = {}
for i in range(1000000):
emails.append(str(uuid.uuid4()))
starttime = time.time()
for i in emails:
ab = calculateab(i)
if ab not in verify:
verify[ab] = 1
else:
verify[ab] += 1
#json for your eye's pleasure
print json.dumps(verify, indent = 4)
#if you look at the numbers, you'll see that they are well distributed so
#unless you are going to do that every second for all users, it should work fine
print "total calculation time {0} seconds".format((time.time() - starttime))
Not that much to do with python, more of a math solution. You could use md5, sha1 or anything along those lines, as long as it has a fixed length and it's a hex number. The -1 on the 6-th line is optional - it sets the range from 0 to 99 instead of 1 to 100. You could also modify that to use floats which will give you a greater flexibility.
I would add an auxiliary table with just userId and A/B. You do not change existent table and it is easy to change the percentage per class if you ever need to. It is very little invasive.
Here is the JS one liner:
const AB = (str) => parseInt(sha1(str).slice(0, 1), 16) % 2 === 0 ? 'A': 'B';
and the result for 10 million random emails:
{ A: 5003530, B: 4996470 }

Why is this Python method leaking memory?

This method iterate over a list of terms in the data base, check if the terms are in a the text passed as argument, and if one is, replace it with a link to the search page with the term as parameter.
The number of terms is high (about 100000), so the process is pretty slow, but this is Ok since it is performed as a cron job. However, it causes the script memory consumtion to skyrocket and I can't find why:
class SearchedTerm(models.Model):
[...]
#classmethod
def add_search_links_to_text(cls, string, count=3, queryset=None):
"""
Take a list of all researched terms and search them in the
text. If they exist, turn them into links to the search
page.
This process is limited to `count` replacements maximum.
WARNING: because the sites got different URLS schemas, we don't
provides direct links, but we inject the {% url %} tag
so it must be rendered before display. You can use the `eval`
tag from `libs` for this. Since they got different namespace as
well, we enter a generic 'namespace' and delegate to the
template to change it with the proper one as well.
If you have a batch process to do, you can pass a query set
that will be used instead of getting all searched term at
each calls.
"""
found = 0
terms = queryset or cls.on_site.all()
# to avoid duplicate searched terms to be replaced twice
# keep a list of already linkified content
# added words we are going to insert with the link so they won't match
# in case of multi passes
processed = set((u'video', u'streaming', u'title',
u'search', u'namespace', u'href', u'title',
u'url'))
for term in terms:
text = term.text.lower()
# no small word and make
# quick check to avoid all the rest of the matching
if len(text) < 3 or text not in string:
continue
if found and cls._is_processed(text, processed):
continue
# match the search word with accent, for any case
# ensure this is not part of a word by including
# two 'non-letter' character on both ends of the word
pattern = re.compile(ur'([^\w]|^)(%s)([^\w]|$)' % text,
re.UNICODE|re.IGNORECASE)
if re.search(pattern, string):
found += 1
# create the link string
# replace the word in the description
# use back references (\1, \2, etc) to preserve the original
# formatin
# use raw unicode strings (ur"string" notation) to avoid
# problems with accents and escaping
query = '-'.join(term.text.split())
url = ur'{%% url namespace:static-search "%s" %%}' % query
replace_with = ur'\1<a title="\2 video streaming" href="%s">\2</a>\3' % url
string = re.sub(pattern, replace_with, string)
processed.add(text)
if found >= 3:
break
return string
You'll probably want this code as well:
class SearchedTerm(models.Model):
[...]
#classmethod
def _is_processed(cls, text, processed):
"""
Check if the text if part of the already processed string
we don't use `in` the set, but `in ` each strings of the set
to avoid subtring matching that will destroy the tags.
This is mainly an utility function so you probably won't use
it directly.
"""
if text in processed:
return True
return any(((text in string) for string in processed))
I really have only two objects with references that could be the suspects here: terms and processed. But I can't see any reason for them to not being garbage collected.
EDIT:
I think I should say that this method is called inside a Django model method itself. I don't know if it's relevant, but here is the code:
class Video(models.Model):
[...]
def update_html_description(self, links=3, queryset=None):
"""
Take a list of all researched terms and search them in the
description. If they exist, turn them into links to the search
engine. Put the reset into `html_description`.
This use `add_search_link_to_text` and has therefor, the same
limitations.
It DOESN'T call save().
"""
queryset = queryset or SearchedTerm.objects.filter(sites__in=self.sites.all())
text = self.description or self.title
self.html_description = SearchedTerm.add_search_links_to_text(text,
links,
queryset)
I can imagine that the automatic Python regex caching eats up some memory. But it should do it only once and the memory consumtion goes up at every call of update_html_description.
The problem is not just that it consumes a lot of memory, the problem is that it does not release it: every calls take about 3% of the ram, eventually filling it up and crashing the script with 'cannot allocate memory'.
The whole queryset is loaded into memory once you call it, that is what will eat up your memory. You want to get chunks of results if the resultset is that large, it might be more hits on the database but it will mean a lot less memory consumption.
I was complety unable to find the cause of the problem, but for now I'm by passing this by isolating the infamous snippet by calling a script (using subprocess) that containes this method call. The memory goes up but of course, goes back to normal after the python process dies.
Talk about dirty.
But that's all I got for now.
make sure that you aren't running in DEBUG.
I think I should say that this method is called inside a Django model method itself.
#classmethod
Why? Why is this "class level"
Why aren't these ordinary methods that can have ordinary scope rules and -- in the normal course of events -- get garbage collected?
In other words (in the form of an answer)
Get rid of #classmethod.

Django query phone numbers excluding brackets

I am trying to build a django query that filters phone numbers (CharField) but excludes brackets
eg.
if I search for 0123456789 it would find (01) 234 567 89
Thanks
Well, you can either use regex, or you can reformat your search:
pn = '0123456789'
Model.objects.filter(phone='(%s) %s %s %s' % (pn[:2], pn[2:5], pn[5:8], pn[8:]))
Ideally you normalize all phone numbers and you'd search for them in that format. If you check out django.contrib.localflavors.us's PhoneNumberField, it forces all new phone numbers to be saved in XXX-XXX-XXXX format, for instance. If you aren't normalizing the phone numbers somehow, you should be. Dealing with multiple potential formats would not be fun.
You can also use regular expressions in your lookup. See: https://docs.djangoproject.com/en/dev/ref/models/querysets/#s-regex
I never like to answer my own questions, but this solution might be helpful to others doing a similar thing:
I defined a function on the model using the #property decorator.
class MyModel(models.model):
....
phoneNumber = CharField...
#property
def raw_phone_number(self):
# function strips those characters and returns just the number

Search range of int values using djapian

I'm using djapian as my search backend, and I'm looking to search for a range of values. For example:
query = 'comments:(0..10)'
Post.indexer.search(query)
would search for Posts with between 0 and 10 comments. I cannot find a way to do this in djapian, though I have found this issue, and patch to implement some kind of date range searching. I also found this page from the xapian official docs describing some kind of range query. However, I lack the knowledge to either formulate my own raw xapian query, and/or feed a raw xapian query into djapian. So help me SO, how can I query a djapian index for a range of int values.
Thanks,
Laurie
Ok, I worked it out. I'll leave the answer here for posterity.
The first thing to do is to attach a NumberValueRangeProcessor to the QueryParser. You can do this by extending the djapian Indexer._get_query_parser. Note the leading underscore. Below is a code snippet showing how I did it.
from djapian import Indexer
from xapian import NumberValueRangeProcessor
class RangeIndexer(Indexer)
def _get_query_parser(self, *args, **kwargs):
query_parser = Indexer._get_query_parser(self, *args, **kwargs)
valno = self.free_values_start_number + 0
nvrp = NumberValueRangeProcessor(valno, 'value_range:', True)
query_parser.add_valuerangeprocessor(nvrp)
return query_parser
Lines to note:
valno = self.free_values_start_number + 0
The self.free_values_start_number is an int, and used as the value no, it is the index of the first column where fields start being defined. I added 0 to this, to indicate that you should add the index of the field you want the range search to be for.
nvrp = NumberValueRangeProcessor(valno, 'value_range:', True)
We send valno to tell the processor what field to deal with. The 'value_range:' indicates the prefix for the processor, so we can search by saying 'value_range:(0..100)'. The True simply indicates that the 'value_range:' should be treated as a prefix not a suffix.
query_parser.add_valuerangeprocessor(nvrp)
This simply adds the NumberValueRangeProcessor to the QueryParser.
Hope that helps anyone who has any problems with this matter. Note that you will need to add a new NumberValueRangeProcessor for each field you want to be able to range search.

Categories