Assuming I have the following sequences:
A-B-C-D
A-C-C-E
B-B-B-D
A-A-E-D
...
I need to assign unique numerical IDs to every element, e.g. A=0, B=1 and so on and work with those IDs. At the moment I generate ID with the following function:
id = -1
ids = dict()
def getid():
global id
id += 1
return id
def genid(s):
global id
if not s in ids:
ids[s] = getid()
return ids[s]
I'm beginner, so it may not be the perfect solution, but it works. However, I worry that it will be very slow/inefficient for large number of.sequences and elements (imagine instead of A, B etc. it has combination of letters ABCD, XYZ and so on). I believe Python has mechanisms to achieve this in a more compact way. May be collections library has something that can achieve this in 1-2 lines?
uuid will generate a unique random id which can be represented as an int, bytes, or hex.
Just import uuid and then use uuid.uuid1().bytes or uuid.uuid1().int or uuid.uuid1().hex to get your id.
You can avoid global altogether, and as suggested use count:
from itertools import count
id_counter = count()
ids = dict()
def getid():
return next(id_counter)
def genid(s):
if s not in ids:
ids[s] = getid()
return ids[s]
You could use some "python magic" to make it shorter:
from itertools import count
def genid(s, id_counter=count(), ids={}):
if s not in ids:
ids[s] = next(id_counter)
return ids[s]
Related
I need a solution where i can generate unique alphanumeric id column for my dataframe. I need that the ids remain same even if I run the script later in future.
Name
Sam
Pray
Brad
I can generate the ids based on this post but I need 5 digit aplhanumeric values which will always remain same.
This is desired output:
Name ID
Sam X25TR
Peter WE558
Pepe TR589
One way would be to generate a hash of the name, by whatever hashing algorithm, and keep the first five characters of the hash. But you should keep in mind that with such as short hash this is likely to cause collisions (same output for multiple different inputs) if you have enough data.
Something along these lines:
import hashlib
def get_id(name: str) -> str:
hash = hashlib.md5(name.encode())
return hash.hexdigest()[:5]
Now for a given input string, get_id returns an alphanumeric 5-character string which is always the same for the same input.
This function generate random alphanumeric string with given length:
import math
import secrets
def random_alphanum(length: int) -> str:
text = secrets.token_hex(nbytes=math.ceil(length / 2))
isEven = length % 2 == 0
return text if isEven else text[1:]
df['ID'] == random_alphanum(5)
Apply to whole rows:
df2['ID'] = df2.apply(lambda x: random_alphanum(5), axis=1, result_type="expand")
Here's my attempt
import secrets
ids = []
while len(ids) < df.shape[0]:
temp = secrets.token_hex(5)[:5]
if temp not in ids:
ids.append(temp)
df.merge(pd.DataFrame(ids).reset_index(), left_on = df.groupby(['Name']).ngroup(), right_on = 'index')
I have the following class that keeps track of an OrderedDict:
class LexDict:
def __init__(self):
self.m_map = OrderedDict() # maps string which is case-sensitive to int
def set(self,id,seqNo):
self.m_map[id] = seqNo
def get(self,id): # seqNo returned
return self.m_map[id] if self.has(id) else 0
def has(self,id): # bool value
return ( id in self.m_map.keys() )
def to_str(self):
stream = ""
for key,value in self.m_map.items():
stream = stream + key + ":" + str(value) + " "
return stream.rstrip()
My goal is to change the set() method to make it lexicographic at all times so that no matter when to_str() is called, it will be in lexicographic order. We can safely assume no mapping in this dictionary will be removed. This will be used in a network situation so efficiency is key and sorting the entire list rather than moving it in the correct spot would hurt performance.
An example of how this could be used.
a = LexDict()
a.set("/Justin",1) # the id will have "/"s (maybe even many) in it, we can image them without "/"s for sorting
a.set("/James",600)
a.set("/Austin",-123)
print( a.to_str() )
Output /Austin:-123 /James:600 /Justin:1
I am a little confused. It sounds like you are referring to the OrderedDict class from the sortedcollections module; that module also contains what you are looking for, namely a SortedDict. In general, the sortedcollections module contains many containers for efficiently using large lists and dictionaries. For example, look up time in a SortedDict is O(log(n)) instead of O(n) for normal python dict().
from sortedcollections import SortedDict
D = SortedDict([("/James",600),("/Justin",1),("/Austin",-123)])
print(D)
In general, SortedDict and SortedList can hold millions of values, but instantly look up values.
I have a function like below:
def fun(content):
for i in content:
id = i.split('\"')[0]
yield id
return id
The problem is that there are some duplicated values in content.
Is there any way to know if the value 'id' is already in the generator 'id'? Rather than get the final generator then use set()?
You can use a set inside fun to keep track of the ids that have been seen already:
def fun(content):
observed = set()
for i in content:
id = i.split('\"')[0]
if id not in observed:
observed.add(id)
yield id
Also, since you're yielding ids you don't need to return at the end.
For our python project we have to solve multiple questions. We are however stuck at this one:
"Write a function that, given a FASTA file name, returns a dictionary with the sequence IDs as keys, and a tuple as value. The value denotes the minimum and maximum molecular weight for the sequence (sequences can be ambiguous)."
import collections
from Bio import Seq
from itertools import product
def ListMW(file_name):
seq_records = SeqIO.parse(file_name, 'fasta',alphabet=generic_dna)
for record in seq_records:
dictionary = Seq.IUPAC.IUPACData.ambiguous_dna_values
result = []
for i in product(*[dictionary[j] for j in record]):
result.append("".join(i))
molw = []
for sequence in result:
molw.append(SeqUtils.molecular_weight(sequence))
tuple= (min(molw),max(molw))
if min(molw)==max(molw):
dict={record.id:molw}
else:
dict={record.id:(min(molw), max(molw))}
print(dict)
Using this code we manage to get this output:
{'seq_7009': (6236.9764, 6367.049999999999)}
{'seq_418': (3716.3642000000004, 3796.4124000000006)}
{'seq_9143_unamb': [4631.958999999999]}
{'seq_2888': (5219.3359, 5365.4089)}
{'seq_1101': (4287.7417, 4422.8254)}
{'seq_107': (5825.695099999999, 5972.8073)}
{'seq_6946': (5179.3118, 5364.420900000001)}
{'seq_6162': (5531.503199999999, 5645.577399999999)}
{'seq_504': (4556.920899999999, 4631.959)}
{'seq_3535': (3396.1715999999997, 3446.1969999999997)}
{'seq_4077': (4551.9108, 4754.0073)}
{'seq_1626_unamb': [3724.3894999999998]}
As you can see this is not one dictionary but multiple dictionaries under each other. So is there anyway we can change our code or type an extra command to get it in this format:
{'seq_7009': (6236.9764, 6367.049999999999),
'seq_418': (3716.3642000000004, 3796.4124000000006),
'seq_9143_unamb': (4631.958999999999),
'seq_2888': (5219.3359, 5365.4089),
'seq_1101': (4287.7417, 4422.8254),
'seq_107': (5825.695099999999, 5972.8073),
'seq_6946': (5179.3118, 5364.420900000001),
'seq_6162': (5531.503199999999, 5645.577399999999),
'seq_504': (4556.920899999999, 4631.959),
'seq_3535': (3396.1715999999997, 3446.1969999999997),
'seq_4077': (4551.9108, 4754.0073),
'seq_1626_unamb': (3724.3894999999998)}
Or in someway manage to make clear that it should use the seq_ID ans key and the Molecular weight as a value for one dictionary?
Set a dictionnary right before your for loop, then update it during your loop such as :
import collections
from Bio import Seq
from itertools import product
def ListMW(file_name):
seq_records = SeqIO.parse(file_name, 'fasta',alphabet=generic_dna)
retDict = {}
for record in seq_records:
dictionary = Seq.IUPAC.IUPACData.ambiguous_dna_values
result = []
for i in product(*[dictionary[j] for j in record]):
result.append("".join(i))
molw = []
for sequence in result:
molw.append(SeqUtils.molecular_weight(sequence))
tuple= (min(molw),max(molw))
if min(molw)==max(molw):
retDict[record.id] = molw
else:
retDict[record.id] = (min(molw), max(molw))}
# instead of printing now, print in the end of your function / script
# print(dict)
Right now, you're setting a new dict at each turn of your loop, and print it. It is just a normal behaviour of your code to print lots and lots of dict.
you're creating a dictionary with 1 entry at each iteration.
You want to:
define a dict variable (better use dct to avoid reusing built-in type name) before your loop
rewrite the assignment to dict in the loop
So before the loop:
dct = {}
and in the loop (instead of your if + dict = code), in a ternary expression, with min & max computed only once:
minval = min(molw)
maxval = max(molw)
dct[record.id] = molw if minval == maxval else (minval,maxval)
I am trying to get a random object from a model A
For now, it is working well with this code:
random_idx = random.randint(0, A.objects.count() - 1)
random_object = A.objects.all()[random_idx]
But I feel this code is better:
random_object = A.objects.order_by('?')[0]
Which one is the best? Possible problem with deleted objects using the first code? Because, for example, I can have 10 objects but the object with the number 10 as id, is not existing anymore? Did I have misunderstood something in A.objects.all()[random_idx] ?
Just been looking at this. The line:
random_object = A.objects.order_by('?')[0]
has reportedly brought down many servers.
Unfortunately Erwans code caused an error on accessing non-sequential ids.
There is another short way to do this:
import random
items = list(Product.objects.all())
# change 3 to how many random items you want
random_items = random.sample(items, 3)
# if you want only a single random item
random_item = random.choice(items)
The good thing about this is that it handles non-sequential ids without error.
Improving on all of the above:
from random import choice
pks = A.objects.values_list('pk', flat=True)
random_pk = choice(pks)
random_obj = A.objects.get(pk=random_pk)
We first get a list of potential primary keys without loading any Django object, then we randomly choose one primary key, and then we load the chosen object only.
The second bit of code is correct, but can be slower, because in SQL that generates an ORDER BY RANDOM() clause that shuffles the entire set of results, and then takes a LIMIT based on that.
The first bit of code still has to evaluate the entire set of results. E.g., what if your random_idx is near the last possible index?
A better approach is to pick a random ID from your database, and choose that (which is a primary key lookup, so it's fast). We can't assume that our every id between 1 and MAX(id) is available, in the case that you've deleted something. So following is an approximation that works out well:
import random
# grab the max id in the database
max_id = A.objects.order_by('-id')[0].id
# grab a random possible id. we don't know if this id does exist in the database, though
random_id = random.randint(1, max_id + 1)
# return an object with that id, or the first object with an id greater than that one
# this is a fast lookup, because your primary key probably has a RANGE index.
random_object = A.objects.filter(id__gte=random_id)[0]
How about calculating maximal primary key and getting random pk?
The book ‘Django ORM Cookbook’ compares execution time of the following functions to get random object from a given model.
from django.db.models import Max
from myapp.models import Category
def get_random():
return Category.objects.order_by("?").first()
def get_random3():
max_id = Category.objects.all().aggregate(max_id=Max("id"))['max_id']
while True:
pk = random.randint(1, max_id)
category = Category.objects.filter(pk=pk).first()
if category:
return category
Test was made on a million DB entries:
In [14]: timeit.timeit(get_random3, number=100)
Out[14]: 0.20055226399563253
In [15]: timeit.timeit(get_random, number=100)
Out[15]: 56.92513192095794
See source.
After seeing those results I started using the following snippet:
from django.db.models import Max
import random
def get_random_obj_from_queryset(queryset):
max_pk = queryset.aggregate(max_pk=Max("pk"))['max_pk']
while True:
obj = queryset.filter(pk=random.randint(1, max_pk)).first()
if obj:
return obj
So far it did do the job as long as there is an id.
Notice that the get_random3 (get_random_obj_from_queryset) function won’t work if you replace model id with uuid or something else. Also, if too many instances were deleted the while loop will slow the process down.
Yet another way:
pks = A.objects.values_list('pk', flat=True)
random_idx = randint(0, len(pks)-1)
random_obj = A.objects.get(pk=pks[random_idx])
Works even if there are larger gaps in the pks, for example if you want to filter the queryset before picking one of the remaining objects at random.
EDIT: fixed call of randint (thanks to #Quique). The stop arg is inclusive.
https://docs.python.org/3/library/random.html#random.randint
I'm sharing my latest test result with Django 2.1.7, PostgreSQL 10.
students = Student.objects.all()
for i in range(500):
student = random.choice(students)
print(student)
# 0.021996498107910156 seconds
for i in range(500):
student = Student.objects.order_by('?')[0]
print(student)
# 0.41299867630004883 seconds
It seems that random fetching with random.choice() is about 2x faster.
in python for getting a random member of a iterable object like list,set, touple or anything else you can use random module.
random module have a method named choice, this method get a iterable object and return a one of all members randomly.
so becouse random.choice want a iterable object you can use this method for queryset in django.
first import the random module:
import random
then create a list:
my_iterable_object = [1, 2, 3, 4, 5, 6]
or create a query_set like this:
my_iterable_object = mymodel.objects.filter(name='django')
and for getting a random member of your iterable object use choice method:
random_member = random.choice(my_iterable_object)
print(random_member) # my_iterable_object is [1, 2, 3, 4, 5, 6]
3
full code:
import random
my_list = [1, 2, 3, 4, 5, 6]
random.choice(my_list)
2
import random
def get_random_obj(model, length=-1):
if length == -1:
length = model.objects.count()
return model.objects.all()[random.randint(0, length - 1)]
#to use this function
random_obj = get_random_obj(A)