How to add unique alphanumeric id for pandas dataframe? - python

I need a solution where i can generate unique alphanumeric id column for my dataframe. I need that the ids remain same even if I run the script later in future.
Name
Sam
Pray
Brad
I can generate the ids based on this post but I need 5 digit aplhanumeric values which will always remain same.
This is desired output:
Name ID
Sam X25TR
Peter WE558
Pepe TR589

One way would be to generate a hash of the name, by whatever hashing algorithm, and keep the first five characters of the hash. But you should keep in mind that with such as short hash this is likely to cause collisions (same output for multiple different inputs) if you have enough data.
Something along these lines:
import hashlib
def get_id(name: str) -> str:
hash = hashlib.md5(name.encode())
return hash.hexdigest()[:5]
Now for a given input string, get_id returns an alphanumeric 5-character string which is always the same for the same input.

This function generate random alphanumeric string with given length:
import math
import secrets
def random_alphanum(length: int) -> str:
text = secrets.token_hex(nbytes=math.ceil(length / 2))
isEven = length % 2 == 0
return text if isEven else text[1:]
df['ID'] == random_alphanum(5)
Apply to whole rows:
df2['ID'] = df2.apply(lambda x: random_alphanum(5), axis=1, result_type="expand")

Here's my attempt
import secrets
ids = []
while len(ids) < df.shape[0]:
temp = secrets.token_hex(5)[:5]
if temp not in ids:
ids.append(temp)
df.merge(pd.DataFrame(ids).reset_index(), left_on = df.groupby(['Name']).ngroup(), right_on = 'index')

Related

ZIP function on Dictionaries

I am struggling with an assignment for a course I have entered.
Create a function which returns a list of countries the number of cases is equal to one:
Hint: you can use the zip() function in Python to iterate over two lists at the same time.
So the prior question was to get the number of countries which had a single case of corona.
There was 7 countries as the output - and the following worked for that.
# Add your code below
def single_case_country_count(data):
item = data['Total Cases']
count = item.count(1)
if count == 0:
print('None found')
return count
# pass
I am however struggling with the second portion returning the names of these said 7 countries.
type(latest) is showing dict
i wrote this code
assuming i will have a dictionary of only cases where it is equal to 1 and
the original list; group them through the zipped function and then finally only show the list of countries.
def single_case_countries(data):
cases = data['Total Cases'] == 1
names = data['Country']
zipped = zip(names,cases)
final = list(zipped)
return final['Country']
# pass
TypeError: 'bool' object is not iterable
The clear issue here is that I cannot filter on the dictionary using " cases = data['Total Cases'] == 1" as it returns back a boolean.
Was wondering if there is some advice (especially filtering on a dictionary for a specific value
I managed to solve this with the following code:
def single_case_countries(data):
countries = []
for country,cases in zip(data["Country"],data["Total Cases"]):
if cases == 1:
countries.append(country)
return countries

Does standard library support generating unique IDs?

Assuming I have the following sequences:
A-B-C-D
A-C-C-E
B-B-B-D
A-A-E-D
...
I need to assign unique numerical IDs to every element, e.g. A=0, B=1 and so on and work with those IDs. At the moment I generate ID with the following function:
id = -1
ids = dict()
def getid():
global id
id += 1
return id
def genid(s):
global id
if not s in ids:
ids[s] = getid()
return ids[s]
I'm beginner, so it may not be the perfect solution, but it works. However, I worry that it will be very slow/inefficient for large number of.sequences and elements (imagine instead of A, B etc. it has combination of letters ABCD, XYZ and so on). I believe Python has mechanisms to achieve this in a more compact way. May be collections library has something that can achieve this in 1-2 lines?
uuid will generate a unique random id which can be represented as an int, bytes, or hex.
Just import uuid and then use uuid.uuid1().bytes or uuid.uuid1().int or uuid.uuid1().hex to get your id.
You can avoid global altogether, and as suggested use count:
from itertools import count
id_counter = count()
ids = dict()
def getid():
return next(id_counter)
def genid(s):
if s not in ids:
ids[s] = getid()
return ids[s]
You could use some "python magic" to make it shorter:
from itertools import count
def genid(s, id_counter=count(), ids={}):
if s not in ids:
ids[s] = next(id_counter)
return ids[s]

Function that takes two ShortUUIDs and outputs a consistently unique string for identification

I'm trying to store a bidirectional relationship in a database and to minimise duplicity of storing two records per relationship, I'm trying to find a way to take two UUIDs in either order and return the same unique id regardless of which UUID was supplied first.
F(a,b) should return the same value as F(b,a)
Examples of ShortUUID output:
wpsWLdLt9nscn2jbTD3uxe
vytxeTZskVKR7C7WgdSP3d
Could something like this work for you?
The function takes two strings as input, orders them, concatenates them into one string, encodes that string and finally returns the hashed result.
import hashlib
def F(a, b):
data = ''.join(sorted([a, b])).encode()
return hashlib.sha1(data).hexdigest()
The output is
>> a = 'string_1'
>> b = 'string_2'
>> print(F(a, b))
376598c12bb7949427f4c037070fff76fe932a66
>> print(F(b, a))
376598c12bb7949427f4c037070fff76fe932a66
Interesting! What do you think of this, that will retain your ShortUUID format?
def F(a,b):
l = (len(a)//2)+1
each_half = zip(a[:l],b[:l]) if a < b else zip(b[:l],a[:l])
return ''.join([x+y for x,y in (each_half)])[:len(a)]
The first line will ensure that F works also if you will change your ShortUUID to have an odd length.
The second line zip one char from the first half of each a and b, ordered.
The last will return the joined string, capped at the length of a
Just tried:
a = 'wpsWLdLt9nscn2jbTD3uxe'
b = 'vytxeTZskVKR7C7WgdSP3d'
assert F(a,b) == F(b,a)
print(F(a,b)) # vwyptsxWeLTdZLstk9VnKs

Increasing efficiency of a string filter

I have a long text file containing a number of strings. Here is the part of the file:
tyh89= 13
kb2= 0
78%= yes
###bb1= 7634.0
iih54= 121
fgddd= no
#aa1= 0
#aa2= 1
#$ac3= 0
yt##hh= 0
#j= 12.1
##hf= no
So, basically all elements have a common structure of: header= value. My goal is to search for elements, whose headers contain specific string parts and read out those elements' values.
A the moment I do it with a rather straight approach: open/read the whole file as a string, differentiate it into list of elements and run if/elif conditions over all elements using a for loop. I provide my code below.
Is it the most efficient way to do it? Or is there a more efficient way to do it with not implementing the loop?
def main():
print(list(import_param()))
def import_param():
fl = open('filename','r')
cn = fl.read()
cn = cn.split('\n')
fl.close()
for st in cn:
if 'fgddd' in st:
el = st.split(' ')
yield float(el[1])
elif '#j' in st:
el = st.split(' ')
yield float(el[1])
if __name__ == '__main__': main()
yes, there is. You have to avoid testing if string contains a string, but rather focus on string equality.
Once you settle for equality, it means that you can create a set with the known keywords, split according to = and test if the set contains your value (using O(1) lookup):
key_set = {"fgddd","#j"}
for st in cn:
if '=' in st:
key,value = st.split("=",1)
if key in key_set:
el = value.strip()
yield float(el)
if you have different types, use a dictionary to convert to the proper type according to the key
key_set = {"fgddd":float ,"#j": float, "whatever":int , "something":str}
for st in cn:
if '=' in st:
key,value = st.split("=",1)
if key in key_set:
el = value.strip()
yield key_set[key](el) # apply type conversion
note that if you don't want any conversion, str will do the job as it returns itself when passed a string.
final note: if you have a say on the input format, suggest to use json instead of a custom format. Parsing becomes trivial using json module, and filtering can be achieved by the same way I've shown.

Find Nth item in comma separated list in Python

I have a large CSV with comma separated lines of varying length. Sorting another set of data I used split(',') in a loop to separate fields, but this method requires each line to have the same number of entries. Is there a way I can look at a line and, independent of the total number of entries, just pull the Nth item? For reference, the method I was using will only work with a line that looks like AAA,BBB,CCC,DDD
entry = 'A,B,C,D'
(a,b,c,d) = entry.split(',')
print a,b,c,d
But I would like to pull A and C even if it looks like A,B,C,D,E,F or A,B,C.
Use a list instead of separate variables.
values = entry.split(',')
print values[0], values[2]
Just use a list:
xyzzy = entry.split(",");
print xyzzy[0], xyzzy[2]
But be aware that, once you allow the possibility of variable element counts, you'd probably better allow for too few:
entry = 'A,B'
xyzzy = entry.split(",");
(a,c) = ('?','?')
if len(xyzzy) > 0: a = xyzzy[0]
if len(xyzzy) > 2: c = xyzzy[2]
print a, c
If you don't want to index the results, it's not difficult to write your own function to deal with the situation where there are either too few or two many values. Although it requires a few more lines of code to set up, an advantage is that you can give the results meaningful names instead of anonymous ones likeresults[0]andresults[2].
def splitter(s, take, sep=',', default=None):
r = s.split(sep)
if len(r) < take:
r.extend((default for _ in xrange(take - len(r))))
return r[:take]
entry = 'A,B,C'
a,b,c,d = splitter(entry, 4)
print a,b,c,d # --> A B C None
entry = 'A,B,C,D,E,F'
a,b,c,d = splitter(entry, 4)
print a,b,c,d # --> A B C D

Categories