Creating a short unique ID based on other values in Python? - python

I have a number of variables in python that I want to use to generate a unique ID for those variables (yet have that ID always produce for those same matching variables).
I have used .encode('hex','strict') to produce an ID which seems to work, however the output value is very long. Is there a way to produce a shorter ID using variables?
myname = 'Midavalo'
mydate = '5 July 2017'
mytime = '8:19am'
codec = 'hex'
print "{}{}{}".format(myname, mydate, mytime).encode(codec,'strict')
This outputs
4d69646176616c6f35204a756c792032303137383a3139616d
I realise with hex it is probably dependant on the length of the three variables, so I'm wondering if there is another codec that can/will produce shorter values without excluding any of the variables?
So far I have tested base64, bz2, hex, quopri, uu, zip from 7.8.4. Python Specific Encodings, but I'm unsure how to get any of these to produce shorter values without removing variables.
Is there another codec I could use, or a way to shorten the values from any of them without removing the uniqueness, or even a completely different way to produce what I require?
All I am trying to do is produce an ID so I can identify those rows when loading them into a database. If the same value already exists it will not create a new row in the database. There is no security requirement, just a unique ID. The values are generated elsewhere into python, so I can't just use a database issued ID for these values.

You could use some hashing algorithm from the hashlib package: https://docs.python.org/3/library/hashlib.html or for python 2: https://docs.python.org/2.7/library/hashlib.html
import hashlib
s = "some string"
hash = hashlib.sha1(str.encode(s)).hexdigest() # you need to encode the strings into bytes here
This hash would be the same for the same string.
Your choice of algorithm depends of the number of chars you want and the risk of collision(two different strings yielding the same hash).

If you are not specific to hash and just want a uniq value based on the two or more strings. It concatenates the first character of every string and outputs a uniq value
#prints HKRC1LB for two string1 and string2
#Concatenate first char of all strings to get a uniq id
def get_uniq_val(*args):
id = ""
for i in args:
for j in i.split():
id += j[0]
return id
def main():
string_1 = "Howard Kid Recreation Centre"
string_2 = "150 Lantern Blvd"
uid = get_uniq_val(string_1,string_2)
print(uid)
if __name__ == "__main__":
main()

Related

function that takes directory of .txt file and returns a dictionary based on set parameters?

I want to make a function that takes the directory of a .txt file as an input and returns a dictionary based on specific parameters. If the .txt file is empty,
then the function will return nothing. When writing this function, I request that no imports, no list comprehension, and only for/while and if statements are used.
This is for the sake of the content I am learning right now, and I would like to be able to learn and interpret the function step-by-step.
An example of a .txt file is below. The amount of lines can vary but every line is formatted such that they appear in the order:
word + a string of 3 numbers connected by commas.
terra,4,5,6
cloud,5,6,7
squall,6,0,8
terra,4,5,8
cloud,6,5,7
First I would like to break down the steps of the function
Each component of the string that is separated by a comma serves a specific purpose:
The last number in the string will be subtracted by the second to last number in a string to form a value in the dictionary.
for example, the last two characters of terra,4,5,6 will be subtracted to form a value of [1] in the dictionary
The alphabetical words will form the keys of the dictionary. If there are multiple entries of the same word in a .txt file then a single key will be formed
and it will contain all the values of the duplicate keys.
for example, terra,4,5,6 , terra,4,4,6 , and terra,4,4,7 will output ('terra', 4):[1,2,3] as a key and value respectively.
However, in order for a key to be marked as a duplicate, the first values of the keys must be the same. If they are not, then they will be separate values.
For example, terra,4,5,6 and terra,5,4,6 will appear separately from eachother in the dictionary as ('terra', 4):[1] and ('terra', 5):[2] respectively.
Example input
if we use the example .txt file mentioned above, the input should look like create_dict("***files/example.txt") and should ouput a dictionary
{('terra', 4):[1,3],('cloud', 5):[1],('squall', 6):[8],('cloud', 6):[2]}. I will add a link to the .txt file for the sake of recreating this example. (note that *** are placeholders for the rest of the directory)
What I'm Trying:
testfiles = (open("**files/example.txt").read()).split('\n')
int_list = []
alpha_list = []
for values in testfiles:
ao = values.split(',') #returns only a portion of the list. why?
for values in ao:
if values.isnumeric():
int_list.append(values) #retrives list of ints from list
for values in ao:
if values.isalpha():
alpha_list.append(values) #retrieves a list of words
{((alpha_list[0]), int(int_list[0])):(int(int_list[2])-(int(int_list[1])))} #each line will always have 3 number values so I used index
this returns {('squall', 6): 1} which is mainly just a proof of concept and not a solution to the function. I wanted to see if it was possible to use the numbers and words I found in int_list and alpha_list using indexes to generate entries in the dictionary. If possible, the same could be applied to the rest of the strings in the .txt file.
Your input is in CSV format.
You really should be using one of these
https://docs.python.org/3/library/csv.html#csv.reader
https://docs.python.org/3/library/csv.html#csv.DictReader
since "odd" characters within a comma-separated field
are non-trivial to handle.
Better to let the library worry about such details.
Using defaultdict(list) is the most natural way,
the most readable way, to implement your dup key requirement.
https://docs.python.org/3/library/collections.html#collections.defaultdict
I know, I know, "no import";
now on to a variant solution.
d = {}
with open('example.txt') as f:
for line in f:
word, nums = line.split(',', maxsplit=1)
a, b, c = map(int, nums.split(','))
delta = c - b
key = (word, a)
if key not in d:
d[key] = []
d[key].append(delta)
return d

Trying to print keys based on their value type (int or str) from a dictionary of lists

I'm learning to access dictionary keys-values and work with list comprehensions. My assignment asks me to:
"Use a while loop that prints only variant names located in chromosomes that do not have numbers (e.g., X)."
And I'm working with this dictionary of lists, where the keys are variant names, and the zeroth elements in the list values (the character sets on the left of the colon([0])) are chromosome names, while the characters to the right of the colon ([1])are their chromosome location, and the [2] values are gene names.
cancer_variations={"rs13283416": ["9:116539328-116539328+","ASTN2"],\
"rs17610181":["17:61590592-61590592+","NACA2"],\
"rs1569113445":["X:12906527-12906527+","TLR8TLR8-AS1"],\
"rs143083812":["7:129203569-129203569+","SMO"],\
"rs5009270":["7:112519123-112519123+","IFRD1"],\
"rs12901372":["15:67078168-67078168+","SMAD3"],\
"rs4765540":["12:124315096-124315096+","FAM101A"],\
"rs3815148":["CHR_HG2266_PATCH:107297975-107297975+","COG5"],\
"rs12982744":["19:2177194-2177194+","DOT1L"],\
"rs11842874":["13:113040195-113040195+","MCF2L"]}
I have found how to print the variant names based on the length of the zeroth element in the lists (the chromosome names):
for rs, info in cancer_variations.items():
tmp_info=info[0].split(":")
if (len(tmp_info[0])>3):
print(rs)
But I'm having trouble printing the key values, the variant names, based on the TYPE of the chromosome name, the zeroth element in the list values. To that end, I've devised this code, but I'm not sure how to phrase the Boolean values to print only if the chromosome name is one particular type, (Str) or (int).
for rs, info in cancer_variations.items():
tmp_info=info[0].split(":")
if tmp_info[0] = type.str
print(rs)
I am not sure exactly what I'm not seeing here with my syntax.
Any help will be greatly appreciated.
If I understand you right, you want to check if the first part before : contains a number or not.
You can iterate the string character-by-character and use str.isnumeric() to check if the character is number or not. If any character is a number, continue to next item:
cancer_variations = {
"rs13283416": ["9:116539328-116539328+", "ASTN2"],
"rs17610181": ["17:61590592-61590592+", "NACA2"],
"rs1569113445": ["X:12906527-12906527+", "TLR8TLR8-AS1"],
"rs143083812": ["7:129203569-129203569+", "SMO"],
"rs5009270": ["7:112519123-112519123+", "IFRD1"],
"rs12901372": ["15:67078168-67078168+", "SMAD3"],
"rs4765540": ["12:124315096-124315096+", "FAM101A"],
"rs3815148": ["CHR_HG2266_PATCH:107297975-107297975+", "COG5"],
"rs12982744": ["19:2177194-2177194+", "DOT1L"],
"rs11842874": ["13:113040195-113040195+", "MCF2L"],
}
for k, (v, *_) in cancer_variations.items():
if not any(ch.isnumeric() for ch in v.split(":")[0]):
print(k)
Prints:
rs1569113445
You need to look up how to determine your desired classification of the data. In this case, all you need is to differentiate alphabetic data from numeric:
if tmp_info[0].isalpha():
print(rs)
Should get you on your way.
First you need to make sure what you want to do.
If what you want is to distinguish a numeric string from a normal string, then you may want to know that a numeric string is strictly formed of numbers; if you add any other character, it's not considered numeric by python. You can prove this making this experiment:
print('23123'.isnumeric())
print('2312ds3'.isnumeric())
Results in:
True
False
Numeric strings is what you are looking to exclude, and any other, in this case, that stays as str, will fit, if i'm understanding.
So, in that manner, we are going to iterate over the dict, using the loop you've made:
for rs, info in cancer_variations.items():
tmp_info=info[0].split(":")
if not tmp_info[0].isnumeric():
print(rs)
Which results in:
rs1569113445
rs3815148

Finding row in Dataframe when dataframe is both int or string?

minor problem doing my head in. I have a dataframe similar to the following:
Number Title
12345678 A
34567890-S B
11111111 C
22222222-L D
This is read from an excel file using pandas in python, then the index set to the first column:
db = db.set_index(['Number'])
I then lookup Title based on Number:
lookup = "12345678"
title = str(db.loc[lookup, 'Title'])
However... Whilst anything postfixed with "-Something" works, anything without it doesn't find a location (eg. 12345678 will not find anything, 34567890-S will). My only hunch is it's to do with looking up as either strings or ints, but I've tried a few things (converting the table to all strings, changing loc to iloc,ix,etc) but so far no luck.
Any ideas? Thanks :)
UPDATE: So trying this from scratch doesn't exhibit the same behaviour (creating a test db presumably just sets everything as strings), however importing from CSV is resulting in the above, and...
Searching "12345678" (as a string) doesn't find it, but 12345678 as an int will. Likewise the opposite for the others. So the dataframe is only matching the pure numbers in the index with ints, but anything else with strings.
Also, I can't not search for the postfix, as I have multiple rows with differing postfix eg 34567890-S, 34567890-L, 34567890-X.
If you want to cast all entries to one particular type, you can use pandas.Series.astype:
db["Number"] = df["Number"].astype(str)
db = db.set_index(['Number'])
lookup = "12345678"
title = db.loc[lookup, 'Title']
Interestingly this is actually slower than using pandas.Index.map:
x1 = [pd.Series(np.arange(n)) for n in np.logspace(1, 4, dtype=int)]
x2 = [pd.Index(np.arange(n)) for n in np.logspace(1, 4, dtype=int)]
def series_astype(x1):
return x1.astype(str)
def index_map(x2):
return x2.map(str)
Consider all the indeces as strings, as at least some of them are not numbers. If you want to lookup a specific item that possibly could have a postfix, you could match it by comparing the start of the strings with .str.startswith:
lookup = db.index.str.startswith("34567890")
title = db.loc[lookup, "Title"]

Arabic words in Python

I have a problem to print an Arabic text in Python, I write a code with convert English characters into Arabic ones as is called (chat language or Franco Arabic) and then create a combination between different results to get suggestions based on user input.
def transliterate(francosentence, verbose=False):
francowords = francosentence.split()
arabicconvertedwords = []
for i in francowords:
rankeddata=[]
rankeddata=transliterate_word(i)
arabicconvertedwords.append(rankeddata)
for index in range(len(rankeddata)):
print rankeddata[index]
ran=list(itertools.product(*arabicconvertedwords))
for I in range(len(ran)):
print ran[I]
The first print (print rankeddata[index]) gives Arabic words, but after the combination process is executed the second print (print ran[I]) gives something like that: (u'\u0627\u0646\u0647', u'\u0631\u0627\u064a\u062d', u'\u0627\u0644\u062c\u0627\u0645\u0639\u0647')
How can I print Arabic words?
Your second loop is operating over tuples of unicode (product yields a single product at a time as a tuple), not individual unicode values.
While print uses the str form of the object printed, tuple's str form uses the repr of the contained objects, it doesn't propagate "str-iness" (technically, tuple lacks __str__ entirely, so it's falling back to __repr__).
If you want to see the Arabic, you need to print the elements individually or concatenate them so you're printing strings, not tuple. For example, you could change:
print ran[I]
to something like:
print u', '.join(ran[I])
which will convert to a single comma-separated unicode value that print will format as expected (the str form), rather than using the repr form with escapes for non-ASCII values.
Side-note: As a point of style (and memory use), use the iterator protocol directly, don't listify everything then use C-style indexing loops. The following code has to store a ton of stuff in memory if the inputs are large (the total size of the output is the multiplicative product of the lengths of each input):
ran=list(itertools.product(*arabicconvertedwords))
for I in range(len(ran)):
print u', '.join(ran[I])
where it could easily produce just one item at a time on demand, producing results faster with no memory overhead:
# Don't listify...
ran = itertools.product(*arabicconvertedwords)
for r in ran: # Iterate items directly, no need for list or indexing
print u', '.join(r)

Python: skip every nth element in hashcheck, conditional mismatch?

So I currently have a script that generates hashes from the contents of a text file and saves them to a dictionary, and it then goes into a second text file and generates hashes from there and compares them to said dictionary. I'm trying to implement some sort of incomplete matching; for example, I want to program some tolerance: for example, I'd like to make it so that every third element in the hash is unimportant to the matching protocol, so if there is a mismatch, it will continue iterating unimpeded. Is it possible to do this?
Furthermore, and this is a separate case, would it be possible to determine a conditional mismatch? For example, if there is a mismatch, there are several elements that would still qualify as "matching", like if I wanted a vowel at a certain position, but it didn't matter which vowel showed up.
In summary, I'm trying to make it so that my script either goes
check,check,disregard,check,check,disregard,etc.
OR
check,check,conditional mismatch?,check,check,conditional mismatch?,etc.
along the hashes. Is this doable?
EDIT: I suppose it's not really hashchecking, but more of string comparison. Here's the relevant code I'm trying to tweak:
# hash table for finding hits
lookup = defaultdict(list)
# store sequence hashes in hash table
for i in xrange(len(file1) - hashlen + 1):
key = file1[i:i+hashlen]
lookup[key].append(i)
# look up hashes in hash table
hits = []
for i in xrange(len(file2) - hashlen + 1):
key = file2[i:i+hashlen]
# store hits to hits list
for hit in lookup.get(key, []):
hits.append((i, hit))
where hashlen is the length of the hash I want to generate (and thus the buffer so I don't go off the end of the file.
As commented, hashes do not have order. You can consider using an OrderedDict.
But maybe this code help you.
skip_rate = 3
for index, (key, value) in enumerate(your_hash.items()):
if index % skip_rate != 0:
do_something(key, value)

Categories