How do I solve my program's counting problem? - python

(Apologies this is gonna be a long question)
I just have a bug in my code that I have not been able to resolve for a very long time. I would really appreciate if someone could help me find out what the problem is.
Context:
I have a long string of letters - lets call this subject - containing the letters A, G, T and C (like DNA) and the whole point of my algorithms is to correctly count how many of each of the following STRs are found within subject. The STRs are:
AGATC
TTTTTTCT
AATG
TCTAG
GATA
TATC
GAAA
TCTG
I must count how many of each are within subject. Counting works by going sequentially letter by letter until the start of one of above STRs are found. If the rest of the STR follows, the program should update the counter of the respective STR and then boost the searching index to account of the length of the STR and then keep going. It should stop when it reaches the end of subject.
(Hope it makes sense).
My Code:
STRs = ['AGATC','TTTTTTCT','AATG','TCTAG','GATA','TATC','GAAA','TCTG']
subject = "GCTAAATTTGTTCAGCCAGATGTAGGCTTACAAATCAAGCTGTCCGCTCGGCACGGCCTACACACGTCGTGTAACTACAACAGCTAGTTAATCTGGATATCACCATGACCGAATCATAGATTTCGCCTTAAGGAGCTTTACCATGGCTTGGGATCCAATACTAAGGGCTCGACCTAGGCGAATGAGTTTCAGGTTGGCAATCAGCAACGCTCGCCATCCGGACGACGGCTTACAGTTAGTAGCATAGTACGCGATTTTCGGGAAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCCGTCAACTCATTCACACCGCATCCTTTCCTGCCACTGTAACTAGTCGACTGGGGAACCTCATCATCCATACTCTCCCACATTATGCCTCCCAACCTTGTTAAGCGTGGCATGCTTGGGATTGCATTGATGCTTCTTGGAGAGGACGCTTTCGTTTTGGAGATTACAGGGATCCAATTTTATCATCGGTTCGACTCCCGTAACGACTTAGCAGTAAGGGTGCTAGTTCCTGGTTAGAATCTTAATAAATCACGTCGCTTGGAGCAAGACAAAGATCGTCGTAATGCCAAGTGCACGACCACCTTCAGACTTGCAGGACCCGTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTCGATAGCTATGCGGTTCAATACAATCTTAACGCAATGCAGCGATGTGGTTTCGTACACTTAGCATAAAACCCCCCACATTAAATCGATGTACCCGCCCTCTTAGACGCCAATTTCAATGCCGAACCTCCGGCGGGTATCTCTGCACTAGGAGAAGTAGCACGTCGCTGTAGCGAACTCCTATCGTGAGATAATTTGTAGAGCTGCTCTTATAATACAATAGCTCAGATGGATTATTCCATGGACATCCCCGTGCGTTGTTTCGAGGATGGTAGGTGGAAATTTTGCCAGACCTCTAGTCTTAAACATGGTTGACGTTATAGGCGCTATCTCTTGCGTCTGGAAGTGTTAATCCGTGAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAACACGCAACTCTGGAGGAGGGCACTGCACTGCAAACTTGCGTAATATCCTTCACCCACACTTGCCTGGCCTCCTTGCTTAAAGCTCTGGCGATGCGATTTTTCGGCCCAGTAGCTGAATAGGTCATGAAATGGGCACCGAACTGGAAAGACCCATATATTCGATACTCACAACTTAATGATAGCGCGATTAAGAGCGACACCAAAAACCAAATTACGTTCACGAACCTTTGAGAGTCAAGGAGACTTAGACCGAATTGAATGATCACTGATGCGCCCGCTGATACTGAGCCTCACCATTAATCGCCGACCAATACGGCGTGTACCGGGCGCGGCCTTGCCGCATAACGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTACACAGCCCCGTCCTCATTGCTAAGTGCACTGGCAACTGGACCTAAAGATTTTTCGAGTATGGCCCTCGAATCAAGCGCCCACCCAGAAACCTACGAGCCAGTAACCCCAGTAAACAAGCATTAGTGCTATATGCTTGCTGCCCACTAGGACCCTTATGGTTCATACCAGGGTGACGTGTCTTGCGGGCCAAGGATGAACCAGAAGCAAGATCCTTAGATGGACGACTGTCTCATTGCTTAAACTCCACATACCAAAGGGCGCGGTAAACGATAGTTTTAGGTAATGTTAGTCGGATGGTTGTCTGCAGCTACCAATACAGCCTGGCACCCAGGGTCTGAACAATAACGCGTGAGAGCAGCTCTCCCGCGTGTGGTGGATTTGCCGTCTATGAAATTGAGGCTCTTGCAACTATTCGCACTCGGAATGCCCTCATATCTGGTGCCTAGCGGCCTTTGCCCCGTGCCGGTAGGACTAAACTCTACGGATCGTTGACGGATCTCGATGTGGAAGATGGTTATGAAAGATAACAACGCGTGTGCTAATTGATTTAGACAAGTATTGCGGCAGTAAAAGATAATCGGCTGCAGAGTTACGAAAGACTTCCATGCATGGATTCCATTCCTTCTAGTATAGGACCCACTCTGAATACACGTCTTGCGGGCCGATCATCTCCACCGCTGCGGAAGAAAGCAATTAAGAATCTATGCTCATTAAGAGTGCGACTATAATGCGGATCTTACAGTGCTAATGATCAGGACGTCGTCCAAGCAGGCTGCATGCCGAATTTAGCTTACGTCAGGATCAGGCGTTATAGCCTGGGAATCGGACTATGAGGACGCCACGACCTCTGGGAGAAAGCTATATACATTGAGGATCGCGCCATCTTTATGAGACTCAAATGAATCTAGATAGGTAGCATTGCGGACTTGAGTTAGCACATCGGTATTGGAAGGTGAGGGTCCTGCCGCTCGTTCTATGTTCGGTTTATAGTATACAAATAGGTCATCCCGAACGTTGAAGTTAAACTCATGACACGTTGTCGTAATGAAACGGGCCTGTTATTAGGGATACAGACAAAAGGCACAAGCTGGCTTGCACATTAAGGCGCACTAGAGATCCTCACAACCGTTGCCCGCACGGAGGTCGTGTCTAACAGACAGTGAACCAGCCGTATTGGGGTGGATGACCTGAGCTTCTTGGGGCCTGTTGTACACCGCGTGTGGTTCAACTGGTACACATACTACGAATATTCGAAATCATTGTACTGTGCTCTTCGGTGCTACTGACTGTGAGCGAATGCATCCCAATCCCAAACAATGCTTGTGGTAGGAGAATTGAAACTCTCGAAGCCTGGCCCAATGTCATCTACTTTTAACATGTCGGGCCAGGAGTTACGGGCATTGCTTACTTACTTTGCCCCCTTACACCACAGCAGCGCGATTCTTGTTGTAGTAGATTTTATACGACTCGCGAATTAAATGGAACTTGTCTGTCCCATATCGATCGTGTCCATCGTAAGATGAGATTGTAGGAGCATTCGGAAGTCTATGCGGCCCAGGGACTACTACGTTAAATCTGGTCAGACGTGGTTTACAAGGCGTCCCGATCTTCTCAGAACATATGGGAAAGCACTACCGTTCCTTCACGCATACAGTTGTTCGTGCCGAACGAGTAAGCTTGCGACCAGCCCACCCGCTAGGGCTATGCAGCGGGTCATGGCTGGCGCCATACTGTGCGGACAACCCACGCTCTGGCAGAAAGCGTCTTGTGTTTTGTAGTAGCTCCAACGGTTAGACCTTCGATATCTATTCAGAGCGCGAGCGACCACTATTAGACGGCATGTAAACAATGTGTATTTGTTCGGCCCAACCGGTATATGGGTAAGACCGCGAAGGGCCTGCGCGAATACCAGCGTCCAAAAATTCCTCACCCGAGATATGCGGTTAGTACCCCTTGGGTAACGGTCCGCTACGGGTAGCGACGCGAGCCGGCCGCATCGGTTGGAGCCGAGTTGTCGGGCAGGCGAGTAACGTGTGCAATTTGATGGGCCCAAGCCTCCGGCACTATCCACCTCATACATCGACAAAAGCACCAAATATGGGGAAAAGCTGAGCGTCGATATGTACATCTACCCAGGAACCGGCCCGAACATTAGGCGGACGTGAATTTCCGACCTAGGTTCGGCTACATTTCTACGATCCAAGCACACGTGAAGGAGGAGGGGTGTTCCGACCGTAAATGAACGAGGTGCGCAGTGACCCGATGGCGTTTAGCGGATAGCCTTCCTATGCCGGCCTATGCTGTATGGTAGTTGGTTGGTGCCTCCAGAGCCACTGCACCCAATCATAGGGTCTACAGCAGCGTACTTATAAAATTGTACGGGTGACCCATATCCATTACGGGTTGCGACCAGTATAGGAGAGTATAACTGCGTGAACTAATGCGTTATGACGCTTCAGAGTTTGCTCGGGCCCGAGTTCTAGGGCTATAATGTGTTAGGGCGCAAGTATGCCAAGCTAAGATGTGGCGTGCACACTAGGAGTTGTGTTCCTCTGCAAGCAGACACGAGCACTCTGGCAGTAGTTTGACCACACCCGGGTATCACTGCTACTCCATTTCGAACAAGCTATTGGAGCGGACAAAATATGCTACTCAAGAGCATTAGTTATAGGTCTACGAGACAGAAGCAGTTACTGAGTCTGAATATTCGATATAAGTAGGCATGGAGGCGGAGCAAAACAACGTCTGCGATCAATCGTGTTGATGACGTATGGCGACTGGAAGGTAAGGACTATGGCCGGACGGAATGATTCATGTTCTGTTCAAAGCTATATTTCGAAGGGGTATATTAGCGGTCCTACACTTGGTTAGCACCCTCCCCCCTCTGGATCCTGCACTAATTCGAGCTGGCCTCCATCGGTATCAGTCCGGAAGCTCCACTCTCTATCGTAGTCCTAATCAACAGGGTGCCAGTTTGCTCACGTGGAAGTTTGAGGCCCTTTGTGCTCCATAGCCAATCACTAACCATGCACGCGCGACCCACTCTACGTCCAGATCGGCTATAATAGTTGCGCCCGGGACTGGCAGAGTAGACATGTAAGCTAGATAGAGCCCCGACATCGGCCAAGAGATCCTACGCTGCTTCCAGATAATGAGAGACATTCTAGCATTAGACATGCAAGTCGGCAGGGACTCCCCTTATCTAGTAATTTCGATGAATTGGTTTTTCGGCTAGCATCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGACCATGCCGACCTCATCATAGAAGGAATGCTCTAAACTTAGAGTGCTACTAGGAAAACTATTAATCAATGATCGTCCTGCTTACATAGCTGGACGGCGAAAGTTCTTATACTGCGGAGGTTGCTGACGTAGAGTGCGCTGGGTACAGCGGATAAGTTGATCAGGGTGGGGATAGGGTGGCTCACCGTTTATACTCATATAGATTCCTGGCGTCGACGCTGTGACAGGGTCGAGATCGAGGGGGAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGCGGAGCGGAGGGAAAATTATCACCAGAGGGTAGGGGCTCGCGACATTCTATTCAATGCATTTCAAGCTACTTACGTATTTCGGCACAGTGACTACTGCCTGCGCGGCAGCCGTAAGGTTTCCCGTCAATAGGTGGCACGTATCATTGATGAAAGTGTCAGCTAATCATTCAGGCCTTA"
x = 0 # Searching index.
dataSTR = { # All the STRs to seach for.
"AGATC":0,
"TTTTTTCT":0,
"AATG":0,
"TCTAG":0,
"GATA":0,
"TATC":0,
"GAAA":0,
"TCTG":0,
}
# This dict will hold all the count values of STR's in the text-file.
# Scanning STR's from the txt file.
total = len(subject)
limit = 8
while x < total:
currentString = subject[x:x+limit] # A temporary variable to hold the next few letters from the text-file at index x.
for STR in STRs:
if STR in currentString: # The STR is found within this set of letters?
lSTR = len(STR) - 1
if STR[0:lSTR] == currentString[0:lSTR]: # In order to minimise the risk of duplication...
dataSTR[STR] += 1 # ...the STR must be at the start of currentString.
#print(currentString, STR, x, dataSTR[STR])
x += lSTR # The index must be boosted each time a new STR is read. In the event that an STR is at the end of a stand...
x += 1 # The index counts up by 1 by default. (From above) ...so that no duplicates are added.
print(dataSTR.items())
print("The correct result is: AGATC - 22, TTTTTTCT - 33, AATG - 43, TCTAG - 12, GATA - 26, TATC - 18, GAAA - 47, TCTG - 41")
(Sorry its very long, it might be helpful to copy into a separate python file).
As you will see from running it, the result my program brings up from counting is incorrect. The correct results are in the final print statement of the program, but the program does not match this (yes I know that these results are 100% correct since this is part of a problem set from an online computer science course).
However, I cannot seem to find the bug or logic error that seems to be causing my program to count wrong and I have been trying for quite a while now. Does anyone know what the solution is?
Please feel free to ask me anything about the program, thank you all.

Your problem statement doesn't agree with the "correct results" given in your example code. Either you've misunderstood the problem, or you've taken the correct results from a different problem. (The "correct results" appear to be for the problem of finding the maximum number of consecutive repeats of each query string.) [The latter possibility is the point that Chris Charley makes in a comment on the original post.]
You can convince yourself by doing the problem "by hand": look at the subject string in a text editor, pick a query string, do a search on it, and step through the occurrences.
E.g., for the query string "GAAA", you'll count ~67 occurrences, but most of them are in a block of 47 repeats in subject[1449:1637]. (This is more obvious if you use a text editor that highlights all occurrences of the search string, as 188 characters of consecutive highlighting should jump out at you.) And 47 agrees with the "correct result" for GAAA.

Does this help?
count_results = dict()
STRs = ['AGATC','TTTTTTCT','AATG','TCTAG','GATA','TATC','GAAA','TCTG']
subject = "loooong string..."
for search_string in STRs:
count_results[search_string] = subject.count(search_string)
print(count_results)
{'AGATC': 28, 'TTTTTTCT': 33, 'AATG': 69, 'TCTAG': 18, 'GATA': 46, 'TATC': 36, 'GAAA': 67, 'TCTG': 60}
I realize the results are sometimes different to your expected counts, but I didn't go through the intricacies of your search algo and wonder if the expected output might be wrong? If not, check out the docs for the str.count() function, to see how & why it gets different output, and adapt what it does to your needs.

Try like this:
import re
# Define STRs and subject here
dic = {}
for x in STRs:
tv = len([m.start() for m in re.finditer(x,subject)])
tv += 1
dic[x] = tv
for y in dic.keys():
print(y,dic[y])

The results in the last print statement are incorrect. I checked it with python's built in method .count(), if you are allowed to use this method just use this one instead, but if not, I would recommend to do the following:
total = len(subject)
while x < total:
for STR in STRs:
limit = len(STR)
currentString = subject[x:x+limit]
if STR == currentString:
dataSTR[STR] += 1
x += 1
that way, you set the limit to the string's length so the STR is either exactly the string or not, so you don't have to check for duplicates. I don't know why your code didn't work, but I hope this will help you.

Related

Extract words from random strings

Below I have some strings in a list:
some_list = ['a','l','p','p','l','l','i','i','r',i','r','a','a']
Now I want to take the word april from this list. There are only two april in this list. So I want to take that two april from this list and append them to another extract list.
So the extract list should look something like this:
extract = ['aprilapril']
or
extract = ['a','p','r','i','l','a','p','r','i','l']
I tried many times trying to get the everything in extract in order, but I still can't seems to get it.
But I know I can just do this
a_count = some_list.count('a')
p_count = some_list.count('p')
r_count = some_list.count('r')
i_count = some_list.count('i')
l_count = some_list.count('l')
total_count = [a_count,p_count,r_count,i_count,l_count]
smallest_count = min(total_count)
extract = ['april' * smallest_count]
Which I wouldn't be here If I just use the code above.
Because I made some rules for solving this problem
Each of the characters (a,p,r,i and l) are some magical code elements, these code elements can't be created out of thin air; they are some unique code elements, that has some uniquw identifier, like a secrete number that is associated with them. So you don't know how to create this magical code elements, the only way to get the code elements is to extract them to a list.
Each of the characters (a,p,r,i and l) must be in order. Imagine they are some kind of chains, they will only work if they are together. Meaning that we got to put p next to and in front of a, and l must come last.
These important code elements are some kind of top secrete stuff, so if you want to get it, the only way is to extract them to a list.
Below are some examples of a incorrect way to do this: (breaking the rules)
import re
word = 'april'
some_list = ['aaaaaaappppppprrrrrriiiiiilll']
regex = "".join(f"({c}+)" for c in word)
match = re.match(regex, text)
if match:
lowest_amount = min(len(g) for g in match.groups())
print(word * lowest_amount)
else:
print("no match")
from collections import Counter
def count_recurrence(kernel, string):
# we need to count both strings
kernel_counter = Counter(kernel)
string_counter = Counter(string)
effective_counter = {
k: int(string_counter.get(k, 0)/v)
for k, v in kernel_counter.items()
}
min_recurring_count = min(effective_counter.values())
return kernel * min_recurring_count
This might sounds really stupid, but this is actually a hard problem (well for me). I originally designed this problem for myself to practice python, but it turns out to be way harder than I thought. I just want to see how other people solve this problem.
If anyone out there know how to solve this ridiculous problem, please help me out, I am just a fourteen-year-old trying to do python. Thank you very much.
I'm not sure what do you mean by "cannot copy nor delete the magical codes" - if you want to put them in your output list you will need to "copy" them somehow.
And btw your example code (a_count = some_list.count('a') etc) won't work since count will always return zero.
That said, a possible solution is
worklist = [c for c in some_list[0]]
extract = []
fail = False
while not fail:
lastpos = -1
tempextract = []
for magic in magics:
if magic in worklist:
pos = worklist.index(magic, lastpos+1)
tempextract.append(worklist.pop(pos))
lastpos = pos-1
else:
fail = True
break
else:
extract.append(tempextract)
Alternatively, if you don't want to pop the elements when you find them, you may compute the positions of all the occurences of the first element (the "a"), and set lastpos to each of those positions at the beginning of each iteration
May not be the most efficient way, although code works and is more explicit to understand the program logic:
some_list = ['aaaaaaappppppprrrrrriiiiiilll']
word = 'april'
extract = []
remove = []
string = some_list[0]
for x in range(len(some_list[0])//len(word)): #maximum number of times `word` can appear in `some_list[0]`
pointer = i = 0
while i<len(word):
j=0
while j<(len(string)-pointer):
if string[pointer:][j] == word[i]:
extract.append(word[i])
remove.append(pointer+j)
i+=1
pointer = j+1
break
j+=1
if i==len(word):
for r_i,r in enumerate(remove):
string = string[:r-r_i] + string[r-r_i+1:]
remove = []
elif j==(len(string)-pointer):
break
print(extract,string)

Extracting multiple data from a single list

I working on a text file that contains multiple information. I converted it into a list in python and right now I'm trying to separate the different data into different lists. The data is presented as following:
CODE/ DESCRIPTION/ Unity/ Value1/ Value2/ Value3/ Value4 and then repeat, an example would be:
P03133 Auxiliar helper un 203.02 417.54 437.22 675.80
My approach to it until now has been:
Creating lists to storage each information:
codes = []
description = []
unity = []
cost = []
Through loops finding a code, based on the code's structure, and using the code's index as base to find the remaining values.
Finding a code's easy, it's a distinct type of information amongst the other data.
For the remaining values I made a loop to find the next value that is numeric after a code. That way I can delimitate the rest of the indexes:
The unity would be the code's index + index until isnumeric - 1, hence it's the first information prior to the first numeric value in each line.
The cost would be the code's index + index until isnumeric + 2, the third value is the only one I need to store.
The description is a little harder, the number of elements that compose it varies across the list. So I used slicing starting at code's index + 1 and ending at index until isnumeric - 2.
for i, carc in enumerate(txtl):
if carc[0] == "P" and carc[1].isnumeric():
codes.append(carc)
j = 0
while not txtl[i+j].isnumeric():
j = j + 1
description.append(" ".join(txtl[i+1:i+j-2]))
unity.append(txtl[i+j-1])
cost.append(txtl[i+j])
I'm facing some problems with this approach, although there will always be more elements to the list after a code I'm getting the error:
while not txtl[i+j].isnumeric():
txtl[i+j] list index out of range.
Accepting any solution to debug my code or even new solutions to problem.
OBS: I'm also going to have to do this to a really similar data font, but the code would be just a sequence of 7 numbers, thus harder to find amongst the other data. Any solution that includes this facet is also appreciated!
A slight addition to your code should resolve this:
while i+j < len(txtl) and not txtl[i+j].isnumeric():
j += 1
The first condition fails when out of bounds, so the second one doesn't get checked.
Also, please use a list of dict items instead of 4 different lists, fe:
thelist = []
thelist.append({'codes': 69, 'description': 'random text', 'unity': 'whatever', 'cost': 'your life'})
In this way you always have the correct values together in the list, and you don't need to keep track of where you are with indexes or other black magic...
EDIT after comment interactions:
Ok, so in this case you split the line you are processing on the space character, and then process the words in the line.
from pprint import pprint # just for pretty printing
textl = 'P03133 Auxiliar helper un 203.02 417.54 437.22 675.80'
the_list = []
def handle_line(textl: str):
description = ''
unity = None
values = []
for word in textl.split()[1:]:
# it splits on space characters by default
# you can ignore the first item in the list, as this will always be the code
# str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296
if not word.replace(',', '').replace('.', '').isnumeric():
if len(description) == 0:
description = word
else:
description = f'{description} {word}' # I like f-strings
elif not unity:
# if unity is still None, that means it has not been set yet
unity = word
else:
values.append(word)
return {'code': textl.split()[0], 'description': description, 'unity': unity, 'values': values}
the_list.append(handle_line(textl))
pprint(the_list)
str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296

Return the rank of word in Gensim Word2vec

I am now working on a project using Gensim.word2vec, and I am a total freshman for this field.
Actually I already got a model. Are there any way that I can get the similarity rank of a word for another word. For example, the top 2 most similar words for the word 'girl' is 'lady' and then 'woman'. Are there any functions I can use if i enter 'lady' is can return 1, if i enter 'woman' it can return 2?
Thanks!
There's no gensim API for this, but you can use basic Python code to find which position (if any) a word appears in a longer sequence – such as the list of results given by gensim's most_similar().
For example:
origin_word = 'apple'
query_word = 'orange'
all_sims = w2v_model.most_similar(origin_word, topn=0) # topn=0 gets all results
query_index = -1
for i, sim_tuple in enumerate(all_sims):
if sim_tuple[0] == query_word:
query_index = i
break
At the end of this code, query_index will either be the (0-based) position of 'orange' in the list-of-all-similars, or -1 if not found.
Note that the most expensive step is the creation of the all_sims ordered-list of all similar words; if you are going to be checking the ranks of multiple query words against one origin word, you'd definitely want to keep the all_sims around rather than re-compute it each time.
In fact, if you were sure you were going to do lots of such lookups, potentially down through the very-deepest words, you might do a single pass to change the results into a dict:
word_to_sims_index = {}
for i, sim_tuple in enumerate(all_sims):
word_to_sims_index[i] = sim_tuple[0]
After that, finding the index of a word would be a (quick constant-time) dict lookup...
query_index = word_to_sims_index[query_word]
...that will throw a KeyError if the query word isn't in the dict. (You could use word_to_sims_index.get(query_word, -1) if you instead wanted a default -1 response when the key is not present.)
I think this is a duplicate, and as they say in the other answer you can use model.rank('girl', 'lady')==1.

Making string series in Python

I have a problem in Python I simply can't wrap my head around, even though it's fairly simple (I think).
I'm trying to make "string series". I don't really know what it's called, but it goes like this:
I want a function that makes strings that run in series, so that every time the functions get called it "counts" up once.
I have a list with "a-z0-9._-" (a to z, 0 to 9, dot, underscore, dash). And the first string I should receive from my method is aaaa, next time I call it, it should return aaab, next time aaac etc. until I reach ----
Also the length of the string is fixed for the script, but should be fairly easy to change.
(Before you look at my code, I would like to apologize if my code doesn't adhere to conventions; I started coding Python some days ago so I'm still a noob).
What I've got:
Generating my list of available characters
chars = []
for i in range(26):
chars.append(str(chr(i + 97)))
for i in range(10):
chars.append(str(i))
chars.append('.')
chars.append('_')
chars.append('-')
Getting the next string in the sequence
iterationCount = 0
nameLen = 3
charCounter = 1
def getString():
global charCounter, iterationCount
name = ''
for i in range(nameLen):
name += chars[((charCounter + (iterationCount % (nameLen - i) )) % len(chars))]
charCounter += 1
iterationCount += 1
return name
And it's the getString() function that needs to be fixed, specifically the way name gets build.
I have this feeling that it's possible by using the right "modulu hack" in the index, but I can't make it work as intended!
What you try to do can be done very easily using generators and itertools.product:
import itertools
def getString(length=4, characters='abcdefghijklmnopqrstuvwxyz0123456789._-'):
for s in itertools.product(characters, repeat=length):
yield ''.join(s)
for s in getString():
print(s)
aaaa
aaab
aaac
aaad
aaae
aaaf
...

Python - Converting a Number to a Letter without an if statement

I am making a program for my own purposes (a naming program) that completely generates a random name. The problem is I cannot assign a number to a letter, so as a being 1 and z being 26, or a being 0 and z being 25. It gives me a SyntaxError. I need to assign this because the random integer (1,26) triggers a letter (if the random integer is 1, select A) and prints the name.
EDIT:
I have implemented your advice, and it works, I am grateful for this, but I wish to have my program create readable names, or more procedural. Here is an example of a name after I tweaked my program: ddjau. Now that doesn't look like a name, so I want it my program to work as if it were creating REAL names, like Samuel or other common names. Thanks!
EDIT (2):
Thanks, Adam, but I need a sort of 'seed' for the user to enter for the start of the name is. (Seed = A, Name = Adam. Seed = G, Name = George.) Should I do this by searching the file line by line, at the very beginning? If so, how do I do this?
Short Answer
Look into Python dictionaries to allow the 1 = 'a' type assignments. Below I have working example that would generate a random name based on gender and a 'litter'.
Disclaimer
I do not fully understand (via the code) what you're trying to accomplish with char/ord and a random letter. Also note having absolutely no idea of your design goals or requirements, I have made the example more complex than it may need to be for instructional purposes.
Additional Resources
* Python Docs for dictionary
* Using Python dictionary relationship to search both ways
In response to the last edit
If you are looking to build random 'real' names, I think your best bet will be to use a large list of names and just pick a random one. If I were you I'd look into something linking to the census results: males and females. Note that male_names.txt and female_names.txt are a copy of the list found at the census website. As a disclaimer, I'm sure there is a more efficient way to load / read the file. Just use this example as a proof on concept.
Update
Here's a quick and dirty way to seed the random values. Again I am not sure that this is the most pythonic way or most efficient way, but it works.
Example
import random
import time
def get_random_name(gender, seed):
if(gender == 'male'):
file = 'male_names.txt'
elif(gender == 'female'):
file = 'female_names.txt'
fid = open(file,'r')
names = []
total_names = 0
for line in fid:
if(line.lower().startswith(seed)):
names.append(line)
total_names = total_names + 1
random_index = random.randint(0,total_names)
return names[random_index]
if (__name__ == "__main__"):
print 'Welcome to Name Database 2.2\n'
print '1. Boy'
print '2. Girl'
bog = raw_input('\nGender: ')
print 'What should the name start with?'
print 'A, Ab, Abc, B, Ba, Br, etc...'
print ''
l = raw_input('Leter(s): ').lower()
new_name = ''
if bog == '1': # Boy
print get_random_name('male',l)
elif bog == '2':
print get_random_name('female',l)
Output
Welcome to Name Database 2.2
1. Boy
2. Girl
Gender: 2
What should the name start with?
A, Ab, Abc, B, Ba, Br, etc...
Leter(s): br
BRITTA
chr (see here) and ord (see here) are the two functions you're looking for (though you already seem to know about the latter). Follow those links for a more detailed explanation.
The first gives you a one-character string based on the integer, the second does the reverse operaion (technically, it handles Unicode as well, which chr doesn't, though you have unichr for that if you need it).
You can base your code on the following:
ch = "E"
print ord (ch) - ord ("A") + 1 # should give 5 for the fifth letter
val = 7
print chr (val + ord ("A") - 1) # should give G, the seventh letter
I'm not entirely sure what you're trying to do, but you can convert a number into a letter with the chr() function. chr() takes an ASCII code, so if you want to use the range [0, 25] instead you can adapt it like so:
chr(25 + ord('a')) # 'z'

Categories