python lists/dictionaries with probabilities

python lists/dictionaries with probabilities - python

Good morning guys :)
i am currently making a Trainer for vocabulary.
I am having a dictionary, where all the vocabulary and their translations are stored in. Now i have an query which tells me what vocabulary i should translate.
If i now enter the translation correctly the probability of the word to get queried should get less. How can i do that? I wondered, if this is possible by making another list which should get called up less than the first one and moving the vocabulary into that list, when answering the translation right.
Here is my code:
import random
vokabeln = {
"Haus": "house",
"Garten": "garden",
"Freund": "friend",
"Freundin": "friend"
}
versuche = int(input("Anzahl der Versuche: "))
i=0
while i < versuche:
x = random.choice(list(vokabeln))
y = vokabeln.get(x)
i+=1
versuch = input("Übersetze " + x)
if(versuch == y):
print("Korrekt!")
else:
print("Falsch, richtig war " + y)

You could do the following. Each time the user gets the translation right, add that word to a separate list. The next time a word is randomly chosen and it's in the new list, allow it to be used with a certain probability, like 50%; otherwise, choose another word. You'll need to put this logic inside of its own loop in case another "correct" word is randomly chosen.
import random
vokabeln = {
"Haus": "house",
"Garten": "garden",
"Freund": "friend",
"Freundin": "friend"
}
korrekt = []
versuche = int(input("Anzahl der Versuche: "))
i=0
while i < versuche:
ok = False
while not ok:
x = random.choice(list(vokabeln))
y = vokabeln.get(x)
if x in korrekt:
if random.random() < 0.5:
ok = True
else:
ok = True
i+=1
versuch = input("Übersetze " + x)
if(versuch == y):
korrekt.append(x)
print("Korrekt!")
else:
print("Falsch, richtig war " + y)

Quick idea.
words = {
("Haus", "house", 1),
("Garten", "garden", 0.5),
("Freund", "friend", 1),
("Freundin", "friend", 1)
}
def get_word():
total_probability = sum(map(words, lambda x: x[2]))
selected = random.random() * total_probability
current_probability = 0
for word, translation, probability in words:
current_probability += probability
if select < current_probability:
return word, translation

You can use random.choices which allows for specifying sample weights. Then I'd also store the vocabulary as a list in order to preserve ordering with respect to the weights. How the weights get updated on right or wrong answers is up to you but you could use an inverse scaling for example:
vocab = [
("Haus", "house"),
("Garten", "garden"),
("Freund", "friend"),
("Freundin", "friend")
]
weights = [1] * len(vocab)
while ...:
index, (x, y) = random.choices(enumerate(vocab), weights)
attempt = input("Translate " + x)
if(attempt == y):
weights[index] = 1 / (1/weights[index] + 1)
else:
weights[index] = 1 / max(1/weights[index] - 1, 1)

Related

Convert string representation of a comparison formula back to formula

Python newbie here.
I have a library function that is expecting a formula object;
solver.Add(x + y < 5)
However my formula is dynamic and being provided from another system as a string "x + y < 5"
The library function doesn't not accept a string, my question is there a way to create a function that can evaluate my string and return the formula as an object? eg
solver.Add(evaluateFormula("x + y < 5"))
or
solver.Add(evaluateFormula("x + y") < 5)
Using the built in eval() does not work for this type of formula since that tries to execute the formula which is not what I want to do.
Here's some full sample code;
from ortools.linear_solver import pywraplp
def main():
# Create the MIP solver BOP/SAT/CBC
solver = pywraplp.Solver.CreateSolver('BOP')
# Sample Input Data
req_json = {
"Variables": [
{
"Name": "x",
"Max": 10
},
{
"Name": "y",
"Max": 5
}
],
"Constraints": [
{
"Name": "c",
"Formula": "x + y < 5"
}
]
}
# Create the variables
v = {}
for i in range(len(req_json['Variables'])):
v[i] = solver.IntVar(0, req_json['Variables'][i]['Max'], req_json['Variables'][i]['Name'])
# Create the constraints
solver.Add(v[0] + v[1] < 5)
# solver.Add(evaluateFormula(req_json['Constraints'][0]['Formula']))
# Maximize
solver.Maximize(v[0] + v[1])
# Solve
status = solver.Solve()
if status == pywraplp.Solver.OPTIMAL:
print('Solution:')
print('Objective value =', solver.Objective().Value())
print('x =', v[0].solution_value())
print('y =', v[1].solution_value())

The idea is that x and y are variables, e.g.,
x = solver.IntVar(0, xmax, 'x')
y = solver.IntVar(0, ymax, 'y')
so that when you call solver.Add(x + y < 5), python will look up the names x and y and substitute them for the objects they are. That is, IntVar + IntVar < 5, which will be a constraint object.
When you do eval('x + y < 5'), it's as if you're doing x + y < 5, so you just need to ensure that the names x and y exist in the current scope. There are two ways of achieving that, namely
var = req_json['Variables'][0]
exec('{0} = IntVar(0, {1}, "{0}")'.format(var['Name'], var['Max'])) # first
locals()[var['Name']] = IntVar(0, var['Max'], var['Name']) # second
The first one creates the string 'x = IntVar(0, 10, "x")', which it executes as a literal python statement. While the second one creates the IntVar programmatically and then stores it in the name x. locals()['x'] = 1 is the equivalent of x = 1.
All in all, the solution could be
# You don't need to manually store the variables, as they're added in `solver.variables()`
for var in req_json['Variables']:
name = var['Name']
locals()[name] = solver.IntVar(0, var['Max'], name)
for constraint in req_json['Constraints']:
solver.Add(eval(constraint['Formula']), constraint['Name'])
# However you decide what the expression is to maximise. Whether it's another key
# in your `req_json` dict, or the LHS of the constraint formula. I'll just hardcode this.
solver.Maximize(x + y)
status = solver.Solve()
if status == pywraplp.Solver.OPTIMAL:
print('Solution:')
print('Objective value =', solver.Objective().Value())
for v in solver.variables():
print('{} = {}'.format(v.name(), v.solution_value()))
THIS ASSUMES THAT eval(constraint['Formula']) WILL NEVER DO ANYTHING MALICIOUS, which you say is your case. If you can't guarantee that, your other option is to parse the string manually for the variable names, operations and relations and build a safe string which can then be evaluated.
Finally, if you run this as is, you'll get an error saying
ValueError: Operators "<" and ">" not supported with the linear solver
But if you change the constraint formula to 'x + y <= 5', it'll work just fine.

Finding Semantic Similarity between Sentences in a Document

I have put together some code from this link which is nicely colour coded with 4 minor changes to fix some errors. I also used some code from 2 previous forums.
What the code is supposed to do is calculate the semantic similarity between consecutive sentences across a whole text then display all the similarity values obtained like this;
'the yellow door.', 'The red hammer' 0.65
'pink fox in the woods.', 'commander fox is blue.' 0.32
Here is the code;
ALPHA = 0.2
BETA = 0.45
ETA = 0.4
PHI = 0.2
DELTA = 0.85
brown_freqs = dict()
N = 0
######################### word similarity ##########################
def get_best_synset_pair(word_1, word_2):
"""
Choose the pair with highest path similarity among all pairs.
Mimics pattern-seeking behavior of humans.
"""
max_sim = -1.0
synsets_1 = wn.synsets(word_1)
synsets_2 = wn.synsets(word_2)
if len(synsets_1) == 0 or len(synsets_2) == 0:
return None, None
else:
max_sim = -1.0
best_pair = None, None
for synset_1 in synsets_1:
for synset_2 in synsets_2:
sim = wn.path_similarity(synset_1, synset_2)
if sim > max_sim:
max_sim = sim
best_pair = synset_1, synset_2
return best_pair
def length_dist(synset_1, synset_2):
l_dist = sys.maxint
if synset_1 is None or synset_2 is None:
return 0.0
if synset_1 == synset_2:
# if synset_1 and synset_2 are the same synset return 0
l_dist = 0.0
else:
wset_1 = set([str(x.name()) for x in synset_1.lemmas()])
wset_2 = set([str(x.name()) for x in synset_2.lemmas()])
if len(wset_1.intersection(wset_2)) > 0:
# if synset_1 != synset_2 but there is word overlap, return 1.0
l_dist = 1.0
else:
# just compute the shortest path between the two
l_dist = synset_1.shortest_path_distance(synset_2)
if l_dist is None:
l_dist = 0.0
# normalize path length to the range [0,1]
return math.exp(-ALPHA * l_dist)
def hierarchy_dist(synset_1, synset_2):
h_dist = sys.maxint
if synset_1 is None or synset_2 is None:
return h_dist
if synset_1 == synset_2:
# return the depth of one of synset_1 or synset_2
h_dist = max([x[1] for x in synset_1.hypernym_distances()])
else:
# find the max depth of least common subsumer
hypernyms_1 = {x[0]:x[1] for x in synset_1.hypernym_distances()}
hypernyms_2 = {x[0]:x[1] for x in synset_2.hypernym_distances()}
lcs_candidates = set(hypernyms_1.keys()).intersection(
set(hypernyms_2.keys()))
if len(lcs_candidates) > 0:
lcs_dists = []
for lcs_candidate in lcs_candidates:
lcs_d1 = 0
if lcs_candidate in hypernyms_1:
lcs_d1 = hypernyms_1[lcs_candidate]
lcs_d2 = 0
if lcs_candidate in hypernyms_2:
lcs_d2 = hypernyms_2[lcs_candidate]
lcs_dists.append(max([lcs_d1, lcs_d2]))
h_dist = max(lcs_dists)
else:
h_dist = 0
return ((math.exp(BETA * h_dist) - math.exp(-BETA * h_dist)) /
(math.exp(BETA * h_dist) + math.exp(-BETA * h_dist)))
def word_similarity(word_1, word_2):
synset_pair = get_best_synset_pair(word_1, word_2)
return (length_dist(synset_pair[0], synset_pair[1]) *
hierarchy_dist(synset_pair[0], synset_pair[1]))
######################### sentence similarity ##########################
def most_similar_word(word, word_set):
max_sim = -1.0
sim_word = ""
for ref_word in word_set:
sim = word_similarity(word, ref_word)
if sim > max_sim:
max_sim = sim
sim_word = ref_word
return sim_word, max_sim
def info_content(lookup_word):
global N
if N == 0:
# poor man's lazy evaluation
for sent in brown.sents():
for word in sent:
word = word.lower()
if not word in brown_freqs:
brown_freqs[word] = 0
brown_freqs[word] = brown_freqs[word] + 1
N = N + 1
lookup_word = lookup_word.lower()
n = 0 if not lookup_word in brown_freqs else brown_freqs[lookup_word]
return 1.0 - (math.log(n + 1) / math.log(N + 1))
def semantic_vector(words, joint_words, info_content_norm):
sent_set = set(words)
semvec = np.zeros(len(joint_words))
i = 0
for joint_word in joint_words:
if joint_word in sent_set:
# if word in union exists in the sentence, s(i) = 1 (unnormalized)
semvec[i] = 1.0
if info_content_norm:
semvec[i] = semvec[i] * math.pow(info_content(joint_word), 2)
else:
# find the most similar word in the joint set and set the sim value
sim_word, max_sim = most_similar_word(joint_word, sent_set)
semvec[i] = PHI if max_sim > PHI else 0.0
if info_content_norm:
semvec[i] = semvec[i] * info_content(joint_word) * info_content(sim_word)
i = i + 1
return semvec
def semantic_similarity(sentence_1, sentence_2, info_content_norm):
words_1 = nltk.word_tokenize(sentence_1)
words_2 = nltk.word_tokenize(sentence_2)
joint_words = set(words_1).union(set(words_2))
vec_1 = semantic_vector(words_1, joint_words, info_content_norm)
vec_2 = semantic_vector(words_2, joint_words, info_content_norm)
return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1) * np.linalg.norm(vec_2))
######################### word order similarity ##########################
def word_order_vector(words, joint_words, windex):
wovec = np.zeros(len(joint_words))
i = 0
wordset = set(words)
for joint_word in joint_words:
if joint_word in wordset:
# word in joint_words found in sentence, just populate the index
wovec[i] = windex[joint_word]
else:
# word not in joint_words, find most similar word and populate
# word_vector with the thresholded similarity
sim_word, max_sim = most_similar_word(joint_word, wordset)
if max_sim > ETA:
wovec[i] = windex[sim_word]
else:
wovec[i] = 0
i = i + 1
return wovec
def word_order_similarity(sentence_1, sentence_2):
"""
Computes the word-order similarity between two sentences as the normalized
difference of word order between the two sentences.
"""
words_1 = nltk.word_tokenize(sentence_1)
words_2 = nltk.word_tokenize(sentence_2)
joint_words = list(set(words_1).union(set(words_2)))
windex = {x[1]: x[0] for x in enumerate(joint_words)}
r1 = word_order_vector(words_1, joint_words, windex)
r2 = word_order_vector(words_2, joint_words, windex)
return 1.0 - (np.linalg.norm(r1 - r2) / np.linalg.norm(r1 + r2))
######################### overall similarity ##########################
def similarity(sentence_1, sentence_2, info_content_norm):
"""
Calculate the semantic similarity between two sentences. The last
parameter is True or False depending on whether information content
normalization is desired or not.
"""
return DELTA * semantic_similarity(sentence_1, sentence_2, info_content_norm) + \
(1.0 - DELTA) * word_order_similarity(sentence_1, sentence_2)
THIS IS THE LOOPING PART
with open ("C:\\Users\\Lenovo2\\Desktop\\Test123.txt", "r") as sentence_file:
# Initialize a list to hold the results
results = []
# Loop until we hit the end of the file
while True:
# Read two lines
x = sentence_file.readline()
y = sentence_file.readline()
# Check if we've reached the end of the file, if so, we're done
if not y:
# Break out of the infinite loop
break
else:
# The .rstrip('\n') removes the newline character from each line
x = x.rstrip('\n')
y = y.rstrip('\n')
# Calculate your similarity value
similarity_value = similarity(x, y, True)
# Add the two lines and similarity value to the results list
results.append([x, y, similarity_value])
# Loop through the pairs in the results list and print them
for pair in results:
print(pair)
When I run the code on a text file I get an error code and instead of obtaining a number for the value of similarity between sentences, I get nan;
Warning (from warnings module):
File "C:\Users\Lenovo2\Desktop\Semantic Analysis (1).py", line 191
return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1) * np.linalg.norm(vec_2))
RuntimeWarning: invalid value encountered in double_scalars
In a previous forum, I understood that this error probably meant that I was dividing by zero so we have a zero vector. I am pretty much stuck there and with limited python experience, I don't know how to fix the program easily and without changing too much.

My guess is that you are passing an empty string. Do you have any blank lines in your text? You don't strip the newline until after checking for empty string, so a string containing only a newline will not be caught.
Since you appear to be on Windows, there also might be '\r\n' style newlines, so your rstrip might not work as expected.
I'd recommend adding the following modification (also do a print for debugging):
# Loop until we hit the end of the file
while True:
# Read two lines, removing trailing whitespace
x = sentence_file.readline().rstrip()
y = sentence_file.readline().rstrip()
# Check if we've reached the end of the file, if so, we're done
if not x or not y:
# Break out of the infinite loop
break
else:
print(x, y)
# Calculate your similarity value
similarity_value = similarity(x, y, True)
# Add the two lines and similarity value to the results list
results.append([x, y, similarity_value])
Note that the code appears to have a bug, because you are not comparing sentences pairwise. That is, if you had sentences [a, b, c, d], you are only comparing (a, b) and (c, d), but you really want to compare (a, b), (b, c), (c, d).
You can clean this up a bit by using the itertools library:
from itertools import pairwise
lines = open ("C:\\Users\\Lenovo2\\Desktop\\Test123.txt", "r")
for a, b in pairwise(lines):
x = a.rstrip()
y = b.rstrip()
# ... rest unchanged

How to search for discontinuous characters in Python list?

I am trying to search a list (DB) for possible matches of fragments of text. For instance, I have a DB with text "evilman". I want to use user inputs to search for any possible matches in the DB and give the answer with a confidence. If the user inputs "hello", then there are no possible matches. If the user inputs "evil", then the possible match is evilman with a confidence of 57% (4 out of 7 alphabets match) and so on.
However, I also want a way to match input text such as "evxxman". 5 out of 7 characters of evxxman match the text "evilman" in the DB. But a simple check in python will say no match since it only outputs text that matches consecutively. I hope it makes sense. Thanks
Following is my code:
db = []
possible_signs = []
db.append("evilman")
text = raw_input()
for s in db:
if text in s:
if len(text) >= len(s)/2:
possible_signs.append(s)
count += 1
confidence = (float(len(text)) / float(len(s))) * 100
print "Confidence:", '%.2f' %(confidence), "<possible match:>", possible_signs[0]

This first version seems to comply with your exemples. It make the strings "slide" against each other, and count the number of identical characters.
The ratio is made by dividing the character count by the reference string length. Add a max and voila.
Call it for each string in your DB.
def commonChars(txt, ref):
txtLen = len(txt)
refLen = len(ref)
r = 0
for i in range(refLen + (txtLen - 1)):
rStart = abs(min(0, txtLen - i - 1))
tStart = txtLen -i - 1 if i < txtLen else 0
l = min(txtLen - tStart, refLen - rStart)
c = 0
for j in range(l):
if txt[tStart + j] == ref[rStart + j]:
c += 1
r = max(r, c / refLen)
return r
print(commonChars('evxxman', 'evilman')) # 0.7142857142857143
print(commonChars('evil', 'evilman')) # 0.5714285714285714
print(commonChars('man', 'evilman')) # 0.42857142857142855
print(commonChars('batman', 'evilman')) # 0.42857142857142855
print(commonChars('batman', 'man')) # 1.0
This second version produces the same results, but using the difflib mentioned in other answers.
It computes matching blocks, sum their lengths, and computes the ratio against the reference length.
import difflib
def commonBlocks(txt, ref):
matcher = difflib.SequenceMatcher(a=txt, b=ref)
matchingBlocks = matcher.get_matching_blocks()
matchingCount = sum([b.size for b in matchingBlocks])
return matchingCount / len(ref)
print(commonBlocks('evxxman', 'evilman')) # 0.7142857142857143
print(commonBlocks('evxxxxman', 'evilman')) # 0.7142857142857143
As shown by the calls above, the behavior is slightly different. "holes" between matching blocks are ignored, and do not change the final ratio.

For finding matches with a quality-estimation, have a look at difflib.SequenceMatcher.ratio and friends - these functions might not be the fastest match-checkers but they are easy to use.
Example copied from difflib docs
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.quick_ratio()
0.75
>>> s.real_quick_ratio()
1.0

Based on your description and examples, it seems to me that you're actually looking for something like the Levenshtein (or edit) distance. Note that it does not quite give the scores you specify, but I think it gives the scores you actually want.
There are several packages implementing this efficiently, e.g., distance:
In [1]: import distance
In [2]: distance.levenshtein('evilman', 'hello')
Out[2]: 6L
In [3]: distance.levenshtein('evilman', 'evil')
Out[3]: 3L
In [4]: distance.levenshtein('evilman', 'evxxman')
Out[4]: 2L
Note that the library contains several measures of similarity, e.g., jaccard and sorensen return a normalized value per default:
>>> distance.sorensen("decide", "resize")
0.5555555555555556
>>> distance.jaccard("decide", "resize")
0.7142857142857143

Create a while loop and track two iterators, one for your key word ("evil") and one for your query word ("evilman"). Here is some pseudocode:
key = "evil"
query = "evilman"
key_iterator = 0
query_iterator = 0
confidence_score = 0
while( key_iterator < key.length && query_iterator < query.length ) {
if (key[key_iterator] == query[query_iterator]) {
confidence_score++
key_iterator++
}
query_iterator++
}
// If we didnt reach the end of the key
if (key_iterator != key.length) {
confidence_score = 0
}
print ("Confidence: " + confidence_score + " out of " + query.length)

Iterate through list and assign a value to the variable in Python

So i'm currently working on code, which solves simple differentials. For now my code looks something like that:
deff diff():
coeffs = []
#checking a rank of a function
lvl = int(raw_input("Tell me a rank of your function: "))
if lvl == 0:
print "\nIf the rank is 0, a differential of a function will always be 0"
#Asking user to write coefficients (like 4x^2 - he writes 4)
for i in range(0, lvl):
coeff = int(raw_input("Tell me a coefficient: "))
coeffs.append(coeff)
#Printing all coefficients
print "\nSo your coefficients are: "
for item in coeffs:
print item
And so what I want to do next? I have every coefficient in my coeffs[] list. So now I want to take every single one from there and assign it to a different variable, just to make use of it. And how can I do it? I suppose I will have to use loop, but I tried to do so for hours - nothing helped me. Sooo, how can I do this? It would be like : a=coeff[0], b = coeff[1], ..., x = coeff[lvl] .

Just access the coefficients directly from the list via their indices.
If you are wanting to use the values in a different context that entails making changes to the values but you want to keep the original list unchanged then copy the list to a new list,
import copy
mutableCoeffs = copy.copy(coeffs)

You do not need new variables.
You already have all you need to compute the coefficients for the derivative function.
print "Coefficients for the derivative:"
l = len(coeffs) -1
for item in coeffs[:-1]:
print l * item
l -=1
Or if you want to put them in a new list :
deriv_coeffs = []
l = len(coeffs) -1
for item in coeffs[:-1]:
deriv_coeffs.append(l * item)
l -=1

I guess from there you want to differenciate no? So you just assign the cofficient times it rank to the index-1?
deff diff():
coeffs = []
#checking a rank of a function
lvl = int(raw_input("Tell me a rank of your function: "))
if lvl == 0:
print "\nIf the rank is 0, a differential of a function will always be 0"
#Asking user to write coefficients (like 4x^2 - he writes 4)
for i in range(0, lvl):
coeff = int(raw_input("Tell me a coefficient: "))
coeffs.append(coeff)
#Printing all coefficients
print "\nSo your coefficients are: "
for item in coeffs:
print item
answer_coeff = [0]*(lvl-1)
for i in range(0,lvl-1):
answer_coeff[i] = coeff[i+1]*(i+1)
print "The derivative is:"
string_answer = "%d" % answer_coeff[0]
for i in range(1,lvl-1):
string_answer = string_answer + (" + %d * X^%d" % (answer_coeff[i], i))
print string_answer

If you REALLY want to assign a list to variables you could do so by accessing the globals() dict. For example:
for j in len(coeffs):
globals()["elm{0}".format(j)] = coeffs[j]
Then you'll have your coefficients in the global variables elm0, elm1 and so on.
Please note that this is most probably not what you really want (but only what you asked for).

How to make a random but partial shuffle in Python?

Instead of a complete shuffle, I am looking for a partial shuffle function in python.
Example : "string" must give rise to "stnrig", but not "nrsgit"
It would be better if I can define a specific "percentage" of characters that have to be rearranged.
Purpose is to test string comparison algorithms. I want to determine the "percentage of shuffle" beyond which an(my) algorithm will mark two (shuffled) strings as completely different.
Update :
Here is my code. Improvements are welcome !
import random
percent_to_shuffle = int(raw_input("Give the percent value to shuffle : "))
to_shuffle = list(raw_input("Give the string to be shuffled : "))
num_of_chars_to_shuffle = int((len(to_shuffle)*percent_to_shuffle)/100)
for i in range(0,num_of_chars_to_shuffle):
x=random.randint(0,(len(to_shuffle)-1))
y=random.randint(0,(len(to_shuffle)-1))
z=to_shuffle[x]
to_shuffle[x]=to_shuffle[y]
to_shuffle[y]=z
print ''.join(to_shuffle)

This is a problem simpler than it looks. And the language has the right tools not to stay between you and the idea,as usual:
import random
def pashuffle(string, perc=10):
data = list(string)
for index, letter in enumerate(data):
if random.randrange(0, 100) < perc/2:
new_index = random.randrange(0, len(data))
data[index], data[new_index] = data[new_index], data[index]
return "".join(data)

Your problem is tricky, because there are some edge cases to think about:
Strings with repeated characters (i.e. how would you shuffle "aaaab"?)
How do you measure chained character swaps or re arranging blocks?
In any case, the metric defined to shuffle strings up to a certain percentage is likely to be the same you are using in your algorithm to see how close they are.
My code to shuffle n characters:
import random
def shuffle_n(s, n):
idx = range(len(s))
random.shuffle(idx)
idx = idx[:n]
mapping = dict((idx[i], idx[i-1]) for i in range(n))
return ''.join(s[mapping.get(x,x)] for x in range(len(s)))
Basically chooses n positions to swap at random, and then exchanges each of them with the next in the list... This way it ensures that no inverse swaps are generated and exactly n characters are swapped (if there are characters repeated, bad luck).
Explained run with 'string', 3 as input:
idx is [0, 1, 2, 3, 4, 5]
we shuffle it, now it is [5, 3, 1, 4, 0, 2]
we take just the first 3 elements, now it is [5, 3, 1]
those are the characters that we are going to swap
s t r i n g
^ ^ ^
t (1) will be i (3)
i (3) will be g (5)
g (5) will be t (1)
the rest will remain unchanged
so we get 'sirgnt'
The bad thing about this method is that it does not generate all the possible variations, for example, it could not make 'gnrits' from 'string'. This could be fixed by making partitions of the indices to be shuffled, like this:
import random
def randparts(l):
n = len(l)
s = random.randint(0, n-1) + 1
if s >= 2 and n - s >= 2: # the split makes two valid parts
yield l[:s]
for p in randparts(l[s:]):
yield p
else: # the split would make a single cycle
yield l
def shuffle_n(s, n):
idx = range(len(s))
random.shuffle(idx)
mapping = dict((x[i], x[i-1])
for i in range(len(x))
for x in randparts(idx[:n]))
return ''.join(s[mapping.get(x,x)] for x in range(len(s)))

import random
def partial_shuffle(a, part=0.5):
# which characters are to be shuffled:
idx_todo = random.sample(xrange(len(a)), int(len(a) * part))
# what are the new positions of these to-be-shuffled characters:
idx_target = idx_todo[:]
random.shuffle(idx_target)
# map all "normal" character positions {0:0, 1:1, 2:2, ...}
mapper = dict((i, i) for i in xrange(len(a)))
# update with all shuffles in the string: {old_pos:new_pos, old_pos:new_pos, ...}
mapper.update(zip(idx_todo, idx_target))
# use mapper to modify the string:
return ''.join(a[mapper[i]] for i in xrange(len(a)))
for i in xrange(5):
print partial_shuffle('abcdefghijklmnopqrstuvwxyz', 0.2)
prints
abcdefghljkvmnopqrstuxwiyz
ajcdefghitklmnopqrsbuvwxyz
abcdefhwijklmnopqrsguvtxyz
aecdubghijklmnopqrstwvfxyz
abjdefgcitklmnopqrshuvwxyz

Evil and using a deprecated API:
import random
# adjust constant to taste
# 0 -> no effect, 0.5 -> completely shuffled, 1.0 -> reversed
# Of course this assumes your input is already sorted ;)
''.join(sorted(
'abcdefghijklmnopqrstuvwxyz',
cmp = lambda a, b: cmp(a, b) * (-1 if random.random() < 0.2 else 1)
))

maybe like so:
>>> s = 'string'
>>> shufflethis = list(s[2:])
>>> random.shuffle(shufflethis)
>>> s[:2]+''.join(shufflethis)
'stingr'
Taking from fortran's idea, i'm adding this to collection. It's pretty fast:
def partial_shuffle(st, p=20):
p = int(round(p/100.0*len(st)))
idx = range(len(s))
sample = random.sample(idx, p)
res=str()
samptrav = 1
for i in range(len(st)):
if i in sample:
res += st[sample[-samptrav]]
samptrav += 1
continue
res += st[i]
return res

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python lists/dictionaries with probabilities - python

Related

Convert string representation of a comparison formula back to formula

Finding Semantic Similarity between Sentences in a Document

How to search for discontinuous characters in Python list?

Iterate through list and assign a value to the variable in Python

How to make a random but partial shuffle in Python?

Categories

Resources