Infer Spaces: Ignore Numbers and Special Characters - python

I'm trying to pass over everything that isn't a letter (apostrophes, etc), and then continue on afterwards. The number should be in its respective place in the result. This is from this accepted answer, and the word list is here.
The string is "thereare7deadlysins"
The code below outputs "there are 7 d e a d l y s i n s"
I'm trying to get "there are 7 deadly sins"
I tried (below), but I receive IndexError: 'string index out of range'
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
if isinstance(s[i], int):
continue
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
The entire thing is:
from math import log
import string
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("/Users/.../Desktop/wordlist.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
table = string.maketrans("","")
l = "".join("thereare7deadlysins".split()).lower()
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
def test_trans(s):
return s.translate(table, string.punctuation)
s = test_trans(l)
print(infer_spaces(s))
EDIT: Based on the accepted answer the following solved my problem:
1. Remove single letters from the wordlist (except a, e, i)
2. Added the following below wordcost.
nums = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
for n in nums:
wordcost[n] = log(2)
The suggestion to change wordcost to (below) did not produce optimal results.
wordcost = dict( (k, (i+1)*log(1+len(k))) for i,k in enumerate(words) )
Example:
String: "Recall8importantscreeningquestions"
Original wordcost: "recall 8 important screening questions"
Suggested wordcost: "re call 8 important s c re e n in g question s"

Note that the word list contains all 26 individual letters as words.
With just the following modifications, your algorithm will correctly infer the spaces for the input string "therearesevendeadlysins" (i.e. "7" changed to "seven"):
Remove the single letter words from the word list (perhaps except for "a" and "i".)
As #Pm 2Ring noted, change the definition of wordcost
to:
wordcost = wordcost = dict( (k, (i+1)*log(1+len(k))) for i,k in enumerate(words) )
So there is something about non-letters that is goofing up your algorithm. Since you have already removed punctuation, perhaps you should treat a string of non-letters as a single word.
For instance, if you add:
wordcost["7"] = log(2)
(in addition to changes 1 and 2 above) your algorithm works on the original test string.

i = len(s) -1
to avoid IndexError: 'string index out of range'
and
if s[i].isdigit():
is the test you're looking for.

Related

How to get an array of all possible binary numbers given a binary number of the same length

goal: I have a string which usually looks like this "010" and I need to replace the zeros by 1 in all the possible ways like this ["010", "110", "111", "011"]
problem when I replace the zeros with 1s I iterate through the letters of the string from left to right then from right to left. As you can see in the code where I did number = number[::-1]. Now, this method does not actually cover all the possibilities.
I also need to maybe start from the middle or maybe use the permutation method But not sure how to apply in python.
mathematically there is something like factorial of the number of places/(2)!
A = '0111011110000'
B = '010101'
C = '10000010000001101'
my_list = [A,B,C]
for number in [A,B,C]:
number = number[::-1]
for i , n in enumerate(number):
number = list(number)
number[i] = '1'
number = ''.join(number)
if number not in my_list: my_list.append(number)
for number in [A,B,C]:
for i , n in enumerate(number):
number = list(number)
number[i] = '1'
number = ''.join(number)
if number not in my_list: my_list.append(number)
print(len(my_list))
print(my_list)
You can use separate out the zeros and then use itertools.product -
from itertools import product
x = '0011'
perm_elements = [('0', '1') if digit == '0' else ('1', ) for digit in x]
print([''.join(x) for x in product(*perm_elements)])
['0011', '0111', '1011', '1111']
If you only need the number of such combinations, and not the list itself - that should just be 2 ** x.count('0')
Well, you will definitely get other answers with a traditional implementations of combinations with fixed indexes, but as we're working with just "0" and "1", you can use next hack:
source = "010100100001100011"
pattern = source.replace("0", "{}")
count = source.count("0")
combinations = [pattern.format(*f"{i:0{count}b}") for i in range(1 << count)]
Basically, we count amount of zeros in source, then iteration over range where limit is number with this amount of set bits and unpack every number in binary form into a pattern.
It should be slightly faster if we predefine pattern for binary transformation too:
source = "010100100001100011"
pattern = source.replace("0", "{}")
count = source.count("0")
fmt = f"{{:0{count}b}}"
result = [pattern.format(*fmt.format(i)) for i in range(1 << count)]
Upd. It's not clear do you need to generate all possible combinations or just get number, so originally I provided code to generate them, but if you will look closely in my method I'm getting number of all possible combinations using 1 << count, where count is amount of '0' chars in source string. So if you need just number, code is next:
source = "010100100001100011"
number_of_combinations = 1 << source.count("0")
Alternatively, you can also use 2 ** source.count("0"), but generally power is much more slower than binary shift, so I'd recommend to use option I originally advised.
We also can use recursive solution for this problem, we iterate over string and if saw a "0" change it to "1" and begin another branch on this new string:
s = "010100100001100011"
def perm(s, i=0, result=[]):
if i < len(s):
if s[i] == "0":
t = s[:i]+"1"+s[i+1:]
result.append(t)
perm(t, i+1, result)
perm(s, i+1, result)
res = [s]
perm(s, 0, res)
print(res)
For each position in the string that has a zero, you can either replace it with a 1 or not. This creates the combinations. So you can progressively build the resulting list of strings by adding the replacements of each '0' position with a '1' based on the previous replacement results:
def zeroTo1(S):
result = [S] # start with no replacement
for i,b in enumerate(S):
if b != '0': continue # only for '0' positions
result += [r[:i]+'1'+r[i+1:] for r in result] # add replacements
return result
print(zeroTo1('010'))
['010', '110', '011', '111']
If you're allowed to use libraries, the product function from itertools can be used to combine the zero replacements directly for you:
from itertools import product
def zeroTo1(S):
return [*map("".join,product(*("01"[int(b):] for b in S)))]
The tuples of 1s and 0s generated by the product function are assembled into individual strings by mapping the string join function onto its output.
Based on your objective you can do this to obtain the expected results.
A = '0111011110000'
B = '010'
C = '10000010000001101'
my_list = [A, B, C]
new_list = []
for key, number in enumerate(my_list):
for key_item, num in enumerate(number):
item_list = [i for i in number]
item_list[key_item] = "1"
new_list.append(''.join(item_list))
print(len(new_list))
print(new_list)

Count of sub-strings that contain character X at least once. E.g Input: str = “abcd”, X = ‘b’ Output: 6

This question was asked in an exam but my code (given below) passed just 2 cases out of 7 cases.
Input Format : single line input seperated by comma
Input: str = “abcd,b”
Output: 6
“ab”, “abc”, “abcd”, “b”, “bc” and “bcd” are the required sub-strings.
def slicing(s, k, n):
loop_value = n - k + 1
res = []
for i in range(loop_value):
res.append(s[i: i + k])
return res
x, y = input().split(',')
n = len(x)
res1 = []
for i in range(1, n + 1):
res1 += slicing(x, i, n)
count = 0
for ele in res1:
if y in ele:
count += 1
print(count)
When the target string (ts) is found in the string S, you can compute the number of substrings containing that instance by multiplying the number of characters before the target by the number of characters after the target (plus one on each side).
This will cover all substrings that contain this instance of the target string leaving only the "after" part to analyse further, which you can do recursively.
def countsubs(S,ts):
if ts not in S: return 0 # shorter or no match
before,after = S.split(ts,1) # split on target
result = (len(before)+1)*(len(after)+1) # count for this instance
return result + countsubs(ts[1:]+after,ts) # recurse with right side
print(countsubs("abcd","b")) # 6
This will work for single character and multi-character targets and will run much faster than checking all combinations of substrings one by one.
Here is a simple solution without recursion:
def my_function(s):
l, target = s.split(',')
result = []
for i in range(len(l)):
for j in range(i+1, len(l)+1):
ss = l[i] + l[i+1:j]
if target in ss:
result.append(ss)
return f'count = {len(result)}, substrings = {result}'
print(my_function("abcd,b"))
#count = 6, substrings = ['ab', 'abc', 'abcd', 'b', 'bc', 'bcd']
Here you go, this should help
from itertools import combinations
output = []
initial = input('Enter string and needed letter seperated by commas: ') #Asking for input
list1 = initial.split(',') #splitting the input into two parts i.e the actual text and the letter we want common in output
text = list1[0]
final = [''.join(l) for i in range(len(text)) for l in combinations(text, i+1)] #this is the core part of our code, from this statement we get all the available combinations of the set of letters (all the way from 1 letter combinations to nth letter)
for i in final:
if 'b' in i:
output.append(i) #only outputting the results which have the required letter/phrase in it

Replacing a character in a string with a set of two possible characters

a = ["0$%","0%%%","0$%$%","0$$"]
The above is a corrupted communication code where the first element of each sequence has been disguised as 0. I want to recover the original and correct code by computing a list of all possible sequences by replacing 0 with either $ or % and then checking which of the sequences is valid. Think of each sequence as corresponding to an alphabet if correct. For instance, "$$$" could correspond to the alphabet "B".
This is what I've done so far
raw_decoded = []
word = []
for i in a:
for j in i:
if j == "0":
x = list(itertools.product(["$", "%"], *i[1:]))
y = ("".join(i) for i in x)
for i in y:
raw_decoded.append(i)
for i in raw_decoded:
letter = code_dict[i] #access dictionary for converting to alphabet
word.append(letter)
return word
Try that:
output = []
for elem in a:
replaced_dollar = elem.replace('0', '$', 1)
replaced_percent = elem.replace('0', '%', 1)
# check replaced_dollar and replaced_percent
# and then write to output
output.append(replaced_...)
Not sure what you mean, perhaps you could add a desired output. What I got from your question could be solved in the following way:
b = []
for el in a:
if el[0] == '0':
b.append(el.replace('0', '%', 1))
b.append(el.replace('0', '$', 1))
else:
b.append(el)

Python string split join 4

import re
string = "is2 Thi1s T4est 3a"
def order(sentence):
res = ''
count = 1
list = sentence.split()
for i in list:
for i in list:
a = re.findall('\d+', i)
if a == [str(count)]:
res += " ".join(i)
count += 1
print(res)
order(string)
Above there is a code which I have problem with. Output which I should get is:
"Thi1s is2 3a T4est"
Instead I'm getting the correct order but with spaces in the wrong places:
"T h i 1 si s 23 aT 4 e s t"
Any idea how to make it work with this code concept?
You are joining the characters of each word:
>>> " ".join('Thi1s')
'T h i 1 s'
You want to collect your words into a list and join that instead:
def order(sentence):
number_words = []
count = 1
words = sentence.split()
for word in words:
for word in words:
matches = re.findall('\d+', word)
if matches == [str(count)]:
number_words.append(word)
count += 1
result = ' '.join(number_words)
print(result)
I used more verbose and clear variable names. I also removed the list variable; don't use list as a variable name if you can avoid it, as that masks the built-in list name.
What you implemented comes down to a O(N^2) (quadratic time) sort. You could instead use the built-in sort() function to bring this to O(NlogN); you'd extract the digit and sort on its integer value:
def order(sentence):
digit = re.compile(r'\d+')
return ' '.join(
sorted(sentence.split(),
key=lambda w: int(digit.search(w).group())))
This differs a little from your version in that it'll only look at the first (consecutive) digits, it doesn't care about the numbers being sequential, and will break for words without digits. It also uses a return to give the result to the caller rather than print. Just use print(order(string)) to print the return value.
If you assume the words are numbered consecutively starting at 1, then you can sort them in O(N) time even:
def order(sentence):
digit = re.compile(r'\d+')
words = sentence.split()
result = [None] * len(words)
for word in words:
index = int(digit.search(word).group())
result[index - 1] = word
return ' '.join(result)
This works by creating a list of the same length, then using the digits from each word to put the word into the correct index (minus 1, as Python lists start at 0, not 1).
I think the bug is simply in the misuse of join(). You want to concatenate the current sorted string. i is simply a token, hence simply add it to the end of the string. Code untested.
import re
string = "is2 Thi1s T4est 3a"
def order(sentence):
res = ''
count = 1
list = sentence.split()
for i in list:
for i in list:
a = re.findall('\d+', i)
if a == [str(count)]:
res = res + " " + i # your bug here
count += 1
print(res)
order(string)

The Alphabet and Recursion

I'm almost done with my program, but I've made a subtle mistake. My program is supposed to take a word, and by changing one letter at a time, is eventually supposed to reach a target word, in the specified number of steps. I had been trying at first to look for similarities, for example: if the word was find, and the target word lose, here's how my program would output in 4 steps:
['find','fine','line','lone','lose]
Which is actually the output I wanted. But if you consider a tougher set of words, like Java and work, the output is supposed to be in 6 steps.
['java', 'lava', 'lave', 'wave', 'wove', 'wore', 'work']
So my mistake is that I didn't realize you could get to the target word, by using letters that don't exist in the target word or original word.
Here's my Original Code:
import string
def changeling(word,target,steps):
alpha=string.ascii_lowercase
x=word##word and target has been changed to keep the coding readable.
z=target
if steps==0 and word!= target:##if the target can't be reached, return nothing.
return []
if x==z:##if target has been reached.
return [z]
if len(word)!=len(target):##if the word and target word aren't the same length print error.
print "error"
return None
i=1
if lookup
if lookup(z[0]+x[1:]) is True and z[0]+x[1:]!=x :##check every letter that could be from z, in variations of, and check if they're in the dictionary.
word=z[0]+x[1:]
while i!=len(x):
if lookup(x[:i-1]+z[i-1]+x[i:]) and x[:i-1]+z[i-1]+x[i:]!=x:
word=x[:i-1]+z[i-1]+x[i:]
i+=1
if lookup(x[:len(x)-1]+z[len(word)-1]) and x[:len(x)-1]+z[len(x)-1]!=x :##same applies here.
word=x[:len(x)-1]+z[len(word)-1]
y = changeling(word,target,steps-1)
if y :
return [x] + y##used to concatenate the first word to the final list, and if the list goes past the amount of steps.
else:
return None
Here's my current code:
import string
def changeling(word,target,steps):
alpha=string.ascii_lowercase
x=word##word and target has been changed to keep the coding readable.
z=target
if steps==0 and word!= target:##if the target can't be reached, return nothing.
return []
if x==z:##if target has been reached.
return [z]
holderlist=[]
if len(word)!=len(target):##if the word and target word aren't the same length print error.
print "error"
return None
i=1
for items in alpha:
i=1
while i!=len(x):
if lookup(x[:i-1]+items+x[i:]) is True and x[:i-1]+items+x[i:]!=x:
word =x[:i-1]+items+x[i:]
holderlist.append(word)
i+=1
if lookup(x[:len(x)-1]+items) is True and x[:len(x)-1]+items!=x:
word=x[:len(x)-1]+items
holderlist.append(word)
y = changeling(word,target,steps-1)
if y :
return [x] + y##used to concatenate the first word to the final list, and if the/
list goes past the amount of steps.
else:
return None
The differences between the two is that the first checks every variation of find with the letters from lose. Meaning: lind, fond, fisd, and fine. Then, if it finds a working word with the lookup function, it calls changeling on that newfound word.
As opposed to my new program, which checks every variation of find with every single letter in the alphabet.
I can't seem to get this code to work. I've tested it by simply printing what the results are of find:
for items in alpha:
i=1
while i!=len(x):
print (x[:i-1]+items+x[i:])
i+=1
print (x[:len(x)-1]+items)
This gives:
aind
fand
fiad
fina
bind
fbnd
fibd
finb
cind
fcnd
ficd
finc
dind
fdnd
fidd
find
eind
fend
fied
fine
find
ffnd
fifd
finf
gind
fgnd
figd
fing
hind
fhnd
fihd
finh
iind
find
fiid
fini
jind
fjnd
fijd
finj
kind
fknd
fikd
fink
lind
flnd
fild
finl
mind
fmnd
fimd
finm
nind
fnnd
find
finn
oind
fond
fiod
fino
pind
fpnd
fipd
finp
qind
fqnd
fiqd
finq
rind
frnd
fird
finr
sind
fsnd
fisd
fins
tind
ftnd
fitd
fint
uind
fund
fiud
finu
vind
fvnd
fivd
finv
wind
fwnd
fiwd
finw
xind
fxnd
fixd
finx
yind
fynd
fiyd
finy
zind
fznd
fizd
finz
Which is perfect! Notice that each letter in the alphabet goes through my word at least once. Now, what my program does is use a helper function to determine if that word is in a dictionary that I've been given.
Consider this, instead of like my first program, I now receive multiple words that are legal, except when I do word=foundword it means I'm replacing the previous word each time. Which is why I'm trying holderlist.append(word).
I think my problem is that I need changeling to run through each word in holderlist, and I'm not sure how to do that. Although that's only speculation.
Any help would be appreciated,
Cheers.
I might be slightly confused about what you need, but by borrowing from this post I belive I have some code that should be helpful.
>>> alphabet = 'abcdefghijklmnopqrstuvwxyz'
>>> word = 'java'
>>> splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
>>> splits
[('', 'java'), ('j', 'ava'), ('ja', 'va'), ('jav', 'a'), ('java', '')]
>>> replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
>>> replaces
['aava', 'bava', 'cava', 'dava', 'eava', 'fava', 'gava', 'hava', 'iava', 'java', 'kava', 'lava', 'mava', 'nava', 'oava', 'pava', 'qava', 'rava', 'sava', 'tava', 'uava', 'vava', 'wav
a', 'xava', 'yava', 'zava', 'java', 'jbva', 'jcva', 'jdva', 'jeva', 'jfva', 'jgva', 'jhva', 'jiva', 'jjva', 'jkva', 'jlva', 'jmva', 'jnva', 'jova', 'jpva', 'jqva', 'jrva', 'jsva', '
jtva', 'juva', 'jvva', 'jwva', 'jxva', 'jyva', 'jzva', 'jaaa', 'jaba', 'jaca', 'jada', 'jaea', 'jafa', 'jaga', 'jaha', 'jaia', 'jaja', 'jaka', 'jala', 'jama', 'jana', 'jaoa', 'japa'
, 'jaqa', 'jara', 'jasa', 'jata', 'jaua', 'java', 'jawa', 'jaxa', 'jaya', 'jaza', 'java', 'javb', 'javc', 'javd', 'jave', 'javf', 'javg', 'javh', 'javi', 'javj', 'javk', 'javl', 'ja
vm', 'javn', 'javo', 'javp', 'javq', 'javr', 'javs', 'javt', 'javu', 'javv', 'javw', 'javx', 'javy', 'javz']
Once you have a list of all possible replaces, you can simply do
valid_words = [valid for valid in replaces if lookup(valid)]
Which should give you all words that can be formed by replacing 1 character in word. By placing this code in a separate method, you could take a word, obtain possible next words from that current word, and recurse over each of those words. For example:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def next_word(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
return [valid for valid in replaces if lookup(valid)]
Is this enough help? I think your code could really benefit by separating tasks into smaller chunks.
Fixed your code:
import string
def changeling(word, target, steps):
alpha=string.ascii_lowercase
x = word #word and target has been changed to keep the coding readable.
z = target
if steps == 0 and word != target: #if the target can't be reached, return nothing.
return []
if x == z: #if target has been reached.
return [z]
holderlist = []
if len(word) != len(target): #if the word and target word aren't the same length print error.
raise BaseException("Starting word and target word not the same length: %d and %d" % (len(word),
i = 1
for items in alpha:
i=1
while i != len(x):
if lookup(x[:i-1] + items + x[i:]) is True and x[:i-1] + items + x[i:] != x:
word = x[:i-1] + items + x[i:]
holderlist.append(word)
i += 1
if lookup(x[:len(x)-1] + items) is True and x[:len(x)-1] + items != x:
word = x[:len(x)-1] + items
holderlist.append(word)
y = [changeling(pos_word, target, steps-1) for pos_word in holderlist]
if y:
return [x] + y #used to concatenate the first word to the final list, and if the list goes past the amount of steps.
else:
return None
Where len(word) and len(target), it'd be better to raise an exception than print something obscure, w/o a stack trace and non-fatal.
Oh and backslashes(\), not forward slashes(/), are used to continue lines. And they don't work on comments

Categories