Find all matches of permutations within allotted time - python

I'm writing a program that takes 9 characters, creates all possible permutations, and grabs a dictionary files for each character and then creates a set of all possible words. What I need to do is compare all permutations to words and return matches.
import os, itertools
def parsed(choices):
mySet = set()
location = os.getcwd()
for item in choices:
filename = location + "\\dicts\\%s.txt" % (item)
mySet.update(open(filename).read().splitlines())
return mySet
def permutations(input):
possibilities = []
pospos = []
for x in range(3,9):
pospos.append([''.join(i) for i in itertools.permutations(input, x)])
for pos in pospos:
for i in pos:
possibilities.append(i)
return possibilities
The problematic function is this one:
def return_matches():
matches = []
words = parsed(['s','m','o','k','e', 'j', 'a', 'c', 'k'])
pos = permutations(['s','m','o','k','e', 'j', 'a', 'c', 'k'])
for item in pos:
if item in words:
matches.append(item)
return matches
This code should return:
matches = ['a', 'om', 'ja', 'jo', ..., 'jacks', 'cokes', 'kecks', 'jokes', 'cakes', 'smoke', 'comes', 'makes', 'cameos']
If I get this code to work properly, it takes 10 - 15 minutes to complete. On the other hand, every attempt at making this execute within allotted time, it can only be done with 5 or less characters or returns the wrong result.
So my question is how to optimize this code to return the right result, within 30 seconds time.
Edit
http://www.mso.anu.edu.au/~ralph/OPTED/v003 this is the website I'm scraping the dictionary files from.

It wastes RAM and time storing all the permutations in a list before you test if they're valid. Instead, test the permutations as you generate them, and save the valid ones into a set to eliminate duplicates.
Duplicates are possible because of the way itertools.permutations works:
Elements are treated as unique based on their position, not on their
value. So if the input elements are unique, there will be no repeat
values in each permutation.
Your input word "SMOKEJACK" contains 2 Ks, so every permutation containing K gets generated twice.
Anyway, here's some code that uses the SOWPODS Scrabble word list for English.
from itertools import permutations
# Get all the words from the SOWPODS file
all_words = set('AI')
fname = 'scrabble_wordlist_sowpods.txt'
with open(fname) as f:
all_words.update(f.read().splitlines())
print(len(all_words))
choices = 'SMOKEJACK'
# Generate all permutations of `choices` from length 3 to 8
# and save them in a set to eliminate duplicates.
matches = set()
for n in range(3, 9):
for t in permutations(choices, n):
s = ''.join(t)
if s in all_words:
matches.add(s)
for i, s in enumerate(sorted(matches)):
print('{:3} {}'.format(i, s))
output
216555
0 ACE
1 ACES
2 ACME
3 ACMES
4 AESC
5 AKE
6 AKES
7 AMOK
8 AMOKS
9 ASK
10 CAKE
11 CAKES
12 CAM
13 CAME
14 CAMEO
15 CAMEOS
16 CAMES
17 CAMS
18 CASE
19 CASK
20 CEAS
21 COKE
22 COKES
23 COMA
24 COMAE
25 COMAKE
26 COMAKES
27 COMAS
28 COME
29 COMES
30 COMS
31 COS
32 COSE
33 COSMEA
34 EAS
35 EKKA
36 EKKAS
37 EMS
38 JACK
39 JACKS
40 JAK
41 JAKE
42 JAKES
43 JAKS
44 JAM
45 JAMES
46 JAMS
47 JOCK
48 JOCKS
49 JOE
50 JOES
51 JOKE
52 JOKES
53 KAE
54 KAES
55 KAM
56 KAME
57 KAMES
58 KAS
59 KEA
60 KEAS
61 KECK
62 KECKS
63 KEKS
64 KOA
65 KOAS
66 KOS
67 MAC
68 MACE
69 MACES
70 MACK
71 MACKS
72 MACS
73 MAE
74 MAES
75 MAK
76 MAKE
77 MAKES
78 MAKO
79 MAKOS
80 MAKS
81 MAS
82 MASE
83 MASK
84 MES
85 MESA
86 MOA
87 MOAS
88 MOC
89 MOCK
90 MOCKS
91 MOCS
92 MOE
93 MOES
94 MOKE
95 MOKES
96 MOS
97 MOSE
98 MOSK
99 OAK
100 OAKS
101 OCA
102 OCAS
103 OES
104 OKA
105 OKAS
106 OKE
107 OKES
108 OMS
109 OSE
110 SAC
111 SACK
112 SAE
113 SAKE
114 SAM
115 SAME
116 SAMEK
117 SCAM
118 SEA
119 SEAM
120 SEC
121 SECO
122 SKA
123 SKEO
124 SMA
125 SMACK
126 SMOCK
127 SMOKE
128 SOAK
129 SOC
130 SOCA
131 SOCK
132 SOJA
133 SOKE
134 SOMA
135 SOME
This code runs in around 2.5 seconds on my rather ancient 32 bit 2GHz machine running Python 3.6.0 on Linux. It's slightly faster on Python 2 (since Python2 strings are ASCII, not Unicode).

Instead of generating all the permutations of your letters, you should use a Prefix Tree, or Trie, to keep track of all the prefixes to valid words.
def make_trie(words):
res = {}
for word in words:
d = res
for c in word:
d = d.setdefault(c, {})
d["."] = None
return res
We are using d["."] = None here to signify where a prefix actually becomes a valid word. Creating the tree can take a few seconds, but you only have to do this once.
Now, we can go through our letters in a recursive function, checking for each letter whether it contributes to a valid prefix in the current stage of the recursion: (That rest = letters[:i] + letters[i+1:] part is not very efficient, but as we will see it does not matter much.)
def find_words(trie, letters, prefix=""):
if "." in trie: # found a full valid word
yield prefix
for i, c in enumerate(letters):
if c in trie: # contributes to valid prefix
rest = letters[:i] + letters[i+1:]
for res in find_words(trie[c], rest, prefix + c):
yield res # all words starting with that prefix
Minimal example:
>>> trie = make_trie(["cat", "cats", "act", "car", "carts", "cash"])
>>> trie
{'a': {'c': {'t': {'.': None}}}, 'c': {'a': {'r': {'t': {'s':
{'.': None}}, '.': None}, 's': {'h': {'.': None}}, 't':
{'s': {'.': None}, '.': None}}}}
>>> set(find_words(trie, "acst"))
{'cat', 'act', 'cats'}
Or with your 9 letters and the words from sowpods.txt:
with open("sowpods.txt") as words:
trie = make_trie(map(str.strip, words)) # ~1.3 s on my system, only once
res = set(find_words(trie, "SMOKEJACK")) # ~2 ms on my system
You have to pipe the result through a set as you have duplicate letters. This yields 153 words, after a total of 623 recursive calls to find_words (measured with a counter variable). Compare that to 216,555 words in the sowpods.txt file and a total of 986,409 permutations of all the 1-9 letter combinations that could make up a valid word. Thus, once the trie is initially generated, res = set(find_words(...)) takes only a few milli seconds.
You could also change the find_words function to use a mutable dictionary of letter counts instead of a string or list of letters. This way, no duplicates are generated and the function is called fewer times, but the overall running time does not change much.
def find_words(trie, letters, prefix=""):
if "." in trie:
yield prefix
for c in letters:
if letters[c] and c in trie:
letters[c] -= 1
for res in find_words(trie[c], letters, prefix + c):
yield res
letters[c] += 1
Then call it like this: find_words(trie, collections.Counter("SMOKEJACK"))

Related

How can I put text efficient in sub lists inside a list? (Python)

I wrote some code to calculate the maximum path sum of a triangle. This is the triangle:
75
95 64
17 47 82
18 35 87 10
20 04 82 47 65
So the maximum path sum of this triangle is: 75+95+82+87+82 = 418
This is my code to calculate it:
lst = [[72],
[95,64],
[17,47,82],
[18,35,87,10],
[20,4,82,47,65]]
something = 1
i = 0
mid = 0
while something != 0:
for x in lst:
new = max(lst[i])
print(new)
i += 1
mid += new
something = 0
print(mid)
As you can see I put every item of the triangle down in lists and put the lists in a (head) list. This are not a lot numbers, but what if I have a bigger triangle? To do it manually is a lot of work. So my question is: How can I put the numbers from the triangle efficient in sub lists inside a head list?
If you have input starting with a line containing the number of rows in the triangle, followed by all the numbers on that many rows, read the first number to get the limit in a range(). Then use a list comprehension to create the list of sublists.
rows = int(input())
lst = [list(map(int, input().split())) for _ in range(rows)]
For instance, to read your sample triangle, the input would be:
5
75
95 64
17 47 82
18 35 87 10
20 04 82 47 65

Remove numbers and user's stop words from pandas data frame

I would like to know how to remove some variables from a dataset, specifically numbers and list of strings. For example.
Test Num
0 bam 132
1 - 65
2 creation 47
3 MAN 32
4 41 831
... ... ...
460 Luchino 21
461 42 4126 7
462 finger 43
463 washing 1
I would like to have something like
Test Num
0 bam 132
2 creation 47
... ... ...
460 Luchino 21
462 finger 43
463 washing 1
where I removed (manually) MAN (it should be included in a list of strings, like a stop word), -, and numbers.
I have tried with isdigit but it is not working so I am sure that there are errors in my code:
df['Text'].where(~df['Text'].str.isdigit())
and for my stop words:
my_stop=['MAN','-']
df['Text'].apply(lambda lst: [x for x in lst if x in my_stop])
If you want to filter you could use .loc
df = df.loc[~df.Text.str.isdigit() & ~df.Text.isin(['MAN']), :]
.where(cond, other) returns a dataframe or series of the same shape as self, but keeps the original values where cond is true and replaces with other where it is false.
Read more in the docs
hi you should try this code :
df[df['Text']!='MAN']

How to do kernel function on an array of strings in Numba cuda?

I have an array of strings that i read from file ,i want to compare each line of my file to a specific string..the file is too large (about 200 MB of lines)
i have followed this tutorial https://nyu-cds.github.io/python-numba/05-cuda/ but it doesn't show exactly how to deal with array of strings/characters.
import numpy as np
from numba import cuda
#cuda.jit
def my_kernel(io_array):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
bw = cuda.blockDim.x
pos = tx + ty * bw
if pos < io_array.size: # Check array boundaries
io_array[pos] # i want here to compare each line of the string array to a specific line
def main():
a = open("test.txt", 'r') # open file in read mode
print("the file contains:")
data = country = np.array(a.read())
# Set the number of threads in a block
threadsperblock = 32
# Calculate the number of thread blocks in the grid
blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock
# Now start the kernel
my_kernel[blockspergrid, threadsperblock](data)
# Print the result
print(data)
if __name__ == '__main__':
main()
I have two problems.
First: how to send my sentence (string) that i want to compare each line of my file to it to the kernal function. (in the io_array without affecting the threads computation)
Second: it how to deal with string array? i get this error when i run the above code
this error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: typing of intrinsic-call at test2.py (18)
File "test2.py", line 18:
def my_kernel(io_array):
<source elided>
if pos < io_array.size: # Check array boundaries
io_array[pos] # do the computation
P.S i'm new to Cuda and have just started learning it.
First of all this:
data = country = np.array(a.read())
doesn't do what you think it does. It does not yield a numpy array that you can index like this:
io_array[pos]
If you don't believe me, just try that in ordinary python code with something like:
print(data[0])
and you'll get an error. If you want help with that, just ask your question on the python or numpy tag.
So we need a different method to load the string data from disk. For simplicity, I choose to use numpy.fromfile(). This method will require that all lines in your file are of the same width. I like that concept. There's more information you would have to describe if you want to handle lines of varying lengths.
If we start out that way, we can load the data as an array of bytes, and use that:
$ cat test.txt
the quick brown fox.............
jumped over the lazy dog........
repeatedly......................
$ cat t43.py
import numpy as np
from numba import cuda
#cuda.jit
def my_kernel(str_array, check_str, length, lines, result):
col,line = cuda.grid(2)
pos = (line*(length+1))+col
if col < length and line < lines: # Check array boundaries
if str_array[pos] != check_str[col]:
result[line] = 0
def main():
a = np.fromfile("test.txt", dtype=np.byte)
print("the file contains:")
print(a)
print("array length is:")
print(a.shape[0])
print("the check string is:")
b = a[33:65]
print(b)
i = 0
while a[i] != 10:
i=i+1
line_length = i
print("line length is:")
print(line_length)
print("number of lines is:")
line_count = a.shape[0]/(line_length+1)
print(line_count)
res = np.ones(line_count)
# Set the number of threads in a block
threadsperblock = (32,32)
# Calculate the number of thread blocks in the grid
blocks_x = (line_length/32)+1
blocks_y = (line_count/32)+1
blockspergrid = (blocks_x,blocks_y)
# Now start the kernel
my_kernel[blockspergrid, threadsperblock](a, b, line_length, line_count, res)
# Print the result
print("matching lines (match = 1):")
print(res)
if __name__ == '__main__':
main()
$ python t43.py
the file contains:
[116 104 101 32 113 117 105 99 107 32 98 114 111 119 110 32 102 111
120 46 46 46 46 46 46 46 46 46 46 46 46 46 10 106 117 109
112 101 100 32 111 118 101 114 32 116 104 101 32 108 97 122 121 32
100 111 103 46 46 46 46 46 46 46 46 10 114 101 112 101 97 116
101 100 108 121 46 46 46 46 46 46 46 46 46 46 46 46 46 46
46 46 46 46 46 46 46 46 10]
array length is:
99
the check string is:
[106 117 109 112 101 100 32 111 118 101 114 32 116 104 101 32 108 97
122 121 32 100 111 103 46 46 46 46 46 46 46 46]
line length is:
32
number of lines is:
3
matching lines (match = 1):
[ 0. 1. 0.]
$

Project Euler Problem #18 Python - getting the wrong result. Why?

I'm trying to solve the Euler Projects as an exercise to learn Python, for the last few days after work, and am now at Problem 18
I looked at the problem, and thought it could be solved by using Dijkstra's algorithm, with the values of the nodes as negative integers, so as to find the "longest" path.
My solution seems to be almost correct (I get 1068) - which is to say wrong. It prints a path, but from what I can tell, it's not the right one. But having looked at it from some time, I can't tell why.
Perhaps this problem cannot be solved by my approach, and I need some other approach, like dynamic programming - or maybe my implementation of Dijkstra is faulty?
I'm pretty confident the parsing from file to graph is working as intended.
This is the data-set:
75
95 64
17 47 82
18 35 87 10
20 04 82 47 65
19 01 23 75 03 34
88 02 77 73 07 63 67
99 65 04 28 06 16 70 92
41 41 26 56 83 40 80 70 33
41 48 72 33 47 32 37 16 94 29
53 71 44 65 25 43 91 52 97 51 14
70 11 33 28 77 73 17 78 39 68 17 57
91 71 52 38 17 14 91 43 58 50 27 29 48
63 66 04 68 89 53 67 30 73 16 69 87 40 31
04 62 98 27 23 09 70 98 73 93 38 53 60 04 23
This is the code. It a fully "working example", as long as the path to the file with the content above is correct.
class Graph:
def __init__(self):
self.nodes = []
self.edges = []
def add_node(self, node):
self.nodes.append(node)
def add_edge(self, edge):
self.edges.append(edge)
def edges_to_node(self, n):
edges = [edge for edge in self.edges if edge.node1.id == n.id]
return edges
class Node:
def __init__(self, id, value, goal):
self.id = id
self.value = value
self.goal = goal
self.visited = False
self.distance = 10000
self.previous = None
def __str__(self):
return "{} - {}".format(self.value, self.goal)
def __repr__(self):
return "{} - {}".format(self.value, self.goal)
class Edge:
def __init__(self, node1, node2):
self.node1 = node1
self.node2 = node2
f = open("problem18.data", "r")
content = f.read()
lines = content.split("\n")
data = []
graph = Graph()
index_generator = 1
last_line = len(lines) - 1
for i in range(len(lines)):
data.append([])
numbers = lines[i].split()
for number in numbers:
goal = i == last_line
data[-1].append(Node(index_generator, -int(number), goal))
index_generator += 1
for i in range(len(data)):
for j in range(len(data[i])):
node = data[i][j]
graph.add_node(node)
if i != last_line:
node2 = data[i+1][j]
node3 = data[i+1][j+1]
edge1 = Edge(node, node2)
edge2 = Edge(node, node3)
graph.add_edge(edge1)
graph.add_edge(edge2)
def dijkstra(graph, start):
start.distance = 0
queue = [start]
while len(queue):
queue.sort(key=lambda x: x.value, reverse=True)
current = queue.pop()
current.visited = True
if current.goal:
return reconstrcut_path(start, current)
edges = graph.edges_to_node(current)
for edge in edges:
neighbour = edge.node2
if neighbour.visited:
continue
queue.append(neighbour)
new_distance = current.distance + neighbour.value
if new_distance < neighbour.distance:
neighbour.distance = new_distance
neighbour.previous = current
return []
def reconstrcut_path(start, n):
path = []
current = n
while current.id is not start.id:
path.append(current)
current = current.previous
path.append(start)
return path
path = dijkstra(graph, graph.nodes[0])
tally = 0
for node in path:
number = max(node.value, -node.value)
print(number)
tally += number
print(tally)
Can you help me troubleshoot what is wrong with this solution?
EDIT: The console output of the run:
98
67
91
73
43
47
83
28
73
75
82
87
82
64
75
1068
Actually, dynamic programming will knock this off neatly. My solution for this and problem 67 is less than 20 lines.
The focus here is very much a Dijkstra approach: work your way down the triangle, maintaining the maximum path cost at each node. Row 1 is trivial:
75
Row 2 is similarly trivial, as both values are ends: each has only one possible path:
95+75 64+75
which evaluates to
170 139
Row 3 has two ends, but the middle value gives us the critical logic: keep the larger of the two paths:
17+170 47+max(170, 139) 82+139
187 217 221
Row 4 has two middles ... just keep going with the process:
18+187 35+max(187, 217) 87+max(217, 221) 10+221
205 252 308 231
Can you take it from here?
As a check for you, the correct answer is quite close to the one you originally got.
Your solution fails because you didn't apply Dijkstra's algorithm. That requires that you maintain the best path to each node you've reached in your search. Instead, you used a row-by-row greedy algotriothm: you kept only the best path so far in the entire pass.
Specifically, when you found the 98 near the right side of the bottom row, you forced an assumption that it was part of the optimum path. You continued this, row by row. The data set is configured specifically to make this approach fail. The best path starts with the 93 + 73 + 58 sequence.
You have to keep all paths in mind; there's a path that is not the best sum for the bottom couple of rows, but catches up in the middle rows while the "fat" path gets starved with some lower numbers in the middle.
Consider this alternative data set:
01
00 01
00 00 01
00 00 00 01
99 00 00 00 01
At least with negated costs, Dijkstra would explore that path of 1s and the zeroes that are "just off the path", but nothing else. The node that takes an other step down that path of 1s is always the best node in the queue, and it ends in a goal node so the algorithm terminates without exploring the rest of the triangle. It would never even see that there is a 99 hiding in the bottom left corner.

python using data from a list, where by you call the data from its index number

I must open a file, compute the averages of a row and column and then the max of the data sheet. The data is imported from a text file. When I am done with the program, it should look like an excel sheet, only printed on my terminal.
Data file must be seven across by six down.
88 90 94 98 100 110 120
75 77 80 86 94 103 113
80 83 85 94 111 111 121
68 71 76 85 96 122 125
77 84 91 102 105 112 119
81 85 90 96 102 109 134
Later, I will have to print the above data. I the math is easy, my problem is selecting the number from the indexed list. Ex:
selecting index 0, 8, 16, 24, 32, 40. Which should be numbers 88, 75, 80, 68, 77, 81.
What I get when I input the index number is 0 = 8, 1 = 8, 2 = " "... ect.
What have I done wrong here? I have another problem where I had typed into the program the list, which works as I wanted this to work. That program was using the index numbers to select a month. 0= a blank index, 1 = january, 2 = Febuary, ect...
I hope this example made clear what I intended to do, but cannot seem to do. Again, the only difference between my months program and this program is that for the below code, I open a file to fill the list. Have I loaded the data poorly? Split and stripped the list poorly? Help is more useful than answers, as I can learn rather than be given the answer.
def main():
print("Program to output a report of noise for certain model cars.")
print("Written by censored.")
print()
fileName = input("Enter the name of the data file: ")
infile = open(fileName, "r")
infileData = infile.read()
line = infileData
#for line in infile:
list = line.split(',')
list = line.strip("\n")
print(list)
n = eval(input("Enter a index number: ", ))
print("The index is", line[n] + ".")
print("{0:>38}".format(str("Speed (MPH)")))
print("{0:>6}".format(str("Car")), ("{0:>3}".format(str(":"))),
("{0:>6}".format(str("30"))), ("{0:>4}".format(str("40"))),
("{0:>4}".format(str("50"))))
main()
Thank you for your time.
You keep overwriting your variables, and I wouldn't recommend masking a built-in (list).
infileData = infile.read()
line = infileData
#for line in infile:
list = line.split(',')
list = line.strip("\n")
should be:
infileData = list(map(int, infile.read().strip().split()))
This reads the file contents into a string, strips off the leading and trailing whitespace, splits it up into a list separated by whitespace, maps each element as an int, and creates a list out of that.
Or:
stringData = infile.read()
stringData = stringData.strip()
stringData = stringData.split()
infileData = []
for item in stringData:
infileData.append(int(item))
Storing each element as an integer lets you easily do calculations on it, such as if item > 65 or exceed = item - 65. When you want to treat them as strings, such as for printing, cast them as such with str():
print("The index is", str(infileData[n]) + ".")
Just to be clear, it looks like your data is space-separated not comma separated. So when you call,
list = line.split(',')
the list looks like this,
['88 90 94 98 100 110 120 75 77 80 86 94 103 113 80 83 85 94 111 111 121 68 71 76 85 96 122 125 77 84 91 102 105 112 119 81 85 90 96 102 109 134']
So therefore, when you access list[0], you will get '8' not '88', or when you access list[2] you will get ' ', not '94'
list = line.split() # this is what you should call (space-separated)
Again this answer is based on how your data is presented.

Categories