python check string contains all characters - python

I'm reading in a long list of words, and I made a node for every word in the list. Each node has an attribute 'word' for their position in the list.
I am trying to connect a node to the next node if the next node is the previous node, with an addition of just one letter
I also alphabetically ordered each word per character, so that CAT -> ACT
I want to draw an edge from each unique starting word, to all of the possible chains, so I can see all the possible chains in the list.
For example
A -> AN -> TAN -> RANT
However A --x-> T
This is my attempt
for i in range(0, G.number_of_nodes()-1):
if ( ( (len(G.node[i]['word'])+1) == len(G.node[i+1]['word']) ) and (G.node[i]['word'] in G.node[i+1]['word'])):
print G.node[i]['word'], G.node[i+1]['word']
Gave me this,
DGO DGOS
DGOS DGOSS
I IN
ELLMS ELLMSS
AEPRS AEPRSS
INW DINW
DINW DINWY
What the word list and the alphabetical list looks like
Why do I not see IN INW?
Also, AGNRT AGNRST should be on there but I don't understand why, along with a lot of other pairs
Where do you think I went wrong?

The problem is that you are only comparing words that appear right next to each other in the list, i.e. words i and i+1, e.g. I and IN appear next to each other, as do WIN and WIND, but IN and WIND are far apart. It seems you want to compare all possible words, which requires a more sophisticated algorithm. Here's an idea:
Make a dictionary where they keys are sorted words and the values are lists of actual words, e.g. {"ACT": ["CAT", "ACT", "TAC], ...}. A collections.defaultdict(list) will be useful for this.
Sort the full input list of words by length. You can use list.sort(key=len) assuming you have just a list of words.
Iterate through the list sorted by length. For each word, go through every subset of length n-1. Something like for i in range(len(word)): process(word[:i] + word[i+1:]). You may want to be careful about duplicates here.
For each subset, sort the subset and look it up in the dictionary. Make a link from every word in the dictionary's value (a list of actual words) to the bigger word.

Looks like a formal languages problem. How do you handle looping nodes?
IN INW is in the list you gave.
AGNRT AGNRST are not in the list, because you started out with a single letter, that letter has to be in the next word for example I -> IN, but IN is not in AGNRT or AGNRST

You seem to be comparing each node with just one other node, so
"IN" directly follows "I" in your wordlist, but "INW" is not directly after "IN"

You can use the 3rd party python library, python-levenshtein, to calculate the Levenshtein Distance which is the string edit distance. In your case, the only allowed 'edit' is the 'insertion' of the character on the next string/word on your list, so you will also need to verify that the length of the next word is 1 plus the previous word.
Here is the sample code that would achieve our stuff:
import Levenshtein as lvst
if len(word2) - len(word1) == 1 and lvst.distance(word1, word2) == 1:
print(word1, word2)
You can install python-levenshtein by either apt-get (systemwide) or pip:
sudo apt-get install python-levenshtein
or
sudo apt-get install python3-levenshtein
or
pip install python-levenshtein

Related

How to search for a set of words in a text file?

I'm writing a project on extracting a semantic orientation from a review stored in a text file.
I have a 400*2 array, each row contains a word and it's weight. I want to check which of these words is in the text file, and calculate the weight of the whole content.
My question is -
what is the most efficient way to do it? Should I search for each word separately, for example with a for loop?
Do I get any benefit from storing the content of the text file in a string object?
https://docs.python.org/3.6/library/mmap.html
This may work for you. You can use find
This may be out of the box thinking, but if you don't care for semantic/grammatic connection of the words:
sort all words from the text by length
sort your array by length
.
Write a for-loop:
Call len() (length) on each word from the text.
Then only check against those words which have the same length.
With some tinkering it might give you a good performance boost instead of the "naive" search.
Also look into search algorithms if you want to achieve an additional boost (concerning finding the first word (of the 400) with e.g. 6 letters - then go "down" the list until the first word with 5 letters comes up, then stop.
Alternatively you could also build an index array with the indexes of the first and last of all 5-letter words (analog for the rest), assuming your words dont change.

Memory error while solving an anagram

I am trying to solve the below question:
An anagram is a type of word play, the result of rearranging the letters of a word or phrase to produce a new word or phrase, using all the original letters exactly once; e.g., orchestra = carthorse. Using the word list at http://www.puzzlers.org/pub/wordlists/unixdict.txt, write a program that finds the sets of words that share the same characters that contain the most words in them.
It's failing even with just 1000 bytes of file size. Also every time a new list is created, so why does Python keep the old list in memory? I am getting the below error.
l=list(map(''.join, itertools.permutations(i)))
gives me:
MemoryError
Here's my code:
import itertools
def anagram():
f=open('unixdict.txt')
f2=open('result_anagram.txt','w')
words = f.read(1000).split('\n')
for i in words:
l=[]
l=list(map(''.join, itertools.permutations(i)))
l.remove(i)
for anagram in l:
if l==i:
f2.write(i + "\n")
return True
anagram()
Changed the above code to, as per suggestion. But still getting the memory error.
import itertools
def anagram():
f=open('unixdict.txt')
f2=open('result_anagram.txt','w')
words = set(line.rstrip('\n') for line in f)
for i in words:
l= map(''.join, itertools.permutations(i))
l =(x for x in l if x!=i)
for anagram in l:
if anagram in words:
f2.write(i + "\n")
return True
anagram()
MemoryError
[Finished in 22.2s]
This program is going to be horribly inefficient no matter what you do.
But you can fix this MemoryError so it'll just take forever to run instead of failing.
First, note that a 12-letter word has 479,001,600 permutations. Storing all of those in memory is going to take more than 2GB of memory. So, how do you solve that? Just don't store them all in memory. Leave the iterator as an iterator instead of making a list, and then you'll only have to fit one at a time, instead of all of them.
There's one problem here: You're actually using that list in the if l==i: line. But clearly that's a mistake. There's no way that a list of strings can ever equal a single string. You might as well replace that line with raise TypeError, at which point you can just replace the whole loop and fail a whole lot faster. :)
I think what you wanted there is if anagram in words:. In which case you have no need for l, except for in the for loop, which means you can safely leave it as a lazy iterator:
for i in words:
l = map(''.join, itertools.permutations(i))
l = (x for x in l if x != i)
for anagram in l:
if anagram in words:
f2.write(i + "\n")
I'm assuming Python 3.x here, since otherwise the list call was completely unnecessary. If you're using 2.x, replace that map with itertools.imap.
As a side note, f.read(1000) is usually going to get part of an extra word at the end, and the leftover part in the next loop. Try readlines. While it's useless with no argument, with an argument it's very useful:
Read and return a list of lines from the stream. hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.
So, f.readlines(1000) will let you read buffers of about 1K at a time, without getting partial lines. Of course now, instead of having to split on newlines, you have to rstrip them:
words = [line.rstrip('\n') for line in f.readlines(1000)]
However, you've got another problem. If you're only reading about 100 words at a time, the chances of finding an anagram are pretty slim. For example, orchestra is not going to be anywhere near carthorse in the dictionary, so there's no way to find unless you remember the entire file. But that should be fine; a typical Unix dictionary like web2 has around 200K lines; you an easily read that into memory and keep it around as a set without making even a dent on your 2GB. So:
words = set(line.rstrip('\n') for line in f)
Also, note that you're trying to print out every word in the dictionary that has an anagram (multiple times, if it has multiple anagrams). Even with an efficient algorithm, that's going to take a long time—and spew out more data than you could possibly want. A more useful program might be one that takes an input word (e.g., via input or sys.argv[1]) and outputs just the anagrams of that word.
Finally:
Even after using l as a generator it taking up too much off time though no failing with memory error. Can you explain the importance of words as a set rather than a list. [Finished in 137.4s] just for 200 bytes, you have mentioned it before, but how to overcome it using words as set?
As I said at the top, "This program is going to be horribly inefficient no matter what you do."
In order to find the anagrams of a 12-letter word, you're going through 479 million permutations, and checking each one against a dictionary of about 200 thousand words, so that's 479M * 200K = 95 trillion checks, for each word. There are two ways to improve this, the first involving using the right data structures for the job, and the second involving the right algorithms for the job.
Changing the collection of things to iterate over from a list into a generator (a lazy iterable) turns something that took linear space (479M strings) into something that takes constant space (some fixed-size iterator state, plus one string at a time). Similarly, changing the collection of words to check against from a list into a set turns something that takes linear time (comparing a string against every element in the list) into something that takes constant time (hashing a string, then seeing if there's anything in the set with that hash value). So, this gets rid of the * 200K part of your problem.
But you've still got the 479M part of the problem. And you can't make that go away with a better data structure. Instead, you have to rethink the problem. How can you check whether any permutation of a word matches any other words, without trying all the permutations?
Well, some permutation of the word X matches the word Y if and only if X and Y have the same letters. It doesn't matter what order the letters in X were in; if the set is the same, there is at least one matching permutation (or exactly one, depending on how you count duplicate letters), and if not, there are exactly 0. So, instead of iterating through all the permutations in the word to look up, just look up its set. But it does matter if there are duplicates, so you can't just use set here. You could use some kind of multi-set (collections.Counter) works… or, with very little loss in efficiency and a big gain in simplicity, you could just sort the letters. After all, if two words have the same letters in some arbitrary order, they have the same letters in the same order when they're both sorted.
Of course you need to know which words are anagrams, not just that there is an anagram, so you can't just look it up in a set of letter sets, you have to look it up in a dictionary that maps letter sets to words. For example, something like this:
lettersets = collections.defaultdict(set)
for word in words:
lettersets[''.join(sorted(word))].add(word)
So now, to look up the anagrams for a word, all you have to do is:
anagrams = lettersets[''.join(sorted(word))]
Not only is that simple and readable, it's also constant-time.
And if you really want to print out the massive list of all anagrams of all words… well, that's easy too:
for _, words in lettersets.items():
for word in words:
print('{} is an anagram of {}'.format(word, ', '.join(words - {word})))
Now, instead of taking 479M*200K time to find anagrams for one word, or 479M*200K*200K time to find all anagrams for all words, it takes constant time to find anagrams for one word, or 200K time to find all anagrams for all words. (Of course there is 200K setup time added to the start to create the mapping, but spending 200K time up-front to save 200K, much less 479M*200K, time for each lookup is an obvious win.)
Things get a little trickier when you want to, e.g., find partial anagrams, or sentence anagarms, but you want to follow the same basic principles: find data structures that let you do things in constant or logarithmic time instead of linear or worse, and find algorithms that don't require you to brute-force your way through an exponential or factorial number of candidates.
import urllib
def anagram():
f=urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt')
words = f.read().split('\n')
d={''.join(sorted(x)):[] for x in words} #create dic with empty list as default
for x in words:
d[''.join(sorted(x))].append(x)
max_len= max( len(v) for k,v in d.iteritems())
for k,v in d.iteritems():
if len(v)>=max_len:
print v
anagram()
Output:
['abel', 'able', 'bale', 'bela', 'elba']
['alger', 'glare', 'lager', 'large', 'regal']
['angel', 'angle', 'galen', 'glean', 'lange']
['evil', 'levi', 'live', 'veil', 'vile']
['caret', 'carte', 'cater', 'crate', 'trace']
['elan', 'lane', 'lean', 'lena', 'neal']
Finished in 5.7 secs
Here's a hint on solving the problem: two strings are anagrams of each other if they have the same collection of letters. You can sort the words (turning e.g. "orchestra" into "acehorrst"), then just see two words have the same sorted order. If they do, then the original words must have been anagrams of each other, since they have all the same letters (in a different order).

Trying to unravel a recursive Python program

I am writing a simple cryptogram solver and am having trouble 'unrolling' a recursive function. I must unroll it for other reasons, otherwise I would leave it recursive.
Here's the idea: I have a variable number of lists, each with words in them. The function's job is to go through each list and, after checking that the word fits in the current alphabet setup, find it's score. So if you have the following lists:
LIST1: [the, and, can,...]
LIST2: [kids, cars, knee,...]
LIST3: [talks, walks, music,...]
...
and the function needs to go through each list (in order) and try to find the best sentence. (I have a scoring algorithm that it calls to compare.) It starts with the first word in the first list, then iterates the second list until it finds a word that works, then starts iterating the third list until it finds a word in that list that works, etc. Once it exhausts the words in the 3rd list, it should then go back to the second and find the next word that works, continuing the process until it's done.
I tried using the Product function, but that doesn't work the right way...that just gives me all possible combinations, and technically works, but is not very efficient.
def find_sentence():
cycle through first list:
cycle through second list:
if word works:
start cycling through third word list.
else:
keep cycling through 2nd word list.
...
Keep going until we have gone through all word lists, finding a score that is above a threshold..
Any help?
From Bakuriu's response:
Thanks for your fast reply! I'm not that great at Python, but I don't think this is working the way I need it to. Your solution is similar to the Product method in that it's goal is to find all words that will work (or fit a score.) The method I need to use is : 1. Start with the 1st word in the 1st list. 2. Start iterating the next list of words. 3. As soon as one of those words works, start going through the 3rd list, etc. 4. When you've reached the end (to the last list of words) and find a candidate, you now have a solution, as you have one word in each list that works. 5. If, say, a word in list 3 does not fit, you must go back to list 2 and CONTINUE searching through that list, finding the next word that works, moving on to start list 3 OVER AGAIN, and continuing until nothing works or you've reached the end. I hope this is clear. Please let me know if I can clarify anything.
You really don't need recursion here at all, actually.
def find_sentence(*variable_number_of_lists):
out = []
for eachlist in variable_number_of_lists:
for word in eachlist:
if scoreword(out, word) > threshhold:
# presumably, your 'scoreword' function would take in the current
# list of okayed words in order to find the most recent one for use
# in your scoring, if I've understood the problem correctly
out.append(word)
break
return out

How can I sort an item alphabetically before I add it to the list?

I have an assignment that I am supposed to find each word in a line, and add that word to a list. Then there is also another list corresponding to the list of word, but that list will tell the amount of times the word appear in the text.
I have finished that part. However, I cannot find a way to compare the new found word to the word in the list, and find the index to insert it in the list in an alphabetical order. I know that I am supposed to write a function that will find that index in the list, so i can insert that item in both lists. I am not allowed to use the sort operator, so I am having a little trouble. Can anyone help me writing that one function using only conditions operators.
If I am not clear, please let me know.
Homework hint: look at Python's source code for the bisect module. That shows how to find indexes and make insertions in a sorted list.

List all words in a dictionary that start with <user input>

How would a go about making a program where the user enters a string, and the program generates a list of words beginning with that string?
Ex:
User: "abd"
Program:abdicate, abdomen, abduct...
Thanks!
Edit: I'm using python, but I assume that this is a fairly language-independent problem.
Use a trie.
Add your list of words to a trie. Each path from the root to a leaf is a valid word. A path from a root to an intermediate node represents a prefix, and the children of the intermediate node are valid completions for the prefix.
One of the best ways to do this is to use a directed graph to store your dictionary. It takes a little bit of setting up, but once done it is fairly easy to then do the type of searches you are talking about.
The nodes in the graph correspond to a letter in your word, so each node will have one incoming link and up to 26 (in English) outgoing links.
You could also use a hybrid approach where you maintain a sorted list containing your dictionary and use the directed graph as an index into your dictionary. Then you just look up your prefix in your directed graph and then go to that point in your dictionary and spit out all words matching your search criteria.
If you on a debian[-like] machine,
#!/bin/bash
echo -n "Enter a word: "
read input
grep "^$input" /usr/share/dict/words
Takes all of 0.040s on my P200.
egrep `read input && echo ^$input` /usr/share/dict/words
oh I didn't see the Python edit, here is the same thing in python
my_input = raw_input("Enter beginning of word: ")
my_words = open("/usr/share/dict/words").readlines()
my_found_words = [x for x in my_words if x[0:len(my_input)] == my_input]
If you really want speed, use a trie/automaton. However, something that will be faster than simply scanning the whole list, given that the list of words is sorted:
from itertools import takewhile, islice
import bisect
def prefixes(words, pfx):
return list(
takewhile(lambda x: x.startswith(pfx),
islice(words,
bisect.bisect_right(words, pfx),
len(words)))
Note that an automaton is O(1) with regard to the size of your dictionary, while this algorithm is O(log(m)) and then O(n) with regard to the number of strings that actually start with the prefix, while the full scan is O(m), with n << m.
def main(script, name):
for word in open("/usr/share/dict/words"):
if word.startswith(name):
print word,
if __name__ == "__main__":
import sys
main(*sys.argv)
If you really want to be efficient - use suffix trees or suffix arrays - wikipedia article.
Your problem is what suffix trees were designed to handle.
There is even implementation for Python - here
You can use str.startswith(). Reference from the official docs:
str.startswith(prefix[, start[, end]])
Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.
try code below:
dictionary = ['apple', 'abdicate', 'orange', 'abdomen', 'abduct', 'banana']
user_input = input('Enter something: ')
for word in dictionary:
if word.startswith(user_input):
print(word)
Output:
Enter something: abd
abdicate
abdomen
abduct
var words = from word in dictionary
where word.key.StartsWith("bla-bla-bla");
select word;
Try using regex to search through your list of words, e.g. /^word/ and report all matches.
If you need to be really fast, use a tree:
build an array and split the words in 26 sets based on the first letter, then split each item in 26 based on the second letter, then again.
So if your user types "abd" you would look for Array[0][1][3] and get a list of all the words starting like that. At that point your list should be small enough to pass over to the client and use javascript to filter.
Most Pythonic solution
# set your list of words, whatever the source
words_list = ('cat', 'dog', 'banana')
# get the word from the user inpuit
user_word = raw_input("Enter a word:\n")
# create an generator, so your output is flexible and store almost nothing in memory
word_generator = (word for word in words_list if word.startswith(user_word))
# now you in, you can make anything you want with it
# here we just list it :
for word in word_generator :
print word
Remember generators can be only used once, so turn it to a list (using list(word_generator)) or use the itertools.tee function if you expect using it more than once.
Best way to do it :
Store it into a database and use SQL to look for the word you need. If there is a lot of words in your dictionary, it will be much faster and efficient.
Python got thousand of DB API to help you do the job ;-)
If your dictionary is really big, i'd suggest indexing with a python text index (PyLucene - note that i've never used the python extension for lucene) The search would be efficient and you could even return a search 'score'.
Also, if your dictionary is relatively static you won't even have the overhead of re-indexing very often.
Don't use a bazooka to kill a fly. Use something simple just like SQLite. There are all the tools you need for every modern languages and you can just do :
"SELECT word FROM dict WHERE word LIKE "user_entry%"
It's lightning fast and a baby could do it. What's more it's portable, persistent and so easy to maintain.
Python tuto :
http://www.initd.org/pub/software/pysqlite/doc/usage-guide.html
A linear scan is slow, but a prefix tree is probably overkill. Keeping the words sorted and using a binary search is a fast and simple compromise.
import bisect
words = sorted(map(str.strip, open('/usr/share/dict/words')))
def lookup(prefix):
return words[bisect.bisect_left(words, prefix):bisect.bisect_right(words, prefix+'~')]
>>> lookup('abdicat')
['abdicate', 'abdication', 'abdicative', 'abdicator']
If you store the words in a .csv file, you can use pandas to solve this rather neatly, and after you have read it once you can reuse the already loaded data frame if the user should be able to perform more than one search per session.
df = pd.read_csv('dictionary.csv')
matching_words = df[0].loc[df[0].str.startswith(user_entry)]

Categories