Searching string for different substrings - python

I have a string. I need to know if any of the following substrings appear in the string.
So, if I have:
thing_name = "VISA ASSESSMENTS"
I've been doing my searches with:
any((_ in thing_name for _ in ['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))
I'm going through a long list of thing_name items, and I don't need to filter, exactly, just check for any number of substrings.
Is this the best way to do this? It feels wrong, but I can't think of a more efficient way to pull this off.

You can try re.search to see if that is faster. Something along the lines of
import re
pattern = re.compile('|'.join(['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))
isMatch = (pattern.search(thing_name) != None)

If your list of substrings is small and the input is small, then using a for loop to do compares is fine.
Otherwise the fastest way I know to search a string for a (large) list of substrings is to construct a DAWG of the word list and then iterate through the input string, keeping a list of DAWG traversals and registering the substrings at each successful traverse.
Another way is to add all the substrings to a hashtable and then hash every possible substring (up to the length of the longest substring) as you traverse the input string.
It's been a while since I've worked in python, my memory of it is that it's slow to implement stuff in. To go the DAWG route, I would probably implement it as a native module and then use it from python (if possible). Otherwise, I'd do some speed checks to verify first but probably go the hashtable route since there are already high performance hashtables in python.

Related

Finding the end of a contiguous substring of a string without iteration or RegEx

I'm trying to write an iterative LL(k) parser, and I've gotten strings down pretty well, because they have a start and end token, and so you can just "".join(tokenlist[string_start:string_end]).
Numbers, however, do not, and only consist of .0123456789. They can occur at any given point in a program, have any arbitrary length and are delimited purely by non-numerals.
Some examples, because that definition is pretty vague:
56 123.45/! is 56 and 123.45 followed by two other tokens
565.5345.345 % is 565.5345, 0.345 and two other tokens (incl. whitespace)
The problem I'm trying to solve is how the parser should figure out where a numeric literal ends. (Note that this is a context-free, self-modifying interpretive grammar thus there is no separate lexical analysis to be done.)
I could and have solved this with iteration:
def _next_notinst(self, atindex, subs = DIGITS):
"""return the next index of a char not in subs"""
for i, e in enumerate(self.toklist[atindex:]):
if e not in subs:
return i - len(self.toklist)
else:
break
return self.idx.v
(I don't think I need to clarify the variables, since it's an example and extremely straightforward.)
Great! That works, but there are at least two issues:
It's O(n) for a number with digit-length n. Not ideal.*
The parser class of which this method is a member is already using a while True: to cycle over arbitrary parts of the string, and I would prefer not having remotely nested loops when I don't need to.
From the previous bullet: since the parser uses arbitrary k lookahead and skipahead, parsing each individual token is absolutely not what I want.
I don't want to use RegEx mostly because I don't know it, and using it for this right now would make my code uncomprehendable to me, its creator.
There must be a simple, < O(n) solution to this, that simply collects the contiguous numerals in a string given a starting point, up until a non-numeral.
*Yes, I'm fully aware the parser itself is O(n), but we don't also need the number catenator to be > O(n). If you don't believe me, the string catenator is O(1) because it simply looks for the next unescaped " in the program and then joins all the chars up to that. Can't I do the same thing for numbers?
My other answer was actually erroneous due to lack of testing.
I decided to suck it up and learn a little bit of RegEx just because it's the only other way to solve this.
^([.\d]+[.\d]+|[.\d]) works for what I want, and matches these:
123.43.453""
.234234!/%
but not, for example:
"1233

Most efficient way to check if any substrings in list are in another list of strings

I have two lists, one of words, and another of character combinations. What would be the fastest way to only return the combinations that don't match anything in the list?
I've tried to make it as streamlined as possible, but it's still very slow when it uses 3 characters for the combinations (goes up to 290 seconds for 4 characters, not even going to try 5)
Here's some example code, currently I'm converting all the words to a list, and then searching the string for each list value.
#Sample of stuff
allCombinations = ["a","aa","ab","ac","ad"]
allWords = ["testing", "accurate" ]
#Do the calculations
allWordsJoined = ",".join( allWords )
invalidCombinations = set( i for i in allCombinations if i not in allWordsJoined )
print invalidCombinations
#Result: set(['aa', 'ab', 'ad'])
I'm just curious if there's a better way to do this with sets? With a combination of 3 letters, there are 18278 list items to search for, and for 4 letters, that goes up to 475254, so currently my method isn't really fast enough, especially when the word list string is about 1 million characters.
Set.intersection seems like a very useful method if you need the whole string, so surely there must be something similar to search for a substring.
The first thing that comes to mind is that you can optimize lookup by checking current combination against combinations that are already "invalid". I.e. if ab is invalid, than ab.? will be invalid too and there's no point to check such.
And one more thing: try using
for i in allCombinations:
if i not in allWordsJoined:
invalidCombinations.add(i)
instead of
invalidCombinations = set(i for i in allCombinations if i not in allWordsJoined)
I'm not sure, but less memory allocations can be a small boost for real data run.
Seeing if a set contains an item is O(1). You would still have to iterate through your list of combinations (with some exceptions. If your word doesn't have "a" it's not going to have any other combinations that contain "a". You can use some tree-like data structure for this) to compare with your original set of words.
You shouldn't convert your wordlist to a string, but rather a set. You should get O(N) where N is the length of your combinations.
Also, I like Python, but it isn't the fastest of languages. If this is the only task you need to do, and it needs to be very fast, and you can't improve the algorithm, you might want to check out other languages. You should be able to very easily prototype something to get an idea of the difference in speed for different languages.

Conditional Removal of suffix from words in a python list

The task that I have to perform is as follows :
Say I have a list of words (Just an example...the list can have any word):
'yappingly', 'yarding', 'yarly', 'yawnfully', 'yawnily', 'yawning','yawningly',
'yawweed', 'yealing', 'yeanling', 'yearling', 'yearly', 'yearnfully','yearning',
'yearnling', 'yeastily', 'yeasting', 'yed',
I have to create a new list of words from which words having the suffix ing are added after removing the suffix (i.e yeasting is added to the new list as yeast) and the remaining words are added as it is
Now as far as insertion of string ending with ing is concerned, i wrote the following code and it works fine
Data=[w[0:-3] for w in wordlist if re.search('ing$',w)]
But how to add the remaining words to the list?? How do I add an else clause to the above if statement? I was unable to find suitable documentation for the above. I did came across several questions on SO regarding the shorthand if else statement, but simply adding the else statement at the end of the above code doesn't work. How do I go about it??
Secondly, if I have to extend the above regular expression for multiple suffixes say as follows:
re.search('(ing|ed|al)$',w)
How do I perform the "trim" operation to remove the suffix accordingly and simultaneously add the word to the new list??
Please Help.
First, what makes you think you need a regexp at all? There are easier ways to strip suffixes.
Second, if you want to use regexps, why not just re.sub instead of trying to use regexps and slicing together? For example:
Data = [re.sub('(ing|ed|al)$', '', w) for w in wordlist]
Then you don't need to work out how much to slice off (which would require you to keep track of the result of re.search so you can get the length of the group, instead of just turning it into a bool).
But if you really want to do things your way, just replace your if filter with a conditional expression, as in iCodez's answer.
Finally, if you're stuck on how to fit something into a one-liner, just take it out of the one-liner. It should be easy to write a strip_suffixes function that returns the suffix-stripped string (which is the original string if there was no suffix). Then you can just write:
Data = [strip_suffixes(w) for w in wordlist]
Regarding your first question, you can use a ternary placed just before the for:
Data=[w[0:-3] if re.search('ing$',w) else w for w in wordlist]
Regarding your second, well, the best answer in my opinion is to use re.sub as #abarnert demonstrated. However, you could also make a slight adaption to your use of re.search:
Data=[re.search('(.*)(?:ing|ed|al)$', w).group(1) for w in wordlist]
Finally, here is a link for more information on comprehensions.

Efficient way to do a large number of search/replaces in Python?

I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:
line = line.replace("-","<EMDASH>") # Replace single character with tag
line = line.replace("<\\#>","#") # tag with single character
line = line.replace("<\\n>","") # remove tag
line = line.replace("\xe1","•") # replace non-ascii character with entity
the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!
Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:
m = re.findall('(<[^>]+>)',line)
And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
# do many more searches...
if tag != tag_new:
line = line.replace(tag,tag_new,1) # potentially problematic
Any thoughts of efficiency here?
Thanks!
str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).
I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.
You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.
If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.
The best solution though would probably be to use cStringIO
Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.
Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...
[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"
...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).
You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example
>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'
Of course you can use a more complex function object, this lambda is just an example.
Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.

List all words in a dictionary that start with <user input>

How would a go about making a program where the user enters a string, and the program generates a list of words beginning with that string?
Ex:
User: "abd"
Program:abdicate, abdomen, abduct...
Thanks!
Edit: I'm using python, but I assume that this is a fairly language-independent problem.
Use a trie.
Add your list of words to a trie. Each path from the root to a leaf is a valid word. A path from a root to an intermediate node represents a prefix, and the children of the intermediate node are valid completions for the prefix.
One of the best ways to do this is to use a directed graph to store your dictionary. It takes a little bit of setting up, but once done it is fairly easy to then do the type of searches you are talking about.
The nodes in the graph correspond to a letter in your word, so each node will have one incoming link and up to 26 (in English) outgoing links.
You could also use a hybrid approach where you maintain a sorted list containing your dictionary and use the directed graph as an index into your dictionary. Then you just look up your prefix in your directed graph and then go to that point in your dictionary and spit out all words matching your search criteria.
If you on a debian[-like] machine,
#!/bin/bash
echo -n "Enter a word: "
read input
grep "^$input" /usr/share/dict/words
Takes all of 0.040s on my P200.
egrep `read input && echo ^$input` /usr/share/dict/words
oh I didn't see the Python edit, here is the same thing in python
my_input = raw_input("Enter beginning of word: ")
my_words = open("/usr/share/dict/words").readlines()
my_found_words = [x for x in my_words if x[0:len(my_input)] == my_input]
If you really want speed, use a trie/automaton. However, something that will be faster than simply scanning the whole list, given that the list of words is sorted:
from itertools import takewhile, islice
import bisect
def prefixes(words, pfx):
return list(
takewhile(lambda x: x.startswith(pfx),
islice(words,
bisect.bisect_right(words, pfx),
len(words)))
Note that an automaton is O(1) with regard to the size of your dictionary, while this algorithm is O(log(m)) and then O(n) with regard to the number of strings that actually start with the prefix, while the full scan is O(m), with n << m.
def main(script, name):
for word in open("/usr/share/dict/words"):
if word.startswith(name):
print word,
if __name__ == "__main__":
import sys
main(*sys.argv)
If you really want to be efficient - use suffix trees or suffix arrays - wikipedia article.
Your problem is what suffix trees were designed to handle.
There is even implementation for Python - here
You can use str.startswith(). Reference from the official docs:
str.startswith(prefix[, start[, end]])
Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.
try code below:
dictionary = ['apple', 'abdicate', 'orange', 'abdomen', 'abduct', 'banana']
user_input = input('Enter something: ')
for word in dictionary:
if word.startswith(user_input):
print(word)
Output:
Enter something: abd
abdicate
abdomen
abduct
var words = from word in dictionary
where word.key.StartsWith("bla-bla-bla");
select word;
Try using regex to search through your list of words, e.g. /^word/ and report all matches.
If you need to be really fast, use a tree:
build an array and split the words in 26 sets based on the first letter, then split each item in 26 based on the second letter, then again.
So if your user types "abd" you would look for Array[0][1][3] and get a list of all the words starting like that. At that point your list should be small enough to pass over to the client and use javascript to filter.
Most Pythonic solution
# set your list of words, whatever the source
words_list = ('cat', 'dog', 'banana')
# get the word from the user inpuit
user_word = raw_input("Enter a word:\n")
# create an generator, so your output is flexible and store almost nothing in memory
word_generator = (word for word in words_list if word.startswith(user_word))
# now you in, you can make anything you want with it
# here we just list it :
for word in word_generator :
print word
Remember generators can be only used once, so turn it to a list (using list(word_generator)) or use the itertools.tee function if you expect using it more than once.
Best way to do it :
Store it into a database and use SQL to look for the word you need. If there is a lot of words in your dictionary, it will be much faster and efficient.
Python got thousand of DB API to help you do the job ;-)
If your dictionary is really big, i'd suggest indexing with a python text index (PyLucene - note that i've never used the python extension for lucene) The search would be efficient and you could even return a search 'score'.
Also, if your dictionary is relatively static you won't even have the overhead of re-indexing very often.
Don't use a bazooka to kill a fly. Use something simple just like SQLite. There are all the tools you need for every modern languages and you can just do :
"SELECT word FROM dict WHERE word LIKE "user_entry%"
It's lightning fast and a baby could do it. What's more it's portable, persistent and so easy to maintain.
Python tuto :
http://www.initd.org/pub/software/pysqlite/doc/usage-guide.html
A linear scan is slow, but a prefix tree is probably overkill. Keeping the words sorted and using a binary search is a fast and simple compromise.
import bisect
words = sorted(map(str.strip, open('/usr/share/dict/words')))
def lookup(prefix):
return words[bisect.bisect_left(words, prefix):bisect.bisect_right(words, prefix+'~')]
>>> lookup('abdicat')
['abdicate', 'abdication', 'abdicative', 'abdicator']
If you store the words in a .csv file, you can use pandas to solve this rather neatly, and after you have read it once you can reuse the already loaded data frame if the user should be able to perform more than one search per session.
df = pd.read_csv('dictionary.csv')
matching_words = df[0].loc[df[0].str.startswith(user_entry)]

Categories