detecting year in list of strings - python

I have list of strings like this:
words = ['hello', 'world', 'name', '1', '2018']
I looking for the fastest way (python 3.6) to detect year "word" in the list. For example, "2018" is year. "1" not. Let's define the acceptable year range to 2000-2020.
Possible solution
Check if the word is number ('2018'.isdigit()) and then convert it to int and check if valid range.
What is the fastest way to do it in python?

You can build a set of your valid years (as strings). Then loop through each of the words you want to test to check if it is a valid year:
words = ['hello', 'world', 'name', '1', '2018']
valid_years = {str(x) for x in range(2000,2021)}
for word in words:
if word in valid_years:
print word
As Martijn Pieters mentioned in the comments, sets are the fastest solution for accessing items with an O(1) complexity:
Sets let you test for membership in O(1) time, using a list has a linear O(length_of_list) cost
EDIT:
As you can see in the comments, there are a lot of different ways of generating the set of valid_years, as long as your data structure is a Set you will have the fastest way of doing what you want.
You can read more here:
List comprehension
Sets
Complexities for different Python data structures (so you can understand which data structures in Python are quicker for specific operations)

Concatenate list to one string with special split char. Use regex to search.
For example:
word_tmp = " ".join(words)
re.search("\b20[0-2]\d\b", word_tmp)

Related

Fastest way to extract matching strings

I want to search for words that match a given word in a list (example below). However, say there is a list that contain millions of words. What is the most efficient way to perform this search?. I was thinking of tokenizing each list and putting the words in hashtable. Then perform the word search / match and retrieve the list of words that contain this word. From what I can see is this operation will take O(n) operations. Is there any other way? may be without using hash-tables?.
words_list = ['yek', 'lion', 'opt'];
# e.g. if we were to search or match the word "key" with the words in the list we should get the word "yek" or a list of words if there many that match
Also, is there a python library or third party package that can perform efficient searches?
It's not entirely clear when you mean by "match" here, but if you can reduce that to an identity comparison, the problem reduces to a set lookup, which is O(1) time.
For example, if "match" means "has exactly the same set of characters":
words_set = {frozenset(word) for word in words_list}
Then, to look up a word:
frozenset(word) in words_set
Or, if it means "has exactly the same multiset of characters" (i.e., counting duplicates but ignoring order):
words_set = {sorted(word) for word in words_list}
sorted(word) in words_set
… or, if you prefer:
words_set = {collections.Counter(word) for word in words_list}
collections.Counter(word) in words_set
Either way, the key (no pun intended… but maybe it should have been) idea here is to come up with a transformation that turns your values (strings) into values that are identical iff they match (a set of characters, a multiset of characters, an ordered list of sorted characters, etc.). Then, the whole point of a set is that it can look for a value that's equal to your value in constant time.
Of course transforming the list takes O(N) time (unless you just build the transformed set in the first place, instead of building the list and then converting it), but you can use it over and over, and it takes O(1) time each time instead of O(N), which is what it sounds like you care about.
If you need to get back the matching word rather than just know that there is one, you can still do this with a set, but it's easier (if you can afford to waste a bit of space) with a dict:
words_dict = {frozenset(word): word for word in words_list}
words_dict[frozenset(word)] # KeyError if no match
If there could be multiple matches, just change the dict to a multidict:
words_dict = collections.defaultdict(set)
for word in words_list:
words_dict[frozenset(word)].add(word)
words_dict[frozenset(word)] # empty set if no match
Or, if you explicitly want it to be a list rather than a set:
words_dict = collections.defaultdict(list)
for word in words_list:
words_dict[frozenset(word)].append(word)
words_dict[frozenset(word)] # empty list if no match
If you want to do it without using hash tables (why?), you can use a search tree or other logarithmic data structure:
import blist # pip install blist to get it
words_dict = blist.sorteddict()
for word in words_list:
words_dict.setdefault(word, set()).add(word)
words_dict[frozenset(word)] # KeyError if no match
This looks almost identical, except for the fact that it's not quite trivial to wrap defaultdict around a blist.sorteddict—but that just takes a few lines of code. (And maybe you actually want a KeyError rather than an empty set, so I figured it was worth showing both defaultdict and normal dict with setdefault somewhere, so you can choose.)
But under the covers, it's using a hybrid B-tree variant instead of a hash table. Although this is O(log N) time instead of O(1), in some cases it's actually faster than a dict.

Python 3 - Comparing two lists, finding out if el is in list, based on "starts with"

I have two lists:
items_on_queue = ['The rose is red and blue', 'The sun is yellow and round']
things_to_tweet = ['The rose is red','The sun is yellow','Playmobil is a toy']
I want to find out if an element is present on both lists based on the FEW CHARACTERS AT THE BEGINNING, and delete the element from things_to_tweet if a match is found.
The final output should be things_to_tweet = ['Playmobil is a toy']
Any idea how I can do this?
Thank you
PS/ I tried, but I cannot do an "==" comparison because each el is different in every list, even if they start the same, so they're not seen as equal by Python.
I also tried a loop inside a loop but I don't know how to compare one element with ALL the elements of another list only IF the strings start in the same manner.
I also checked other SO threads but they seem to refer to comparisons between lists when elements are exactly the same, which is not what I need.
Condition with String startswith(..)
[s for s in things_to_tweet if not any(i.startswith(s) for i in items_on_queue)]
#Output:
#['Playmobil is a toy']
To keep things simple and readable, I would make use of a helper function (I named it is_prefix_of_any). Without this function we would have two nested loops, which is needlessly confusing. Checking whether a string is a prefix of another string is done with the str.startswith function.
I also opted to create a new list instead of removing strings from things_to_tweet, because removing things from a list you're iterating over will often cause unexpected results.
# define a helper function that checks if any string in a list
# starts with another string
# we will use this to check if any string in items_on_queue starts
# with a string from things_to_tweet
def is_prefix_of_any(prefix, strings):
for string in strings:
if string.startswith(prefix):
return True
return False
# build a new list containing only the strings we want
things = []
for thing in things_to_tweet:
if not is_prefix_of_any(thing, items_on_queue):
things.append(thing)
print(things) # output: ['Playmobil is a toy']
A veteran would do this with much less code, but this should be a lot easier to understand.

Python: how to check that if an item is in a list efficiently?

I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.
However, my input is pretty big (about 600 millions lines), and checking if an element belongs to a list is a O(n) operation according to the Python documentation.
My code is something like:
words_in_line = []
for word in line:
if word in my_list:
words_in_line.append(word)
As it takes too much time (days actually), I wanted to improve that part which is taking most of the time. I have a look at Python collections, and, more precisely, at deque. However, the only give a O(1) operation time access to the head and the tail of a list, not in the middle.
Do someone has an idea about how to do that in a better way?
You might consider a trie or a DAWG or a database. There are several Python implementations of the same.
Here is some relative timings for you to consider of a set vs a list:
import timeit
import random
with open('/usr/share/dict/words','r') as di: # UNIX 250k unique word list
all_words_set={line.strip() for line in di}
all_words_list=list(all_words_set) # slightly faster if this list is sorted...
test_list=[random.choice(all_words_list) for i in range(10000)]
test_set=set(test_list)
def set_f():
count = 0
for word in test_set:
if word in all_words_set:
count+=1
return count
def list_f():
count = 0
for word in test_list:
if word in all_words_list:
count+=1
return count
def mix_f():
# use list for source, set for membership testing
count = 0
for word in test_list:
if word in all_words_set:
count+=1
return count
print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs"
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"
Prints:
list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs
ie, matching a set of 10000 words against a set of 250,000 words is 17,085 X faster than matching a list of same 10000 words in a list of the same 250,000 words. Using a list for the source and a set for membership testing is 28,392 X faster than an unsorted list alone.
For membership testing, a list is O(n) and sets and dicts are O(1) for lookups.
Conclusion: Use better data structures for 600 million lines of text!
I'm not clear on why you chose a list in the first place, but here are some alternatives:
Using a set() is likely a good idea. This is very fast, though unordered, but sometimes that's exactly what's needed.
If you need things ordered and to have arbitrary lookups as well, you could use a tree of some sort:
http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/
If set membership testing with a small number of false positives here or there is acceptable, you might check into a bloom filter:
http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/
Depending on what you're doing, a trie might also be very good.
This uses list comprehension
words_in_line = [word for word in line if word in my_list]
which would be more efficient than the code you posted, though how much more for your huge data set is hard to know.
There are two improvments you can make here.
Back your word list with a hashtable. This will afford you O(1) performance when you are checking if a word is present in your word list. There are a number of ways to do this; the most fitting in this scenario is to convert your list to a set.
Using a more appropriate structure for your matching-word collection.
If you need to store all of the matches in memory at the same time, use a dequeue, since its append performance is superior to lists.
If you don't need all the matches in memory at once, consider using a generator. A generator is used to iterate over matched values according to the logic you specify, but it only stores part of the resulting list in memory at a time. It may offer improved performance if you are experiencing I/O bottlenecks.
Below is an example implementation based on my suggestions (opting for a generator, since I can't imagine you need all those words in memory at once).
from itertools import chain
d = set(['a','b','c']) # Load our dictionary
f = open('c:\\input.txt','r')
# Build a generator to get the words in the file
all_words_generator = chain.from_iterable(line.split() for line in f)
# Build a generator to filter out the non-dictionary words
matching_words_generator = (word for word in all_words_generator if word in d)
for matched_word in matching_words_generator:
# Do something with matched_word
print matched_word
# We're reading the file during the above loop, so don't close it too early
f.close()
input.txt
a b dog cat
c dog poop
maybe b cat
dog
Output
a
b
c
b

list of lists of anagram in python [duplicate]

This question already has answers here:
Finding and grouping anagrams by Python
(7 answers)
Closed 7 months ago.
If my input is a list like this:
words = ['cat','act','wer','erw']
I want to make a list of lists of anagrams like this -
[['cat','act'],['wer','erw']]
I have tried to do something like this:
[[w1 for w in words if w!=w1 and sorted(w1)==sorted(w)] for w1 in words]
but it doesn't work. The output was :
[['cat'], ['act'], ['wer'], ['erw']]
In addition, I don`t want to use any import (except string). What is the mistake?
Be aware that your original method is actually O(#words2) time and thus will not work on large datasets of perhaps more than 10000 words.
groupby one-liner:
One of the most elegant weirdest use cases I've ever seen for itertools.groupby:
>>> [list(v) for k,v in groupby(sorted(words,key=sorted),sorted)]
[['cat', 'act'], ['wer', 'erw']]
defaultdict three-liner:
Using collections.defaultdict, you can do:
anagrams = defaultdict(list)
for w in words:
anagrams[tuple(sorted(w))].append(w)
As for If doing it your original way without any imports, you can emulate collections.defaultdict as follows:
anagrams = {}
for w in words:
key = tuple(sorted(w))
anagrams.setdefault(key,[]).append(w)
example:
>>> anagrams
{('e', 'r', 'w'): ['wer', 'erw'], ('a', 'c', 't'): ['cat', 'act']}
(Also written up in whi's answer.)
map-reduce:
This problem is also the poster child for map-reduce, where the reduction key you use is the sorted letters (or more efficiently, a hash). This will allow you to massively parallelize the problem.
If we assume the length of words is bounded, the groupby solution is O(#words log(#words)), while the hash solution is expected O(#words). In the unlikely event the length of words is arbitrary in length, sorting (O(length log(length)) per word) is less efficient than using an order-agnostic hash of the letters (O(length) per word). Sadly, collections.Counter is not hashable so you'd have to write your own.
words = ['cat','act','wer','erw']
dic={}
for w in words:
k=''.join(sorted(w))
dic.setdefault(k,[])
dic[k].append(w)
print dic.values()
this is better in perform: O(n)
You can find various solutions to anagrams of a single word at a time by googleing. It is likely that there would be a more efficient solver around than the obvious "search through all the words I know and see if they have the same letters".
Once you have one, you can put it into a function:
def anagrams(word):
"return a list of all known anagrams of *word*"
Once you have that, generalising it to a list of words is trivial:
[anagrams(word) for word in words]
This one should do the trick in the style you prefer
[[w, w1] for w1 in words for w in words if w!=w1 and sorted(w1)==sorted(w)][::2]

Python text search question [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I wonder, if you open a text file in Python. And then you'd like to search of words containing a number of letters.
Say you type in 6 different letters (a,b,c,d,e,f) you want to search.
You'd like to find words matching at least 3 letters.
Each letter can only appear once in a word.
And the letter 'a' always has to be containing.
How should the code look like for this specific kind of search?
Let's see...
return [x for x in document.split()
if 'a' in x and sum((1 if y in 'abcdef' else 0 for y in x)) >= 3]
split with no parameters acts as a "words" function, splitting on any whitespace and removing words that contain no characters. Then you check if the letter 'a' is in the word. If 'a' is in the word, you use a generator expression that goes over every letter in the word. If the letter is inside of the string of available letters, then it returns a 1 which contributes to the sum. Otherwise, it returns 0. Then if the sum is 3 or greater, it keeps it. A generator is used instead of a list comprehension because sum will accept anything iterable and it stops a temporary list from having to be created (less memory overhead).
It doesn't have the best access times because of the use of in (which on a string should have an O(n) time), but that generally isn't a very big problem unless the data sets are huge. You can optimize that a bit to pack the string into a set and the constant 'abcdef' can easily be a set. I just didn't want to ruin the nice one liner.
EDIT: Oh, and to improve time on the if portion (which is where the inefficiencies are), you could separate it out into a function that iterates over the string once and returns True if the conditions are met. I would have done this, but it ruined my one liner.
EDIT 2: I didn't see the "must have 3 different characters" part. You can't do that in a one liner. You can just take the if portion out into a function.
def is_valid(word, chars):
count = 0
for x in word:
if x in chars:
count += 1
chars.remove(x)
return count >= 3 and 'a' not in chars
def parse_document(document):
return [x for x in document.split() if is_valid(x, set('abcdef'))]
This one shouldn't have any performance problems on real world data sets.
Here is what I would do if I had to write this:
I'd have a function that, given a word, would check whether it satisfies the criteria and would return a boolean flag.
Then I'd have some code that would iterate over all words in the file, present each of them to the function, and print out those for which the function has returned True.
I agree with aix's general plan, but it's perhaps even more general than a 'design pattern,' and I'm not sure how far it gets you, since it boils down to, "figure out a way to check for what you want to find and then check everything you need to check."
Advice about how to find what you want to find: You've entered into one of the most fundamental areas of algorithm research. Though LCS (longest common substring) is better covered, you'll have no problems finding good examples for containment either. The most rigorous discussion of this topic I've seen is on a Google cs wonk's website: http://neil.fraser.name. He has something called diff-match-patch which is released and optimized in many different languages, including python, which can be downloaded here:
http://code.google.com/p/google-diff-match-patch/
If you'd like to understand more about python and algorithms, magnus hetland has written a great book about python algorithms and his website features some examples within string matching and fuzzy string matching and so on, including the levenshtein distance in a very simple to grasp format. (google for magnus hetland, I don't remember address).
WIthin the standard library you can look at difflib, which offers many ways to assess similarity of strings. You are looking for containment which is not the same but it is quite related and you could potentially make a set of candidate words that you could compare, depending on your needs.
Alternatively you could use the new addition to python, Counter, and reconstruct the words you're testing as lists of strings, then make a function that requires counts of 1 or more for each of your tested letters.
Finally, on to the second part of the aix's approach, 'then apply it to everything you want to test,' I'd suggest you look at itertools. If you have any kind of efficiency constraint, you will want to use generators and a test like the one aix proposes can be most efficiently carried out in python with itertools.ifilter. You have your function that returns True for the values you want to keep, and the builtin function bool. So you can just do itertools.ifilter(bool,test_iterable), which will return all the values that succeed.
Good luck
words = 'fubar cadre obsequious xray'
def find_words(src, required=[], letters=[], min_match=3):
required = set(required)
letters = set(letters)
words = ((word, set(word)) for word in src.split())
words = (word for word in words if word[1].issuperset(required))
words = (word for word in words if len(word[1].intersection(letters)) >= min_match)
words = (word[0] for word in words)
return words
w = find_words(words, required=['a'], letters=['a', 'b', 'c', 'd', 'e', 'f'])
print list(w)
EDIT 1: I too didn't read the requirements closely enough. To ensure a word contains only 1 instance of a valid letter.
from collections import Counter
def valid(word, letters, min_match):
"""At least min_match, no more than one of any letter"""
c = 0
count = Counter(word)
for letter in letters:
char_count = count.get(letter, 0)
if char_count > 1:
return False
elif char_count == 1:
c += 1
if c == min_match:
return True
return True
def find_words(srcfile, required=[], letters=[], min_match=3):
required = set(required)
words = (word for word in srcfile.split())
words = (word for word in words if set(word).issuperset(required))
words = (word for word in words if valid(word, letters, min_match))
return words

Categories