Fast String within List Searching - python

Using Python 3, I have a list containing over 100,000 strings (list1), each 300 characters long at most. I also have a list of over 9 million substrings (list2)--I want to count how many elements a substring in list2 appears in. For instance,
list1 = ['cat', 'caa', 'doa', 'oat']
list2 = ['at', 'ca', 'do']
I want the function to return (mapped to list2):
[2, 2, 1]
Normally, this is very simple and requires very little. However, due to the massive size of the lists, I have efficiency issues. I want to find the fastest way to return that counter list.
I've tried list comprehensions, generators, maps, loops of all kinds and I have yet to find an fast way to do this easy task. What is theoretically the fastest way to complete this objective, preferably taking O(len(list2)) steps very quickly?

set M = len(list1) and N = len(list2)
For each of the N entries in list2 you're going to have to do M comparisons against the entries in list1. That's a worst case running time of O(M x N). If you take it further, lets take each entry in list2 to be of length 1 and each entry in list1 to be of length 300, then you got a running time of O(300M x N).
If performance is really an issue, try dynamic programming. Here is a start:
1) sort list2 in ascending order of length like so:
['scorch', 'scorching', 'dump', 'dumpster', 'dumpsters']
2) sort them into sublists such that each preceeding entry is a subset of the proceeding entry like so:
[['scorch', 'scorching'] , ['dump', 'dumpster', 'dumpsters']]
3) Now if you compare against list1 and 'scorch' is not in there, then you don't have to search for 'scorching' either. Likewise if 'dump' is not in there, neither is 'dumpster' or 'dumpsters'
note the worst case run time is still the same

I believe this task can be solved in linear time with an Aho Corasick string matching machine.
See this answer for additional information (maybe you get ideas from the other answers to that question, too - it is almost the same task and I think Aho Corasick is the theoretically fastest way to solve this).
You will have to modify the string matching machine in such way, that instead of returning the match it increases the counter of every matched substring by one. (This should be only a minor modification).

Not sure how you could avoid having some sort of O(n**2) algorithm. Here is a simple implementation.
>>> def some_sort_of_count(list1, list2):
>>> return [sum(x in y for y in list1) for x in list2]
>>>
>>> list1 = ['cat', 'caa', 'doa', 'oat']
>>> list2 = ['at', 'ca', 'do']
>>> some_sort_of_count(list1, list2)
[2, 2, 1]

Related

Add two list together where the start of each list will be added to the other

I feel like this concept is pretty easy but I'm just having a hard time grasping what to do.
Example:
list1: [super, cool, huge]
list2: [dog, cat, mouse]
What I'm trying to get is a single list being: [superdog, coolcat, hugemouse]
The best way to do this would be using list comprehension as posted by user d.b
[x + y for x, y in zip(list1, list2)]
I guessing by the question that you are new to python and computer science in general (apologies if you are not and this offends you) so here is a bit of detail on how this works.
List comprehension is used for making lists (obviously). The syntax is:
[ expression for item in iterable if condition == True ]
zip() is a built-in function which combines two lists
zip(list1, list2)
# Returns <zip at 0x18db1d949c0>
list(zip(list1, list2))
# Returns [('super', 'dog'), ('cool', 'cat'), ('huge', 'mouse')]
You need to use zip(list1, list2) as you cannot put in two iterables into list comprension. However, using zip you input a single iterable containing the combination of both lists.
The item used x and y refer to the first and second value in each of the tuples in the list. Which means the the expression would be x + y which is super + dog which is superdog.
You can use a while loop
i = 0
ans = [0,0,0] # Assigning randomly 0 to all indices
while i<3:
ans[i] = list1[i]+list2[i]
i += 1
There are many ways to solve this problem. The list comprehension from #d.b and #stuupid is a compact way to do it, but might be a little confusing if you're just getting started in Python.
All of the techniques boil down to somehow iterating through a list. Let's start with your lists. I changed all of the items inside your lists to strings, since that's what I think you really wanted
list1 = ['super', 'cool', 'huge']
list2 = ['dog', 'cat', 'mouse']
The most common way to iterate through a list might be with a for loop. You can do
for item in list1:
print(item)
which gives
super
cool
huge
using the range function is a very common construct for a for loop:
for i in range(3):
print(i)
which gives
0
1
2
You can combine range() with len() to get the indexes of a list:
for i in range(len(list1)):
print(i)
gives
0
1
2
Getting back to your problem. We can use the range(len()) trick to step through the indices of list1 and fill in a new list on the fly:
list1 = ['super', 'cool', 'huge']
list2 = ['dog', 'cat', 'mouse']
list3 = [] # new empty list
for i in range(len(list1)):
list3.append(list1[i] + list2[i])
print(list3)
which gives
['superdog', 'coolcat', 'hugemouse']
This is a very basic way to approach this problem and it takes advantage of the fact that list1 and list2 have the same number of items. A safer way to solve the problem would be to do some len() comparisons between the lists to make sure they are the same length.

Index error trying to deduplicate a list only using if and for

My goal is to write a program that removes duplicated elements from a list – E.g., [2,3,4,5,4,6,5] →[2,3,4,5,6] without the function set (only if and for)
At the end of my code I got stuck. I have tried to change everything in the if statement but it got me to nowhere, same error repeating itself here:
n=int(input('enter the number of elements in your list '))
mylist=[]
for i in range (n):
for j in range (n):
ele=input(' ')
if mylist[i]!=mylist[j]: ***here is the error exactly , I dont really understand what does the out-of-range above problem have to do with the if statement right here***
mylist.append(ele)
print(mylist)
However, I changed nearly everything and I still got the following error:
if mylist[i]!=mylist[j]:
IndexError: list index out of range
Why does this issue keep coming back? ps: I cant use the function set because I am required to use if and for only
You can do this all with a list comprehension that checks if the value has previously appeared in the list slice that you have already looped over.
mylist = [1,2,3,3,4,5,6,4,1,2,4,3]
outlist = [x for index, x in enumerate(mylist) if x not in mylist[:index]]
print(outlist)
Output
[1, 2, 3, 4, 5, 6]
In case you're not familiar with list comprehensions yet, the above comprehension is functionally equivalent to:
outlist = []
for index, x in enumerate(mylist):
if x not in mylist[:index]:
outlist.append(x)
print(outlist)
A set would be more efficient, but
>>> def dedupe(L):
... seen = []
... return [seen.append(e) or e for e in L if e not in seen]
...
>>> dedupe([2,3,4,5,4,6,5])
[2, 3, 4, 5, 6]
This is not just a toy solution. Because sets can only contain hashable types, you might really have reason to resort to this kind of thing. This approach entails a linear search of the seen items, which is fine for small cases, but is slow compared to hashing.
It still may be possible to do better than this in some cases. Even if you can't hash them, if you can at least sort the seen elements, then you can speed up the search to logarithmic time by using a binary search. (See bisect).

How can I order a list based on another lists order?

I have a strange problem so I'll just demo it for you to make it easier to understand. I have two lists:
>>> a = [1, 2, 20, 6, 210]
>>> b = [20, 6, 1]
The result I'm looking for is 3 (position of last matching item in list a based on matches in b)
b always has less data as it contains all duplicates of list a. I want to know what item in B matches the furthest in list a. So in this example, it would be 6 as 6 is furthest in the list of A.
Is there an easy way to do this - my initial approach is a nested loop but I suspect there's a simpler approach?
The simplest (if not necessarily most efficient) code is just:
max(b, key=a.index)
or if you want the whole list sorted, not just the maximum value as described, one of:
b.sort(key=a.index)
or
sorted(b, key=a.index)
If duplicates in the reference list are a possibility and you need to get the last index of a value, replace index with one of these simulations of rindex.
Update: Addressing your requirement for getting the position, not the value, there is an alternative way to solve this that would involve less work. It's basically a modification of one of the better solutions to emulating rindex:
bset = frozenset(b)
last_a_index = len(a) - next(i for i, x in enumerate(reversed(a), 1) if x in bset)
This gets the work down to O(m + n) (vs. O(m * n) for other solutions) and it short-circuits the loop over a; it scans in reverse until it finds a value in b then immediately produces the index. It can trivially be extended to produce the value with a[last_a_index], since it doesn't really matter where it was found in b. More complex code, but faster, particularly if a/b might be huge.
Since you asked for the position:
>>> max(map(a.index, b))
3

Is there a parallel way to compare two big lists of integers?

I have an 8 million list of unique integers representing IDs of, lets say books. The thing is this list changes every semester (erased IDs, new IDs). Using only list comprehension to get a new list of "new ids" or "erased ids" takes forever.
I did try two operations to look for the two previously described items (erased IDs, new IDs)
l1 = [1,2,3,4,5]
l2 = [0,2,3,4,6,7]
new_ids = [x for x in l2 if x not in l1]
erased_ids = [x for x in l1 if x not in l2]
Is there a parallel way to process these comparisons to get a better performance?
You could try to do this with multiprocessing, but that's not going to help you much as it will only cut the time to compute the answer in half. You said it takes forever, and forever / 2 is still forever. You need a different algorithm. Try sets
set1 = set(l1)
set2 = set(l2)
new_ids = list(set2 - set1)
erased_ids = list(set1 - set2)
Your algorithm runs in O(n^2). This is because [x for x in l2 if x not in l1] checks all of l1 for an x, for every x in l2. If l1 and l2 have 8m elements, that requires 8000000^2 = 160000000000000 checks.
Instead, a set is a data structure (which uses hashing internally) that can check for element membership in one operation, or O(1). In theory, checking if a number x is in a set takes the same amount of time whether the set has one element or 8 million. This is not true for a list.
Sets can also be subtracted. set2 - set1 means "the things in set2 that aren't in set1". This is done in O(n) time, I presume by doing n O(1) checks for membership.
Building the sets in the first place is also O(n) time, as adding to a set is an O(1) operation and you must do it to n elements.
Therefore, this whole algorithm runs in O(n) time.

How do you nest list comprehensions

I have been assigned homework and have spent hours running in circles attempting to nest comprehensions. Specifically, I am attempting to find vowels in a string (say, S = 'This is an easy assignment') and have it return the vowels from the string in a list (so, [1], [1], [1], [2], [3])
I figured out quickly [len(x) for x in S.lower().split()]
to give the length of the words, but cannot successfully get it to produce the required output. This problem cannot use anything but list comprehensions.
Something like:
[[len([v for v in x if v in 'aeiou'])] for x in S.lower().split()]
Not sure why you'd want the counts to be single element lists themselves, but that would do it with only list comprehensions (plus a len call).
Note: A suggested edit was to use [sum(x.count(v) for v in 'aeiou') for x in S.lower().split()]; this is a reasonable way to do it (though x.count would mean traversing the string five times; in CPython it might still be faster, but in JIT-ed interpreters a single pass is likely superior), and to keep it use list comprehensions, not generator expressions, you'd have to pointlessly realize the list before summing: [sum([x.count(v) for v in 'aeiou']) for x in S.lower().split()]

Categories