Finding consecutive numbers from multiple lists in python - python

Consider, for example, that I have 3 python lists of numbers, like this:
a = [1,5,7]
b = [2,6,8]
c = [4,9]
I need to be able to check if there are consecutive numbers from these lists, one number from each list, and return true if there are.
In the above example, 7 from list a, 8 from list b and 9 from list c are consecutive, so the returned value should be true.This should be extendable to any number of lists (the number of lists is not known in advance, because they are created on the fly based on prior conditions).
Also, values in a list is not present in any other list. For example, list a above contains the element '1', so '1' is not present in any other list.
Is there a way to accomplish? It seems simple, yet too complex. I am a python newbie, and have been trying all sorts of loops but not even getting close to what I am looking for.
Looking for suggestions. Thanks in advance.
UPDATE: Here is the context for this question.
I am trying to implement a 'phrase search' in a sentence (which is part of a much bigger task).
Here is an example.
The sentence is:
My friend is my colleague.
I have created an index, which is a dictionary having the word as the key and a list of its positions as the value. So for the above sentence, I get:
{
'My': [0,3],
'friend': [1],
'is': [2],
'colleague': [4]
}
I need to search for the phrase 'friend is my' in the above sentence.
So I am trying to do something like this:
First get the positions of words in the phrase from the dictionary, to get:
{
'My': [0,3],
'friend': [1],
'is': [2],
}
Then check if the words in my phrase have consecutive positions, which goes back to my original question of finding consecutive numbers in different lists.
Since 'friend' is in position 1, 'is' is in position 2, and 'my' is in position 3. Hence, I should be able to conclude that the given sentence contains my phrase.

Can you assume
lists are sorted?
O(n) memory usage is acceptable?
As a start, you could merge the lists and then check for consecutive elements. This isn't a complete solution because it would match consecutive elements that all appear in a single list (see comments).
from itertools import chain, pairwise
# from https://docs.python.org/3/library/itertools.html#itertools-recipes
def triplewise(iterable):
"Return overlapping triplets from an iterable"
# triplewise('ABCDEFG') --> ABC BCD CDE DEF EFG
for (a, _), (b, c) in pairwise(pairwise(iterable)):
yield a, b, c
def consecutive_numbers_in_list(*lists: list[list]) -> bool:
big_list = sorted(chain(*lists))
for first, second, third in triplewise(big_list):
if (first + 1) == second == (third - 1):
return True
return False
consecutive_numbers_in_list(a, b, c)
# True
Note itertools.pairwise is py 3.10
If the lists are sorted but you need constant memory, then you can use an n pointer approach in which you have a pointer to the first element of each list, then advance the lowest pointer on each iteration and keep track of the last three values seen at all times.
Ultimately, your question doesn't make that much sense, in that this doesn't seem like a typical programming task. If you are a newbie to programming, you can ask what you are trying to accomplish, instead of how to implement your candidate solution, and we might be able to suggest a better method overall. See https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
UPDATE
You are implementing phrase search. So an additional requirement, compared to the original question, is that the first list contain the first index of the sequence, the second list contain the second index of the sequence, etc. (As I assume that "friend my is" is not an acceptable search result for the query "my friend is".)
Pseudocode:
for each index i in the j=1th list:
for each list from the jth list to the nth list:
see whether i + j - 1 appears in list j
Depending on the characteristics of your data, you may find there are easier/more efficient approaches
can find all the documents matching n of the search terms in the phrase, then do exact substring matching in the document
if search terms have max token length that is relatively short, then you can add n-grams to your search index
This is a very general problem, you can look at implementations in popular search engines like ElasticSearch.

Related

Finding common string in list and displaying them

I am trying to create a function compare(lst1,lst2) which compares the each element in a list and returns every common element in a new list and shows percentage of how common it is. All the elements in the list are going to be strings. For example the function should return:
lst1 = AAAAABBBBBCCCCCDDDD
lst2 = ABCABCABCABCABCABCA
common strand = AxxAxxxBxxxCxxCxxxx
similarity = 25%
The parts of the list which are not similar will simply be returned as x.
I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.
This is what I came up with.
lst1 = 'AAAAABBBBBCCCCCDDDD'
lst2 = 'ABCABCABCABCABCABCA'
common_strand = ''
score = 0
for i in range(len(lst1)):
if lst1[i] == lst2[i]:
common_strand = common_strand + str(lst1[i])
score += 1
else:
common_strand = common_strand + 'x'
print('Common Strand: ', common_strand)
print('Similarity Score: ', score/len(lst1))
Output:
Common Strand: AxxAxxxBxxxCxxCxxxx
Similarity Score: 0.2631578947368421
I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.
You have two strings A and B. Strings are ordered sequences of characters.
Suppose both A and B have equal length (the same number of characters). Choose some position i < len(A), len(B) (remember Python sequences are 0-indexed). Your problem statement requires:
If character i in A is identical to character i in B, yield that character
Otherwise, yield some placeholder to denote the mismatch
How do you find the ith character in some string A? Take a look at Python's string methods. Remember: strings are sequences of characters, so Python strings also implement several sequence-specific operations.
If len(A) != len(B), you need to decide what to do if you're comparing the ith element in either string to a string smaller than i. You might think to represent these as the same placeholder in (2).
If you know how to iterate the result of zip, you know how to use for loops. All you need is a way to iterate over the sequence of indices. Check out the language built-in functions.
Finally, for your measure of similarity: if you've compared n characters and found that N <= n are mismatched, you can define 1 - (N / n) as your measure of similarity. This works well for equally-long strings (for two strings with different lengths, you're always going to be calculating the proportion relative to the longer string).

Backtrack in python but stop going deeper if condition is met

So I understand the concept of backtracking and looked through the code here to implement the simple string permutation algorithm.
My problem however is the following:
I have a list of n lists of the form:
l = [[1,2,3],[4,3,2],[7,6,5],[1,2,9],[6,8]]
Say I sort these lists by least number of occurrence of an element and place in a dictionary where the key represents the number of occurrence of an element, like so:
d = {1:[[7,6,5],[6,8],[4,3,2],[1,2,9]], 2:[[1,2,3],[1,2,9],[4,3,2],[7,6,5],[6,8]], 3:[[1,2,3],[4,3,2],[1,29]]}
So what that dictionary represents is the following:
The numbers 4, 5, 7, 8 and 9 occur only once so all lists containing them are put into a list with the key 1.
The numbers 1, 3 and 6 occur twice and all lists containing them are put into a list with the corresponding key 2.
The number 2 occurs 3 times so all lists containing it are put in a list of lists with key 3.
Now, in order to find a set cover for these the set of numbers from 1 to 9, I know I will NEED to use all the list of lists with the key 1. If this satisfies the entire range from 1 to 9 (in this case it does), my job is done. Otherwise, I will need to backtrack with EACH list in the list of lists corresponding to key 2, if I find a combination that satisfies, I stop. If none of them satisfy, I move onto the lists corresponding to key 3.
How would one achieve this in python and is this a good way of implementing it?
I realize the question is hard to explain in plain English so if it's unclear, please let me know.

comparing python dictionary values

I'm creating a very basic search engine in python, I am working on creating a method for handling phrase queries, so if the position of 2 words are within 1 they are next to each other in the document and it will output all document numbers where this happens.
I currently have a dictionary which looks like this
{'8':[['1170', '1264', '1307', '1559', '1638'], ['197', '1169']],
'6':[['345', '772'], ['346']}
This is just a layout example.
w=word, p=position ||
{doc1:[w1p1, w1p2, w1p3],[w2p1, w2p2]}
The key is the document id, followed by the positions in that document that the 1st word contains, then the positions of the 2nd word. There will be as many words (grouping of the positions) as that of in the query.
My questions is, is there a way were i can compare the values of the 1 and 2nd + 3rd etc ... values for the same document ID?. I want to compare them to see if a words position is only +1 of the other word.
So you can see for doc 6 word 2 follows word 1, this would result in the key being sent back.
There are a couple ways to achieve what you're trying to do here. I'm assuming based on the example you gave me that there are always only two words, and the lists are always ordered ordered.
No matter what the method, you'll want to iterate over the documents (The dictionary). Iterating over dictionaries is simple in Python; you can see an example here. After that, the steps change
First option - less efficient, slightly simpler:
Iterate over each item (location) in list 1 (the locations of the first word).
Iterate over each item (location) in list 2 (the locations of the second word).
Compare the two locations, and if they're within 1 of each other, return the document id.
Example:
for documentNumber in docdictionary:
for word1location in docdictionary[documentNumber][0]:
for word2location in docdictionary[documentNumber][1]:
if abs(word1location - word2location) == 1:
return documentNumber
Second Option - more efficient, slightly more complicated:
Start at the beginning of each list of word locations, keeping track of where you are
Check the two values at the locations you're at.
If the two values are 1 word apart, return the document number
If the two values are not, check which list item (page position), has a lower value and move to the next item in that list, repeat
If one of the lists (ex. list 1) runs out of numbers, and the other list (list 2) is at a value that is greater than the last value of the first (list 1), return None.
Example:
for documentNumber in docdictionary:
list1pos = 0
list2pos = 0
while True:
difference = docdictionary[documentNumber][0][list1pos] - docdictionary[documentNumber][1][list2pos]
if abs(difference) == 1:
return documentNumber
if difference < 0: #Page location 2 is greater
list1pos++
if list1pos == len(docdictionary[documentNumber][0]): #We were at the end of list 1, there will be no more matches
break
else: #Page location 1 is greater
list2pos++
if list2pos == len(docdictionary[documentNumber][1]): #We were at the end of list 2, there will be no more matches
break
return None
As a reminder, option 2 only works if the lists are always sorted. Also, you don't always need to return the document id right away. You could just add the document id to a list if you want all the documents that the pair happens in instead of the first one it finds. You could even use a dictionary to easily keep track of how many times the word pair appears in each document.
Hope this helped! Please let me know if anything isn't clear.

Collapse list of lists to eliminate redundancy

I have a couple of long lists of lists of related objects that I'd like to group to reduce redundancy. Pseudocode:
>>>list_of_lists = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]...]
>>>remove_redundancy(list_of_lists)
[[1,2,3,4,8,9,10],[5,6,7]...]
So lists that contain the same elements would be collapsed into single lists. Collapsing them is easy, once I find lists to combine I can make the lists into sets and take their union, but I'm not sure how to compare the lists. Do I need to do a series of for loops?
My first thought was that I should loop through and check whether each item in a sublist is in any of the other lists, if yes, merge the lists and then start over, but that seems terribly inefficient. I did some searching and found this: Python - dividing a list-of-lists to groups but my data isn't structured. Also, my actual data is a series of strings and thus not sortable in any meaningful sense.
I can write some gnarly looping code to make this work, but I was wondering if there are any built-in functions that would make this sort of comparison easier. Maybe something in list comprehensions?
I think this is a reasonably efficient way of doing it, if I understand your question correctly. The result here will be a list of sets.
Maybe the missing bit of knowledge was d & g (also written d.intersection(g)) for finding the set intersection, along with the fact that an empty set is "falsey" in Python
data = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]]
result = []
for d in data:
d = set(d)
matched = [d]
unmatched = []
# first divide into matching and non-matching groups
for g in result:
if d & g:
matched.append(g)
else:
unmatched.append(g)
# then combine all matching groups into one group
# while leaving unmatched groups intact
result = unmatched + [set().union(*matched)]
print(result)
# [set([5, 6, 7]), set([1, 2, 3, 4, 8, 9, 10])]
We start with no groups at all (result = []). Then we take the first list from the data. We then check which of the existing groups intersect this list and which don't. Then we merge all of these matching groups along with the list (achieved by starting with matched = [d]). We don't touch the non-matching groups (though maybe some of these will end up being merged in a later iteration). If you add a line print(result) in each loop you should be able to see how it's built up.
The union of all the sets in matched is computed by set().union(*matched). For reference:
Pythonic Way to Create Union of All Values Contained in Multiple Lists
What does the Star operator mean?
I assume that you want to merge lists that contain any common element.
Here is a function that looks efficiently (to the best of my knowledge) if any two lists contain at least one common element (according to the == operator)
import functools #python 2.5+
def seematch(X,Y):
return functools.reduce(lambda x,y : x|y,functools.reduce(lambda x,y : x+y, [[k==l for k in X ] for l in Y]))
it would be even faster if you would use a reduce that can be interrupted when finding "true" as described here:
Stopping a Reduce() operation mid way. Functional way of doing partial running sum
I was trying to find an elegant way to iterate fast after having that in place, but I think a good way would be simply looping once and creating an other container that will contain the "merged" lists. You loop once on the lists contained on the original list and for every new list created on the proxy list.
Having said that - it seems there might be a much better option - see if you can do away with that redundancy by some sort of book-keeping on the previous steps.
I know this is an incomplete answer - hope that helped anyway!

Python: Listing the duplicates in a list

I am fairly new to Python and I am interested in listing duplicates within a list. I know how to remove the duplicates ( set() ) within a list and how to list the duplicates within a list by using collections.Counter; however, for the project that I am working on this wouldn't be the most efficient method to use since the run time would be n(n-1)/2 --> O(n^2) and n is anywhere from 5k-50k+ string values.
So, my idea is that since python lists are linked data structures and are assigned to the memory when created that I begin counting duplicates from the very beginning of the creation of the lists.
List is created and the first index value is the word 'dog'
Second index value is the word 'cat'
Now, it would check if the second index is equal to the first index, if it is then append to another list called Duplicates.
Third index value is assigned 'dog', and the third index would check if it is equal to 'cat' then 'dog'; since it matches the first index, it is appended to Duplicates.
Fourth index is assigned 'dog', but it would check the third index only, and not the second and first, because now you can assume that since the third and second are not duplicates that the fourth does not need to check before, and since the third/first are equal, the search stops at the third index.
My project gives me these values and append it to a list, so I would want to implement that above algorithm because I don't care how many duplicates there are, I just want to know if there are duplicates.
I can't think of how to write the code, but I figured the basic structure of it, but I might be completely off (using random numgen for easier use):
for x in xrange(0,10):
list1.append(x)
for rev, y in enumerate(reversed(list1)):
while x is not list1(y):
cond()
if ???
I really don't think you'll get better than a collections.Counter for this:
c = Counter(mylist)
duplicates = [ x for x,y in c.items() if y > 1 ]
building the Counter should be O(n) (unless you're using keys which are particularly bad for hashing -- But in my experience, you need to try pretty hard to make that happen) and then getting the duplicates list is also O(n) giving you a total complexity of O(2n) == O(n) (for typical uses).

Categories