Python - Find matching elements between and within nested lists - python

Background: I have a lengthy script which calculates possible chemical formula for a given mass (based on a number of criteria), and outputs (amongst other things) a code which corresponds to the 'class' of compounds which that formula belong to. I calculate formula from batches of masses which should all be members of the same class. However, given instrumentation etc limits, it is possible to calculate several possible formula for each mass. I need to check if any of the classes calculated are common to all peaks, and if so, return the position of the match/etc.
I'm struggling with working out how to do an iterative if/for loop which checks every combination for matches (in an efficient way).
The image included summarises the issue:
Or on actual screenshots of the data structure:
image link here -
As you can see, I have a list called "formulae" which has a variable number of elements (in this case, 12).
Each element in formulae is a list, again with a variable number of elements.
Each element within those lists is a list, containing 15 7 elements. I wish to compare the 11th element amongst different elements.
I.e.
formulae[0][0][11] == formulae[1][0][11]
formulae[0][0][11] == formulae[1][1][11]
...
formulae[0][1][11] == formulae[11][13][11]
I imagine the answer might involve a couple of nested for and if statements, but I can't get my head around it.
I then will need to export the lists which matched (like formulae[0][0]) to a new array.
Unless I'm doing this wrong?
Thanks for any help!
EDIT:
1- My data structure has changed slightly, and I need to check that elements [?][?][4] and [?][?][5] and [?][?][6] and [?][?][7] all match the corresponding elements in another list.
I've attempted to adapt some of the code suggested, but can't quite get it to work...
check_O = 4
check_N = 5
check_S = 6
check_Na = 7
# start with base (left-hand) formula
nbase_i = len(formulae)
for base_i in range(len(formulae)): # length of first index
for base_j in range(len(formulae[base_i])): # length of second index
count = 0
# check against comparison (right-hand) formula
for comp_i in range(len(formulae)): # length of first index
for comp_j in range(len(formulae[comp_i])): # length of second index
if base_i != comp_i:
o_test = formulae[base_i][base_j][check_O] == formulae[comp_i][comp_j][check_O]
n_test = formulae[base_i][base_j][check_N] == formulae[comp_i][comp_j][check_N]
s_test = formulae[base_i][base_j][check_S] == formulae[comp_i][comp_j][check_S]
na_test = formulae[base_i][base_j][check_Na] == formulae[comp_i][comp_j][check_Na]
if o_test == n_test == s_test == na_test == True:
count = count +1
else:
count = 0
if count < nbase_i:
print base_i, base_j, comp_i,comp_j
o_test = formulae[base_i][base_j][check_O] == formulae[comp_i][comp_j][check_O]
n_test = formulae[base_i][base_j][check_N] == formulae[comp_i][comp_j][check_N]
s_test = formulae[base_i][base_j][check_S] == formulae[comp_i][comp_j][check_S]
na_test = formulae[base_i][base_j][check_Na] == formulae[comp_i][comp_j][check_Na]
if o_test == n_test == s_test == na_test == True:
count = count +1
else:
count = 0
elif count == nbase_i:
matching = "Got a match! " + "[" +str(base_i) + "][" + str(base_j) + "] matches with " + "[" + str(comp_i) + "][" + str(comp_j) +"]"
print matching
else:
count = 0

I would take a look at using in such as
agg = []
for x in arr:
matched = [y for y in arr2 if x in y]
agg.append(matched)

Prune's answer not right, should be like this:
check_index = 11
# start with base (left-hand) formula
for base_i in len(formulae): # length of first index
for base_j in len(formulae[base_i]): # length of second index
# check against comparison (right-hand) formula
for comp_i in len(formulae): # length of first index
for comp_j in len(formulae[comp_i]): # length of second index
if formulae[base_i][base_j][check_index] == formulae[comp_i][comp_j][check_index]:
print "Got a match"
# Here you add whatever info *you* need to identify the match

I'm not sure I fully understand your data structure, hence I'm not gonna write code here but propose an idea: how about an inverted index?
Like you scan once the lists creating kind of a summary of where the value you look for is.
You could create a dictionary composed as follows:
{
'ValueOfInterest1': [ (position1), (position2) ],
'ValueOfInterest2': [ (positionX) ]
}
Then at the end you can have a look at the dictionary and see if any of the values (basically lists) have length > 1.
Of course you'd need to find a way to create a position format that makes sense to you.
Just an idea.

Does this get you going?
check_index = 11
# start with base (left-hand) formula
for base_i in len(formulae): # length of first index
for base_j in len(formulae[0]): # length of second index
# check against comparison (right-hand) formula
for comp_i in len(formulae): # length of first index
for comp_j in len(formulae[0]): # length of second index
if formulae[base_i][base[j] == formulae[comp_i][comp_j]:
print "Got a match"
# Here you add whatever info *you* need to identify the match

Related

Python Optimizating the Van sequence

I am writing a code on python for the platform Coding Games . The code is about Van Eck's sequence and i pass 66% of the "tests".
Everything is working as expected , the problem is that the process runs out of the time allowed.
Yes , the code is slow.
I am not a python writer and I would like to ask you if you could do any optimization on the piece of code and if your method is complex ( Complex,meaning if you will be using something along vectorized data ) and not just swap an if (because that is easily understandable) to give a good explanation for your choice .
Here is my code for the problem
import sys
import math
def LastSeen(array):
startingIndex = 0
lastIndex = len(array) - 1
closestNum = 0
for startingIndex in range(len(array)-1,-1,-1):
if array[lastIndex] == array[startingIndex] and startingIndex != lastIndex :
closestNum = abs(startingIndex - lastIndex)
break
array.append(closestNum)
return closestNum
def calculateEck(elementFirst,numSeq):
number = numSeq
first = elementFirst
result = 0
sequence.append(first)
sequence.append(0)
number -= 2
while number != 0 :
result = LastSeen(sequence)
number -= 1
print(result)
firstElement = int(input())
numSequence = int(input())
sequence = []
calculateEck(firstElement,numSequence)
so here is my code without dictionaries. van_eck contains the sequence in the end. Usually I would use a dict to track the last position of each element to save runtime. Otherwise you would need to iterate over the list to find the last occurence which can take very long.
Instead of a dict, I simply initialized an array of sufficient size and use it like a dict. To determine its size keep in mind that all numbers in the van-eck sequence are either 0 or tell you how far away the last occurrence is. So the first n numbers of the sequence can never be greater than n. Hence, you can just give the array a length equal to the size of the sequence you want to have in the end.
-1 means the element was not there before.
DIGITS = 100
van_eck = [0]
last_pos = [0] + [-1] * DIGITS
for i in range(DIGITS):
current_element = van_eck[i]
if last_pos[current_element] == -1:
van_eck.append(0)
else:
van_eck.append(i - last_pos[current_element])
last_pos[current_element] = i

Always have this error 'IndexError: string index out of range', when i have taken into consideration of index range

Need to write a program that prints the longest substring of variable, in which the letters occur in alphabetical order.
eg. s = 'onsjdfjqiwkvftwfbx', it should returns 'dfjq'.
as a beginner, code written as below:
y=()
z=()
for i in range(len(s)-1):
letter=s[i]
while s[i]<=s[i+1]:
letter+=s[i+1]
i+=1
y=y+(letter,)
z=z+(len(letter),)
print(y[z.index(max(z))])
However, above code will always return
IndexError: string index out of range.
It will produce the desired result until I change it to range(len(s)-3).
Would like to seek advice on:
Why range(len(s)-1) will lead to such error message? In order to take care of index up to i+1, I have already reduce the range value by 1.
my rationale is, if the length of variable s is 14, it has index from 0-13, range(14) produce value 0-13. However as my code involves i+1 index, range is reduced by 1 to take care of this part.
How to amend above code to produce correct result.
if s = 'abcdefghijklmnopqrstuvwxyz', above code with range(len(s)-3) returns IndexError: string index out of range again. Why? what's wrong with this code?
Any help is appreciated~
Te reason for the out of range index is that in your internal while loop, you are advancing i without checking for its range. Your code is also very inefficient, as you have nested loops, and you are doing a lot of relatively expensive string concatenation. A linear time algorithm without concatenations would look something like this:
s = 'onsjdfjqiwkvftwfbcdefgxa'
# Start by assuming the longest substring is the first letter
longest_end = 0
longest_length = 1
length = 1
for i in range(1, len(s)):
if s[i] > s[i - 1]:
# If current character higher in order than previous increment current length
length += 1
if length > longest_length:
# If current length, longer than previous maximum, remember position
longest_end = i + 1
longest_length = length
else:
# If not increasing order, reset current length
length = 1
print(s[longest_end - longest_length:longest_end])
Regarding "1":
Actually, using range(len(s)-2) should also work.
The reason range(len(s)-1) breaks:
For 'onsjdfjqiwkvftwfbx', the len() will be equal to 18. Still, the max index you can refer is 17 (since indexing starts at 0).
Thus, when you loop through "i"'s, at some point, i will increase to 17 (which corresponds to len(s)-1) and then try access s[i+1] in your while comparison (which is impossible).
Regarding "2":
The following should work:
current_output = ''
biggest_output = ''
for letter in s:
if current_output == '':
current_output += letter
else:
if current_output[-1]<=letter:
current_output += letter
else:
if len(current_output) > len(biggest_output):
biggest_output = current_output
current_output = letter
if len(current_output) > len(biggest_output):
biggest_output = current_output
print(biggest_output)

noobie Iterating over list to display series

I'm trying to generate the fibonacci series here. Not necessarily looking for an answer specific to that series but why the loop I've created here won't generate a list with upto 20 values for an input of '0'.
So far I've tried appending within and before the loop. The result I get is [0,1]. It doesn't seem to add to the list beyond that.
series = []
value = input("Enter an integer: \n")
i = int(value)
series.append(i)
if series[0] == 0:
series.append(1)
for i in series[2:20]:
series[i]=series[i-1]+series[i-2]
series.append(i)
print(series)
After doing series.append(1) you series values [0,1] only so series[2:20] == [] and you iterate on nothing and fill in nothing. And you cannot access an index that is not already allocated, so you can't do series[i] and you did not reach that index yet, you just need to append the values
if series[0] == 0:
series.append(1)
for i in range(2, 20):
series.append(series[i - 1] + series[i - 2])
series[2:20] returns values of series from index 2 included to 20 excluded
range(2,20) generates values 2 included to 20 ecluded

Improving the time complexity of a function that returns the index of the first occurrence of an element in a list

UPDATE 1 (Oct.16): The original code had a few logic errors which were rectified. The updated code below should now produce the correct output for all lists L, S.T they meet the criteria for a special list.
I am trying to decrease the running time of the following function:
The "firstrepeat" function takes in a special list L and an index, and produces the smallest index such that L[i] == L[j]. In other words, whatever the element at L[i] is, the "firstrepeat" function returns the index of the first occurrence of this element in the list.
What is special about the list L?:
The list may contain repeated elements on the increasing side of the list, or the decreasing side, but not both. i.e [3,2,1,1,1,5,6] is fine but not [4,3,2,2,1,2,3]
The list is decreasing(or staying the same) and then increasing(or staying the same).
Examples:
L = [4,2,0,1,3]
L = [3,3,3,1,0,7,8,9,9]
L = [4,3,3,1,1,1]
L = [1,1,1,1]
Example Output:
Say we have L = [4,3,3,1,1,1]
firstrepeat(L,2) would output 1
firstrepeat(L,5) would output 3
I have the following code. I believe the complexity is O(log n) or better (though I could be missing something). I am looking for ways to improve the time complexity.
def firstrepeat(L, i):
left = 0
right = i
doubling = 1
#A Doubling Search
#In doubling search, we start at one index and then we look at one step
#forward, then two steps forward, then four steps, then 8, then 16, etc.
#Once we have gone too far, we do a binary search on the subset of the list
#between where we started and where we went to far.
while True:
if (right - doubling) < 0:
left = 0
break
if L[i] != L[right - doubling]:
left = right - doubling
break
if L[i] == L[right - doubling]:
right = right - doubling
doubling = doubling * 2
#A generic Binary search
while right - left > 1:
median = (left + right) // 2
if L[i] != L[median]:
left = median
else:
right = median
f L[left] == L[right]:
return left
else:
return right

Merging arrays slices in Python

So I have a string that looks like this:
data="ABCABDABDABBCBABABDBCABBDBACBBCDB"
And I am taking random 10 character slices out of it:
start=int(random.random()*100)
end = start+10
slice = data[start:start+10]
But what I am trying to do now is count the number of 'gaps' or 'holes' that were not sliced out at all.
slices_indices = []
for i in xrange(0,100):
start=int(random.random()*100)
end= 10
slice = data[start:end]
...
slices_indices.append([start,end])
For instance, after running this a couple times. I covered this amount:
ABCAB DABD ABBCBABABDB C ABBDBACBBCDB
But left two 'gaps' of slices. Is there a 'Pythonic' way to find the number of these gaps? So basically I am looking for a function that count_gaps given the slice indices.
For example above,
count_gaps(slices_indices)
would give me two
Thanks in advance
There are several, although all involve a bit of messing about
You could compare the removed strings against the original, and work out which characters you didn't hit.
That's a very roundabout way of doing it, though, and won't work properly if you ever have the same 10 characters in the string twice. eg 1234123 or something.
A better solution would be to store the values of i you use, then step back through the data string comparing the current position to the values of i you used (plus 10). If it doesn't match, job done.
eg (pseudo code)
# Make an array the same length as the string
charsUsed = array(data.length)
# Do whatever
for i in xrange(0,100)
someStuffYouWereDoingBefore()
# Store our "used chars" in the array
for(char = i; char < i+10; char++)
if(char <= data.length) # Don't go out of bounds on the array!
charsUsed[i] = true
Then to see which chars weren't used, just walk through charsUsed array and count whatever it is you want to count (consecutive gaps etc)
Edit in response to updated question:
I'd still use the above method to make a "which chars were used" array. Your count_gaps() function then just needs to walk through the array to "find" the gaps
eg (pseudo...something. This isn't even vaguely Python. Hopefully you get the idea though)
The idea is essentially to see if the current position is false (ie not used) and the last position is true (used) meaning it's the start of a "new" gap. If both are false, we're in the middle of a gap, and if both are true, we're in the middle of a "used" string
function find_gaps(array charsUsed)
{
# Count the gaps
numGaps = 0
# What did we look at last (to see if it's the start of a gap)
# Assume it's true if you want to count "gaps" at the start of the string, assume it's false if you don't.
lastPositionUsed = true
for(i = 0; i < charsUsed.length; i++)
{
if(charsUsed[i] = false && lastPositionUsed = true)
{
numGaps++
}
lastPositionUsed = charsUsed[i]
}
return numGaps
}
The other option would be to step through the charsUsed array again and "group" consecutive values into a smaller away, then count the value you want... essentially the same thing but with a different approach. With this example I just ignore group I don't want and the "rest" of the group I do, counting only the boundaries between the group we don't want, and the group we do.
It is a bit of a messy task, but I think sets are the way to go. I hope my code below is self-explanatory, but if there are parts you don't understand please let me know.
#! /usr/bin/env python
''' Count gaps.
Find and count the sections in a sequence that weren't touched by random slicing
From http://stackoverflow.com/questions/26060688/merging-arrays-slices-in-python
Written by PM 2Ring 2014.09.27
'''
import random
from string import ascii_lowercase
def main():
def rand_slice():
start = random.randint(0, len(data) - slice_width)
return start, start + slice_width
#The data to slice
data = 5 * ascii_lowercase
print 'Data:\n%s\nLength : %d\n' % (data, len(data))
random.seed(42)
#A set to capture slice ranges
slices = set()
slice_width = 10
num_slices = 10
print 'Extracting %d slices from data' % num_slices
for i in xrange(num_slices):
start, end = rand_slice()
slices |= set(xrange(start, end))
data_slice = data[start:end].upper()
print '\n%2d, %2d : %s' % (start, end, data_slice)
data = data[:start] + data_slice + data[end:]
print data
#print sorted(slices)
print '\nSlices:\n%s\n' % sorted(slices)
print '\nSearching for gaps missed by slicing'
unsliced = sorted(tuple(set(xrange(len(data))) - slices))
print 'Unsliced:\n%s\n' % (unsliced,)
gaps = []
if unsliced:
last = start = unsliced[0]
for i in unsliced[1:]:
if i > last + 1:
t = (start, last + 1)
gaps.append(t)
print t
start = i
last = i
t = (start, last + 1)
gaps.append(t)
print t
print '\nGaps:\n%s\nCount: %d' % (gaps, len(gaps))
if __name__ == '__main__':
main()
I'd use some kind of bitmap. For example, Extending your code:
data="ABCABDABDABBCBABABDBCABBDBACBBCDB"
slices_indices = [0]*len(data)
for i in xrange(0,100):
start=int(random.random()*len(data))
end=start + 10
slice = data[start:end]
slices_indices[start:end] = [1] * len(slice)
I've used a list here, but you could use any other appropriate data structure, probably something more compact, if your data is rather big.
So, we've initialized the bitmap with zeros, and marked with ones the selected chunks of data. Now we can use something from itertools, for example:
from itertools import groupby
groups = groupby(slices_indices)
groupby returns an iterator where each element is a tuple (element, iterator). To just count gaps you can do something simple, like:
gaps = len([x for x in groups if x[0] == 0])

Categories