Get random unique regions of a list using python - python

I have a list of numbers, say list=[100,102,108,307,365,421,433,487,511,537,584].
I want to get unique regions from this list for example region 1 from 102-307, region 2 from 421-487 and region 3 from 511-584. These regions should be non overlapping and unique.

I'll credit #TimPietzcker for pointing me in the direction of this answer, although I didn't use the function he offered (random.sample).
In this code, I choose six indices from those in list_ (renamed from list to avoid overwriting the built-in) without replacement, using np.random.choice. I then sort these indices and iterate over each pair of adjacent indices, taking as a region the values from the first index (i) to the second (j) in the pair, inclusive (hence the j + 1).
(If I had used j instead of j + 1, the indices would never be able to include all the values in list, due to the lack of replacement during the selection phase. For example, if one pair were (1, 3), the minimum value for the first index of the next pair would be 4, because 3 could not be chosen twice. Thus, the first pair would take the values at indices 1 and 2, and the value at 3 would be skipped.)
Since it's possible for j to be equal to len(list_) - 1, I've included a try/except section, which catches the IndexError that would be raised in this case and causes the region to include all values through the end of list_ -- equivalent to taking the values from i to j, inclusive, as for all other cases.
import numpy as np
list_ = [100,102,108,307,365,421,433,487,511,537,584]
n_regions = 3
indices = sorted(np.random.choice(range(len(list_)), size=n_regions * 2,
replace=False))
list_of_regions = []
for i, j in zip(indices[::2], indices[1::2]):
try:
list_of_regions.append(list_[i:j + 1])
except IndexError:
# j + 1 == len(list_), so leave it off.
list_of_regions.append(list_[i:])

Related

Python - efficient way to find first occurences of multiple values

I have a following problem: I need to find first occurences in an array for values greater than or equal than multiple other values.
Example:
array_1 = [-3,2,8,-1,0,5]
array_2 = [5,1]
Script has to find where in array_1 is the first value greater than or equal to each value from array_2 so the expected result in that case would be [3,2] for 1-based indices
A simple loop won't be any good for my case as both array have close to million values and it has to execute quickly preferably under a minute.
Simple loop solution that has a run time of about half an hour:
for j in range(0, len(array_2)):
for i in range(0, len(array_1)):
if array_1[i] >= array_2[j]:
solution[j] = i
break
Edit: indices clarification as #Sergio Tulentsev correctly pointed out
First perform some preprocessing on the data: create a new list that only has the values that are greater than all predecessors in the original data, and combine them in a tuple with the 1-based position where they were found.
So for instance, for the example data [-3,2,8,-1,0,5], this would be:
[(-3, 1), (2, 2), (8, 3)]
Note how the answer to any query can only be 1, 2 or 3, as the values at the other positions are all smaller than 8.
Then for each query use a binary search to find the tuple whose left value is at least the queried value, and return the right value of the found tuple (the position). For the binary search you can rely on the bisect library:
import bisect
def solve(data, queries):
# preprocessing
maxima = []
greatest = float("-inf")
for i, val in enumerate(data):
if val > greatest:
greatest = val
maxima.append((val, i+1))
# main
return [maxima[bisect.bisect_left(maxima, (query,))][1]
for query in queries]
Example use:
data = [-3,2,8,-1,0,5]
queries = [5,1]
print(solve(data, queries)) # [3, 2]
I suggest using a loop over the first array and using max(array_2) for the second one.

Reducing nested-loops of a python question on array

for _ in range(int(input())):
num=int(input())
a=list(map(int,input().split()))[:num]
sum=0
for i in range(len(a)):
j=a[i]
count=0
key=0
for k in range(len(a)):
if j==a[k]:
key=k
sum+=abs(key-i)
print(sum)
Given an integer array. The task is to calculate the sum of absolute difference of indices of first and last occurrence for every integer that is present in the array.
Required to calculate the sum of the answer for every such that occurs in the array.
One input:
1 2 3 3 2
Sample Output:
4
Explanation: The elements which occur in the array are 1,2,3.
it has only occurred once so the answer for 1 is 0.
it has two occurrences at 2 and 5 so |5-2|=3
it has two occurrences at 3 and 4 so |4-3|=1.
So total sum=0+3+1=4.
p.s: The first loop is for test cases.
Pleae suggest me to reduce time-complexity.
intially you can create a dictiory of unique number and append all the index of each number and then in second loop you can get the diffrence of each integar.
for _ in range(int(input())):
num=int(input())
a=list(map(int,input().split()))[:num]
sum=0
nums = {}
for i in range(len(a)):
j=a[i]
if j not in nums:
nums[j] = []
nums[j].append(i)
for key in nums:
sum += abs(nums[key][-1] - nums[key][0])
print(sum)
This answer uses the same reasoning as others: that is storing the indices as a list of values in a dictionary, but uses a few built-in functions and methods to reduce code and make it 'cleaner'.
In [11]: array = [1, 2, 3, 3, 2]
In [12]: indices = {}
In [13]: for ix, num in enumerate(array, start=1):
...: indices.setdefault(num, []).append(ix)
...:
In [14]: total = 0
In [15]: for num, ixes in indices.items():
...: if len(ixes) == 1:
...: continue
...: else:
...: total += abs(ixes[-1] - ixes[0])
...:
In [16]: total
Out[16]: 4
enumerate is a function that creates a sequence of tuple pairs from a given sequence like a list. The first element is an "index" (by default, set to 0, but you can start from any integer) and the second is the actual value from the original sequence.
setdefault is a method on a dictionary that returns the value for a given key, but if that key doesn't exist, inserts the key and sets as its default value the item passed in as the second parameter; in this case, it's an empty list to store the indices.
items is again a method on dictionaries with which one can loop through one key-value pair at a time.
Sounds like hackerrank. As usual, most of the provided information of the problem is irrelevant and can be forgotten as soon as seen.
You need:
the index when an element occures first: you add it as negative to the total and put it into the dictionary
if the value is already in the dict, update the position only
at the end you sum all values of the dict and add it to your summation
Code:
num = 5
a = list(map(int,"1 2 3 3 2".split()))
s = 0
d = {}
for idx, num in enumerate(a):
if num not in d:
s -= idx
d[num] = idx
print(s+sum(d.values()))
Output:
4
This uses a dictionary and strictly 2 loops - one over the n given numbers and one over u distinct numbers inside it if you ignore the int-conversion step wich already loops once over all of them.
Space:
the total sum and 1 index for each unique number which makes it worstcase O(n+1) in space (each number is also unique)
Time:
normally you get O(n + u) wich is less then the worst case (all numbers are unique) which would be O(2*n). 2 is only a factor - so it is essentially O(n) in time.
If you still have time-problems, using a collections.defaultdict(int) should be faster.
Solution 1 (dict):
One way to it is by using a dictionary for each item, saving all indices and get the difference of last and first occurence for each item. See below:
def get_sum(a):
d={i:[] for i in set(a)}
for i in range(len(a)):
d[a[i]].append(i)
sum=0
for i in d.values():
sum+=i[-1]-i[0]
return sum
Solution 2 (reversed list):
Another way is to use the reversed list and use list.index(item) for both original and reverse list and get the difference. See below:
def get_sum2(a):
m=a[::-1]
return sum(len(m)-m.index(i)-1-a.index(i) for i in set(a))
Output:
>>>get_sum([1,2,3,3,2])
4
>>>get_sum2([1,2,3,3,2])
4

Matching the first element of a list with other first elements in the list

I am trying to solve the question given in this video https://www.youtube.com/watch?reload=9&v=XCeDBWI4sa4
My list contains sub-lists that constitute each digit of a number of the type strings.
Example: I turned my list of strings
['58','12','50','17'] into four sub-lists like so [['5','8'],['1','2'],['5','0'],['1','7']] because I want to compare the first digit of each number and if the first digits are equal, I increment the variable "pair" which is currently 0. pair=0
Since 58 and 50 have the same first digit, they constitute a pair, same goes for 12 and 17. Also, a pair can only be made if both the numbers are at either even position or odd position. 58 and 50 are at even indices, hence they satisfy the condition. also, at most two pairs can be made for the same first digit. So 51,52, 53 would constitute only 2 pairs instead of three. How do I check this? A simple solution will be appreciated.
list_1=[['5','8'],['1','2'],['5','0'],['1','7']]
and test_list= ['58','12','50','17']
for i in range(0,len(test_list)):
for j in range(1,len(test_list)):
if (list_1[i][0] == list_1[j][0] and (i,j%2==0 or i,j%2==1)):
pair =pair+1
print (pair)
That is what I came up with but I am not getting the desired output.
pair = 0
val_list = ['58','12','50','17', '57', '65', '51']
first_digit, visited_item_list = list(), list()
for item in val_list:
curr = int(item[0])
first_digit.append(curr)
for item in first_digit:
if item not in visited_item_list:
occurences = first_digit.count(item)
if occurences % 2 == 0:
pair = pair + occurences // 2
visited_item_list.append(item)
print(pair)
Using collections.Counter to count occurrences for each first digit. Sum up the totals minus the total number of unique types (to account for more than one).
Iterates over even and odd separately:
Uncomment #return sum(min(c,2) for x in c) - len(c) if you want it to never count more than 2 for digit duplicates. eg: [51,52,53,54,56,57,58,59,50,...] will still return 4, no matter how many more 5X you add. (min(c,2) guarantees the value will never exceed 2)
from collections import Counter
a = ['58','12','50','17','50','18']
def dupes(a):
c = Counter(a).values() # count instances of each element in a, get list of counts
#return sum(min(c,2) for x in c) - len(c) # maximum value of 2 for counts
return sum(c) - len(c) # sum up all the counts, subtract unique elements (you want the counts starting from 0)
even = dupes(a[x][0] for x in range(0, len(a), 2))
# a[x][0]: first digit of even a elements
# range(0, len(a), 2): range of numbers from 0 to length of a, skip by 2 (evens)
# call dupes([list of first digit of even elements])
odd = dupes(a[x][0] for x in range(1, len(a), 2))
# same for odd
print(even+odd)
Here's a fairly simple solution:
import collections
l= [['5','8'],['1','2'],['5','0'],['1','7']]
c = collections.Counter([i[0] for i in l])
# Counter counts the occurrences of items in a list (or other
# collection). After the previous line, c is
# Counter({'5': 2, '1': 2})
sum([c-1 for c in c.values()])
The output, in this case, is 2.

Removing points from list if distance between 2 points is below a certain threshold

I have a list of points and I want to keep the points of the list only if the distance between them is greater than a certain threshold. So, starting from the first point, if the the distance between the first point and the second is less than the threshold then I would remove the second point then compute the distance between the first one and the third one. If this distance is less than the threshold, compare the first and fourth point. Else move to the distance between the third and fourth and so on.
So for example, if the threshold is 2 and I have
list = [1, 2, 5, 6, 10]
then I would expect
new_list = [1, 5, 10]
Thank you!
Not a fancy one-liner, but you can just iterate the values in the list and append them to some new list if the current value is greater than the last value in the new list, using [-1]:
lst = range(10)
diff = 3
new = []
for n in lst:
if not new or abs(n - new[-1]) >= diff:
new.append(n)
Afterwards, new is [0, 3, 6, 9].
Concerning your comment "What if i had instead a list of coordinates (x,y)?": In this case you do exactly the same thing, except that instead of just comparing the numbers, you have to find the Euclidean distance between two points. So, assuming lst is a list of (x,y) pairs:
if not new or ((n[0]-new[-1][0])**2 + (n[1]-new[-1][1])**2)**.5 >= diff:
Alternatively, you can convert your (x,y) pairs into complex numbers. For those, basic operations such as addition, subtraction and absolute value are already defined, so you can just use the above code again.
lst = [complex(x,y) for x,y in lst]
new = []
for n in lst:
if not new or abs(n - new[-1]) >= diff: # same as in the first version
new.append(n)
print(new)
Now, new is a list of complex numbers representing the points: [0j, (3+3j), (6+6j), (9+9j)]
While the solution by tobias_k works, it is not the most efficient (in my opinion, but I may be overlooking something). It is based on list order and does not consider that the element which is close (within threshold) to the maximum number of other elements should be eliminated the last in the solution. The element that has the least number of such connections (or proximities) should be considered and checked first. The approach I suggest will likely allow retaining the maximum number of points that are outside the specified thresholds from other elements in the given list. This works very well for list of vectors and therefore x,y or x,y,z coordinates. If however you intend to use this solution with a list of scalars, you can simply include this line in the code orig_list=np.array(orig_list)[:,np.newaxis].tolist()
Please see the solution below:
import numpy as np
thresh = 2.0
orig_list=[[1,2], [5,6], ...]
nsamp = len(orig_list)
arr_matrix = np.array(orig_list)
distance_matrix = np.zeros([nsamp, nsamp], dtype=np.float)
for ii in range(nsamp):
distance_matrix[:, ii] = np.apply_along_axis(lambda x: np.linalg.norm(np.array(x)-np.array(arr_matrix[ii, :])),
1,
arr_matrix)
n_proxim = np.apply_along_axis(lambda x: np.count_nonzero(x < thresh),
0,
distance_matrix)
idx = np.argsort(n_proxim).tolist()
idx_out = list()
for ii in idx:
for jj in range(ii+1):
if ii not in idx_out:
if self.distance_matrix[ii, jj] < thresh:
if ii != jj:
idx_out.append(jj)
pop_idx = sorted(np.unique(idx_out).tolist(),
reverse=True)
for pop_id in pop_idx:
orig_list.pop(pop_id)
nsamp = len(orig_list)

Calculating the similarity of multiple elements with unequal length of a nested list

I have a nested list, with every second element having varying lengths:
lst = [[a,bcbcbcbcbc],[e,bbccbbccb],[i,ccbbccbb],[o,cbbccbb]]
My output is a csv of dataframe with this look:
comparison similarity_score
a:e *some score
a:i *some score
a:o *some score
e:i *some score
e:o *some score
i:o *some score
my code:
similarity = []
for i in lst:
name = i[0]
string = i[1]
score = 0.0
length =(len(string))
for i in range(length):
if string[i]==string[i+1]:
score += 1.0
new_score = (100.0*score)/length
name_seq = name[i] + ':' + name[i+1]
similarity.append(name_seq,new_score)
similarity.pdDataFrame(similarity, columns = ['comparison' , 'similarity_score'])
similarity.to_csv('similarity_score.csv')
but I am recieving an error:
if codes[i]==codes[i+1]:
IndexError: string index out of range
any advice? thanks!
According to Python's documentation range does the following by example:
>>>range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In your code (assuming variable names have not changed):
...
length =(len(string)) # For an input of 'bcb' length will be 3
for i in range(length): # For an input of 'bcb' range will be [0, 1, 2]
if string[i]==string[i+1]: # When i == 2 i + 1 == 3 which gives you the
# IndexError: string index out of range
...
In other words, given an input bcb, your if statement will look at the following indices:
(0, 1)
(1, 2)
(2, 3) <-- The 3 in this case is your issue.
To fix your issue iterate from [0, len(string) - 1]
I think your biggest issue is that at the top level you're just iterating on one name,string pair at a time, not a pair of name,string pairs like you want to see in your output (as shown by the paired names a:e).
You're trying to index the name and string values later on, but doing so is not achieving what you want (comparing two strings to each other to compute a score), since you're only accessing adjacent characters in the same string. The exception you're getting is because i+1 may go off the end of the string. There's further confusion since you're using i for both the index in the inner loop and as the items taken from the outer loop (the name, string pairs).
To get pairs of pairs, I suggest using itertools.combinations:
import itertools
for [name1, string1], [name2, string2] in itertools.combinations(lst, 2):
Now you can use the two name and two string variables in the rest of the loop.
I'm not entirely sure I understand how you want to compare the strings to get your score, since they're not the same length as one another. If you want to compare just the initial parts of the strings (and ignore the trailing bit of the longer one), you could use zip to get pairs of corresponding characters between the two strings. You can then compare them in a generator expression and add up the bool results (True is a special version of the integer 1 and False is a version of 0). You can then divide by the smaller of the string's lengths (or maybe the larger if you want to penalize length differences):
common_letters = sum(c1 == c2 for c1, c2 in zip(string1, string2))
new_score = common_letters * 100 / min(len(string1), len(string2))
There's one more obvious issue, where you're calling append with two arguments. If you really want to be appending a 2-tuple, you need an extra set of parentheses:
similarity.append((name_seq, new_score))

Categories