How to find the repeating arrays in a list - python

I have a list of around 131000 arrays, each of length 300. I am using python
I want to check which of the arrays are repeating in this list. I am trying this by comparing each array with others. like :
Import numpy as np
wordEmbeddings = [[0.8,0.4....upto 300 elements]....upto 131000 arrays]
count = 0
for i in range(0,len(wordEmbeddings)):
for j in range(0,len(wordEmbeddings)):
if i != j:
if np.array_equal(wordEmbeddings[i],wordEmbeddings[j]):
count += 1
this is running very slowly, It might take hours to finish, how can I do this efficiently ?

You can use collections.Counter to count the frequency of each sub list
>>> from collections import Counter
>>> Counter(list(map(tuple, wordEmbeddings)))
We need to cast the sublist to tuples since list is unhashable i.e. it cannot be used as a key in dict.
This will give you result like this:
>>> Counter({(...4, 5, 6...): 1, (...1, 2, 3...): 1})
The key of Counter object here is the list and value is the number of times this list occurs. Next you can filter the resulting Counter object to only yield elements where value is > 1:
>>> items = Counter(list(map(tuple, wordEmbeddings)))
>>> list(filter(lambda x: items[x] > 1,items))
Timeit results:
$ python -m timeit -s "a = [range(300) for _ in range(131000)]" -s "from collections import Counter" "Counter(list(map(tuple, a)))"
10 loops, best of 3: 1.18 sec per loop

You can remove duplicate comparisons by using
for i in range(0,len(wordEmbeddings)):
for j in range(i,len(wordEmbeddings)):
You could look in to pypy for general purpose speed ups.
It might also be worth looking into hashing the arrays somehow.
Here's a question on the speeding up np array comparison. Do the order of the elements matter to you?

You can use set and tuple to find duplicated arrays inside another array. Create a new list contains tuples, we use tuples because lists are unhashable type. And then filter new list with using set.
tuple = list(map(tuple, wordEmbeddings))
duplications = set([t for t in tuple if tuple.count(t) > 1])
print(duplications)

maybe you can reduce the initial list to unique hashes, or non-unique sums,
and go over the hashes first - which may be a faster way to compare elements

I suggest you first sort the list (might also be helpful for further processing) and then compare. The advantage is that you only need to compare every array element to the previous one:
import numpy as np
from functools import cmp_to_key
wordEmbeddings = [[0.8, 0.4, 0.3, 0.2], [0.2,0.3,0.7], [0.8, 0.4, 0.3, 0.2], [ 1.0, 3.0, 4.0, 5.0]]
def smaller (x,y):
for i in range(min(len(x), len(y))):
if x[i] < y[i]:
return 1
elif y[i] < x[i]:
return -1
if len(x) > len(y):
return 1
else:
return -1
wordEmbeddings = sorted(wordEmbeddings, key=cmp_to_key(smaller))
print(wordEmbeddings)
# output: [[1.0, 3.0, 4.0, 5.0], [0.8, 0.4, 0.3, 0.2], [0.8, 0.4, 0.3, 0.2], [0.2, 0.3, 0.7]]
count = 0
for i in range(1, len(wordEmbeddings)):
if (np.array_equal(wordEmbeddings[i], wordEmbeddings[i-1])):
count += 1
print(count)
# output: 1
If N is the length of word embedding and n is the length of the inner array, then your approach was to do O(N*N*n) comparisons. When reducing the comparisons as in con--'s answer, then you still have O(N*N*n/2) comparisons.
Sorting will take O(N*log(N)*n) time and the subsequent step of counting only takes O(N*n) time which all in all is shorter than O(N*N*n/2)

Related

Extracting items from array using variable sliding window in Python [duplicate]

This question already has answers here:
Rolling or sliding window iterator?
(29 answers)
Closed 6 days ago.
I have an array of digits: array = [1.0, 1.0, 2.0, 4.0, 1.0]
I would like to create a function that extracts sequences of digits from the input array and appends to one of two lists depending on defined conditions being met
The first condition f specifies the number of places to look ahead from index i and check if a valid index exists. If true, append array[i] to list1. If false, append to list2.
I have implemented it as follows:
def somefunc(array, f):
list1, list2 = [], []
for i in range(len(array)):
if i + f < len(array):
list1.append(array[i])
else:
list2.append(array[i])
return list1, list2
This functions correctly as follows:
somefunc(array,f=1) returns ([1.0, 1.0, 2.0, 4.0], [1.0])
somefunc(array,f=2) returns ([1.0, 1.0, 2.0], [4.0, 1.0])
somefunc(array,f=3) returns ([1.0, 1.0], [2.0, 4.0, 1.0])
However, I would like to add a second condition to this function, b, that specifies the window length for previous digits to be summed and then appended to the lists according to the f condition above.
The logic is this:
iterate through array and at each index i check if i+f is a valid index.
If true, append the sum of the previous b digits to list1
If false, append the sum of the previous b digits to list2
If the length of window b isn't possible (i.e. b=2 when i=0) continue to next index.
With both f and b conditions implemented. I would expect:
somefunc(array,f=1, b=1) returns ([1.0, 1.0, 2.0, 4.0], [1.0])
somefunc(array,f=1, b=2) returns ([2.0, 3.0, 6.0], [5.0])
somefunc(array,f=2, b=2) returns ([2.0, 3.0], [6.0, 5.0])
My first challenge is implementing the b condition. I cannot seem to figure out how. see edit below
I also wonder if there is a more efficient approach than the iterative method I have begun?
Given only the f condition, I know that the following functions correctly and would bypass the need for iteration:
def somefunc(array, f):
return array[:-f], array[-f:]
However, I again don't know how to implement the b condition in this approach.
Edit
I have managed an iterative solution which implements the f and b conditions:
def somefunc(array, f, b):
list1, list2 = [], []
for i in range(len(array)):
if i >= (b-1):
if i + f < len(array):
list1.append(sum(array[i+1-b: i+1]))
else:
list2.append(sum(array[i+1-b: i+1]))
return list1, list2
However, the indexing syntax feels horrible and I so I am certain there must be a more elegant solution. Also, anything with improved runtime would really be preferable.
I can see two minor improvements you could implement in your code:
def somefunc(array, f, b):
list1, list2 = [], []
size = len(array) # Will only measure the length of the array once
for i in range(b-1, size): # By starting from b-1 you can remove an if statement
if i + f < size: # We use the size here
list1.append(sum(array[i+1-b: i+1]))
else:
list2.append(sum(array[i+1-b: i+1]))
return list1, list2
Edit:
An ever better solution would be to add the new digit and substract the last at each iteration. This way you don't need to redo the whole sum each iteration:
def somefunc(array, f, b):
list1, list2 = [], []
value = 0
size = len(array)
for i in range(b-1, size):
if value != 0:
value = value - array[i-b] + array[i] # Get the last value, add the value at index i and remove the value at index i-b
else:
value = sum(array[i+1-b: i+1])
if i + f < size:
list1.append(value)
else:
list2.append(value)
return list1, list2

Making a loop to calculate [n] - [n+1] inside a single python list

I'm trying to calculate, with a loop, a substraction between n element and n+1 element inside a single python list.
For exemple :
list = [26.5, 17.3, 5.9, 10.34, 3.87]
# expected calculation
# 26.5 - 17.3 = 9.2
# 17.3 - 5.9 = 11.4
# ...
# expected result
# list_2 = [9.2, 11.4, -4.44, 6.47]
I tried with :
list_2 = [n-(n+1) for n in list]
# output
# [-1.0, -1.0, -1.0, -1.0, -1.0]
-----------------------------------------
list_2 = []
for n in list:
list_2 += list[n] - list[n+1]
# output
# TypeError: list indices must be integers or slices, not float
# with correction of 'TypeError'
list_2 += list[int(n)] - list[int(n+1)]
# output
# TypeError: 'float' object is not iterable
The problem look simple but... I can't do it.
Do you have an idea?
If it is possible, I'm looking for a native python3 solution.
Thank you per advance and have a nice day/evening.
You can use zip and slice, here is the respective documentation from the official https://docs.python.org site:
Documentation for the zip function:
zip(*iterables, strict=False)
Iterate over several iterables in parallel, producing tuples with an item from each one.
Code
>>> l = [26.5, 17.3, 5.9, 10.34, 3.87]
>>> result = [current-_next for current,_next in zip(l,l[1:])]
>>> result
[9.2, 11.4, -4.4399999999999995, 6.47]
NOTE: The third result is unprecise because of floating point math
Alternative?
As an alternative, you can also do this as an iterable using itertools.starmap and operator.sub:
>>> from operator import sub
>>> result = itertools.starmap(sub,zip(l,l[1:]))
Try this.
l = [26.5, 17.3, 5.9, 10.34, 3.87]
result = [l[i] - l[i + 1] for i in range(len(l) - 1)]
#Ankit Sharma's answer works perfectly, as does #XxJames07-'s. Personally, I prefer using enumerate() rather that range(len) or zip in such situations. For example:
>>> l = [26.5, 17.3, 5.9, 10.34, 3.87]
>>> result = [n - l[i + 1] for i, n in enumerate(l[:-1])]
>>> print(result)
[9.2, 11.4, -4.4399999999999995, 6.47]
This is simply because it's more "pythonic", i.e. it has cleaner syntax, can work with any iterable, can access both value and index, and it is more optimised (wastes less resources on the computer). Here's a good explanation.
This solution is also simpler and more readable than using zip, which is a tad bit unnecessarily complicated for a simple problem.

Sorting lists with indexing for easy item removal

I am facing the following problem. I have multiple lists in Python that I want them to have them sorted with a sort of indexing in order to remove items from the other lists. Let me further explain.
listA_ID = [1,2,3,5,6,7] # integer from 0-250
listA_Var1 = [3.9, 4.7, 2.1, 1.2, 0.15, 0.99]
listB_ID = [2,5,6,7,8,10] # integer from 0-250
listB_Var1 = [0.54, 0.35, 1.19, 2.45, 3.1, 1.75]
>> After Comparison of listA_ID & listB_ID I should end up with the common IDs.
listA_ID = listB_ID = sorted(list(set(listA_ID) & set(listB_ID)))
listA_ID = [2,5,6,7]
listB_ID = [2,5,6,7]
Therefore I want to delete the elements [1, 3] from listA_ID which are in the positions of [0, 2] of that list and the same thing from listA_Var1, delete [3.9, 2.1] which are in the same positions [0, 2].
Similarly, I want to remove the elements [8, 10] from listB_ID which are in the positions of [4, 5] of that list and the same thing from listB_Var1, delete [3.1, 1.75] which are in the same positions [4, 5].
>> and then listA_Var1 & listB_Var1 will become
listA_Var1 = [4.7, 1.2, 0.15, 0.99]
listB_Var1 = [0.54, 0.35, 1.19, 2.45]
Any ideas on an efficient way to implement that? From my experience using Matlab a lot, after comparing the two lists, I have a way to get the indexes that are not needed and then applying these indexes to the lists, what I get are the final lists listA_Var1 & listB_Var1.
Any ideas please? Thanks in advance!
1. Getting the Intersection
There are many way to do this. For a detailed discussion see here. As is suggested there, if dublicates do not matter (i.e. your lists either do not contain dublicates or they do but you do not care about them), you can, for example, use set() to get the shared values:
intersection_A_B = sorted(list(set(listA_ID) & set(listB_ID)))
Alternatively, you can also turn just one of the lists into a set and then use the intersection() method, such as:
intersection_A_B = list(set(listA_ID).intersection(listB_ID))
In contrast, if dublicates matter or could pose an issue (say, both listA_ID and listB_ID feature a value twice and you want your intersection to preserve both listings of the value), instead of using set() or intersection(), you could use list comprehension:
intersection_A_B = [x for x in listA_ID if x in listB_ID]
2. Removing Values
Edit: After getting the intersection (note that, now that I got what you were really after, the first step of the process refers to intersection_A_B instead of updating listA_ID and listB_ID because their original states are needed for the following operation), this should do the trick:
del_indices_A = [i for i, value in enumerate(listA_ID) if value not in intersection_A_B]
listA_Var1 = [listA_Var1[x] for x in range(len(listA_Var1)) if x not in del_indices_A]
del_indices_B = [i for i, value in enumerate(listB_ID) if value not in intersection_A_B]
listB_Var1 = [listB_Var1[x] for x in range(len(listB_Var1)) if x not in del_indices_B]
This first checks which indices in listA_ID and listB_ID corresponded to values not included in intersection_A_B and then excludes values corresponding to those indices in listA_Var1 and listB_Var2.
First, I'm going to explain step by step this approach:
- Step 1: We are going to looking for the intersection elements in both, listA_ID and listB_ID.
intersection_AB = set(listA_ID) & set(listB_ID)
- Step 2: Then, we do a difference of sets. It's very important putting in the first place set(listA_ID), because the difference of sets is not commutative.
# You can use difference() method alternatively:
# A_elements = list(set(listA_ID).difference(intersection_AB)) but personally I like the minus operator.
A_elements = list(set(listA_ID) - intersection_AB)
- Step 3: Then, We looking for the indexes based on the elements found in the previous step.
index_to_remove_list_A = [listA_ID.index(i) for i in A_elements]
Or you can use also (althoug less legible):
index_to_remove_list_A = [listA_ID.index(i) for i in list(set(listA_ID) - intersection_AB)]
- Step 4:
Delete the correct elements in the list.
for i in sorted(index_to_remove_list_A, reverse=True):
del listA_Var1[i]
print(listA_Var1)
Edit: Full code with both lists ...
A_elements = list(set(listA_ID) - intersection_AB)
B_elements = list(set(listB_ID) - intersection_AB)
index_to_remove_list_A = [listA_ID.index(i) for i in A_elements]
index_to_remove_list_B = [listB_ID.index(i) for i in B_elements]
for i in sorted(index_to_remove_list_A, reverse=True):
del listA_Var1[i]
for i in sorted(index_to_remove_list_B, reverse=True):
del listB_Var1[i]
print(listA_Var1) # [4.7, 1.2, 0.15, 0.99]
print(listB_Var1) # [0.54, 0.35, 1.19, 2.45]
Well, I will post a working solution myself too, using numpy.
intersection_A_B = sorted(list(set(listA_ID) & set(listB_ID)))
# Convert Lists to Arrays
np_listA_ID = np.asarray( listA_ID )
np_listB_ID = np.asarray( listB_ID )
# Comparison of two arrays
np_list_ID, listA_ind, listB_ind = np.intersect1d(np_listA_ID, np_listB_ID, assume_unique=False, return_indices=True)
# Keep only Items Needed
np_listA_Var1 = np.asarray( listA_Var1 )
np_listB_Var1 = np.asarray( listB_Var1 )
# Covert Array to List again
listA_ID=listB_ID=np_list_ID.tolist()
listA_Var1 = np_listA_Var1[listA_ind].tolist()
listB_Var1 = np_listB_Var1[listB_ind].tolist()

How to pick same values in a list if the list contain floating numbers

In the following code I want to check how many unique values are in the list and this can be done in for loop. After knowing the number of unique values I want to see how many times a single unique values appear in a and then I want to count their number. Can someone please guide me how to do that. List contains floating points. What if I convert it in numpy array and then find same values.
`a= [1.0, 1.0, 1.0, 1.0, 1.5, 1.5, 1.5, 3.0, 3.0]
list = []
for i in a:
if i not in list:
list.append(i)
print(list)
for j in range(len(list))
g= np.argwhere(a==list[j])
print(g)`
You can use np.unique to get it done
np.unique(np.array(a),return_counts=True)
You can also do it using counters from collections
from collections import Counter
Var=dict(Counter(a))
print(Var)
The primitive way is to use loops
[[x,a.count(x)] for x in set(a)]
If you are not familiar with list comprehensions, this is its explaination
ls=[]
for x in set(a):
ls.append([x,a.count(x)])
print(ls)
If you want it using if else,
counter = dict()
for k in a:
if not k in counter:
counter[k] = 1
else:
counter[k] += 1
print(counter)

Python print nth element from list of lists [duplicate]

This question already has answers here:
How to print column in python array?
(2 answers)
Closed 5 years ago.
I have the following list:
[[50.954818803035948, 55.49664787231189, 8007927.0, 0.0],
[50.630482185654436, 55.133473852776916, 8547795.0, 0.0],
[51.32738085400576, 55.118344981379266, 6600841.0, 0.0],
[49.425931642638567, 55.312890225131163, 7400096.0, 0.0],
[48.593467836476407, 55.073137270550006, 6001334.0, 0.0]]
I want to print the third element from every list. The desired result is:
8007927.0
8547795.0
6600841.0
7400096.0
6001334.0
I tried:
print data[:][2]
but it is not outputting the desired result.
Many way to do this. Here's a simple list way, without an explicit for loop.
tt = [[50.954818803035948, 55.49664787231189, 8007927.0, 0.0], [50.630482185654436, 55.133473852776916, 8547795.0, 0.0], [51.32738085400576, 55.118344981379266, 6600841.0, 0.0], [49.425931642638567, 55.312890225131163, 7400096.0, 0.0], [48.593467836476407, 55.073137270550006, 6001334.0, 0.0]]
print [x[2] for x in tt]
> [8007927.0, 8547795.0, 6600841.0, 7400096.0, 6001334.0]
And making is safe for potentially shorted lists
print [x[2] for x in tt if len(tt) > 3]
More sophisticated output (python 2.7), prints values as newline (\n) seperated
print '\n'.join([str(x[2]) for x in tt])
> 8007927.0
> 8547795.0
> 6600841.0
> 7400096.0
> 6001334.0
Try this:
for item in data:
if len(item) >= 3: # to prevent list out of bound exception.
print(int(item[2]))
map and list comprehensive have been given, I would like to provide two more ways, say d is your list:
With zip:
zip(*d)[2]
With numpy:
>>> import numpy
>>> nd = numpy.array(d)
>>> print(nd[:,2])
[ 8007927., 8547795., 6600841., 7400096., 6001334.]
Maybe you try a map function
In python 3:
list(map(lambda l: l[2], z))
In python 2:
map(lambda l: l[2], z)
In order to print the nth element of every list from a list of lists, you need to first access each list, and then access the nth element in that list.
In practice, it would look something like this
def print_nth_element(listset, n):
for listitem in listset:
print(int(listitem[n])) # Since you want them to be ints
Which could then be called in the form print_nth_element(data, 2) for your case.
The reason your data[:][2] is not yielding correct results is because data[:] returns the entire list of lists as it is, and then executing getting the 3rd element of that same list is just getting the thirst element of the original list. So data[:][2] is practically equivalent to data[2].

Categories