Compare 1 column of 2D array and remove duplicates Python - python

Say I have a 2D array like:
array = [['abc',2,3,],
['abc',2,3],
['bb',5,5],
['bb',4,6],
['sa',3,5],
['tt',2,1]]
I want to remove any rows where the first column duplicates
ie compare array[0] and return only:
removeDups = [['sa',3,5],
['tt',2,1]]
I think it should be something like:
(set first col as tmp variable, compare tmp with remaining and #set array as returned from compare)
for x in range(len(array)):
tmpCol = array[x][0]
del array[x]
removed = compare(array, tmpCol)
array = copy.deepcopy(removed)
print repr(len(removed)) #testing
where compare is:
(compare first col of each remaining array items with tmp, if match remove else return original array)
def compare(valid, tmpCol):
for x in range(len(valid)):
if valid[x][0] != tmpCol:
del valid[x]
return valid
else:
return valid
I keep getting 'index out of range' error. I've tried other ways of doing this, but I would really appreciate some help!

Similar to other answers, but using a dictionary instead of importing counter:
counts = {}
for elem in array:
# add 1 to counts for this string, creating new element at this key
# with initial value of 0 if needed
counts[elem[0]] = counts.get(elem[0], 0) + 1
new_array = []
for elem in array:
# check that there's only 1 instance of this element.
if counts[elem[0]] == 1:
new_array.append(elem)

One option you can try is create a counter for the first column of your array before hand and then filter the list based on the count value, i.e, keep the element only if the first element appears only once:
from collections import Counter
count = Counter(a[0] for a in array)
[a for a in array if count[a[0]] == 1]
# [['sa', 3, 5], ['tt', 2, 1]]

You can use a dictionary and count the occurrences of each key.
You can also use Counter from the library collections that actually does this.
Do as follows :
from collection import Counter
removed = []
for k, val1, val2 in array:
if Counter([k for k, _, _ in array])[k]==1:
removed.append([k, val1, val2])

Related

Searching index position in python

cols = [2,4,6,8,10,12,14,16,18] # selected the columns i want to work with
df = pd.read_csv('mywork.csv')
df1 = df.iloc[:, cols]
b= np.array(df1)
b
outcome
b = [['WV5 6NY' 'RE4 9VU' 'BU4 N90' 'TU3 5RE' 'NE5 4F']
['SA8 7TA' 'BA31 0PO' 'DE3 2FP' 'LR98 4TS' 0]
['MN0 4NU' 'RF5 5FG' 'WA3 0MN' 'EA15 8RE' 'BE1 4RE']
['SB7 0ET' 'SA7 0SB' 'BT7 6NS' 'TA9 0LP' 'BA3 1OE']]
a = np.concatenate(b) #concatenated to get a single array, this worked well
a = np.array([x for x in a if x != 'nan'])
a = a[np.where(a != '0')] #removed the nan
print(np.sort(a)) # to sort alphabetically
#Sorted array
['BA3 1OE' 'BA31 0PO' 'BE1 4RE' 'BT7 6NS' 'BU4 N90'
'DE3 2FP' 'EA15 8RE' 'LR98 4TS' 'MN0 4NU', 'NE5 4F' 'RE4 9VU'
'RF5 5FG' 'SA7 0SB' 'SA8 7TA' 'SB7 0ET' 'TA9 0LP' 'TU3 5RE'
'WA3 0MN' 'WV5 6NY']
#Find the index position of all elements of b in a(sorted array)
def findall_index(b, a )
result = []
for i in range(len(a)):
for j in range(len(a[i])):
if b[i][j] == a:
result.append((i, j))
return result
print(findall_index(0,result))
I am still very new with python, I tried finding the index positions of all element of b in a above. The underneath codes blocks doesn't seem to be giving me any result. Please can some one help me.
Thank you in advance.
One way you could approach this is by zipping (creating pairs) the index of elements in b with the actual elements and then sorting this new array based on the elements only. Now you have a mapping from indices of the original array to the new sorted array. You can then just loop over the sorted pairs to map the current index to the original index.
I would highly suggest you to code this yourself, since it will help you learn!

Removing values >0 in data set

I have a data set which is a list of lists, looking like this:
[[-0.519418066, -0.680905835],
[0.895518429, -0.654813183],
[0.092350219, 0.135117023],
[-0.299403315, -0.568458405],....]
its shape is (9760,) and I am trying to remove all entries where the value of the first number in each entry is greater than 0, so in this example the 2nd and 3rd entries would be removed to leave
[[-0.519418066, -0.680905835],
[-0.299403315, -0.568458405],....]
So far I have written:
for x in range(9670):
for j in filterfinal[j][0]:
if filterfinal[j][0] > 0:
np.delete(filterfinal[j])
this returns: TypeError: list indices must be integers or slices, not list
Thanks in advance for any help on this problem!
You can use numpy's boolean indexing:
>>> x = np.random.randn(10).reshape((5,2))
array([[-0.46490993, 0.09064271],
[ 1.01982349, -0.46011639],
[-0.40474591, -1.91849573],
[-0.69098115, 0.19680831],
[ 2.00139248, -1.94348869]])
>>> x[x[:,0] > 0]
array([[ 1.01982349, -0.46011639],
[ 2.00139248, -1.94348869]])
Some explanation:
x[:,0] selects the first column of your array.
x > 0 will return an array of the same shape where each value is replaced by the result of the element-wise comparison (i.e., is the value > 0 or not?)
So, x[:,0] > 0 will give you an array of shape (n,1) with True or False values depending on the first value of your row.
You can then pass this array of booleans as an index to your original array, where it will return you an array of only the indexes that are True. By passing in a boolean array of shape (n,1), you select per row.
You are talking about "shape", so I assume that you are using numpy. Also, you are mentioning np in your example code, so you are able to apply element wise operations together with boolean indexing
array = np.array([[-0.519418066, -0.680905835],
[0.895518429, -0.654813183],
[0.092350219, 0.135117023],
[-0.299403315, -0.568458405]])
filtered = array[array[:, 0] < 0]
Use a list comprehension:
lol = [[-0.519418066, -0.680905835],[0.895518429, -0.654813183],[0.092350219, 0.135117023],[-0.299403315, -0.568458405]]
filtered_lol = [l for l in lol if l[0] <= 0]
You can use a list comprehension that unpacks the first item from each sub-list and retains only those with the first item <= 0 (assuming your list of lists is stored as variable l):
[l for a, _ in l if a <= 0]
You can go through this in a for loop and making a new list without the positives like so:
new_list = []
for item in old_list:
if item[0] < 0:
new_list.append(item)
But I'd prefer to instead use the in built filter function if you are comfortable with it and do something like:
def is_negative(number):
return number < 0
filtered_list = filter(is_negative, old_list)
This is similar to a list comprehension - or just using a for loop. However it returns a generator instead so you never have to hold two lists in memory making the code more efficient.

Identify a single difference in a python list

I would have to get some help concerning a part of my code.
I have some python list, example:
list1 = (1,1,1,1,1,1,5,1,1,1)
list2 = (6,7,4,4,4,1,6,7,6)
list3 = (8,8,8,8,9)
I would like, for each list, know if there is a single value that is different compare to every other values if and only if all of these other values are the same. For example, in the list1, it would identify "5" as a different value, in list2 it would identify nothing as there are more than 2 different values and in list3 it would identify "9"
What i already did is :
for i in list1:
if list1(i)==len(list1)-1
print("One value identified")
The problem is that i get "One value identified" as much time as "1" is present in my list ...
But what i would like to have is an output like that :
The most represented value equal to len(list1)-1 (Here "1")
The value that is present only once (Here "5")
The position in the list where the "5"
You could use something like that:
def odd_one_out(lst):
s = set(lst)
if len(s)!=2: # see comment (1)
return False
else:
return any(lst.count(x)==1 for x in s) # see comment (2)
which for the examples you provided, yields:
print(odd_one_out(list1)) # True
print(odd_one_out(list2)) # False
print(odd_one_out(list3)) # True
To explain the code I would use the first example list you provided [1,1,1,1,1,1,5,1,1,1].
(1) converting to set removes all the duplicate values from your list thus leaving you with {1, 5} (in no specific order). If the length of this set is anything other than 2 your list does not fulfill your requirements so False is returned
(2) Assuming the set does have a length of 2, what we need to check next is that at least one of the values it contains appear only once in the original list. That is what this any does.
You can use the built-in Counter from High-performance container datatypes :
from collections import Counter
def is_single_diff(iterable):
c = Counter(iterable)
non_single_items = list(filter(lambda x: c[x] > 1, c))
return len(non_single_items) == 1
Tests
list1 = (1,1,1,1,1,1,5,1,1,1)
list2 = (6,7,4,4,4,1,6,7,6)
list3 = (8,8,8,8,9)
In: is_single_diff(list1)
Out: True
In: is_single_diff(list2)
Out: False
In: is_single_diff(list3)
Out: True
Use numpy unique, it will give you all the information you need.
myarray = np.array([1,1,1,1,1,1,5,1,1,1])
vals_unique,vals_counts = np.unique(myarray,return_counts=True)
You can first check for the most common value. After that, go through the list to see if there is a different value, and keep track of it.
If you later find another value that isn't the same as the most common one, the list does not have a single difference.
list1 = [1,1,1,1,1,1,5,1,1,1]
def single_difference(lst):
most_common = max(set(lst), key=lst.count)
diff_idx = None
diff_val = None
for idx, i in enumerate(lst):
if i != most_common:
if diff_val is not None:
return "No unique single difference"
diff_idx = idx
diff_val = i
return (most_common, diff_val, diff_idx)
print(single_difference(list1))

How to get each element of numpy array?

I have a numpy array as follows :
Keys which will store some values. for example
Keys [2,3,4,7,8]
How to get index of 4 and store the index in a int variable ?
For example the index value of 4 is 2, so 2 will be stored in a int variable.
I have tried with following code segment
//enter code here
for i in np.nditer(Keys):
print(keys[i]);
//enter code here
I am using python 3.5
Spyder 3.5.2
Anaconda 4.2.0
Is keys a list or numpy array
keys = [[2,3,4,7,8] # or
keys = np.array([2,3,4,7,8])
You don't need to iterate to see the elements of either. But you can do
for i in keys:
print(i)
for i in range(len(keys)):
print(keys[i])
[i for i in keys]
these work for either.
If you want the index of the value 4, the list has a method:
keys.index(4)
for the array
np.where(keys==4)
is a useful bit of code. Also
np.in1d(keys, 4)
np.where(np.in1d(keys, 4))
Forget about np.nditer. That's for advanced programming, not routine iteration.
There are several ways. If the list is not too large, then:
where_is_4 = [e for i,e in enumerate(Keys) if i==4][0]
What this does is it loops over the list with an enumerator and creates a list that contains the value of the enumerator every time the value '4' occurs.
Why not just do:
for i in range( len( Key ) ):
if ( Key[ i ] == 4 ):
print( i )
You can find all indices where the value is 4 using:
>>> keys = np.array([2,3,4,7,8])
>>> np.flatnonzero(keys == 4)
array([2])
There is a native numpy method for this called where.
It will return an array of the indices where some given condition is true. So you can just pick the first entry, if the list isn't empty:
N = 4
indicies = np.where(x==N)[0]
index = None
if indicies:
index = indicies[0]
Use of numpy.where(condition) will be a good choice here. From the below code you can get location of 4.
import numpy as np
keys = np.array([2,3,4,7,8])
result = np.where(keys==4)
result[0][0]

check for duplicates in a python list

I've seen a lot of variations of this question from things as simple as remove duplicates to finding and listing duplicates. Even trying to take bits and pieces of these examples does not get me my result.
My question is how am I able to check if my list has a duplicate entry? Even better, does my list have a non-zero duplicate?
I've had a few ideas -
#empty list
myList = [None] * 9
#all the elements in this list are None
#fill part of the list with some values
myList[0] = 1
myList[3] = 2
myList[4] = 2
myList[5] = 4
myList[7] = 3
#coming from C, I attempt to use a nested for loop
j = 0
k = 0
for j in range(len(myList)):
for k in range(len(myList)):
if myList[j] == myList[k]:
print "found a duplicate!"
return
If this worked, it would find the duplicate (None) in the list. Is there a way to ignore the None or 0 case? I do not care if two elements are 0.
Another solution I thought of was turn the list into a set and compare the lengths of the set and list to determine if there is a duplicate but when running set(myList) it not only removes duplicates, it orders it as well. I could have separate copies, but it seems redundant.
Try changing the actual comparison line to this:
if myList[j] == myList[k] and not myList[j] in [None, 0]:
I'm not certain if you are trying to ascertain whether or a duplicate exists, or identify the items that are duplicated (if any). Here is a Counter-based solution for the latter:
# Python 2.7
from collections import Counter
#
# Rest of your code
#
counter = Counter(myList)
dupes = [key for (key, value) in counter.iteritems() if value > 1 and key]
print dupes
The Counter object will automatically count occurances for each item in your iterable list. The list comprehension that builds dupes essentially filters out all items appearing only once, and also upon items whose boolean evaluation are False (this would filter out both 0 and None).
If your purpose is only to identify that duplication has taken place (without enumerating which items were duplicated), you could use the same method and test dupes:
if dupes: print "Something in the list is duplicated"
If you simply want to check if it contains duplicates. Once the function finds an element that occurs more than once, it returns as a duplicate.
my_list = [1, 2, 2, 3, 4]
def check_list(arg):
for i in arg:
if arg.count(i) > 1:
return 'Duplicate'
print check_list(my_list) == 'Duplicate' # prints True
To remove dups and keep order ignoring 0 and None, if you have other falsey values that you want to keep you will need to specify is not None and not 0:
print [ele for ind, ele in enumerate(lst[:-1]) if ele not in lst[:ind] or not ele]
If you just want the first dup:
for ind, ele in enumerate(lst[:-1]):
if ele in lst[ind+1:] and ele:
print(ele)
break
Or store seen in a set:
seen = set()
for ele in lst:
if ele in seen:
print(ele)
break
if ele:
seen.add(ele)
You can use collections.defaultdict and specify a condition, such as non-zero / Truthy, and specify a threshold. If the count for a particular value exceeds the threshold, the function will return that value. If no such value exists, the function returns False.
from collections import defaultdict
def check_duplicates(it, condition, thresh):
dd = defaultdict(int)
for value in it:
dd[value] += 1
if condition(value) and dd[value] > thresh:
return value
return False
L = [1, None, None, 2, 2, 4, None, 3, None]
res = check_duplicates(L, condition=bool, thresh=1) # 2
Note in the above example the function bool will not consider 0 or None for threshold breaches. You could also use, for example, lambda x: x != 1 to exclude values equal to 1.
In my opinion, this is the simplest solution I could come up with. this should work with any list. The only downside is that it does not count the number of duplicates, but instead just returns True or False
for k, j in mylist:
return k == j
Here's a bit of code that will show you how to remove None and 0 from the sets.
l1 = [0, 1, 1, 2, 4, 7, None, None]
l2 = set(l1)
l2.remove(None)
l2.remove(0)

Categories