Python: get number of occurrences based on condition - python

If I have list of integers and a function getErrorType(int) that returns some enum type, what's a Pythonic way to get a dictionary where the key is the enum type and value is the count of how many values in the array returned that error type?
Example:
arr = [1, 2, 3]
getErrorType(1) returns EXCEPTION
getErrorType(2) returns MALFORMED_DATA
getErrorType(3) returns EXCEPTION
I want to be able to get: {EXCEPTION: 2, MALFORMED_DATA: 1}

I would use dict comprehension
d={getErrorType(a):a for a in arr}
EDIT:
ok to get the counts like OP said, i would do something like this
d={x: [getErrorType(a) for a in arr].count(x) for x in set([getErrorType(a) for a in arr])}
Though this may make it too hard to read to be pythonic/////

data = {}
for a in arr:
error_type = getErrorType(a)
if error_type in data:
data[error_type] += 1
else:
data[error_type] = 1

Don't think there is an efficient method to keep count using dict comprehension.
I would probably just use an iterative approach using a normal dictionary or a defaultdict
from collections import defaultdict
d = defaultdict(int)
for num in arr: d[getErrorType(num)] += 1

Made a generic function just to simulate you could pass the whole arr into your function then you could use .count on the new list that has the results to form a dicitonary
def getErrorType(a):
return ['Ex' if i % 2 else 'Mal' for i in a ]
arr = [1, 2, 3]
lista = getErrorType(arr)
dicta = {i: lista.count(i) for i in lista}
(xenial)vash#localhost:~/python/stack_overflow$ python3.7 helping.py
{'Ex': 2, 'Mal': 1}
I do not agree with looping through += 1 every item to create a dictionary that doesn't seem efficient, I stand by this

Combined some solution above
d = defaultdict(int)
for num in set(arr):
d[getErrorType(num)] += arr.count(num)

Related

Find the missing strings containing int value from a given list

Given a list of string, say mystr= ["State0", "State1", "State2", "State5", "State8"].
I need to find the missing States (here "State3", "State4", "State6", "State7"). Is there a possibility to find it?
Desired output: mylist = ["State3", "State4", "State6", "State7"]
Assuming the highest number in the list might not be known, one way is to extract the numerical part in each string, take the set.difference with a range up to the highest value and create a new list using a list comprehension:
import re
ints = [int(re.search(r'\d+', i).group(0)) for i in mystr]
# [0, 1, 2, 5, 8]
missing = set(range(max(ints))) - set(ints)
# {3, 4, 6, 7}
[f'State{i}' for i in missing]
# ['State3', 'State4', 'State6', 'State7']
Simple enough with list comprehensions and f-strings:
mystr = ["State0", "State1", "State2", "State5", "State8"]
highest = max(int(state.split('State')[-1]) for state in mystr)
mylist = [f"State{i}" for i in range(highest) if f"State{i}" not in mystr]
print(mylist)
Output:
['State3', 'State4', 'State6', 'State7']
Note that this solution is nice and general and will work even if the last element in the original list if for example "State1024", and even if the original list is not sorted.
I am not sure what you are asking. I think this is what you expect.
mystr= ["State0", "State1", "State2", "State5", "State8"]
print(['State'+str(p) for p in range(8) if 'State'+str(p) not in mystr ])
You can use the following solution:
lst = ["State0", "State1", "State2", "State5", "State8"]
states = set(lst)
len_states = len(states)
missing = []
num = 0
while len_states:
state = f'State{num}'
if state in states:
len_states -= 1
else:
missing.append(state)
num += 1
print(missing)
Output:
['State3', 'State4', 'State6', 'State7']

Indexing a list with an unique index

I have a list say l = [10,10,20,15,10,20]. I want to assign each unique value a certain "index" to get [1,1,2,3,1,2].
This is my code:
a = list(set(l))
res = [a.index(x) for x in l]
Which turns out to be very slow.
l has 1M elements, and 100K unique elements. I have also tried map with lambda and sorting, which did not help. What is the ideal way to do this?
You can do this in O(N) time using a defaultdict and a list comprehension:
>>> from itertools import count
>>> from collections import defaultdict
>>> lst = [10, 10, 20, 15, 10, 20]
>>> d = defaultdict(count(1).next)
>>> [d[k] for k in lst]
[1, 1, 2, 3, 1, 2]
In Python 3 use __next__ instead of next.
If you're wondering how it works?
The default_factory(i.e count(1).next in this case) passed to defaultdict is called only when Python encounters a missing key, so for 10 the value is going to be 1, then for the next ten it is not a missing key anymore hence the previously calculated 1 is used, now 20 is again a missing key and Python will call the default_factory again to get its value and so on.
d at the end will look like this:
>>> d
defaultdict(<method-wrapper 'next' of itertools.count object at 0x1057c83b0>,
{10: 1, 20: 2, 15: 3})
The slowness of your code arises because a.index(x) performs a linear search and you perform that linear search for each of the elements in l. So for each of the 1M items you perform (up to) 100K comparisons.
The fastest way to transform one value to another is looking it up in a map. You'll need to create the map and fill in the relationship between the original values and the values you want. Then retrieve the value from the map when you encounter another of the same value in your list.
Here is an example that makes a single pass through l. There may be room for further optimization to eliminate the need to repeatedly reallocate res when appending to it.
res = []
conversion = {}
i = 0
for x in l:
if x not in conversion:
value = conversion[x] = i
i += 1
else:
value = conversion[x]
res.append(value)
Well I guess it depends on if you want it to return the indexes in that specific order or not. If you want the example to return:
[1,1,2,3,1,2]
then you can look at the other answers submitted. However if you only care about getting a unique index for each unique number then I have a fast solution for you
import numpy as np
l = [10,10,20,15,10,20]
a = np.array(l)
x,y = np.unique(a,return_inverse = True)
and for this example the output of y is:
y = [0,0,2,1,0,2]
I tested this for 1,000,000 entries and it was done essentially immediately.
Your solution is slow because its complexity is O(nm) with m being the number of unique elements in l: a.index() is O(m) and you call it for every element in l.
To make it O(n), get rid of index() and store indexes in a dictionary:
>>> idx, indexes = 1, {}
>>> for x in l:
... if x not in indexes:
... indexes[x] = idx
... idx += 1
...
>>> [indexes[x] for x in l]
[1, 1, 2, 3, 1, 2]
If l contains only integers in a known range, you could also store indexes in a list instead of a dictionary for faster lookups.
You can use collections.OrderedDict() in order to preserve the unique items in order and, loop over the enumerate of this ordered unique items in order to get a dict of items and those indices (based on their order) then pass this dictionary with the main list to operator.itemgetter() to get the corresponding index for each item:
>>> from collections import OrderedDict
>>> from operator import itemgetter
>>> itemgetter(*lst)({j:i for i,j in enumerate(OrderedDict.fromkeys(lst),1)})
(1, 1, 2, 3, 1, 2)
For completness, you can also do it eagerly:
from itertools import count
wordid = dict(zip(set(list_), count(1)))
This uses a set to obtain the unique words in list_, pairs
each of those unique words with the next value from count() (which
counts upwards), and constructs a dictionary from the results.
Original answer, written by nneonneo.

How to find second smallest UNIQUE number in a list?

I need to create a function that returns the second smallest unique number, which means if
list1 = [5,4,3,2,2,1], I need to return 3, because 2 is not unique.
I've tried:
def second(list1):
result = sorted(list1)[1]
return result
and
def second(list1):
result = list(set((list1)))
return result
but they all return 2.
EDIT1:
Thanks guys! I got it working using this final code:
def second(list1):
b = [i for i in list1 if list1.count(i) == 1]
b.sort()
result = sorted(b)[1]
return result
EDIT 2:
Okay guys... really confused. My Prof just told me that if list1 = [1,1,2,3,4], it should return 2 because 2 is still the second smallest number, and if list1 = [1,2,2,3,4], it should return 3.
Code in eidt1 wont work if list1 = [1,1,2,3,4].
I think I need to do something like:
if duplicate number in position list1[0], then remove all duplicates and return second number.
Else if duplicate number postion not in list1[0], then just use the code in EDIT1.
Without using anything fancy, why not just get a list of uniques, sort it, and get the second list item?
a = [5,4,3,2,2,1] #second smallest is 3
b = [i for i in a if a.count(i) == 1]
b.sort()
>>>b[1]
3
a = [5,4,4,3,3,2,2,1] #second smallest is 5
b = [i for i in a if a.count(i) == 1]
b.sort()
>>> b[1]
5
Obviously you should test that your list has at least two unique numbers in it. In other words, make sure b has a length of at least 2.
Remove non unique elements - use sort/itertools.groupby or collections.Counter
Use min - O(n) to determine the minimum instead of sort - O(nlongn). (In any case if you are using groupby the data is already sorted) I missed the fact that OP wanted the second minimum, so sorting is still a better option here
Sample Code
Using Counter
>>> sorted(k for k, v in Counter(list1).items() if v == 1)[1]
1
Using Itertools
>>> sorted(k for k, g in groupby(sorted(list1)) if len(list(g)) == 1)[1]
3
Here's a fancier approach that doesn't use count (which means it should have significantly better performance on large datasets).
from collections import defaultdict
def getUnique(data):
dd = defaultdict(lambda: 0)
for value in data:
dd[value] += 1
result = [key for key in dd.keys() if dd[key] == 1]
result.sort()
return result
a = [5,4,3,2,2,1]
b = getUnique(a)
print(b)
# [1, 3, 4, 5]
print(b[1])
# 3
Okay guys! I got the working code thanks to all your help and helping me to think on the right track. This code works:
`def second(list1):
if len(list1)!= len(set(list1)):
result = sorted(list1)[2]
return result
elif len(list1) == len(set(list1)):
result = sorted(list1)[1]
return result`
Okay, here usage of set() on a list is not going to help. It doesn't purge the duplicated elements. What I mean is :
l1=[5,4,3,2,2,1]
print set(l1)
Prints
[0, 1, 2, 3, 4, 5]
Here, you're not removing the duplicated elements, but the list gets unique
In your example you want to remove all duplicated elements.
Try something like this.
l1=[5,4,3,2,2,1]
newlist=[]
for i in l1:
if l1.count(i)==1:
newlist.append(i)
print newlist
This in this example prints
[5, 4, 3, 1]
then you can use heapq to get your second largest number in your list, like this
print heapq.nsmallest(2, newlist)[-1]
Imports : import heapq, The above snippet prints 3 for you.
This should to the trick. Cheers!

Optimize search to find next matching value in a list

I have a program that goes through a list and for each objects finds the next instance that has a matching value. When it does it prints out the location of each objects. The program runs perfectly fine but the trouble I am running into is when I run it with a large volume of data (~6,000,000 objects in the list) it will take much too long. If anyone could provide insight into how I can make the process more efficient, I would greatly appreciate it.
def search(list):
original = list
matchedvalues = []
count = 0
for x in original:
targetValue = x.getValue()
count = count + 1
copy = original[count:]
for y in copy:
if (targetValue == y.getValue):
print (str(x.getLocation) + (,) + str(y.getLocation))
break
Perhaps you can make a dictionary that contains a list of indexes that correspond to each item, something like this:
values = [1,2,3,1,2,3,4]
from collections import defaultdict
def get_matches(x):
my_dict = defaultdict(list)
for ind, ele in enumerate(x):
my_dict[ele].append(ind)
return my_dict
Result:
>>> get_matches(values)
defaultdict(<type 'list'>, {1: [0, 3], 2: [1, 4], 3: [2, 5], 4: [6]})
Edit:
I added this part, in case it helps:
values = [1,1,1,1,2,2,3,4,5,3]
def get_next_item_ind(x, ind):
my_dict = get_matches(x)
indexes = my_dict[x[ind]]
temp_ind = indexes.index(ind)
if len(indexes) > temp_ind + 1:
return(indexes)[temp_ind + 1]
return None
Result:
>>> get_next_item_ind(values, 0)
1
>>> get_next_item_ind(values, 1)
2
>>> get_next_item_ind(values, 2)
3
>>> get_next_item_ind(values, 3)
>>> get_next_item_ind(values, 4)
5
>>> get_next_item_ind(values, 5)
>>> get_next_item_ind(values, 6)
9
>>> get_next_item_ind(values, 7)
>>> get_next_item_ind(values, 8)
There are a few ways you could increase the efficiency of this search by minimising additional memory use (particularly when your data is BIG).
you can operate directly on the list you are passing in, and don't need to make copies of it, in this way you won't need: original = list, or copy = original[count:]
you can use slices of the original list to test against, and enumerate(p) to iterate through these slices. You won't need the extra variable count and, enumerate(p) is efficient in Python
Re-implemented, this would become:
def search(p):
# iterate over p
for i, value in enumerate(p):
# if value occurs more than once, print locations
# do not re-test values that have already been tested (if value not in p[:i])
if value not in p[:i] and value in p[(i + 1):]:
print(e, ':', i, p[(i + 1):].index(e))
v = [1,2,3,1,2,3,4]
search(v)
1 : 0 2
2 : 1 2
3 : 2 2
Implementing it this way will only print out the values / locations where a value is repeated (which I think is what you intended in your original implementation).
Other considerations:
More than 2 occurrences of value: If the value repeats many times in the list, then you might want to implement a function to walk recursively through the list. As it is, the question doesn't address this - and it may be that it doesn't need to in your situation.
using a dictionary: I completely agree with Akavall above, dictionary's are a great way of looking up values in Python - especially if you need to lookup values again later in the program. This will work best if you construct a dictionary instead of a list when you originally create the list. But if you are only doing this once, it is going to cost you more time to construct the dictionary and query over it than simply iterating over the list as described above.
Hope this helps!

iterating quickly through list of tuples

I wonder whether there's a quicker and less time consuming way to iterate over a list of tuples, finding the right match. What I do is:
# this is a very long list.
my_list = [ (old1, new1), (old2, new2), (old3, new3), ... (oldN, newN)]
# go through entire list and look for match
for j in my_list:
if j[0] == VALUE:
PAIR_FOUND = True
MATCHING_VALUE = j[1]
break
this code can take quite some time to execute, depending on the number of items in the list. I'm sure there's a better way of doing this.
I think that you can use
for j,k in my_list:
[ ... stuff ... ]
Assuming a bit more memory usage is not a problem and if the first item of your tuple is hashable, you can create a dict out of your list of tuples and then looking up the value is as simple as looking up a key from the dict. Something like:
dct = dict(tuples)
val = dct.get(key) # None if item not found else the corresponding value
EDIT: To create a reverse mapping, use something like:
revDct = dict((val, key) for (key, val) in tuples)
The question is dead but still knowing one more way doesn't hurt:
my_list = [ (old1, new1), (old2, new2), (old3, new3), ... (oldN, newN)]
for first,*args in my_list:
if first == Value:
PAIR_FOUND = True
MATCHING_VALUE = args
break
The code can be cleaned up, but if you are using a list to store your tuples, any such lookup will be O(N).
If lookup speed is important, you should use a dict to store your tuples. The key should be the 0th element of your tuples, since that's what you're searching on. You can easily create a dict from your list:
my_dict = dict(my_list)
Then, (VALUE, my_dict[VALUE]) will give you your matching tuple (assuming VALUE exists).
I wonder whether the below method is what you want.
You can use defaultdict.
>>> from collections import defaultdict
>>> s = [('red',1), ('blue',2), ('red',3), ('blue',4), ('red',1), ('blue',4)]
>>> d = defaultdict(list)
>>> for k, v in s:
d[k].append(v)
>>> sorted(d.items())
[('blue', [2, 4, 4]), ('red', [1, 3, 1])]

Categories