Python: Counting the occurrence from 2 big arrays - python

I have the following script that counts the occurrence of values from one array to another
array_1 = [1,2,0,5,7,0]
array_2 = [1,0,1,1,9,6]
# on array 2 there are 3 occurrence of 1, and 1 occurrence of zero, but because there is another zero at array_1 add 1 more. 3+2 = 5
for r in array_1:
total_count = total_count + array_2.count(r)
print("total sum: {0}".format(total_count))
its ok when dealing with small array size, but struggles when the array size increases (1 million for array_1 and 1 million array_2). Is there a better to approach this?
sorry for the confusion, i updated the question a little bit.

Note: The answer by #Netwave is five time faster.
You can use collections.Counter. It is be faster, because it only iterates ones of the list.
from collections import Counter
array_1 = [1,2,0,5,7]
array_2 = [1,0,1,1,9]
c = Counter(array_1)
total_count = sum(c[x] for x in array_2)
print("total sum: {0}".format(total_count))

Use a set instead of a list:
array1_set = set(array_1)
total_count = sum(1 for x in array_2 if x in array1_set)

If there are a lot of repeated numbers in array 1, you'll save time by caching them (building a dict in the form {number: count}). A typical caching function would look like this:
counts = {}
def get_count(number):
if number in counts:
return counts[number]
else:
count = your_counting_function(number)
counts[number] = count
return count
This behavior is packaged into the functools.lru_cache decorator, so that function can be simplified as:
from functools import lru_cache
#lru_cache(maxsize=None)
def get_count(number):
return array_2.count(number)
This would be a pretty efficient approach if you have a small number of distinct values in array 1—say, a random shuffle of the integers 1 through 10. This is sometimes referred to as array_1 having a low cardinality (a cardinality of 10).
If you have a high cardinality (say 900k distinct values in an array of 1M values), a better optimization would be precomputing all the counts before you even start, so that you only have to make one pass over array_2. Dict lookups are much, much faster than counting through the whole array.
array_2_counts = {}
for number in array_2:
if number in counts:
array_2_counts[number] += 1
else:
array_2_counts[number] = 1
total_count = 0
for number in array_1:
total_count += array_2_counts[number]
Python has a built-in for this, too! The above code can be simplified using collections.Counter:
from collections import Counter
array_2_counter = new Counter(array_2)
for number in array_1:
total_count += array_2_counter[number]

array_1 = [1,2,0,5,7]
array_2 = [1,0,1,1,9]
array_2_counts = {}
for number in array_1:
freq=array_2.count(number)
array_2_counts.update({number:freq})
print(array_2_counts)

Related

most efficient way to iterate over a large array looking for a missing element in Python

I was trying an online test. the test asked to write a function that given a list of up to 100000 integers whose range is 1 to 100000, would find the first missing integer.
for example, if the list is [1,4,5,2] the output should be 3.
I iterated over the list as follow
def find_missing(num)
for i in range(1, 100001):
if i not in num:
return i
the feedback I receives is the code is not efficient in handling big lists.
I am quite new and I couldnot find an answer, how can I iterate more efficiently?
The first improvement would be to make yours linear by using a set for the repeated membership test:
def find_missing(nums)
s = set(nums)
for i in range(1, 100001):
if i not in s:
return i
Given how C-optimized python sorting is, you could also do sth like:
def find_missing(nums)
s = sorted(set(nums))
return next(i for i, n in enumerate(s, 1) if i != n)
But both of these are fairly space inefficient as they create a new collection. You can avoid that with an in-place sort:
from itertools import groupby
def find_missing(nums):
nums.sort() # in-place
return next(i for i, (k, _) in enumerate(groupby(nums), 1) if i != k)
For any range of numbers, the sum is given by Gauss's formula:
# sum of all numbers up to and including nums[-1] minus
# sum of all numbers up to but not including nums[-1]
expected = nums[-1] * (nums[-1] + 1) // 2 - nums[0] * (nums[0] - 1) // 2
If a number is missing, the actual sum will be
actual = sum(nums)
The difference is the missing number:
result = expected - actual
This compulation is O(n), which is as efficient as you can get. expected is an O(1) computation, while actual has to actually add up the elements.
A somewhat slower but similar complexity approach would be to step along the sequence in lockstep with either a range or itertools.count:
for a, e in zip(nums, range(nums[0], len(nums) + nums[0])):
if a != e:
return e # or break if not in a function
Notice the difference between a single comparison a != e, vs a linear containment check like e in nums, which has to iterate on average through half of nums to get the answer.
You can use Counter to count every occurrence of your list. The minimum number with occurrence 0 will be your output. For example:
from collections import Counter
def find_missing():
count = Counter(your_list)
keys = count.keys() #list of every element in increasing order
main_list = list(range(1:100000)) #the list of values from 1 to 100k
missing_numbers = list(set(main_list) - set(keys))
your_output = min(missing_numbers)
return your_output

Creating data in loop subject to moving condition

I am trying to create a list of data in a for loop then store this list in a list if it satisfies some condition. My code is
R = 10
lam = 1
proc_length = 100
L = 1
#Empty list to store lists
exponential_procs_lists = []
for procs in range(0,R):
#Draw exponential random variables
z_exponential = np.random.exponential(lam,proc_length)
#Sort values to increase
z_exponential.sort()
#Insert 0 at start of list
z_dat_r = np.insert(z_exponential,0,0)
sum = np.sum(np.diff(z_dat_r))
if sum < 5*L:
exponential_procs_lists.append(z_dat_r)
which will store some of the R lists that satisfies the sum < 5L condition. My question is, what is the best way to store R lists where the sum of each list is less than 5L? The lists can be different length but they must satisfy the condition that the sum of the increments is less than 5*L. Any help much appreciated.
Okay so based on your comment, I take that you want to generate an exponential_procs_list, inside which every sublist has a sum < 5*L.
Well, I modified your code to chop the sublists as soon as the sum exceeds 5*L.
Edit : See answer history to see my last answer for the approach above.
Well looking closer, notice you don't actually need the discrete difference array. You're finding the difference array, summing it up and checking whether the sum's < 5L and if it is, you append the original array.
But notice this:
if your array is like so: [0, 0.00760541, 0.22281415, 0.60476231], it's difference array would be [0.00760541 0.21520874 0.38194816].
If you add the first x terms of the difference array, you get the x+1th element of the original array. So you really just need to keep elements which are lesser than 5L:
import numpy as np
R = 10
lam = 1
proc_length = 5
L = 1
exponential_procs_lists = []
def chop(nums, target):
good_list = []
for num in nums:
if num >= target:
break
good_list.append(num)
return good_list
for procs in range(0,R):
z_exponential = np.random.exponential(lam,proc_length)
z_exponential.sort()
z_dat_r = np.insert(z_exponential,0,0)
good_list = chop(z_dat_r, 5*L)
exponential_procs_lists.append(good_list)
You could probably also just do a binary search(for better time complexity) or use a filter lambda, that's up to you.

Efficiently find index of smallest number larger than some value in a large sorted list

If I have a long list of sorted numbers, and I want to find the index of the smallest element larger than some value, is there a way to implement it more efficiently than using binary search on the entire list?
For example:
import random
c = 0
x = [0 for x in range(50000)]
for n in range(50000):
c += random.randint(1,100)
x[n] = c
What would be the most efficient way of finding the location of the largest element in x smaller than some number, z
I know that you can already do:
import bisect
idx = bisect.bisect(x, z)
But assuming that this would be performed many times, would there be an even more efficient way than binary search? Since the range of the list is large, creating a dict of all possible integers uses too much memory. Would it be possible to create a smaller list of say every 5000 numbers and use that to speed up the lookup to a specific portion of the large list?
Can you try if this can be a solution?
It takes long to generate the list, but seems fast to report the result.
Given the list:
import random
limit = 50 # to set the number of elements
c = 0
x = [0 for x in range(limit)]
for n in range(limit):
c += random.randint(1,100)
x[n] = c
print(x)
Since it is a sorted list, you could retrieve the value using a for loop:
z = 1600 # reference for lookup
res = ()
for i, n in enumerate(x):
if n > z:
res = (i, n)
break
print (res) # this is the index and value of the elements that match the condition to break
print(x[res[0]-1]) # this is the element just before

calculate the sum of repeated numbers in a tuple

I have a tuple of 50 numbers having digits 0...9 (with repetition), I want to calculate the sum of each repeated digits and create a new tuple for each repeated digits. how can I do that in python.
(1,2,2,3,4,6,9,1,3,5,6,9,2,2,2,4,6,8,....9)..so sum of each repeated number like sumof2, sumof3...!!! I do know how to proceed.
try using the groupby() function in itertools.
data = (1,2,2,2,3,3,...)
for key, group in groupby(data):
print "The sum of ", key, " is ", sum(list(group))
If you wanted to do this without itertools (because reasons), then the best approach would be to use a 'remembering' variable. (This code could probably be cleaned a little)
sums = []
prev = -1
curr_sum = 0
for element in data:
if element != prev:
if prev > 0:
sums.append(curr_sum)
curr_sum = 0
prev = 0
curr_sum += element
sums.append(curr_sum)
This will leave you with an array of the sums.
OR, with dictionaries even!
sums = {}
for element in data:
sums[element] = data.count(element) * element
# sums[4] = sum of 4s
Maybe collections.Counter might help in this case if I'm reading the question correctly.
From what I understand you want the sum of the repeated elements inside a tuple that is corresponded with the int value?
This is no no means an efficient way of solving this, but hopefully it helps. I found this answer from a different kind of question to help solve yours:
How to count the frequency of the elements in a list? Answered by YOU
from collections import Counter
data = (0,1,2,3,4,5,6,7,8,9,2,2,3,4,5,6,......)
results = ()
test = sorted(data)
counter = Counter(data)
values = counter.values()
keys = counter.keys()
for i in range(0,len(keys)):
results += ((keys[i],values[i]*keys[i]),)
print results

Create a long list of random values, no duplicates

I want to create a list given two inputs, and under the condition that there cannot be any duplicates. The list should contain a random sequence of numbers. Then numbers in the list are positive integers.
Input 1: the length of the list (var samples)
Input 2: the highest number of the list (var end)
I know how to do this, but I want the list to contain a vast number of numbers, 1 million numbers, or more.
I have created 2 methods to solve this problem myself, both have their issues, on of them is slow the other produces a MemoryError.
Method 1, MemoryError:
import random
def create_lst_rand_int(end, samples):
if samples > end:
print('You cannot create this list')
else:
lst = []
lst_possible_values = range(0, end)
for item in range(0, samples):
random_choice = random.choice(lst_possible_values)
lst_possible_values.remove(random_choice)
lst.append(random_choice)
return lst
print create_lst_rand_int(1000000000000, 100000000001)
Method 2, slow:
import random
def lst_rand_int(end, samples):
lst = []
# lst cannot exist under these conditions
if samples > end:
print('List must be longer or equal to the highest value')
else:
while len(lst) < samples:
random_int = random.randint(0, end)
if not random_int in lst:
lst.append(random_int)
return lst
print lst_rand_int(1000000000000, 100000000001)
Since neither of my methods work well (method 1 does work better than method 2) I would like to know how I can create a list that meets my requirements better.
Try the solution given in the docs:
http://docs.python.org/2/library/random.html#random.sample
To choose a sample from a range of integers, use an xrange() object as an argument. This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).
Or, in your case, random.sample(xrange(0,1000000000000), 100000000001)
This is still a giant data structure that may or may not fit in your memory. On my system:
>>> sys.getsizeof(1)
24
So 100000000001 samples will require 2400000000024 bytes, or roughly two terabytes. I suggest you find a way to work with smaller numbers of samples.
Try:
temp = xrange(end+1)
random.sample(temp, samples)
random.sample() does not pick any duplicates.
Since sample always returns a list, you're out of luck with such a large size. Try using a generator instead:
def rrange(min, max):
seen = set()
while len(seen) <= max - min:
n = random.randint(min, max)
if n not in seen:
seen.add(n)
yield n
This still requires memory to store seen elements, but at least not everything at once.
You could use a set instead of a list, and avoid checking for duplicates.
def lr2(end, samples):
lst = set()
# lst cannot exist under these conditions
if samples > end:
print('List must be longer or equal to the highest value')
else:
for _ in range(samples):
random_int = random.randint(0, end)
lst.add(random_int)
return lst
Since your sample size is such a large percentage of the items being sampled, a much faster approach is to shuffle the list of items and then just remove the first or last n items.
import random
def lst_rand_int(end, samples):
lst = range(0, end)
random.shuffle(lst)
return lst[0:samples]
If samples > end it will just return the whole list
If the list is too large for memory, you can break it into parts and store the parts on disc. In that case a random choice should be made to choose a section, then an item in the section and remove it for each sample required.

Categories