calculate the sum of repeated numbers in a tuple - python

I have a tuple of 50 numbers having digits 0...9 (with repetition), I want to calculate the sum of each repeated digits and create a new tuple for each repeated digits. how can I do that in python.
(1,2,2,3,4,6,9,1,3,5,6,9,2,2,2,4,6,8,....9)..so sum of each repeated number like sumof2, sumof3...!!! I do know how to proceed.

try using the groupby() function in itertools.
data = (1,2,2,2,3,3,...)
for key, group in groupby(data):
print "The sum of ", key, " is ", sum(list(group))
If you wanted to do this without itertools (because reasons), then the best approach would be to use a 'remembering' variable. (This code could probably be cleaned a little)
sums = []
prev = -1
curr_sum = 0
for element in data:
if element != prev:
if prev > 0:
sums.append(curr_sum)
curr_sum = 0
prev = 0
curr_sum += element
sums.append(curr_sum)
This will leave you with an array of the sums.
OR, with dictionaries even!
sums = {}
for element in data:
sums[element] = data.count(element) * element
# sums[4] = sum of 4s

Maybe collections.Counter might help in this case if I'm reading the question correctly.
From what I understand you want the sum of the repeated elements inside a tuple that is corresponded with the int value?
This is no no means an efficient way of solving this, but hopefully it helps. I found this answer from a different kind of question to help solve yours:
How to count the frequency of the elements in a list? Answered by YOU
from collections import Counter
data = (0,1,2,3,4,5,6,7,8,9,2,2,3,4,5,6,......)
results = ()
test = sorted(data)
counter = Counter(data)
values = counter.values()
keys = counter.keys()
for i in range(0,len(keys)):
results += ((keys[i],values[i]*keys[i]),)
print results

Related

most efficient way to iterate over a large array looking for a missing element in Python

I was trying an online test. the test asked to write a function that given a list of up to 100000 integers whose range is 1 to 100000, would find the first missing integer.
for example, if the list is [1,4,5,2] the output should be 3.
I iterated over the list as follow
def find_missing(num)
for i in range(1, 100001):
if i not in num:
return i
the feedback I receives is the code is not efficient in handling big lists.
I am quite new and I couldnot find an answer, how can I iterate more efficiently?
The first improvement would be to make yours linear by using a set for the repeated membership test:
def find_missing(nums)
s = set(nums)
for i in range(1, 100001):
if i not in s:
return i
Given how C-optimized python sorting is, you could also do sth like:
def find_missing(nums)
s = sorted(set(nums))
return next(i for i, n in enumerate(s, 1) if i != n)
But both of these are fairly space inefficient as they create a new collection. You can avoid that with an in-place sort:
from itertools import groupby
def find_missing(nums):
nums.sort() # in-place
return next(i for i, (k, _) in enumerate(groupby(nums), 1) if i != k)
For any range of numbers, the sum is given by Gauss's formula:
# sum of all numbers up to and including nums[-1] minus
# sum of all numbers up to but not including nums[-1]
expected = nums[-1] * (nums[-1] + 1) // 2 - nums[0] * (nums[0] - 1) // 2
If a number is missing, the actual sum will be
actual = sum(nums)
The difference is the missing number:
result = expected - actual
This compulation is O(n), which is as efficient as you can get. expected is an O(1) computation, while actual has to actually add up the elements.
A somewhat slower but similar complexity approach would be to step along the sequence in lockstep with either a range or itertools.count:
for a, e in zip(nums, range(nums[0], len(nums) + nums[0])):
if a != e:
return e # or break if not in a function
Notice the difference between a single comparison a != e, vs a linear containment check like e in nums, which has to iterate on average through half of nums to get the answer.
You can use Counter to count every occurrence of your list. The minimum number with occurrence 0 will be your output. For example:
from collections import Counter
def find_missing():
count = Counter(your_list)
keys = count.keys() #list of every element in increasing order
main_list = list(range(1:100000)) #the list of values from 1 to 100k
missing_numbers = list(set(main_list) - set(keys))
your_output = min(missing_numbers)
return your_output

Creating data in loop subject to moving condition

I am trying to create a list of data in a for loop then store this list in a list if it satisfies some condition. My code is
R = 10
lam = 1
proc_length = 100
L = 1
#Empty list to store lists
exponential_procs_lists = []
for procs in range(0,R):
#Draw exponential random variables
z_exponential = np.random.exponential(lam,proc_length)
#Sort values to increase
z_exponential.sort()
#Insert 0 at start of list
z_dat_r = np.insert(z_exponential,0,0)
sum = np.sum(np.diff(z_dat_r))
if sum < 5*L:
exponential_procs_lists.append(z_dat_r)
which will store some of the R lists that satisfies the sum < 5L condition. My question is, what is the best way to store R lists where the sum of each list is less than 5L? The lists can be different length but they must satisfy the condition that the sum of the increments is less than 5*L. Any help much appreciated.
Okay so based on your comment, I take that you want to generate an exponential_procs_list, inside which every sublist has a sum < 5*L.
Well, I modified your code to chop the sublists as soon as the sum exceeds 5*L.
Edit : See answer history to see my last answer for the approach above.
Well looking closer, notice you don't actually need the discrete difference array. You're finding the difference array, summing it up and checking whether the sum's < 5L and if it is, you append the original array.
But notice this:
if your array is like so: [0, 0.00760541, 0.22281415, 0.60476231], it's difference array would be [0.00760541 0.21520874 0.38194816].
If you add the first x terms of the difference array, you get the x+1th element of the original array. So you really just need to keep elements which are lesser than 5L:
import numpy as np
R = 10
lam = 1
proc_length = 5
L = 1
exponential_procs_lists = []
def chop(nums, target):
good_list = []
for num in nums:
if num >= target:
break
good_list.append(num)
return good_list
for procs in range(0,R):
z_exponential = np.random.exponential(lam,proc_length)
z_exponential.sort()
z_dat_r = np.insert(z_exponential,0,0)
good_list = chop(z_dat_r, 5*L)
exponential_procs_lists.append(good_list)
You could probably also just do a binary search(for better time complexity) or use a filter lambda, that's up to you.

How to find the most common string(s) in a Python list?

I am dealing with ancient DNA data. I have an array with n different base pair calls for a given coordinate.
e.g.,
['A','A','C','C','G']
I need to setup a bit in my script whereby the most frequent call(s) are identified. If there is one, it should use that one. If there are two (or three) that are tied (e.g., A and C here), I need it randomly pick one of the two.
I have been looking for a solution but cannot find anything satisfactory. The most frequent solution, I see is Counter, but Counter is useless for me as c.most_common(1) will not identify that 1 and 2 are tied.
You can get the maximum count from the mapping returned by Counter with the max function first, and then ues a list comprehension to output only the keys whose counts equal the maximum count. Since Counter, max, and list comprehension all cost linear time, the overall time complexity of the code can be kept at O(n):
from collections import Counter
import random
lst = ['A','A','C','C','G']
counts = Counter(lst)
greatest = max(counts.values())
print(random.choice([item for item, count in counts.items() if count == greatest]))
This outputs either A or C.
Something like this would work:
import random
string = ['A','A','C','C','G']
dct = {}
for x in set(string):
dct[x] = string.count(x)
max_value = max(dct.values())
lst = []
for key, value in dct.items():
if value == max_value:
lst.append(key)
print(random.choice(lst))

Python: Counting the occurrence from 2 big arrays

I have the following script that counts the occurrence of values from one array to another
array_1 = [1,2,0,5,7,0]
array_2 = [1,0,1,1,9,6]
# on array 2 there are 3 occurrence of 1, and 1 occurrence of zero, but because there is another zero at array_1 add 1 more. 3+2 = 5
for r in array_1:
total_count = total_count + array_2.count(r)
print("total sum: {0}".format(total_count))
its ok when dealing with small array size, but struggles when the array size increases (1 million for array_1 and 1 million array_2). Is there a better to approach this?
sorry for the confusion, i updated the question a little bit.
Note: The answer by #Netwave is five time faster.
You can use collections.Counter. It is be faster, because it only iterates ones of the list.
from collections import Counter
array_1 = [1,2,0,5,7]
array_2 = [1,0,1,1,9]
c = Counter(array_1)
total_count = sum(c[x] for x in array_2)
print("total sum: {0}".format(total_count))
Use a set instead of a list:
array1_set = set(array_1)
total_count = sum(1 for x in array_2 if x in array1_set)
If there are a lot of repeated numbers in array 1, you'll save time by caching them (building a dict in the form {number: count}). A typical caching function would look like this:
counts = {}
def get_count(number):
if number in counts:
return counts[number]
else:
count = your_counting_function(number)
counts[number] = count
return count
This behavior is packaged into the functools.lru_cache decorator, so that function can be simplified as:
from functools import lru_cache
#lru_cache(maxsize=None)
def get_count(number):
return array_2.count(number)
This would be a pretty efficient approach if you have a small number of distinct values in array 1—say, a random shuffle of the integers 1 through 10. This is sometimes referred to as array_1 having a low cardinality (a cardinality of 10).
If you have a high cardinality (say 900k distinct values in an array of 1M values), a better optimization would be precomputing all the counts before you even start, so that you only have to make one pass over array_2. Dict lookups are much, much faster than counting through the whole array.
array_2_counts = {}
for number in array_2:
if number in counts:
array_2_counts[number] += 1
else:
array_2_counts[number] = 1
total_count = 0
for number in array_1:
total_count += array_2_counts[number]
Python has a built-in for this, too! The above code can be simplified using collections.Counter:
from collections import Counter
array_2_counter = new Counter(array_2)
for number in array_1:
total_count += array_2_counter[number]
array_1 = [1,2,0,5,7]
array_2 = [1,0,1,1,9]
array_2_counts = {}
for number in array_1:
freq=array_2.count(number)
array_2_counts.update({number:freq})
print(array_2_counts)

Pythonic way of checking if indefinite # of consec elements in list sum to given value

Having trouble figuring out a nice way to get this task done.
Say i have a list of triangular numbers up to 1000 -> [0,1,3,6,10,15,..]etc
Given a number, I want to return the consecutive elements in that list that sum to that number.
i.e.
64 --> [15,21,28]
225 --> [105,120]
371 --> [36, 45, 55, 66, 78, 91]
if there's no consecutive numbers that add up to it, return an empty list.
882 --> [ ]
Note that the length of consecutive elements can be any number - 3,2,6 in the examples above.
The brute force way would iteratively check every possible consecutive pairing possibility for each element. (start at 0, look at the sum of [0,1], look at the sum of [0,1,3], etc until the sum is greater than the target number). But that's probably O(n*2) or maybe worse. Any way to do it better?
UPDATE:
Ok, so a friend of mine figured out a solution that works at O(n) (I think) and is pretty intuitively easy to follow. This might be similar (or the same) to Gabriel's answer, but it was just difficult for me to follow and I like that this solution is understandable even from a basic perspective. this is an interesting question, so I'll share her answer:
def findConsec(input1 = 7735):
list1 = range(1, 1001)
newlist = [reduce(lambda x,y: x+y,list1[0:i]) for i in list1]
curr = 0
end = 2
num = sum(newlist[curr:end])
while num != input1:
if num < input1:
num += newlist[end]
end += 1
elif num > input1:
num -= newlist[curr]
curr += 1
if curr == end:
return []
if num == input1:
return newlist[curr:end]
A 3-iteration max solution
Another solution would be to start from close where your number would be and walk forward from one position behind. For any number in the triangular list vec, their value can be defined by their index as:
vec[i] = sum(range(0,i+1))
The division between the looking-for sum value and the length of the group is the average of the group and, hence, lies within it, but may as well not exist in it.
Therefore, you can set the starting point for finding a group of n numbers whose sum matches a value val as the integer part of the division between them. As it may not be in the list, the position would be that which minimizes their difference.
# vec as np.ndarray -> the triangular or whatever-type series
# val as int -> sum of n elements you are looking after
# n as int -> number of elements to be summed
import numpy as np
def seq_index(vec,n,val):
index0 = np.argmin(abs(vec-(val/n)))-n/2-1 # covers odd and even n values
intsum = 0 # sum of which to keep track
count = 0 # counter
seq = [] # indices of vec that sum up to val
while count<=2: # walking forward from the initial guess of where the group begins or prior to it
intsum = sum(vec[(index0+count):(index0+count+n)])
if intsum == val:
seq.append(range(index0+count,index0+count+n))
count += 1
return seq
# Example
vec = []
for i in range(0,100):
vec.append(sum(range(0,i))) # build your triangular series from i = 0 (0) to i = 99 (whose sum equals 4950)
vec = np.array(vec) # convert to numpy to make it easier to query ranges
# looking for a value that belong to the interval 0-4590
indices = seq_index(vec,3,4)
# print indices
print indices[0]
print vec[indices]
print sum(vec[indices])
Returns
print indices[0] -> [1, 2, 3]
print vec[indices] -> [0 1 3]
print sum(vec[indices]) -> 4 (which we were looking for)
This seems like an algorithm question rather than a question on how to do it in python.
Thinking backwards I would copy the list and use it in a similar way to the Sieve of Eratosthenes. I would not consider the numbers that are greater than x. Then start from the greatest number and sum backwards. Then if I get greater than x, subtract the greatest number (exclude it from the solution) and continue to sum backward.
This seems the most efficient way to me and actually is O(n) - you never go back (or forward in this backward algorithm), except when you subtract or remove the biggest element, which doesn't need accessing the list again - just a temp var.
To answer Dunes question:
Yes, there is a reason - to subtracts the next largest in case of no-solution that sums larger. Going from the first element, hit a no-solution would require access to the list again or to the temporary solution list to subtract a set of elements that sum greater than the next element to sum. You risk to increase the complexity by accessing more elements.
To improve efficiency in the cases where an eventual solution is at the beginning of the sequence you can search for the smaller and larger pair using binary search. Once a pair of 2 elements, smaller than x is found then you can sum the pair and if it sums larger than x you go left, otherwise you go right. This search has logarithmic complexity in theory. In practice complexity is not what it is in theory and you can do whatever you like :)
You should pick the first three elements, sum them and do and then you keep subtracting the first of the three and add the next element in the list and see if the sum add up to whatever number you want. That would be O(n).
# vec as np.ndarray
import numpy as np
itsum = sum(list[0:2]) # the sum you want to iterate and check its value
sequence = [[] if itsum == whatever else [range(0,3)]] # indices of the list that add up to whatever (creation)
for i in range(3,len(vec)):
itsum -= vec[i-3]
itsum += vec[i]
if itsum == whatever:
sequence.append(range(i-2,i+1)) # list of sequences that add up to whatever
The solution you provide in the question isn't truly O(n) time complexity -- the way you compute your triangle numbers makes the computation O(n2). The list comprehension throws away the previous work that want into calculating the last triangle number. That is: tni = tni-1 + i (where tn is a triangle number). Since you also, store the triangle numbers in a list, your space complexity is not constant, but related to the size of the number you are looking for. Below is an identical algorithm, but is O(n) time complexity and O(1) space complexity (written for python 3).
# for python 2, replace things like `highest = next(high)` with `highest = high.next()`
from itertools import count, takewhile, accumulate
def find(to_find):
# next(low) == lowest number in total
# next(high) == highest number not in total
low = accumulate(count(1)) # generator of triangle numbers
high = accumulate(count(1))
total = highest = next(high)
# highest = highest number in the sequence that sums to total
# definitely can't find solution if the highest number in the sum is greater than to_find
while highest <= to_find:
# found a solution
if total == to_find:
# keep taking numbers from the low iterator until we find the highest number in the sum
return list(takewhile(lambda x: x <= highest, low))
elif total < to_find:
# add the next highest triangle number not in the sum
highest = next(high)
total += highest
else: # if total > to_find
# subtract the lowest triangle number in the sum
total -= next(low)
return []

Categories