Processing big list using python

Processing big list using python - python

So I'm trying to solve a challenge and have come across a dead end. My solution works when the list is small or medium but when it is over 50000. It just "time out"
a = int(input().strip())
b = list(map(int,input().split()))
result = []
flag = []
for i in range(len(b)):
temp = a - b[i]
if(temp >=0 and temp in flag):
if(temp<b[i]):
result.append((temp,b[i]))
else:
result.append((b[i],temp))
flag.remove(temp)
else:
flag.append(b[i])
result.sort()
for i in result:
print(i[0],i[1])
Where
a = 10
and b = [ 2, 4 ,6 ,8, 5 ]
Solution sum any two element in b which matches a
**Edit: ** Updated full code

flag is a list, of potentially the same order of magnitude as b. So, when you do temp in flag that's a linear search: it has to check every value in flag to see if that value is == temp. So, that's 50000 comparisons. And you're doing that once per loop in a linear walk over b. So, your total time is quadratic: 50,000 * 50,000 = 2,500,000,000. (And flag.remove is also linear time.)
If you replace flag with a set, you can test it for membership (and remove from it) in constant time. So your total time drops from quadratic to linear, or 50,000 steps, which is a lot faster than 2 billion:
flagset = set(flag)
for i in range(len(b)):
temp = a - b[i]
if(temp >=0 and temp in flagset):
if(temp<b[i]):
result.append((temp,b[i]))
else:
result.append((b[i],temp))
flagset.remove(temp)
else:
flagset.add(b[i])
flag = list(flagset)
If flag needs to retain duplicate values, then it's a multiset, not a set, which means you can implement with Counter:
flagset = collections.Counter(flag)
for i in range(len(b)):
temp = a - b[i]
if(temp >=0 and flagset[temp]):
if(temp<b[i]):
result.append((temp,b[i]))
else:
result.append((b[i],temp))
flagset[temp] -= 1
else:
flagset[temp] += 1
flag = list(flagset.elements())
In your edited code, you’ve got another list that’s potentially of the same size, result, and you’re sorting that list every time through the loop.
Sorting takes log-linear time. Since you do it up to 50,000 times, that’s around log(50;000) * 50,000 * 50,000, or around 30 billion steps.
If you needed to keep result in order throughout the operation, you’d want to use a logarithmic data structure, like a binary search tree or a skiplist, so you could insert a new element in the right place in logarithmic time, which would mean just 800.000 steps.
But you don’t need it in order until the end. So, much more simply, just move the result.sort out of the loop and do it at the end.

Related

Does this binary sorting algorithm need to be optimized significantly in memory usage and speed of execution

I made a binary search algorithm, and I just was wondering if there is a way to increase the performance of memory usage / effective usage of memory and the speed of execution. Specifically, is there a way to quickly insert an item since I know that using the insert method requires shifting the indexes of all the rightward values by 1. This takes time, so is there a faster way to do this, maybe use a different data set that is mutable and index-able? Also, I made this with the intention that the values are all real numbers / floats or integers. Also, am I correct to assume I won't be using much more memory at any given point in the execution due to popping off the values and sorting them into the sorted list. Since A (length of original array) - B (number of values popped off) + B (number of values popped of from the original array that are moved into sorted array) = A.
def binary_sort(array: []):
""" Returns an organized version of the array """
number = 0
length = len(array)
sorted_array = []
while number < length:
value = array.pop()
start = 0
end = number
while start < end:
avg = (start + end) // 2
v = sorted_array[avg]
if v <= value:
start = 1 + avg
else:
end = avg
sorted_array.insert(start, value)
number += 1
return sorted_array

Some for loop number decreasing when they should be increasing while iterating through a list

I've been looking at my code for a while and I'm just stuck as to where I messed up, so maybe one of you can help.
What my for loop should do is: It iterates through a long list of times. It averages the first 100 times, and sets that as a value. This part works. What it should do next is add one to t (the value i'm using for my for loop) so it will average the 1st and 101st time together, if this average is less than .1 seconds faster, add the t value to a list of x values, then, set that value as the new average to beat. If it's not .1 lower, we increase t and try again until it works.
Here's my code, and I'll walk you through what it means to make it more readable
for t in times:
t = int(t)
sumset = sum(times[t:t + 100])
avgset = (int(round(float(sumset/100), 3) * 10)) /10
if t + 100 > len(times):
break
elif (avgset) <= firstavg - .1:
avglist.append(avgset)
firstavg -= .1
xlist.append(t)
print(t)
print("avgset is "+str(avgset))
print("should decrease by .1 " + str(math.ceil(firstavg * 10) / 10))
tlist.append(t)
t += 1
else:
t += 1
I'll explain it here.
for t in times:
t = int(t)
sumset = sum(times[t:t + 100])
avgset = (int(round(float(sumset/100), 3) * 10)) /10
for each value in the my list called times, we take the value and make sure it's an int, I do this because earlier I was getting an indexing problem saying it wasn't an int. Sumset gets the sum of the first 100 times that we need, and avgset turns it into an average, multiplies it by 10, uses int to chop off the decimal, and divide by ten to get a tenth value.
Ex
12.34 * 10 = 123.4, int(123.4) = 123, 123 / 10 is 12.3.
Then here
if t + 100 > len(times):
break
We make sure there are 100 values left to iterate through, if not we end the loop.
On this big chunk
elif (avgset) <= firstavg - .1:
avglist.append(avgset)
firstavg -= .1
xlist.append(t)
print(t)
print("avgset is "+str(avgset))
print("should decrease by .1 " + str(math.ceil(firstavg * 10) / 10))
tlist.append(t)
t += 1
We check: if the set is <= to the first average - .1, we append that set of averages to a list of lowering averages. Then we decrease the first avg, and append the value of t to a list that will make up our x-values. What it should do, is produce me a list of x values where each value corresponds to a decrease of .1 from the original average (t: t +100) where t is 0. And we get a y-list (which would be avglist) which is each decrease of .1. I'm not sure where I messed up, so if someone could point me in the right direction I would really appreciate it, thanks!

In my opinion, there are multiple things to address in your code:
1) The main and most important is that you are mixing up the elements in your list (floats) with their indices, i.e. their positions in the list. What you want is to iterate over the indices, not over the elements themselves. What I mean is that given the list:
my_list = [5.67, 4.23, 7.88, 9.5]
the indices of [5.67, 4.23, 7.88, 9.5] are respectively: 0,1,2,3. Python wants to have integer number to iterate because it interpretes these numbers as the position of the elements in the list, regardless of their value. And positions, obviously, always need to be integers, i.e. you are either the 4th or the 5th, not the 4.23th. However, this DOES NOT mean that the values of the elements themselves need to be integers. For accounting for this difference there is the python builtin function enumerate():
>>> for index, value in enumerate([5.67, 4.23, 7.88, 9.5]):
... print (index, '->', value)
...
0 -> 5.67
1 -> 4.23
2 -> 7.88
3 -> 9.5
>>>
this is the reason for which you needed to convert your values (not the indices) to integers, and make the trick of multiplying and dividing by 10 for not losing the 0.1 resolution that you use to compare. You can forget about all that.
2) You do not really need to check in each iteration wheter there are still 100 elements left in the list or not. It suffices to iterate until the -100th element:
for index, time in enumerate(times[:-100]):
and it will automatically stop at the -100th. However, when you do it, remember you want to use always index as the iterator variable, not time. Moreover, in another for loop you might use in some other case, if you need to check whether some condition is fulfilled to process the current element, and if not skip to the next one, you should use continue instead of break:
for index, time in enumerate(times):
if index+100 > len(times):
continue
continue gets you out of the if statement and brings you to the for loop, ready to iterate with the next element. break will break the for loop and stop the iteration.
3) At the end of each of your iterations you have a
elif (...):
...
t += 1
else:
t += 1
this is wrong in many ways:
3.1) first because you are inside an iterator, and t refers to the variable you are using for iterating. You do not need at all to tell the iterator to sum 1 to the iteration variable at the end of each iteration. Doing that is its one job. It knows.
3.2) Supposing that would be any other control variable inside the loop that you indeed need to manually increase by one, you are repeating code lines. You basically get the same effect if you remove the else clause and remove the indent of the last line of the elif clause:
elif (...):
...
t += 1
so the algorithm will fall into t +=1 regardles of whether the elif clause is satisfied or not.
3.3) This is related to the above bullet 1): In your particular case and since you are wrongly using t to iterate (as discussed above), by doing t += 1 you are modifying the list you iterate over, i.e. you are altering the input data.
Taking all this into account, one possible way to roughly implement your code could be:
import numpy as np
times = 100*np.random.rand(150)
list_of_avgs = [np.sum(times[:100])/100.]
for index, element in enumerate(times[:-100]):
avg = np.sum(times[index:index+100])/100.
if avg + 0.1 <= list_of_avgs[-1]:
list_of_avgs.append(avg)
else:
continue
print (list_of_avgs)
which results into (input data is radomly generated):
[49.779866192794358, 49.594673775689778, 49.4409179407875,
49.304759324340424, 49.106580355542434, 48.651419303532919,
48.505888846346672, 47.834645246733295, 47.300679740055074,
46.956253292222293, 46.598627928361239, 46.427709019922297]
Cheers and good luck!
D.

What is the complexity of this algorithm (search twice integer equals in an array)

I have a question, what is the complexity of this alogirthm ?
def search(t):
i = 0;
find = False
while (not(find) and i < len(t)) :
j = i + 1
while (not(find) and j < len(t)) :
if (t[j] == t[i]) :
find = True
j += 1
i += 1
return find
Thanks

Assuming t is a list, it's quadratic (O(n^2), where n is the length of the list).
You know it is because it iterates through t (first while loop), and in each of these iterations, it iterates through t again. Which means it iterates through len(t) elements, len(t) times. Therefore, O(len(t)**2).
You can bring the complexity of that algorithm down to O(len(t)) and exactly one line of code by using the appropriate data structure:
def search(t):
return (len(set(t)) != len(t))
For more info about how sets work, see https://docs.python.org/2/library/stdtypes.html#set-types-set-frozenset

The best case complexity is O(1), as the search may succeed immediately.
The worst case complexity is O(N²), achieved in case the search fails (there are (N-1)+(N-2)+...+2+1 comparisons made, i.e. N(N-1)/2 in total).
The average case can be estimated as follows: assuming that the array contains K entries that are not unique and are spread uniformly, the first of these is located after N/K elements on average, so the outer loop will run N/K times, with a cost of (N-1)+(N-2)+....+(N-N/K) comparisons. In the last iteration of the outer loop, the inner loop will run about 2N/K times.
Roughly, the expected time is O(N²/K).

What's a fast and pythonic/clean way of removing a sorted list from another sorted list in python?

I am creating a fast method of generating a list of primes in the range(0, limit+1). In the function I end up removing all integers in the list named removable from the list named primes. I am looking for a fast and pythonic way of removing the integers, knowing that both lists are always sorted.
I might be wrong, but I believe list.remove(n) iterates over the list comparing each element with n. meaning that the following code runs in O(n^2) time.
# removable and primes are both sorted lists of integers
for composite in removable:
primes.remove(composite)
Based off my assumption (which could be wrong and please confirm whether or not this is correct) and the fact that both lists are always sorted, I would think that the following code runs faster, since it only loops over the list once for a O(n) time. However, it is not at all pythonic or clean.
i = 0
j = 0
while i < len(primes) and j < len(removable):
if primes[i] == removable[j]:
primes = primes[:i] + primes[i+1:]
j += 1
else:
i += 1
Is there perhaps a built in function or simpler way of doing this? And what is the fastest way?
Side notes: I have not actually timed the functions or code above. Also, it doesn't matter if the list removable is changed/destroyed in the process.
For anyone interested the full functions is below:
import math
# returns a list of primes in range(0, limit+1)
def fastPrimeList(limit):
if limit < 2:
return list()
sqrtLimit = int(math.ceil(math.sqrt(limit)))
primes = [2] + range(3, limit+1, 2)
index = 1
while primes[index] <= sqrtLimit:
removable = list()
index2 = index
while primes[index] * primes[index2] <= limit:
composite = primes[index] * primes[index2]
removable.append(composite)
index2 += 1
for composite in removable:
primes.remove(composite)
index += 1
return primes

This is quite fast and clean, it does O(n) set membership checks, and in amortized time it runs in O(n) (first line is O(n) amortized, second line is O(n * 1) amortized, because a membership check is O(1) amortized):
removable_set = set(removable)
primes = [p for p in primes if p not in removable_set]
Here is the modification of your 2nd solution. It does O(n) basic operations (worst case):
tmp = []
i = j = 0
while i < len(primes) and j < len(removable):
if primes[i] < removable[j]:
tmp.append(primes[i])
i += 1
elif primes[i] == removable[j]:
i += 1
else:
j += 1
primes[:i] = tmp
del tmp
Please note that constants also matter. The Python interpreter is quite slow (i.e. with a large constant) to execute Python code. The 2nd solution has lots of Python code, and it can indeed be slower for small practical values of n than the solution with sets, because the set operations are implemented in C, thus they are fast (i.e. with a small constant).
If you have multiple working solutions, run them on typical input sizes, and measure the time. You may get surprised about their relative speed, often it is not what you would predict.

The most important thing here is to remove the quadratic behavior. You have this for two reasons.
First, calling remove searches the entire list for values to remove. Doing this takes linear time, and you're doing it once for each element in removable, so your total time is O(NM) (where N is the length of primes and M is the length of removable).
Second, removing elements from the middle of a list forces you to shift the whole rest of the list up one slot. So, each one takes linear time, and again you're doing it M times, so again it's O(NM).
How can you avoid these?
For the first, you either need to take advantage of the sorting, or just use something that allows you to do constant-time lookups instead of linear-time, like a set.
For the second, you either need to create a list of indices to delete and then do a second pass to move each element up the appropriate number of indices all at once, or just build a new list instead of trying to mutate the original in-place.
So, there are a variety of options here. Which one is best? It almost certainly doesn't matter; changing your O(NM) time to just O(N+M) will probably be more than enough of an optimization that you're happy with the results. But if you need to squeeze out more performance, then you'll have to implement all of them and test them on realistic data.
The only one of these that I think isn't obvious is how to "use the sorting". The idea is to use the same kind of staggered-zip iteration that you'd use in a merge sort, like this:
def sorted_subtract(seq1, seq2):
i1, i2 = 0, 0
while i1 < len(seq1):
if seq1[i1] != seq2[i2]:
i2 += 1
if i2 == len(seq2):
yield from seq1[i1:]
return
else:
yield seq1[i1]
i1 += 1

Python Time Complexity (run-time)

def f2(L):
sum = 0
i = 1
while i < len(L):
sum = sum + L[i]
i = i * 2
return sum
Let n be the size of the list L passed to this function. Which of the following most accurately describes how the runtime of this function grow as n grows?
(a) It grows linearly, like n does.
(b) It grows quadratically, like n^2 does.
(c) It grows less than linearly.
(d) It grows more than quadratically.
I don't understand how you figure out the relationship between the runtime of the function and the growth of n. Can someone please explain this to me?

ok, since this is homework:
this is the code:
def f2(L):
sum = 0
i = 1
while i < len(L):
sum = sum + L[i]
i = i * 2
return sum
it is obviously dependant on len(L).
So lets see for each line, what it costs:
sum = 0
i = 1
# [...]
return sum
those are obviously constant time, independant of L.
In the loop we have:
sum = sum + L[i] # time to lookup L[i] (`timelookup(L)`) plus time to add to the sum (obviously constant time)
i = i * 2 # obviously constant time
and how many times is the loop executed?
it's obvously dependant on the size of L.
Lets call that loops(L)
so we got an overall complexity of
loops(L) * (timelookup(L) + const)
Being the nice guy I am, I'll tell you that list lookup is constant in python, so it boils down to
O(loops(L)) (constant factors ignored, as big-O convention implies)
And how often do you loop, based on the len() of L?
(a) as often as there are items in the list (b) quadratically as often as there are items in the list?
(c) less often as there are items in the list (d) more often than (b) ?

I am not a computer science major and I don't claim to have a strong grasp of this kind of theory, but I thought it might be relevant for someone from my perspective to try and contribute an answer.
Your function will always take time to execute, and if it is operating on a list argument of varying length, then the time it takes to run that function will be relative to how many elements are in that list.
Lets assume it takes 1 unit of time to process a list of length == 1. What the question is asking, is the relationship between the size of the list getting bigger vs the increase in time for this function to execute.
This link breaks down some basics of Big O notation: http://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/
If it were O(1) complexity (which is not actually one of your A-D options) then it would mean the complexity never grows regardless of the size of L. Obviously in your example it is doing a while loop dependent on growing a counter i in relation to the length of L. I would focus on the fact that i is being multiplied, to indicate the relationship between how long it will take to get through that while loop vs the length of L. Basically, try to compare how many loops the while loop will need to perform at various values of len(L), and then that will determine your complexity. 1 unit of time can be 1 iteration through the while loop.
Hopefully I have made some form of contribution here, with my own lack of expertise on the subject.
Update
To clarify based on the comment from ch3ka, if you were doing more than what you currently have inside your with loop, then you would also have to consider the added complexity for each loop. But because your list lookup L[i] is constant complexity, as is the math that follows it, we can ignore those in terms of the complexity.

Here's a quick-and-dirty way to find out:
import matplotlib.pyplot as plt
def f2(L):
sum = 0
i = 1
times = 0
while i < len(L):
sum = sum + L[i]
i = i * 2
times += 1 # track how many times the loop gets called
return times
def main():
i = range(1200)
f_i = [f2([1]*n) for n in i]
plt.plot(i, f_i)
if __name__=="__main__":
main()
... which results in
Horizontal axis is size of L, vertical axis is how many times the function loops; big-O should be pretty obvious from this.

Consider what happens with an input of length n=10. Now consider what happens if the input size is doubled to 20. Will the runtime double as well? Then it's linear. If the runtime grows by factor 4, then it's quadratic. Etc.

When you look at the function, you have to determine how the size of the list will affect the number of loops that will occur.
In your specific situation, lets increment n and see how many times the while loop will run.
n = 0, loop = 0 times
n = 1, loop = 1 time
n = 2, loop = 1 time
n = 3, loop = 2 times
n = 4, loop = 2 times
See the pattern? Now answer your question, does it:
(a) It grows linearly, like n does. (b) It grows quadratically, like n^2 does.
(c) It grows less than linearly. (d) It grows more than quadratically.
Checkout Hugh's answer for an empirical result :)

it's O(log(len(L))), as list lookup is a constant time operation, independant of the size of the list.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing big list using python - python

Related

Does this binary sorting algorithm need to be optimized significantly in memory usage and speed of execution

Some for loop number decreasing when they should be increasing while iterating through a list

What is the complexity of this algorithm (search twice integer equals in an array)

What's a fast and pythonic/clean way of removing a sorted list from another sorted list in python?

Python Time Complexity (run-time)

Categories

Resources