Sample integers except the ones in a specified set

Sample integers except the ones in a specified set - python

Suppose I have a set of integers exceptions = set([3, 2, 6, ...]). What is the most efficient way to draw num samples of integers from the interval [0,n) uniformly at random except the ones that appear in exceptions using python.
Here are some ideas I've played with but are not satisfactory.
# Create a set of the integers I am interested in
integers = set(range(n)) - exceptions
# Draw samples from the set
samples = np.random.choice(integers, num)
In my case n is rather large (on the order of 10^12) such that creating the set seems like a waste of memory.
However, len(exceptions) is comparatively small (on the order of 10^6) such that rejection sampling might be a reasonable approach.
samples = []
# Iterate until we have enough samples
while len(samples) < num:
# Draw a sample
proposal = np.random.randint(n)
# Accept or reject
if proposal not in exceptions:
samples.append(proposal)
Unfortunately, all the looping and set membership testing is rather slow. Obviously, I could write a C-extension or use Cython but I wanted to see whether you guys have a better idea.
Edited to include suggestions in the comments.

I believe that in case of exceptions being small compared to n, rejection might be your best shot; note that with n=10**12 and e=10**6 probability of getting invalid result is only one in a million, so you will be good without rejecting and re-randoming most of the time. If you need multiple samples anyway, it's probably a good idea to get much random numbers (like 101%) and then filter them all in bulk, discarding exceptions (I believe there is a way in numpy to get multiple samples faster that in a loop). If there are still too many, simply truncate the result; if there are too little, generate more.
As the size of exceptions is getting bigger, you might try a different approach:
Assume you have a range of n integers and e exceptions. In fact, then, you are interested in getting one of n-e uniformly distributed results. So having x = np.random.randint(n-e) should be good enough, provided you can transform such a result into a meaningful one fast. First, an example to illustrate the idea:
Let's assume n=100, set of exceptions = {11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 26, 27}
case 1: your random result was 60. You have to count how many exceptions smaller than or equal to 60 are there: there are 13. So the result is 60+13=73.
case 2: your random result was 22. There are 10 exceptions smaller than that, so you increase the result by 10, getting 32. There are however two more exceptions that are not greater than 32, but greater than 22, and therefore not yet counted. You have to further increase the value by 2, final result being 34.
The algorithm, then, would be as follows:
n = ... # full range
exceptions = ... # collection of your exceptions
e = len(exceptions)
def get_sample():
x = np.random.randint(n-e)
skipped = 0
small_exceptions = count_small_exceptions(x, exceptions)
while skipped < small_exceptions:
skipped = small_exceptions
small_exceptions = count_small_exceptions(x+skipped, exceptions)
return x+skipped
In order to implement that, you need a fast way to count exceptions smaller than a given value - module bisect should do the trick, as it provides binary-search mechanisms on a sorted array.
Why does this work? The loop stops when skipped == small_exceptions, so there are exactly skipped exceptions in the range 0..(skipped+x). If the returned value (skipped+x) was an exception, the loop would have stopped earlier, because there would have been less exceptions in the previous iteration.
The exact cost analysis is not that easy, but if you assume that you have one exception for every k regular values, then increases in subsequent iterations are k times smaller each time. Worst-case scenario is when exceptions=range(e) and random x comes out as 0. Then there would be e iterations, with small increase in every one of them. If e is small compared to n this shouldn't be a problem; otherwise, you might further improve this solution by storing exceptions as intervals instead, effectively merging all subsequent exceptions into only one 'hole' and one operation.

Related

Find all the times in a (very large) array that have a difference greater than x

My array is time, so it is sorted and increasing.
I have to pull out the beginning/end where the difference in the array is greater than 30. The problem which other solutions don't cover, is that the array is thousands of values so looping through the array seems inefficient.
hugeArr = np.array([0, 2.072, 50.0, 90.0, 91.1])
My desired output for the above array would be something like: (2.072,50) (50,90).
Is there a way to accomplish this?

You can use np.diff and np.where to find the correct indices:
>>> idxs = np.where(np.diff(hugeArr) > 30)[0]
>>> list(zip(hugeArr[idxs], hugeArr[idxs + 1]))
[(2.072, 50.0), (50.0, 90.0)]
(Assuming you require only consecutive values)
And as #not_speshal mentioned, you can use np.column_stack instead of list(zip(...)) to stay within NumPy boundaries:
>>> np.column_stack((hugeArr[idxs], hugeArr[idxs+1]))
array([[ 2.072, 50. ],
[50. , 90. ]])

Try to think about what your'e trying to do. For each value in the array if the next value is larger by more then 30 you'd like to save the tuple of them.
The key words here are for each. This is a classic O(n) complexity algorithm, so decreasing its time complexity seems impossible to me.
However, you can make changes specific to your array to make the algorithm faster.
For example, if your'e looking for a difference of 30 and you know that the average difference is 1, you might be better off looking for index i at
difference = hugeArr[i+15] - hugeArr[i]
and see if this is bigger then 30. If it isn't (and it probably won't be), you can skip these 15 indices as you know that no gap between two consecutive values is larger then the big gap.
If this works for you, run tests, 15 is completely arbitrary and maybe your magic number is 25. Change it a bit and time how long your function takes to run.

A strategy that comes to mind is that we don't have to check numbers between two numbers that have a distance smaller than 30, we can do this because it is sorted. For example if the abs(hugeArr[0] - hugeArr[-1]) < 30 we dont have to check anything because nothing will have a distance of over 30.
We would start at the ends and work our way inwards. So check the starting number and ending number first. Then we go halfway hugeArr[len(hugeArr)//2] and check that number distance against the hugeArr[0] and hugeArr[-1]. Then we go into the ranges (hugeArr[0:len(hugeArr)//2] and hugeArr[len(hugeArr)//2:-1]). We break those two ranges again in half and wherever there is a distance from end to end smaller than 30 we don't check those. We can make this a recursive algorithm.
Worst case you'll have a distance over 30 everywhere and end up with O(n) but it could give you some advantage.
Something like this however you might want to refactor to numpy.
def check(arr):
pairs = []
def check_range(hugeArr):
difference = abs(hugeArr[0] - hugeArr[-1])
if difference < 30:
return
if len(hugeArr) == 2:
pairs.append((hugeArr[0], hugeArr[1]))
return
halfway = len(hugeArr)//2
check_range(hugeArr[:halfway+1])
check_range(hugeArr[halfway:])
check_range(arr)
return pairs

Circular Primes Project Euler #35

This is the problem I'm trying to solve.
The number, 197, is called a circular prime because all rotations of
the digits: 197, 971, and 719, are themselves prime.
There are thirteen such primes below 100: 2, 3, 5, 7, 11, 13, 17, 31,
37, 71, 73, 79, and 97.
How many circular primes are there below one million?
This is my attempt. I first put all prime numbers below 1000000 in a list called primes then calculated all their possible permutations and stored those permutations in a list. Then I checked whether those permutations are prime or not.
import math
def isPrime(num):
flag = 1
root = int(math.sqrt(num) + 1)
for i in range (2,root+1):
if num%i == 0:
flag = 0
break
else:
flag = 1
if flag == 1:
return True
else:
return False
primes = [#list of all primes below 1000000]
def permutations(word):
if len(word) == 1:
return [word]
char = word[0]
perms = permutations(word[1:])
result = []
for perm in perms:
for i in range (len(perm)+1):
result.append(perm[i:] + char + perm[:i])
return result
count = 0
for i in primes:
to_be_tested = permutations(str(i))
count_to_be_fulfilled = len(to_be_tested)
new_count = 0
for j in to_be_tested:
if isPrime(int(j)):
new_count += 1
if new_count == count_to_be_fulfilled:
count += 1
print(count)
I get the answer 22 and according to Project Euler it's wrong. I don't know the answer as I want to solve this on my own and don't want to cheat. Please tell me what I'm doing wrong.

Solution
I will not post the full solution, because from the Project Euler's about section:
I learned so much solving problem XXX so is it okay to publish my
solution elsewhere?
It appears that you have answered your own
question. There is nothing quite like that "Aha!" moment when you
finally beat a problem which you have been working on for some time.
It is often through the best of intentions in wishing to share our
insights so that others can enjoy that moment too. Sadly, however,
that will not be the case for your readers. Real learning is an active
process and seeing how it is done is a long way from experiencing that
epiphany of discovery. Please do not deny others what you have so
richly valued yourself.
However, I will give you a few tips.
Data structures
As I have mentioned in the comment, using sets will improve your program's performance a whole lot, because access to the data in those is really quick (you can check why by googling a technique called hashing if you don't know about it). Precisely, the performance with lists will be O(N) whereas with sets it will be about O(1).
Rotations vs permutations
In the problem statement, you are required to calculate the rotations of a number. As pointed out by #ottomeister, you should really check whether your program is doing what the problem statement expects it to be doing. Currently, calling permutations("197") will return the following list - ['791', '917', '179', '971', '719', '197'] - whereas the problem expects the result to be ['197', '971', '719']. Rather than creating permutations, you should compute rotations, which can be found for example by moving each digit to the left (with wrapping around) until the initial number is returned (could make a recursive algorithm like that if you really like those).
Prime checking
You are currently checking if each number is prime by performing a loop which checks if the number N is divisible by everything up to sqrt(N). As you have noticed, this is quite slow for a lot of numbers, and there are better ways to check if a number is prime. In your scenario, the ideal solution is to simply do N in primes because you already have generated the primes, which, if you use sets instead of lists, will be O(1). Alternatively, you can have a look at primality tests (especially the heuristic and probabilistic tests)
Finally
I generally recommend testing your own solution before everything else, you could have easily spotted that your permutations function is not meeting the problem statement's expected results at all. I suggest to break things further into smaller functions, for example you could have a rotations(number) function similar to your permutations, maybe a is_circular(number) function which checks if the given number meets problem requirements and count_circular(upper_bound) function which computes and counts the circular numbers. If you also comment everything as you go, it will make debugging things a lot easier and you will be able to confirm everything works as expected on the go :)

Fastest way to compute e^x?

What is the fastest way to compute e^x, given x can be a floating point value.
Right now I have used the python's math library to compute this, below is the complete code where in result = -0.490631 + 0.774275 * math.exp(0.474907 * sum) is the main logic, rest is file handling code which the question demands.
import math
import sys
def sum_digits(n):
r = 0
while n:
r, n = r + n % 10, n // 10
return r
def _print(string):
fo = open("output.txt", "w+")
fo.write(string)
fo.close()
try:
f = open('input.txt')
except IOError:
_print("error")
sys.exit()
data = f.read()
num = data.split('\n', 1)[0]
try:
val = int(num)
except ValueError:
_print("error")
sys.exit()
sum = sum_digits(int(num))
f.close()
if (sum == 2):
_print("1")
else:
result = -0.490631 + 0.774275 * math.exp(0.474907 * sum)
_print(str(math.ceil(result)))
The rvalue of result is the equation of curve (which is the solution to a programming problem) which I derived from wolfarm-mathematica using my own data set.
But this doesn't seem to pass the par criteria of the assessment !
I have also tried the newton-raphson way but the convergence for larger x is causing the problem, other than that, calculating the natural log ln(x) is a challenge there again !
I don't have any language constraint so any solution is acceptable. Also if the python's math library is fastest as some of the comments says then can anyone give an insight on the time complexity and execution time of this program, in short the efficiency of the program ?

I don't know if the exponential curve math is accurate in this code, but it certainly isn't the slow point.
First, you read the input data in one read call. It does have to be read, but that loads the entire file. The next step takes the first line only, so it would seem more appropriate to use readline. That split itself is O(n) where n is the file size, at least, which might include data you were ignoring since you only process one line.
Second, you convert that line into an int. This probably requires Python's long integer support, but the operation could be O(n) or O(n^2). A single pass algorithm would multiply the accumulated number by 10 for each digit, allocating one or two new (longer) longs each time.
Third, sum_digits breaks that long int down into digits again. It does so using division, which is expensive, and two operations as well, rather than using divmod. That's O(n^2), because each division has to process every higher digit for each digit. And it's only needed because of the conversion you just did.
Summing the digits found in a string is likely easier done with something like sum(int(c) for c in l if c.isdigit()) where l is the input line. It's not particularly fast, as there's quite a bit of overhead in the digit conversions and the sum might grow large, but it does make a single pass with a fairly tight loop; it's somewhere between O(n) and O(n log n), depending on the length of the data, because the sum might grow large itself.
As for the unknown exponential curve, the existence of an exception for a low number is concerning. There's likely some other option that's both faster and more accurate if the answer's an integer anyway.
Lastly, you have at least four distinct output data formats: error, 2, 3.0, 3e+20. Do you know which of these is expected? Perhaps you should be using formatted output rather than str to convert your numbers.
One extra note: If the data is really large, processing it in chunks will definitely speed things up (instead of running out of memory, needing to swap, etc). As you're looking for a digit sum your size complexity can be reduced from O(n) to O(log n).

Code finding the first triangular number with more than 500 divisors will not finish running

Okay, so I'm working on Euler Problem 12 (find the first triangular number with a number of factors over 500) and my code (in Python 3) is as follows:
factors = 0
y=1
def factornum(n):
x = 1
f = []
while x <= n:
if n%x == 0:
f.append(x)
x+=1
return len(f)
def triangle(n):
t = sum(list(range(1,n)))
return t
while factors<=500:
factors = factornum(triangle(y))
y+=1
print(y-1)
Basically, a function goes through all the numbers below the input number n, checks if they divide into n evenly, and if so add them to a list, then return the length in that list. Another generates a triangular number by summing all the numbers in a list from 1 to the input number and returning the sum. Then a while loop continues to generate a triangular number using an iterating variable y as the input for the triangle function, and then runs the factornum function on that and puts the result in the factors variable. The loop continues to run and the y variable continues to increment until the number of factors is over 500. The result is then printed.
However, when I run it, nothing happens - no errors, no output, it just keeps running and running. Now, I know my code isn't the most efficient, but I left it running for quite a bit and it still didn't produce a result, so it seems more likely to me that there's an error somewhere. I've been over it and over it and cannot seem to find an error.
I'd merely request that a full solution or a drastically improved one isn't given outright but pointers towards my error(s) or spots for improvement, as the reason I'm doing the Euler problems is to improve my coding. Thanks!

You have very inefficient algorithm.
If you ask for pointers rather than full solution, main pointers are:
There is a more efficient way to calculate next triangular number. There is an explicit formula in the wiki. Also if you generate sequence of all numbers it is just more efficient to add next n to the previous number. (Sidenote list in sum(list(range(1,n))) makes no sense to me at all. If you want to use this approach anyway, sum(xrange(1,n) will probably be much more efficient as it doesn't require materialization of the range)
There are much more efficient ways to factorize numbers
There is a more efficient way to calculate number of factors. And it is actually called after Euler: see Euler's totient function
Generally Euler project problems (as in many other programming competitions) are not supposed to be solvable by sheer brute force. You should come up with some formula and/or more efficient algorithm first.

As far as I can tell your code will work, but it will take a very long time to calculate the number of factors. For 150 factors, it takes on the order of 20 seconds to run, and that time will grow dramatically as you look for higher and higher number of factors.
One way to reduce the processing time is to reduce the number of calculations that you're performing. If you analyze your code, you're calculating n%1 every single time, which is an unnecessary calculation because you know every single integer will be divisible by itself and one. Are there any other ways you can reduce the number of calculations? Perhaps by remembering that if a number is divisible by 20, it is also divisible by 2, 4, 5, and 10?
I can be more specific, but you wanted a pointer in the right direction.

From the looks of it the code works fine, it`s just not the best approach. A simple way of optimizing is doing until the half the number, for example. Also, try thinking about how you could do this using prime factors, it might be another solution. Best of luck!

First you have to def a factor function:
from functools import reduce
def factors(n):
step = 2 if n % 2 else 1
return set(reduce(list.__add__,
([i, n//i] for i in range(1, int(pow(n,0.5) + 1)) if n % i
== 0)))
This will create a set and put all of factors of number n into it.
Second, use while loop until you get 500 factors:
a = 1
x = 1
while len(factors(a)) < 501:
x += 1
a += x
This loop will stop at len(factors(a)) = 500.
Simple print(a) and you will get your answer.

Detect significant changes in a data-set that gradually changes

I have a list of data in python that represents amount of resources used per minute. I want to find the number of times it changes significantly in that data set. What I mean by significant change is a bit different from what I've read so far.
For e.g. if I have a dataset like
[10,15,17,20,30,40,50,70,80,60,40,20]
I say a significant change happens when data increases by double or reduces by half with respect to the previous normal.
For e.g. since the list starts with 10, that is our starting normal point
Then when data doubles to 20, I count that as one significant change and set the normal to 20.
Then when data doubles to 40, it is considered a significant change and the normal is now 40
Then when data doubles to 80, it is considered a significant change and the normal is now 80
After that when data reduces by half to 40, it is considered as another significant change and the normal becomes 40
Finally when data reduces by half to 20, it is the last significant change
Here there are a total of 5 significant changes.
Is it similar to any other change detection algorithm? How can this be done efficiently in python?

This is relatively straightforward. You can do this with a single iteration through the list. We simply update our base when a 'significant' change occurs.
Note that my implementation will work for any iterable or container. This is useful if you want to, for example, read through a file without having to load it all into memory.
def gen_significant_changes(iterable, *, tol = 2):
iterable = iter(iterable) # this is necessary if it is container rather than generator.
# note that if the iterable is already a generator iter(iterable) returns itself.
base = next(iterable)
for x in iterable:
if x >= (base * tol) or x <= (base/tol):
yield x
base = x
my_list = [10,15,17,20,30,40,50,70,80,60,40,20]
print(list(gen_significant_changes(my_list)))

I can't help with the Python part, but in terms of math, the problem you're asking is fairly simple to solve using log base 2. A significant change occurs when the current value divided by a constant can be reached by raising 2 to a different power (as an integer) than the previous value. (The constant is needed since the first value in the array forms the basis of comparison.)
For each element at t, compute:
current = math.log(Array[t] /Array[0], 2)
previous = math.log(Array[t-1]/Array[0], 2)
if math.floor(current) <> math.floor(previous) a significant change has occurred
Using this method you do not need to keep track of a "normal point" at all, you just need the array. By removing the additional state variable we enable the array to be processed in any order, and we could give portions of the array to different threads if the dataset were very large. You wouldn't be able to do that with your current method.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.