Generate a large list of points with no duplicates

Generate a large list of points with no duplicates - python

I want to create a large list containing 20,000 points in the form of:
[[x, y], [x, y], [x, y]]
where x and y can be any random integer between 0 and 1000. How would I be able to do this such that there are no duplicate coordinates [x, y]?

You could just use a while loop to pad it out until it's big enough:
>>> from random import randint
>>> n, N = 1000, 20000
>>> points = {(randint(0, n), randint(0, n)) for i in xrange(N)}
>>> while len(points) < N:
... points |= {(randint(0, n), randint(0, n))}
...
>>> points = list(list(x) for x in points)
Your initial idea was probably slow because it was iterating lists for checking containmentship, which is O(n). This uses sets which are faster, and then only converts to the list structure once at the end.

Try this :
import itertools
x = range(0,10)
aList =[]
for pair in itertools.combinations(x,2):
for i in range(0,10):
aList.append(pair)
print aList
If you want point between 0-10 with all unique and stored in a list,
or you You need it random order, then use some random function .

Since n = 1001 is relatively small in your case, random.sample(population, k) will do just fine, taking a random sample of 20000 pairs from the space of possible pairs (no duplicates):
import random
print random.sample([[x, y] for x in xrange(1001) for y in xrange(1001)], 20000)
This is the most concise and readable solution. (But if n is very big, generating the entire space of points will not be computationally efficient.)

An approach that avoids while loops with unknown iteration counts and avoids storing huge lists in memory is to use random.sample to produce unique encoded values from a single range (in Py3) or xrange (in Py2) to avoid actually generating huge temporaries; a simple mathematical operation can split the "encoded" values back into two values:
import random
xys = random.sample(range(1001 * 1001), 20000)
[divmod(xy, 1001) for xy in xys] # Wrap divmod in list() if you must have list, not tuple

Related

Python: Memory-efficient random sampling of list of permutations

I am seeking to sample n random permutations of a list in Python.
This is my code:
obj = [ 5 8 9 ... 45718 45719 45720]
#type(obj) = numpy.ndarray
pairs = random.sample(list(permutations(obj,2)),k= 150)
Although the code does what I want it to, it causes memory issues. I sometimes receive the error Memory error when running on CPU, and when running on GPU, my virtual machine crashes.
How can I make the code work in a more memory-efficient manner?

This avoids using permutations at all:
count = len(obj)
def index2perm(i,obj):
i1, i2 = divmod(i,len(obj)-1)
if i1 <= i2:
i2 += 1
return (obj[i1],obj[i2])
pairs = [index2perm(i,obj) for i in random.sample(range(count*(count-1)),k=3)]

Building on Pablo Ruiz's excellent answer, I suggest wrapping his sampling solution into a generator function that yields unique permutations by keeping track of what it has already yielded:
import numpy as np
def unique_permutations(sequence, r, n):
"""Yield n unique permutations of r elements from sequence"""
seen = set()
while len(seen) < n:
# This line of code adapted from Pablo Ruiz's answer:
candidate_permutation = tuple(np.random.choice(sequence, r, replace=False))
if candidate_permutation not in seen:
seen.add(candidate_permutation)
yield candidate_permutation
obj = list(range(10))
for permutation in unique_permutations(obj, 2, 15):
# do something with the permutation
# Or, to save the result as a list:
pairs = list(unique_permutations(obj, 2, 15))
My assumption is that you are sampling a small subset of the very large number of possible permutations, in which case collisions will be rare enough that keeping a seen set will not be expensive.
Warnings: this function is an infinite loop if you ask for more permutations than are possible given the inputs. It will also get increasingly slow an n gets close to the number of possible permutations, since collisions will get increasingly frequent.
If I were to put this function in my code base, I would put a shield at the top that calculated the number of possible permutations and raised a ValueError exception if n exceeded that number, and maybe output a warning if n exceeded one tenth that number, or something like that.

You can avoid listing the permutation iterator that could be massive in memory. You can generate random permutations by sampling the list with replace=False.
import numpy as np
obj = np.array([5,8,123,13541,42])
k = 15
permutations = [tuple(np.random.choice(obj, 2, replace=False)) for _ in range(k)]
print(permutations)
This problem becomes much harder, if you for example impose no repetition in your random permutations.
Edit, no repetitions code
I think this is the best possible approach for the non repetition case.
We index all possible permutations from 1 to n**2-n in a permutation matrix where the diagonal should be avoided. We sample the indexes without repetitions and without listing them, then we map the samples to the coordinates of the permutations and then we get the permutations from the indexes of matrix.
import random
import numpy as np
obj = np.array([1,2,3,10,43,19,323,142,334,33,312,31,12])
k = 150
obj_len = len(obj)
indexes = random.sample(range(obj_len**2-obj_len), k)
def mapm(m):
return m + m //(obj_len) +1
permutations = [(obj[mapm(i)//obj_len], obj[mapm(i)%obj_len]) for i in indexes]
This approach is not based on any assumption, does not load the permutations and also the performance is not based on a while loop failing to insert duplicates, as no duplicates are ever generated.

List comprehension has excruciatingly slow load time than regular code that does the same thing

I have a list comprehension that prints all the prime numbers from 1 to 1000. For some strange reason, my list comprehension takes 1:46 to load in the terminal. I find this very weird because when I write the same code out normally, it loads instantaneously.
Here is my comprehension: print([x for x in range(2, 1000) if x not in [y * z for y in range(2, 1001) for z in range(2, 1001)if y * z < 1000]])
As you can see, it makes a list of number from 2 and 1000 and prints the (prime) ones that are not in the list of composite numbers under 1000. When I run this, it correctly outputs, but takes ages on every computer I try. I thought maybe my code was just erroneous. However, when I isolate the [y * z for y in range(2, 1001) for z in range(2, 1001)if y * z < 1000] line, there is no delay in displaying the composites. And when I generate the regular list of number for comparison, there is also no lag. It's just when I use the "not in" operator that the comprehension takes ridiculously long to print them.
I thought that perhaps the not in comparison was being extra slow. But to my frustration, I noticed that when I wrote out the comparison part of the code normally and not in comprehension, there was absolutely no delay. See this:
x = [y * z for y in range(2, 1001) for z in range(2, 1001)if y * z < 1000]
newlist = []
for z in range(2, 1000):
if z not in x:
newlist.append(z)
print(newlist)
As you can see, I slapped the composite list into a variable and did the if statement and loop regularly. If x wasn't in the list, then it was added to a new list. Achieving the same goal of my list comprehension. I was wondering, if there was a solution to my list comprehension being so slow. The logic matches my original comprehension so why is it taking longer in comprehension format if it is essentially the same?
Please try to not add any additional features to my code, I'm to trying to use list comprehension and only list comprehension.

The inner list is recreated on every iteration of x. Simply separate it out:
composites = [y*z for y in range(2, 1001) for z in range(2, 1001) if y*z < 1000]
[x for x in range(2, 1000) if x not in composites]
By the way, if you make composites a set, lookups (in and not in) are much faster (O(1) instead of O(n), where n=len(composites)).
composites = {y*z for y ...}

Random Sample of N Distinct Permutations of a List

Suppose I have a Python list of arbitrary length k. Now, suppose I would like a random sample of n , (where n <= k!) distinct permutations of that list. I was tempted to try:
import random
import itertools
k = 6
n = 10
mylist = list(range(0, k))
j = random.sample(list(itertools.permutations(mylist)), n)
for i in j:
print(i)
But, naturally, this code becomes unusably slow when k gets too large. Given that the number of permutations that I may be looking for n is going to be relatively small compared to the total number of permutations, computing all of the permutations is unnecessary. Yet it's important that none of the permutations in the final list are duplicates.
How would you achieve this more efficiently? Remember, mylist could be a list of anything, I just used list(range(0, k)) for simplicity.

You can generate permutations, and keep track of the ones you have already generated. To make it more versatile, I made a generator function:
import random
k = 6
n = 10
mylist = list(range(0, k))
def perm_generator(seq):
seen = set()
length = len(seq)
while True:
perm = tuple(random.sample(seq, length))
if perm not in seen:
seen.add(perm)
yield perm
rand_perms = perm_generator(mylist)
j = [next(rand_perms) for _ in range(n)]
for i in j:
print(i)

Naïve implementation
Bellow the naïve implementation I did (well implemented by #Tomothy32, pure PSL using generator):
import numpy as np
mylist = np.array(mylist)
perms = set()
for i in range(n): # (1) Draw N samples from permutations Universe U (#U = k!)
while True: # (2) Endless loop
perm = np.random.permutation(k) # (3) Generate a random permutation form U
key = tuple(perm)
if key not in perms: # (4) Check if permutation already has been drawn (hash table)
perms.update(key) # (5) Insert into set
break # (6) Break the endless loop
print(i, mylist[perm])
It relies on numpy.random.permutation which randomly permute a sequence.
The key idea is:
to generate a new random permutation (index randomly permuted);
to check if permutation already exists and store it (as tuple of int because it must hash) to prevent duplicates;
Then to permute the original list using the index permutation.
This naïve version does not directly suffer to factorial complexity O(k!) of itertools.permutations function which does generate all k! permutations before sampling from it.
About Complexity
There is something interesting about the algorithm design and complexity...
If we want to be sure that the loop could end, we must enforce N <= k!, but it is not guaranteed. Furthermore, assessing the complexity requires to know how many time the endless-loop will actually loop before a new random tuple is found and break it.
Limitation
Let's encapsulate the function written by #Tomothy32:
import math
def get_perms(seq, N=10):
rand_perms = perm_generator(mylist)
return [next(rand_perms) for _ in range(N)]
For instance, this call work for very small k<7:
get_perms(list(range(k)), math.factorial(k))
But will fail before O(k!) complexity (time and memory) when k grows because it boils down to randomly find a unique missing key when all other k!-1 keys have been found.
Always look on the bright side...
On the other hand, it seems the method can generate a reasonable amount of permuted tuples in a reasonable amount of time when N<<<k!. Example, it is possible to draw more than N=5000 tuples of length k where 10 < k < 1000 in less than one second.
When k and N are kept small and N<<<k!, then the algorithm seems to have a complexity:
Constant versus k;
Linear versus N.
This is somehow valuable.

Keep a 2-tuple of ints within a specific range

I am having trouble with logic checks. There is a coordinate system grid that is 15x15, row x column.
There is a random list of 2-tuples that contains coordinates:[(3,3),(4,5),(14,14),(13,0),(0,13)]. (r, c)
I would like to check if some random choice tuple is within a certain range: 2x2 and 12x12. So for example the tuple: (3,3) and (4,5) would be allowed, but (13,0) and (0, 13) would not.
How would I go about implementing checks to see if a tuple pair lies within the range?

The best approach would be to not create random tuples outside the wanted range at all:
import random
def randCoord(xMin,xMax,yMin,yMax, numCoords):
"""Generates 'numCoords' random coord tuples between ([xMin,xMax],[yMin,yMax])"""
for _ in range(numCoords):
yield (random.randint(xMin,xMax),random.randint(yMin,yMax))
test = randCoord(2,13,2,13,5) # all will be valid
If the creation of the random tuple is out of your hand, you should go for logic check (see Louis Sugys answer ). For a small sample areas a look up table (LUT) is a viable option:
Your area is very limited - 15*15 = 225 elements total of which only 100 are ok- if you want to repeatedly check it by lookup, create a set with all allowed ones and look them up.
As pointed out by the comments to my answer: for bigger problem sizes you should probably not use a lookup - for this problem it might be ok to use a LUT:
allowed = set((x,y) for x in range(2,13) for y in range (2,13))
test = [(2,5),(7,13)]
for t in test:
print(t, "Yes" if t in allowed else "No")
and test them. Done.
Output:
(2, 5) Yes
(7, 13) No
If in doubt, measure (trice - cut once): Just got curious about timings and if a GeneratedLUT would outperform the once created set (which I correctly doubted) and how bad a LUT would be compared to logical checks:
Edit: fixed several errors due to tips from #StefanPochmann in the comments
import timeit
setupTxt = """import random
random.seed(42)
def randCoord(xMin,xMax,yMin,yMax, numCoords):
for _ in range(numCoords):
yield (random.randint(xMin,xMax),random.randint(yMin,yMax))
def allowed(xMin,xMax,yMin,yMax):
from itertools import product
return product( range(xMin,xMax+1) , range (yMin,yMax+1))
allowedLUT = set( ((x,y) for x in range(2,13) for y in range (2,13)))
# need list(generator) so data persists between iterations
testSample = list(randCoord(0,15,0,15,1000))
"""
n = 500
Tests:
# LUT-set prefilled and checked
LUTtiming = timeit.timeit(
"for t in testSample: 1 if t in allowedLUT else -1",
setup = setupTxt, number=n)
# LUTgenerated on each call
LUTgenerated = timeit.timeit(
"for t in testSample: 1 if t in allowed(2,12,2,12) else -1",
setup = setupTxt, number=n)
# simple condition checking
condCheckTiming= timeit.timeit(
"for x,y in testSample: 1 if 2 <= x <= 12 and 2 <= y <= 12 else -1",
setup = setupTxt, number=n)
print ("LUT: \t{0:f}\nLUTGen:\t{1:f}\nConds: \t{2:f}".format(LUTtiming,
LUTgenerated,
condCheckTiming))
Outputs (Laptop - pyfiddle.io gives odd numbers):
LUT: 0.753209 # faster then conditional checking but takes memory for LUT
LUTGen: 8.983708 # generator not suited for _this_ purpose
Conds: 0.828372 # less memory as LUT, 10% slower

Let's say that your list of coordinates is called coords. Then, to verify that the i-th element is ok, you just use the expression:
coords[i][0] >= 2 and coords[i][1] >= 2 and coords[i][0] <= 12 and coords[i][1] <= 12

Another approach if you want to work with, given the lower and upper bounds, would be to create two ranges for X and Y axis and check for every coordinate if it's within these boundaries:
#Input
coord = [(3,3),(4,5),(14,14),(13,0),(0,13)]
lowerBorder = (2,2)
upperBorder = (12,12)
# Create the ranges for each axis.
xRange, yRange = [range(lowerBorder[i], upperBorder[i]) for i in range(2)]
# Filter the coordinates.
valid = list(filter(lambda c: c[0] in xRange and c[1] in yRange, coord))
print(valid)
Output:
[(3, 3), (4, 5)]

Creating two concatenated arrays from a generator

Consider the following example in Python 2.7. We have an arbitrary function f() that returns two 1-dimensional numpy arrays. Note that in general f() may returns arrays of different size and that the size may depend on the input.
Now we would like to call map on f() and concatenate the results into two separate new arrays.
import numpy as np
def f(x):
return np.arange(x),np.ones(x,dtype=int)
inputs = np.arange(1,10)
result = map(f,inputs)
x = np.concatenate([i[0] for i in result])
y = np.concatenate([i[1] for i in result])
This gives the intended result. However, since result may take up much memory, it may be preferable to use a generator by calling imap instead of map.
from itertools import imap
result = imap(f,inputs)
x = np.concatenate([i[0] for i in result])
y = np.concatenate([i[1] for i in result])
However, this gives an error because the generator is empty at the point where we calculate y.
Is there a way to use the generator only once and still create these two concatenated arrays? I'm looking for a solution without a for loop, since it is rather inefficient to repeatedly concatenate/append arrays.
Thanks in advance.

Is there a way to use the generator only once and still create these two concatenated arrays?
Yes, a generator can be cloned with tee:
import itertools
a, b = itertools.tee(result)
x = np.concatenate([i[0] for i in a])
y = np.concatenate([i[1] for i in b])
However, using tee does not help with the memory usage in your case. The above solution would require 5 N memory to run:
N for caching the generator inside tee,
2 N for the list comprehensions inside np.concatenate calls,
2 N for the concatenated arrays.
Clearly, we could do better by dropping the tee:
x_acc = []
y_acc = []
for x_i, y_i in result:
x_acc.append(x_i)
y_acc.append(y_i)
x = np.concatenate(x_acc)
y = np.concatenate(y_acc)
This shaved off one more N, leaving 4 N. Going further means dropping the intermediate lists and preallocating x and y. Note, that you needn't know the exact sizes of the arrays, only the upper bounds:
x = np.empty(capacity)
y = np.empty(capacity)
right = 0
for x_i, y_i in result:
left = right
right += len(x_i) # == len(y_i)
x[left:right] = x_i
y[left:right] = y_i
x = x[:right].copy()
y = y[:right].copy()
In fact, you don't even need an upper bound. Just ensure that x and y are big enough to accommodate the new item:
for x_i, y_i in result:
# ...
if right >= len(x):
# It would be slightly trickier for >1D, but the idea
# remains the same: alter the 0-the dimension to fit
# the new item.
new_capacity = max(right, len(x)) * 1.5
x = x.resize(new_capacity)
y = y.resize(new_capacity)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generate a large list of points with no duplicates - python

I want to create a large list containing 20,000 points in the form of: [[x, y], [x, y], [x, y]] where x and y can be any random integer between 0 and 1000. How would I be able to do this such that there are no duplicate coordinates [x, y]?

Try this : import itertools x = range(0,10) aList =[] for pair in itertools.combinations(x,2): for i in range(0,10): aList.append(pair) print aList If you want point between 0-10 with all unique and stored in a list, or you You need it random order, then use some random function .

Related

Python: Memory-efficient random sampling of list of permutations

List comprehension has excruciatingly slow load time than regular code that does the same thing

Random Sample of N Distinct Permutations of a List

Keep a 2-tuple of ints within a specific range

Creating two concatenated arrays from a generator

Categories

Resources