how to generate 100000 random numbers in python [duplicate] - python

I tried using random.randint(0, 100), but some numbers were the same. Is there a method/module to create a list unique random numbers?

This will return a list of 10 numbers selected from the range 0 to 99, without duplicates.
import random
random.sample(range(100), 10)

You can use the shuffle function from the random module like this:
import random
nums = list(range(1, 100)) # list of integers from 1 to 99
# adjust this boundaries to fit your needs
random.shuffle(nums)
print(nums) # <- List of unique random numbers
Note here that the shuffle method doesn't return any list as one may expect, it only shuffle the list passed by reference.

You can first create a list of numbers from a to b, where a and b are respectively the smallest and greatest numbers in your list, then shuffle it with Fisher-Yates algorithm or using the Python's random.shuffle method.

Linear Congruential Pseudo-random Number Generator
O(1) Memory
O(k) Operations
This problem can be solved with a simple Linear Congruential Generator. This requires constant memory overhead (8 integers) and at most 2*(sequence length) computations.
All other solutions use more memory and more compute! If you only need a few random sequences, this method will be significantly cheaper. For ranges of size N, if you want to generate on the order of N unique k-sequences or more, I recommend the accepted solution using the builtin methods random.sample(range(N),k) as this has been optimized in python for speed.
Code
# Return a randomized "range" using a Linear Congruential Generator
# to produce the number sequence. Parameters are the same as for
# python builtin "range".
# Memory -- storage for 8 integers, regardless of parameters.
# Compute -- at most 2*"maximum" steps required to generate sequence.
#
def random_range(start, stop=None, step=None):
import random, math
# Set a default values the same way "range" does.
if (stop == None): start, stop = 0, start
if (step == None): step = 1
# Use a mapping to convert a standard range into the desired range.
mapping = lambda i: (i*step) + start
# Compute the number of numbers in this range.
maximum = (stop - start) // step
# Seed range with a random integer.
value = random.randint(0,maximum)
#
# Construct an offset, multiplier, and modulus for a linear
# congruential generator. These generators are cyclic and
# non-repeating when they maintain the properties:
#
# 1) "modulus" and "offset" are relatively prime.
# 2) ["multiplier" - 1] is divisible by all prime factors of "modulus".
# 3) ["multiplier" - 1] is divisible by 4 if "modulus" is divisible by 4.
#
offset = random.randint(0,maximum) * 2 + 1 # Pick a random odd-valued offset.
multiplier = 4*(maximum//4) + 1 # Pick a multiplier 1 greater than a multiple of 4.
modulus = int(2**math.ceil(math.log2(maximum))) # Pick a modulus just big enough to generate all numbers (power of 2).
# Track how many random numbers have been returned.
found = 0
while found < maximum:
# If this is a valid value, yield it in generator fashion.
if value < maximum:
found += 1
yield mapping(value)
# Calculate the next value in the sequence.
value = (value*multiplier + offset) % modulus
Usage
The usage of this function "random_range" is the same as for any generator (like "range"). An example:
# Show off random range.
print()
for v in range(3,6):
v = 2**v
l = list(random_range(v))
print("Need",v,"found",len(set(l)),"(min,max)",(min(l),max(l)))
print("",l)
print()
Sample Results
Required 8 cycles to generate a sequence of 8 values.
Need 8 found 8 (min,max) (0, 7)
[1, 0, 7, 6, 5, 4, 3, 2]
Required 16 cycles to generate a sequence of 9 values.
Need 9 found 9 (min,max) (0, 8)
[3, 5, 8, 7, 2, 6, 0, 1, 4]
Required 16 cycles to generate a sequence of 16 values.
Need 16 found 16 (min,max) (0, 15)
[5, 14, 11, 8, 3, 2, 13, 1, 0, 6, 9, 4, 7, 12, 10, 15]
Required 32 cycles to generate a sequence of 17 values.
Need 17 found 17 (min,max) (0, 16)
[12, 6, 16, 15, 10, 3, 14, 5, 11, 13, 0, 1, 4, 8, 7, 2, ...]
Required 32 cycles to generate a sequence of 32 values.
Need 32 found 32 (min,max) (0, 31)
[19, 15, 1, 6, 10, 7, 0, 28, 23, 24, 31, 17, 22, 20, 9, ...]
Required 64 cycles to generate a sequence of 33 values.
Need 33 found 33 (min,max) (0, 32)
[11, 13, 0, 8, 2, 9, 27, 6, 29, 16, 15, 10, 3, 14, 5, 24, ...]

The solution presented in this answer works, but it could become problematic with memory if the sample size is small, but the population is huge (e.g. random.sample(insanelyLargeNumber, 10)).
To fix that, I would go with this:
answer = set()
sampleSize = 10
answerSize = 0
while answerSize < sampleSize:
r = random.randint(0,100)
if r not in answer:
answerSize += 1
answer.add(r)
# answer now contains 10 unique, random integers from 0.. 100

If you need to sample extremely large numbers, you cannot use range
random.sample(range(10000000000000000000000000000000), 10)
because it throws:
OverflowError: Python int too large to convert to C ssize_t
Also, if random.sample cannot produce the number of items you want due to the range being too small
random.sample(range(2), 1000)
it throws:
ValueError: Sample larger than population
This function resolves both problems:
import random
def random_sample(count, start, stop, step=1):
def gen_random():
while True:
yield random.randrange(start, stop, step)
def gen_n_unique(source, n):
seen = set()
seenadd = seen.add
for i in (i for i in source() if i not in seen and not seenadd(i)):
yield i
if len(seen) == n:
break
return [i for i in gen_n_unique(gen_random,
min(count, int(abs(stop - start) / abs(step))))]
Usage with extremely large numbers:
print('\n'.join(map(str, random_sample(10, 2, 10000000000000000000000000000000))))
Sample result:
7822019936001013053229712669368
6289033704329783896566642145909
2473484300603494430244265004275
5842266362922067540967510912174
6775107889200427514968714189847
9674137095837778645652621150351
9969632214348349234653730196586
1397846105816635294077965449171
3911263633583030536971422042360
9864578596169364050929858013943
Usage where the range is smaller than the number of requested items:
print(', '.join(map(str, random_sample(100000, 0, 3))))
Sample result:
2, 0, 1
It also works with with negative ranges and steps:
print(', '.join(map(str, random_sample(10, 10, -10, -2))))
print(', '.join(map(str, random_sample(10, 5, -5, -2))))
Sample results:
2, -8, 6, -2, -4, 0, 4, 10, -6, 8
-3, 1, 5, -1, 3

If the list of N numbers from 1 to N is randomly generated, then yes, there is a possibility that some numbers may be repeated.
If you want a list of numbers from 1 to N in a random order, fill an array with integers from 1 to N, and then use a Fisher-Yates shuffle or Python's random.shuffle().

Here is a very small function I made, hope this helps!
import random
numbers = list(range(0, 100))
random.shuffle(numbers)

A very simple function that also solves your problem
from random import randint
data = []
def unique_rand(inicial, limit, total):
data = []
i = 0
while i < total:
number = randint(inicial, limit)
if number not in data:
data.append(number)
i += 1
return data
data = unique_rand(1, 60, 6)
print(data)
"""
prints something like
[34, 45, 2, 36, 25, 32]
"""

One straightforward alternative is to use np.random.choice() as shown below
np.random.choice(range(10), size=3, replace=False)
This results in three integer numbers that are different from each other. e.g., [1, 3, 5], [2, 5, 1]...

The answer provided here works very well with respect to time
as well as memory but a bit more complicated as it uses advanced python
constructs such as yield. The simpler answer works well in practice but, the issue with that
answer is that it may generate many spurious integers before actually constructing
the required set. Try it out with populationSize = 1000, sampleSize = 999.
In theory, there is a chance that it doesn't terminate.
The answer below addresses both issues, as it is deterministic and somewhat efficient
though currently not as efficient as the other two.
def randomSample(populationSize, sampleSize):
populationStr = str(populationSize)
dTree, samples = {}, []
for i in range(sampleSize):
val, dTree = getElem(populationStr, dTree, '')
samples.append(int(val))
return samples, dTree
where the functions getElem, percolateUp are as defined below
import random
def getElem(populationStr, dTree, key):
msd = int(populationStr[0])
if not key in dTree.keys():
dTree[key] = range(msd + 1)
idx = random.randint(0, len(dTree[key]) - 1)
key = key + str(dTree[key][idx])
if len(populationStr) == 1:
dTree[key[:-1]].pop(idx)
return key, (percolateUp(dTree, key[:-1]))
newPopulation = populationStr[1:]
if int(key[-1]) != msd:
newPopulation = str(10**(len(newPopulation)) - 1)
return getElem(newPopulation, dTree, key)
def percolateUp(dTree, key):
while (dTree[key] == []):
dTree[key[:-1]].remove( int(key[-1]) )
key = key[:-1]
return dTree
Finally, the timing on average was about 15ms for a large value of n as shown below,
In [3]: n = 10000000000000000000000000000000
In [4]: %time l,t = randomSample(n, 5)
Wall time: 15 ms
In [5]: l
Out[5]:
[10000000000000000000000000000000L,
5731058186417515132221063394952L,
85813091721736310254927217189L,
6349042316505875821781301073204L,
2356846126709988590164624736328L]

In order to obtain a program that generates a list of random values without duplicates that is deterministic, efficient and built with basic programming constructs consider the function extractSamples defined below,
def extractSamples(populationSize, sampleSize, intervalLst) :
import random
if (sampleSize > populationSize) :
raise ValueError("sampleSize = "+str(sampleSize) +" > populationSize (= " + str(populationSize) + ")")
samples = []
while (len(samples) < sampleSize) :
i = random.randint(0, (len(intervalLst)-1))
(a,b) = intervalLst[i]
sample = random.randint(a,b)
if (a==b) :
intervalLst.pop(i)
elif (a == sample) : # shorten beginning of interval
intervalLst[i] = (sample+1, b)
elif ( sample == b) : # shorten interval end
intervalLst[i] = (a, sample - 1)
else :
intervalLst[i] = (a, sample - 1)
intervalLst.append((sample+1, b))
samples.append(sample)
return samples
The basic idea is to keep track of intervals intervalLst for possible values from which to select our required elements from. This is deterministic in the sense that we are guaranteed to generate a sample within a fixed number of steps (solely dependent on populationSize and sampleSize).
To use the above function to generate our required list,
In [3]: populationSize, sampleSize = 10**17, 10**5
In [4]: %time lst1 = extractSamples(populationSize, sampleSize, [(0, populationSize-1)])
CPU times: user 289 ms, sys: 9.96 ms, total: 299 ms
Wall time: 293 ms
We may also compare with an earlier solution (for a lower value of populationSize)
In [5]: populationSize, sampleSize = 10**8, 10**5
In [6]: %time lst = random.sample(range(populationSize), sampleSize)
CPU times: user 1.89 s, sys: 299 ms, total: 2.19 s
Wall time: 2.18 s
In [7]: %time lst1 = extractSamples(populationSize, sampleSize, [(0, populationSize-1)])
CPU times: user 449 ms, sys: 8.92 ms, total: 458 ms
Wall time: 442 ms
Note that I reduced populationSize value as it produces Memory Error for higher values when using the random.sample solution (also mentioned in previous answers here and here). For above values, we can also observe that extractSamples outperforms the random.sample approach.
P.S. : Though the core approach is similar to my earlier answer, there are substantial modifications in implementation as well as approach alongwith improvement in clarity.

The problem with the set based approaches ("if random value in return values, try again") is that their runtime is undetermined due to collisions (which require another "try again" iteration), especially when a large amount of random values are returned from the range.
An alternative that isn't prone to this non-deterministic runtime is the following:
import bisect
import random
def fast_sample(low, high, num):
""" Samples :param num: integer numbers in range of
[:param low:, :param high:) without replacement
by maintaining a list of ranges of values that
are permitted.
This list of ranges is used to map a random number
of a contiguous a range (`r_n`) to a permissible
number `r` (from `ranges`).
"""
ranges = [high]
high_ = high - 1
while len(ranges) - 1 < num:
# generate a random number from an ever decreasing
# contiguous range (which we'll map to the true
# random number).
# consider an example with low=0, high=10,
# part way through this loop with:
#
# ranges = [0, 2, 3, 7, 9, 10]
#
# r_n :-> r
# 0 :-> 1
# 1 :-> 4
# 2 :-> 5
# 3 :-> 6
# 4 :-> 8
r_n = random.randint(low, high_)
range_index = bisect.bisect_left(ranges, r_n)
r = r_n + range_index
for i in xrange(range_index, len(ranges)):
if ranges[i] <= r:
# as many "gaps" we iterate over, as much
# is the true random value (`r`) shifted.
r = r_n + i + 1
elif ranges[i] > r_n:
break
# mark `r` as another "gap" of the original
# [low, high) range.
ranges.insert(i, r)
# Fewer values possible.
high_ -= 1
# `ranges` happens to contain the result.
return ranges[:-1]

I found a quite faster way than having to use the range function (very slow), and without using random function from python (I don´t like the random built-in library because when you seed it, it repeats the pattern of the random numbers generator)
import numpy as np
nums = set(np.random.randint(low=0, high=100, size=150)) #generate some more for the duplicates
nums = list(nums)[:100]
This is quite fast.

You can use Numpy library for quick answer as shown below -
Given code snippet lists down 6 unique numbers between the range of 0 to 5. You can adjust the parameters for your comfort.
import numpy as np
import random
a = np.linspace( 0, 5, 6 )
random.shuffle(a)
print(a)
Output
[ 2. 1. 5. 3. 4. 0.]
It doesn't put any constraints as we see in random.sample as referred here.

import random
sourcelist=[]
resultlist=[]
for x in range(100):
sourcelist.append(x)
for y in sourcelist:
resultlist.insert(random.randint(0,len(resultlist)),y)
print (resultlist)

Try using...
import random
LENGTH = 100
random_with_possible_duplicates = [random.randrange(-3, 3) for _ in range(LENGTH)]
random_without_duplicates = list(set(random_with_possible_duplicates)) # This removes duplicates
Advatages
Fast, efficient and readable.
Possible Issues
This method can change the length of the list if there are duplicates.

If you wish to ensure that the numbers being added are unique, you could use a Set object
if using 2.7 or greater, or import the sets module if not.
As others have mentioned, this means the numbers are not truly random.

If the amount of numbers you want is random, you can do something like this. In this case, length is the highest number you want to choose from.
If it notices the new random number was already chosen, itll subtract 1 from count (since a count was added before it knew whether it was a duplicate or not). If its not in the list, then do what you want with it and add it to the list so it cant get picked again.
import random
def randomizer():
chosen_number=[]
count=0
user_input = int(input("Enter number for how many rows to randomly select: "))
numlist=[]
#length = whatever the highest number you want to choose from
while 1<=user_input<=length:
count=count+1
if count>user_input:
break
else:
chosen_number = random.randint(0, length)
if line_number in numlist:
count=count-1
continue
if chosen_number not in numlist:
numlist.append(chosen_number)
#do what you want here

Edit: ignore my answer here. use python's random.shuffle or random.sample, as mentioned in other answers.
to sample integers without replacement between `minval` and `maxval`:
import numpy as np
minval, maxval, n_samples = -50, 50, 10
generator = np.random.default_rng(seed=0)
samples = generator.permutation(np.arange(minval, maxval))[:n_samples]
# or, if minval is 0,
samples = generator.permutation(maxval)[:n_samples]
with jax:
import jax
minval, maxval, n_samples = -50, 50, 10
key = jax.random.PRNGKey(seed=0)
samples = jax.random.shuffle(key, jax.numpy.arange(minval, maxval))[:n_samples]

From the CLI in win xp:
python -c "import random; print(sorted(set([random.randint(6,49) for i in range(7)]))[:6])"
In Canada we have the 6/49 Lotto. I just wrap the above code in lotto.bat and run C:\home\lotto.bat or just C:\home\lotto.
Because random.randint often repeats a number, I use set with range(7) and then shorten it to a length of 6.
Occasionally if a number repeats more than 2 times the resulting list length will be less than 6.
EDIT: However, random.sample(range(6,49),6) is the correct way to go.

Related

How can I make all possible list combinations using reversible seeds?

I want to have a list made out of 100 items, each item being a value from 0-31 inclusive. I then want to be able to take one of these lists, and know what seed/input is necessary to randomly generate that exact list.
Using some suitable Linear Congruential Generator:
You could use this research paper by Pierre L'Ecuyer:
Tables_of_linear_congruential_generators_of_different_sizes_and_good_lattice_structure
The lowest power of 2 modulus for which the paper gives examples of (decently pseudo-random) LCGs is 230, which is close to 1 billion. See table 4 of the paper. Just pick one of those LCGs, say:
un+1 = ((116646453 * un) + 5437) mod 230
Each of your items is exactly 5 bits wide. If you decide to group your items 6 by 6, each group is exactly 30 bits wide, so can be considered as one state of this modulo 230 LCG.
From a initial group of 6 items, one step of the LCG will generate the next group, that is the next 6 items. And the paper tells you that the serie will look reasonably random overall.
Hence, you can regard the first 6 items as your “seed”, as you can reconstruct the whole list from its leftmost 6 items.
Even assuming that for the sake of obfuscation you started the visible part of the list after the seed, you would still have only about one billion possible seeds to worry about. Any decent computer would be able to find the left-hidden seed by brute force within a handful of seconds, by simulating the LCG for every possible seed and comparing with the actual list.
Sample Python code:
One can start by creating a class that, given a seed, supplies an unlimited serie of items between 0 and 31:
class LEcuyer30:
def __init__ (self, seed):
self.seed = seed & ((1 << 30) - 1)
self.currGroup = self.seed
self.itemCount = 6
def __toNextGroup(self):
nextGroup = ((116646453 * self.currGroup) + 5437) % (1 << 30)
self.currGroup = nextGroup
self.itemCount = 6
def getItem(self):
if (self.itemCount <= 0):
self.__toNextGroup()
# extract 5 bits:
word = self.currGroup >> (5 * (self.itemCount - 1))
self.itemCount -= 1
return (word & 31)
Test code:
We can create a sequence of 20 items and print it:
# Main program:
randomSeed = 514703103
rng = LEcuyer30(randomSeed)
itemList = []
for k in range(20):
item = rng.getItem()
itemList.append(item)
print("itemList = %s" % itemList)
Program output:
itemList = [15, 10, 27, 15, 23, 31, 1, 10, 5, 15, 16, 8, 4, 16, 24, 31, 7, 5, 8, 19]

How to make sure that a list of generated numbers follow a uniform distribution

I have a list of 150 numbers from 0 to 149. I would like to use a for loop with 150 iterations in order to generate 150 lists of 6 numbers such that,t in each iteration k, the number k is included as well as 5 different random numbers. For example:
S0 = [0, r1, r2, r3, r4, r5] # r1, r2,..., r5 are random numbers between 0 and 150
S1 = [1, r1', r2', r3', r4', r5'] # r1', r2',..., r5' are new random numbers between 0 and 150
...
S149 = [149, r1'', r2'', r3'', r4'', r5'']
In addition, the numbers in each list have to be different and with a minimum distance of 5. This is the code I am using:
import random
import numpy as np
final_list = []
for k in range(150):
S = [k]
for it in range(5):
domain = [ele for ele in range(150) if ele not in S]
d = 0
x = k
while d < 5:
d = np.Infinity
x = random.sample(domain, 1)[0]
for ch in S:
if np.abs(ch - x) < d:
d = np.abs(ch - x)
S.append(x)
final_list.append(S)
Output:
[[0, 149, 32, 52, 39, 126],
[1, 63, 16, 50, 141, 79],
[2, 62, 21, 42, 35, 71],
...
[147, 73, 38, 115, 82, 47],
[148, 5, 78, 115, 140, 43],
[149, 36, 3, 15, 99, 23]]
Now, the code is working but I would like to know if it's possible to force that number of repetitions that each number has through all the iterations is approximately the same. For example, after using the previous code, this plot indicates how many times each number has appeared in the generated lists:
As you can see, there are numbers that have appeared more than 10 times while there are others that have appeared only 2 times. Is it possible to reduce this level of variation so that this plot can be approximated as a uniform distribution? Thanks.
First, I am not sure that your assertion that the current results are not uniformly distributed is necessarily correct. It would seem prudent to me to try and examine the histogram over several repetitions of the process, rather than just one.
I am not a statistician, but when I want to approximate uniform distribution (and assuming that the functions in random provide uniform distribution), what I try to do is to simply accept all results returned by random functions. For that, I need to limit the choices given to these functions ahead of calling them. This is how I would go about your task:
import random
import numpy as np
N = 150
def random_subset(n):
result = []
cands = set(range(N))
for i in range(6):
result.append(n) # Initially, n is the number that must appear in the result
cands -= set(range(n - 4, n + 5)) # Remove candidates less than 5 away
n = random.choice(list(cands)) # Select next number
return result
result = np.array([random_subset(n) for n in range(N)])
print(result)
Simply put, whenever I add a number n to the result set, I take out of the selection candidates, an environment of the proper size, to ensure no number of a distance of less than 5 can be selected in the future.
The code is not optimized (multiple set to list conversions) but it works (as per my uderstanding).
You can force it to be precisely uniform, if you so desire.
Apologies for the mix of globals and locals, this seemed the most readable. You would want to rewrite according to how variable your constants are =)
import random
SIZE = 150
SAMPLES = 5
def get_samples():
pool = list(range(SIZE)) * SAMPLES
random.shuffle(pool)
items = []
for i in range(SIZE):
selection, pool = pool[:SAMPLES], pool[SAMPLES:]
item = [i] + selection
items.append(item)
return items
Then you will have exactly 5 of each (and one more in the leading position, which is a weird data structure).
>>> set(collections.Counter(vv for v in get_samples() for vv in v).values())
{6}
The method above does not guarantee the last 5 numbers are unique, in fact, you would expect ~10/150 to have a duplicate. If that is important, you need to filter your distribution a little more and decide how well you value tight uniformity, duplicates, etc.
If your numbers are approximately what you gave above, you also can patch up the results (fairly) and hope to avoid long search times (not the case for SAMPLES sizes closer to OPTIONS size)
def get_samples():
pool = list(range(SIZE)) * SAMPLES
random.shuffle(pool)
i = 0
while i < len(pool):
if i % SAMPLES == 0:
seen = set()
v = pool[i]
if v in seen: # swap
dst = random.choice(range(SIZE))
pool[dst], pool[i] = pool[i], pool[dst]
i = dst - dst % SAMPLES # Restart from swapped segment
else:
seen.add(v)
i += 1
items = []
for i in range(SIZE):
selection, pool = pool[:SAMPLES], pool[SAMPLES:]
assert len(set(selection)) == SAMPLES, selection
item = [i] + selection
items.append(item)
return items
This will typically take less than 5 passes through to clean up any duplicates, and should leave all arrangements satisfying your conditions equally likely.

Given a string of a million numbers, return all repeating 3 digit numbers

I had an interview with a hedge fund company in New York a few months ago and unfortunately, I did not get the internship offer as a data/software engineer. (They also asked the solution to be in Python.)
I pretty much screwed up on the first interview problem...
Question: Given a string of a million numbers (Pi for example), write
a function/program that returns all repeating 3 digit numbers and number of
repetition greater than 1
For example: if the string was: 123412345123456 then the function/program would return:
123 - 3 times
234 - 3 times
345 - 2 times
They did not give me the solution after I failed the interview, but they did tell me that the time complexity for the solution was constant of 1000 since all the possible outcomes are between:
000 --> 999
Now that I'm thinking about it, I don't think it's possible to come up with a constant time algorithm. Is it?
You got off lightly, you probably don't want to be working for a hedge fund where the quants don't understand basic algorithms :-)
There is no way to process an arbitrarily-sized data structure in O(1) if, as in this case, you need to visit every element at least once. The best you can hope for is O(n) in this case, where n is the length of the string.
Although, as an aside, a nominal O(n) algorithm will be O(1) for a fixed input size so, technically, they may have been correct here. However, that's not usually how people use complexity analysis.
It appears to me you could have impressed them in a number of ways.
First, by informing them that it's not possible to do it in O(1), unless you use the "suspect" reasoning given above.
Second, by showing your elite skills by providing Pythonic code such as:
inpStr = '123412345123456'
# O(1) array creation.
freq = [0] * 1000
# O(n) string processing.
for val in [int(inpStr[pos:pos+3]) for pos in range(len(inpStr) - 2)]:
freq[val] += 1
# O(1) output of relevant array values.
print ([(num, freq[num]) for num in range(1000) if freq[num] > 1])
This outputs:
[(123, 3), (234, 3), (345, 2)]
though you could, of course, modify the output format to anything you desire.
And, finally, by telling them there's almost certainly no problem with an O(n) solution, since the code above delivers results for a one-million-digit string in well under half a second. It seems to scale quite linearly as well, since a 10,000,000-character string takes 3.5 seconds and a 100,000,000-character one takes 36 seconds.
And, if they need better than that, there are ways to parallelise this sort of stuff that can greatly speed it up.
Not within a single Python interpreter of course, due to the GIL, but you could split the string into something like (overlap indicated by vv is required to allow proper processing of the boundary areas):
vv
123412 vv
123451
5123456
You can farm these out to separate workers and combine the results afterwards.
The splitting of input and combining of output are likely to swamp any saving with small strings (and possibly even million-digit strings) but, for much larger data sets, it may well make a difference. My usual mantra of "measure, don't guess" applies here, of course.
This mantra also applies to other possibilities, such as bypassing Python altogether and using a different language which may be faster.
For example, the following C code, running on the same hardware as the earlier Python code, handles a hundred million digits in 0.6 seconds, roughly the same amount of time as the Python code processed one million. In other words, much faster:
#include <stdio.h>
#include <string.h>
int main(void) {
static char inpStr[100000000+1];
static int freq[1000];
// Set up test data.
memset(inpStr, '1', sizeof(inpStr));
inpStr[sizeof(inpStr)-1] = '\0';
// Need at least three digits to do anything useful.
if (strlen(inpStr) <= 2) return 0;
// Get initial feed from first two digits, process others.
int val = (inpStr[0] - '0') * 10 + inpStr[1] - '0';
char *inpPtr = &(inpStr[2]);
while (*inpPtr != '\0') {
// Remove hundreds, add next digit as units, adjust table.
val = (val % 100) * 10 + *inpPtr++ - '0';
freq[val]++;
}
// Output (relevant part of) table.
for (int i = 0; i < 1000; ++i)
if (freq[i] > 1)
printf("%3d -> %d\n", i, freq[i]);
return 0;
}
Constant time isn't possible. All 1 million digits need to be looked at at least once, so that is a time complexity of O(n), where n = 1 million in this case.
For a simple O(n) solution, create an array of size 1000 that represents the number of occurrences of each possible 3 digit number. Advance 1 digit at a time, first index == 0, last index == 999997, and increment array[3 digit number] to create a histogram (count of occurrences for each possible 3 digit number). Then output the content of the array with counts > 1.
A million is small for the answer I give below. Expecting only that you have to be able to run the solution in the interview, without a pause, then The following works in less than two seconds and gives the required result:
from collections import Counter
def triple_counter(s):
c = Counter(s[n-3: n] for n in range(3, len(s)))
for tri, n in c.most_common():
if n > 1:
print('%s - %i times.' % (tri, n))
else:
break
if __name__ == '__main__':
import random
s = ''.join(random.choice('0123456789') for _ in range(1_000_000))
triple_counter(s)
Hopefully the interviewer would be looking for use of the standard libraries collections.Counter class.
Parallel execution version
I wrote a blog post on this with more explanation.
The simple O(n) solution would be to count each 3-digit number:
for nr in range(1000):
cnt = text.count('%03d' % nr)
if cnt > 1:
print '%03d is found %d times' % (nr, cnt)
This would search through all 1 million digits 1000 times.
Traversing the digits only once:
counts = [0] * 1000
for idx in range(len(text)-2):
counts[int(text[idx:idx+3])] += 1
for nr, cnt in enumerate(counts):
if cnt > 1:
print '%03d is found %d times' % (nr, cnt)
Timing shows that iterating only once over the index is twice as fast as using count.
Here is a NumPy implementation of the "consensus" O(n) algorithm: walk through all triplets and bin as you go. The binning is done by upon encountering say "385", adding one to bin[3, 8, 5] which is an O(1) operation. Bins are arranged in a 10x10x10 cube. As the binning is fully vectorized there is no loop in the code.
def setup_data(n):
import random
digits = "0123456789"
return dict(text = ''.join(random.choice(digits) for i in range(n)))
def f_np(text):
# Get the data into NumPy
import numpy as np
a = np.frombuffer(bytes(text, 'utf8'), dtype=np.uint8) - ord('0')
# Rolling triplets
a3 = np.lib.stride_tricks.as_strided(a, (3, a.size-2), 2*a.strides)
bins = np.zeros((10, 10, 10), dtype=int)
# Next line performs O(n) binning
np.add.at(bins, tuple(a3), 1)
# Filtering is left as an exercise
return bins.ravel()
def f_py(text):
counts = [0] * 1000
for idx in range(len(text)-2):
counts[int(text[idx:idx+3])] += 1
return counts
import numpy as np
import types
from timeit import timeit
for n in (10, 1000, 1000000):
data = setup_data(n)
ref = f_np(**data)
print(f'n = {n}')
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
try:
assert np.all(ref == func(**data))
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'f(**data)', globals={'f':func, 'data':data}, number=10)*100))
except:
print("{:16s} apparently crashed".format(name[2:]))
Unsurprisingly, NumPy is a bit faster than #Daniel's pure Python solution on large data sets. Sample output:
# n = 10
# np 0.03481400 ms
# py 0.00669330 ms
# n = 1000
# np 0.11215360 ms
# py 0.34836530 ms
# n = 1000000
# np 82.46765980 ms
# py 360.51235450 ms
I would solve the problem as follows:
def find_numbers(str_num):
final_dict = {}
buffer = {}
for idx in range(len(str_num) - 3):
num = int(str_num[idx:idx + 3])
if num not in buffer:
buffer[num] = 0
buffer[num] += 1
if buffer[num] > 1:
final_dict[num] = buffer[num]
return final_dict
Applied to your example string, this yields:
>>> find_numbers("123412345123456")
{345: 2, 234: 3, 123: 3}
This solution runs in O(n) for n being the length of the provided string, and is, I guess, the best you can get.
As per my understanding, you cannot have the solution in a constant time. It will take at least one pass over the million digit number (assuming its a string). You can have a 3-digit rolling iteration over the digits of the million length number and increase the value of hash key by 1 if it already exists or create a new hash key (initialized by value 1) if it doesn't exists already in the dictionary.
The code will look something like this:
def calc_repeating_digits(number):
hash = {}
for i in range(len(str(number))-2):
current_three_digits = number[i:i+3]
if current_three_digits in hash.keys():
hash[current_three_digits] += 1
else:
hash[current_three_digits] = 1
return hash
You can filter down to the keys which have item value greater than 1.
As mentioned in another answer, you cannot do this algorithm in constant time, because you must look at at least n digits. Linear time is the fastest you can get.
However, the algorithm can be done in O(1) space. You only need to store the counts of each 3 digit number, so you need an array of 1000 entries. You can then stream the number in.
My guess is that either the interviewer misspoke when they gave you the solution, or you misheard "constant time" when they said "constant space."
Here's my answer:
from timeit import timeit
from collections import Counter
import types
import random
def setup_data(n):
digits = "0123456789"
return dict(text = ''.join(random.choice(digits) for i in range(n)))
def f_counter(text):
c = Counter()
for i in range(len(text)-2):
ss = text[i:i+3]
c.update([ss])
return (i for i in c.items() if i[1] > 1)
def f_dict(text):
d = {}
for i in range(len(text)-2):
ss = text[i:i+3]
if ss not in d:
d[ss] = 0
d[ss] += 1
return ((i, d[i]) for i in d if d[i] > 1)
def f_array(text):
a = [[[0 for _ in range(10)] for _ in range(10)] for _ in range(10)]
for n in range(len(text)-2):
i, j, k = (int(ss) for ss in text[n:n+3])
a[i][j][k] += 1
for i, b in enumerate(a):
for j, c in enumerate(b):
for k, d in enumerate(c):
if d > 1: yield (f'{i}{j}{k}', d)
for n in (1E1, 1E3, 1E6):
n = int(n)
data = setup_data(n)
print(f'n = {n}')
results = {}
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'results[name] = f(**data)', globals={'f':func, 'data':data, 'results':results, 'name':name}, number=10)*100))
for r in results:
print('{:10}: {}'.format(r, sorted(list(results[r]))[:5]))
The array lookup method is very fast (even faster than #paul-panzer's numpy method!). Of course, it cheats since it isn't technicailly finished after it completes, because it's returning a generator. It also doesn't have to check every iteration if the value already exists, which is likely to help a lot.
n = 10
counter 0.10595780 ms
dict 0.01070654 ms
array 0.00135370 ms
f_counter : []
f_dict : []
f_array : []
n = 1000
counter 2.89462101 ms
dict 0.40434612 ms
array 0.00073838 ms
f_counter : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_dict : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_array : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
n = 1000000
counter 2849.00500992 ms
dict 438.44007806 ms
array 0.00135370 ms
f_counter : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_dict : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_array : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
Image as answer:
Looks like a sliding window.
Here is my solution:
from collections import defaultdict
string = "103264685134845354863"
d = defaultdict(int)
for elt in range(len(string)-2):
d[string[elt:elt+3]] += 1
d = {key: d[key] for key in d.keys() if d[key] > 1}
With a bit of creativity in for loop(and additional lookup list with True/False/None for example) you should be able to get rid of last line, as you only want to create keys in dict that we visited once up to that point.
Hope it helps :)
-Telling from the perspective of C.
-You can have an int 3-d array results[10][10][10];
-Go from 0th location to n-4th location, where n being the size of the string array.
-On each location, check the current, next and next's next.
-Increment the cntr as resutls[current][next][next's next]++;
-Print the values of
results[1][2][3]
results[2][3][4]
results[3][4][5]
results[4][5][6]
results[5][6][7]
results[6][7][8]
results[7][8][9]
-It is O(n) time, there is no comparisons involved.
-You can run some parallel stuff here by partitioning the array and calculating the matches around the partitions.
inputStr = '123456123138276237284287434628736482376487234682734682736487263482736487236482634'
count = {}
for i in range(len(inputStr) - 2):
subNum = int(inputStr[i:i+3])
if subNum not in count:
count[subNum] = 1
else:
count[subNum] += 1
print count

For cycle gets stuck in Python

My code below is getting stuck on a random point:
import functions
from itertools import product
from random import randrange
values = {}
tables = {}
letters = "abcdefghi"
nums = "123456789"
for x in product(letters, nums): #unnecessary
values[x[0] + x[1]] = 0
for x in product(nums, letters): #unnecessary
tables[x[0] + x[1]] = 0
for line_cnt in range(1,10):
for column_cnt in range(1,10):
num = randrange(1,10)
table_cnt = functions.which_table(line_cnt, column_cnt) #Returns a number identifying the table considered
#gets the values already in the line and column and table considered
line = [y for x,y in values.items() if x.startswith(letters[line_cnt-1])]
column = [y for x,y in values.items() if x.endswith(nums[column_cnt-1])]
table = [x for x,y in tables.items() if x.startswith(str(table_cnt))]
#if num is not contained in any of these then it's acceptable, otherwise find another number
while num in line or num in column or num in table:
num = randrange(1,10)
values[letters[line_cnt-1] + nums[column_cnt-1]] = num #Assign the number to the values dictionary
print(line_cnt) #debug
print(sorted(values)) #debug
As you can see it's a program that generates random sudoku schemes using 2 dictionaries : values that contains the complete scheme and tables that contains the values for each table.
Example :
5th square on the first line = 3
|
v
values["a5"] = 3
tables["2b"] = 3
So what is the problem? Am I missing something?
import functions
...
table_cnt = functions.which_table(line_cnt, column_cnt) #Returns a number identifying the table considered
It's nice when we can execute the code right ahead on our own computer to test it. In other words, it would have been nice to replace "table_cnt" with a fixed value for the example (here, a simple string would have sufficed).
for x in product(letters, nums):
values[x[0] + x[1]] = 0
Not that important, but this is more elegant:
values = {x+y: 0 for x, y in product(letters, nums)}
And now, the core of the problem:
while num in line or num in column or num in table:
num = randrange(1,10)
This is where you loop forever. So, you are trying to generate a random sudoku. From your code, this is how you would generate a random list:
nums = []
for _ in range(9):
num = randrange(1, 10)
while num in nums:
num = randrange(1, 10)
nums.append(num)
The problem with this approach is that you have no idea how long the program will take to finish. It could take one second, or one year (although, that is unlikely). This is because there is no guarantee the program will not keep picking a number already taken, over and over.
Still, in practice it should still take a relatively short time to finish (this approach is not efficient but the list is very short). However, in the case of the sudoku, you can end up in an impossible setting. For example:
line = [6, 9, 1, 2, 3, 4, 5, 8, 0]
column = [0, 0, 0, 0, 7, 0, 0, 0, 0]
Where those are the first line (or any line actually) and the last column. When the algorithm will try to find a value for line[8], it will always fail since 7 is blocked by column.
If you want to keep it this way (aka brute force), you should detect such a situation and start over. Again, this is very unefficient and you should look at how to generate sudokus properly (my naive approach would be to start with a solved one and swap lines and columns randomly but I know this is not a good way).

Counting number of values between interval

Is there any efficient way in python to count the times an array of numbers is between certain intervals? the number of intervals i will be using may get quite large
like:
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
some function(mylist, startpoints):
# startpoints = [0,10,20]
count values in range [0,9]
count values in range [10-19]
output = [9,10]
you will have to iterate the list at least once.
The solution below works with any sequence/interval that implements comparision (<, >, etc) and uses bisect algorithm to find the correct point in the interval, so it is very fast.
It will work with floats, text, or whatever. Just pass a sequence and a list of the intervals.
from collections import defaultdict
from bisect import bisect_left
def count_intervals(sequence, intervals):
count = defaultdict(int)
intervals.sort()
for item in sequence:
pos = bisect_left(intervals, item)
if pos == len(intervals):
count[None] += 1
else:
count[intervals[pos]] += 1
return count
data = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
print count_intervals(data, [10, 20])
Will print
defaultdict(<type 'int'>, {10: 10, 20: 9})
Meaning that you have 10 values <10 and 9 values <20.
I don't know how large your list will get but here's another approach.
import numpy as np
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
np.histogram(mylist, bins=[0,9,19])
You can also use a combination of value_counts() and pd.cut() to help you get the job done.
import pandas as pd
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
split_mylist = pd.cut(mylist, [0, 9, 19]).value_counts(sort = False)
print(split_mylist)
This piece of code will return this:
(0, 10] 10
(10, 20] 9
dtype: int64
Then you can utilise the to_list() function to get what you want
split_mylist = split_mylist.tolist()
print(split_mylist)
Output: [10, 9]
If the numbers are integers, as in your example, representing the intervals as frozensets can perhaps be fastest (worth trying). Not sure if the intervals are guaranteed to be mutually exclusive -- if not, then
intervals = [frozenzet(range(10)), frozenset(range(10, 20))]
counts = [0] * len(intervals)
for n in mylist:
for i, inter in enumerate(intervals):
if n in inter:
counts[i] += 1
if the intervals are mutually exclusive, this code could be sped up a bit by breaking out of the inner loop right after the increment. However for mutually exclusive intervals of integers >= 0, there's an even more attractive option: first, prepare an auxiliary index, e.g. given your startpoints data structure that could be
indices = [sum(i > x for x in startpoints) - 1 for i in range(max(startpoints))]
and then
counts = [0] * len(intervals)
for n in mylist:
if 0 <= n < len(indices):
counts[indices[n]] += 1
this can be adjusted if the intervals can be < 0 (everything needs to be offset by -min(startpoints) in that case.
If the "numbers" can be arbitrary floats (or decimal.Decimals, etc), not just integer, the possibilities for optimization are more restricted. Is that the case...?

Categories