Detect significant changes in a data-set that gradually changes - python

I have a list of data in python that represents amount of resources used per minute. I want to find the number of times it changes significantly in that data set. What I mean by significant change is a bit different from what I've read so far.
For e.g. if I have a dataset like
[10,15,17,20,30,40,50,70,80,60,40,20]
I say a significant change happens when data increases by double or reduces by half with respect to the previous normal.
For e.g. since the list starts with 10, that is our starting normal point
Then when data doubles to 20, I count that as one significant change and set the normal to 20.
Then when data doubles to 40, it is considered a significant change and the normal is now 40
Then when data doubles to 80, it is considered a significant change and the normal is now 80
After that when data reduces by half to 40, it is considered as another significant change and the normal becomes 40
Finally when data reduces by half to 20, it is the last significant change
Here there are a total of 5 significant changes.
Is it similar to any other change detection algorithm? How can this be done efficiently in python?

This is relatively straightforward. You can do this with a single iteration through the list. We simply update our base when a 'significant' change occurs.
Note that my implementation will work for any iterable or container. This is useful if you want to, for example, read through a file without having to load it all into memory.
def gen_significant_changes(iterable, *, tol = 2):
iterable = iter(iterable) # this is necessary if it is container rather than generator.
# note that if the iterable is already a generator iter(iterable) returns itself.
base = next(iterable)
for x in iterable:
if x >= (base * tol) or x <= (base/tol):
yield x
base = x
my_list = [10,15,17,20,30,40,50,70,80,60,40,20]
print(list(gen_significant_changes(my_list)))

I can't help with the Python part, but in terms of math, the problem you're asking is fairly simple to solve using log base 2. A significant change occurs when the current value divided by a constant can be reached by raising 2 to a different power (as an integer) than the previous value. (The constant is needed since the first value in the array forms the basis of comparison.)
For each element at t, compute:
current = math.log(Array[t] /Array[0], 2)
previous = math.log(Array[t-1]/Array[0], 2)
if math.floor(current) <> math.floor(previous) a significant change has occurred
Using this method you do not need to keep track of a "normal point" at all, you just need the array. By removing the additional state variable we enable the array to be processed in any order, and we could give portions of the array to different threads if the dataset were very large. You wouldn't be able to do that with your current method.

Related

Efficient Way to Repeatedly Split Large NumPy Array and Record Middle

I have a large NumPy array nodes = np.arange(100_000_000) and I need to rearrange this array by:
Recording and then removing the middle value in the array
Split the array into the left half and right half
Repeat Steps 1-2 for each half
Stop when all values are exhausted
So, for a smaller input example nodes = np.arange(10), the output would be:
[5 2 8 1 4 7 9 0 3 6]
This was accomplished by naively doing:
import numpy as np
def split(node, out):
mid = len(node) // 2
out.append(node[mid])
return node[:mid], node[mid+1:]
def reorder(a):
nodes = [a.tolist()]
out = []
while nodes:
tmp = []
for node in nodes:
for n in split(node, out):
if n:
tmp.append(n)
nodes = tmp
return np.array(out)
if __name__ == "__main__":
nodes = np.arange(10)
print(reorder(nodes))
However, this is way too slow for nodes = np.arange(100_000_000) and so I am looking for a much faster solution.
You can vectorize your function with Numpy by working on groups of slices.
Here is an implementation:
# Similar to [e for tmp in zip(a, b) for e in tmp] ,
# but on Numpy arrays and much faster
def interleave(a, b):
assert len(a) == len(b)
return np.column_stack((a, b)).reshape(len(a) * 2)
# n is the length of the input range (len(a) in your example)
def fast_reorder(n):
if n == 0:
return np.empty(0, dtype=np.int32)
startSlices = np.array([0], dtype=np.int32)
endSlices = np.array([n], dtype=np.int32)
allMidSlices = np.empty(n, dtype=np.int32) # Similar to "out" in your implementation
midInsertCount = 0 # Actual size of allMidSlices
# Generate a bunch of middle values as long as there is valid slices to split
while midInsertCount < n:
# Generate the new mid/left/right slices
midSlices = (endSlices + startSlices) // 2
# Computing the next slices is not needed for the last step
if midInsertCount + len(midSlices) < n:
# Generate the nexts slices (possibly with invalid ones)
newStartSlices = interleave(startSlices, midSlices+1)
newEndSlices = interleave(midSlices, endSlices)
# Discard invalid slices
isValidSlices = newStartSlices < newEndSlices
startSlices = newStartSlices[isValidSlices]
endSlices = newEndSlices[isValidSlices]
# Fast appending
allMidSlices[midInsertCount:midInsertCount+len(midSlices)] = midSlices
midInsertCount += len(midSlices)
return allMidSlices[0:midInsertCount]
On my machine, this is 89 times faster than your scalar implementation with the input np.arange(100_000_000) dropping from 2min35 to 1.75s. It also consume far less memory (rougthly 3~4 times less). Note that if you want a faster code, then you probably need to use a native language like C or C++.
Edit:
The question has been updated to have a much smaller input array so I leave the below for historical reasons. Basically it was likely a typo but we often get accustomed to computers working with insanely large numbers and when memory is involved they can be a real problem.
There is already a numpy based solution submitted by someone else that I think fits the bill.
Your code requires an insane amount of RAM just to hold 100 billion 64 bit integers. Do you have 800GB of RAM? Then you convert the numpy array to a list which will be substantially larger than the array (each packed 64 bit int in the numpy array will become a much less memory efficient python int object and the list will have a pointer to that object). Then you make a lot of slices of the list which will not duplicate the data but will duplicate the pointers to the data and use even more RAM. You also append all the result values to a list a single value at a time. Lists are very fast for adding items generally but with such an extreme size this will not only be slow but the way the list is allocated is likely to be extremely wasteful RAM wise and contribute to major problems (I believe they double in size when they get to a certain level of fullness so you will end up allocating more RAM than you need and doing many allocations and likely copies). What kind of machine are you running this on? There are ways to improve your code but unless you're running it on a super computer I don't know that you're going to ever finish that calculation. I only..only? have 32GB of RAM and I'm not going to even try to create a 100B int_64 numpy array as I don't want to use up ssd write life for a mass of virtual memory.
As for improving your code stick to numpy arrays don't change to a python list it will greatly increase the RAM you need. Preallocate a numpy array to put the answer in. Then you need a new algorithm. Anything recursive or recursive like (ie a loop splitting the input,) will require tracking a lot of state, your nodes list is going to be extraordinarily gigantic and again use a lot of RAM. You could use len(a) to indicate values that are removed from your list and scan through the entire array each time to figure out what to do next but that will save RAM in favour of a tremendous amount of searching a gigantic array. I feel like there is an algorithm to cut numbers from each end and place them in the output and just track the beginning and end but I haven't figured it out at least not yet.
I also think there is a simpler algorithm where you just track the number of splits you've done instead of making a giant list of slices and keeping it all in memory. Take the middle of the left half and then the middle of the right then count up one and when you take the middle of the left half's left half you know you have to jump to the right half then the count is one so you jump over to the original right half's left half and on and on... Based on the depth into the halves and the length of the input you should be able to jump around without scanning or tracking all of those slices though I haven't been able to dedicate much time to thinking this through in my head.
With a problem of this nature if you really need to push the limits you should consider using C/C++ so you can be as efficient as possible with RAM usage and because you're doing an insane number of tiny things which doesn't map well to python performance.

Justification of constants used in random.sample

I'm looking into the source code for the function sample in random.py (python standard library).
The idea is simple:
If a small sample (k) is needed from a large population (n): Just pick k random indices, since it is unlikely you'll pick the same number twice as the population is so large. And if you do, just pick again.
If a relatively large sample (k) is needed, compared to the total population (n): It is better to keep track of what you have picked.
My Question
There are a few constants involved, setsize = 21 and setsize += 4 ** _log(3*k,4). The critical ratio is roughly k : 21+3k. The comment says # size of a small set minus size of an empty list and # table size for big sets.
Where have these specific numbers come from? What is there justification?
The comments shed some light, however I find they bring as many questions as they answer.
I would kind of understand, size of a small set but find the "minus size of an empty list" confusing. Can someone shed any light on this?
what is meant specifically by "table" size, as apposed to say "set size".
Looking on the github repository, it looks like a very old version simply used the ratio k : 6*k, as the critical ratio, but I find that equally mysterious.
The code
def sample(self, population, k):
"""Chooses k unique random elements from a population sequence or set.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
To choose a sample in a range of integers, use range as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
if isinstance(population, _Set):
population = tuple(population)
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence or set. For dicts, use list(d).")
randbelow = self._randbelow
n = len(population)
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
result = [None] * k
setsize = 21 # size of a small set minus size of an empty list
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
if n <= setsize:
# An n-length list is smaller than a k-length set
pool = list(population)
for i in range(k): # invariant: non-selected at [0,n-i)
j = randbelow(n-i)
result[i] = pool[j]
pool[j] = pool[n-i-1] # move non-selected item into vacancy
else:
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow(n)
while j in selected:
j = randbelow(n)
selected_add(j)
result[i] = population[j]
return result
(I apologise is this question would be better placed in math.stackexchange. I couldn't think of any probability/statistics-y reasons for this particular ratio, and the comments sounded as though, it was maybe something to do with the amount of space that sets and lists use - but could't find any details anywhere).
This code is attempting to determine whether using a list or a set would take more space (instead of trying to estimate the time cost, for some reason).
It looks like 21 was the difference between the size of an empty list and a small set on the Python build this constant was determined on, expressed in multiples of the size of a pointer. I don't have a build of that version of Python, but testing on my 64-bit CPython 3.6.3 gives a difference of 20 pointer sizes:
>>> sys.getsizeof(set()) - sys.getsizeof([])
160
and comparing the 3.6.3 list and set struct definitions to the list and set definitions from the change that introduced this code, 21 seems plausible.
I said "the difference between the size of an empty list and a small set" because both now and at the time, small sets used a hash table contained inside the set struct itself instead of externally allocated:
setentry smalltable[PySet_MINSIZE];
The
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
check adds the size of the external table allocated for sets larger than 5 elements, with size again expressed in number of pointers. This computation assumes the set never shrinks, since the sampling algorithm never removes elements. I am not currently sure whether this computation is exact.
Finally,
if n <= setsize:
compares the base overhead of a set plus any space used by an external hash table to the n pointers required by a list of the input elements. (It doesn't seem to account for the overallocation performed by list(population), so it may be underestimating the cost of the list.)

Is there any efficient way to increment the corresponding set positions of an integer in an integer array?

Any solution consuming less than O(Bit Length) time is welcome. I need to process around 100 million large integers.
answer = [0 for i in xrange(100)]
def pluginBits(val):
global answer
for j in xrange(len(answer)):
if val <= 0:
break
answer[j] += (val & 1)
val >>= 1
A speedier way to do this would be to use '{:b}'.format(someval) to convert from integer to a string of '1's and '0's. Python still needs to do similar work to perform this conversion, but doing it at the C layer in the interpreter internals involves significantly less overhead for larger values.
For conversion to actual list of integer 1s and 0s, you could do something like:
# Done once at top level to make translation table:
import string
bitstr_to_intval = string.maketrans(b'01', b'\x00\x01')
# Done for each value to convert:
bits = bytearray('{:b}'.format(origint).translate(bitstr_to_intval))
Since bytearray is a mutable sequence of values in range(256) that iterates the actual int values, you don't need to convert to list; it should be usable in 99% of the places the list would be used, using less memory and running faster.
This does generate the values in the reverse of the order your code produces (that is, bits[-1] here is the same as your answer[0], bits[-2] is your answer[1], etc.), and it's unpadded, but since you're summing bits, the padding isn't needed, and reversing the result is a trivial reversing slice (add [::-1] to the end). Summing the bits from each input can be made much faster by making answer a numpy array (that allows a bulk element-wise addition at the C layer), and putting it all together gets:
import string
bitstr_to_intval = string.maketrans(b'01', b'\x00\x01')
answer = numpy.zeros(100, numpy.uint64)
def pluginBits(val):
bits = bytearray('{:b}'.format(val).translate(bitstr_to_intval))[::-1]
answer[:len(bits)] += bits
In local tests, this definition of pluginBits takes a little under one-seventh the time to sum the bits at each position for 10,000 random input integers of 100 bits each, and gets the same results.

Sample integers except the ones in a specified set

Suppose I have a set of integers exceptions = set([3, 2, 6, ...]). What is the most efficient way to draw num samples of integers from the interval [0,n) uniformly at random except the ones that appear in exceptions using python.
Here are some ideas I've played with but are not satisfactory.
# Create a set of the integers I am interested in
integers = set(range(n)) - exceptions
# Draw samples from the set
samples = np.random.choice(integers, num)
In my case n is rather large (on the order of 10^12) such that creating the set seems like a waste of memory.
However, len(exceptions) is comparatively small (on the order of 10^6) such that rejection sampling might be a reasonable approach.
samples = []
# Iterate until we have enough samples
while len(samples) < num:
# Draw a sample
proposal = np.random.randint(n)
# Accept or reject
if proposal not in exceptions:
samples.append(proposal)
Unfortunately, all the looping and set membership testing is rather slow. Obviously, I could write a C-extension or use Cython but I wanted to see whether you guys have a better idea.
Edited to include suggestions in the comments.
I believe that in case of exceptions being small compared to n, rejection might be your best shot; note that with n=10**12 and e=10**6 probability of getting invalid result is only one in a million, so you will be good without rejecting and re-randoming most of the time. If you need multiple samples anyway, it's probably a good idea to get much random numbers (like 101%) and then filter them all in bulk, discarding exceptions (I believe there is a way in numpy to get multiple samples faster that in a loop). If there are still too many, simply truncate the result; if there are too little, generate more.
As the size of exceptions is getting bigger, you might try a different approach:
Assume you have a range of n integers and e exceptions. In fact, then, you are interested in getting one of n-e uniformly distributed results. So having x = np.random.randint(n-e) should be good enough, provided you can transform such a result into a meaningful one fast. First, an example to illustrate the idea:
Let's assume n=100, set of exceptions = {11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 26, 27}
case 1: your random result was 60. You have to count how many exceptions smaller than or equal to 60 are there: there are 13. So the result is 60+13=73.
case 2: your random result was 22. There are 10 exceptions smaller than that, so you increase the result by 10, getting 32. There are however two more exceptions that are not greater than 32, but greater than 22, and therefore not yet counted. You have to further increase the value by 2, final result being 34.
The algorithm, then, would be as follows:
n = ... # full range
exceptions = ... # collection of your exceptions
e = len(exceptions)
def get_sample():
x = np.random.randint(n-e)
skipped = 0
small_exceptions = count_small_exceptions(x, exceptions)
while skipped < small_exceptions:
skipped = small_exceptions
small_exceptions = count_small_exceptions(x+skipped, exceptions)
return x+skipped
In order to implement that, you need a fast way to count exceptions smaller than a given value - module bisect should do the trick, as it provides binary-search mechanisms on a sorted array.
Why does this work? The loop stops when skipped == small_exceptions, so there are exactly skipped exceptions in the range 0..(skipped+x). If the returned value (skipped+x) was an exception, the loop would have stopped earlier, because there would have been less exceptions in the previous iteration.
The exact cost analysis is not that easy, but if you assume that you have one exception for every k regular values, then increases in subsequent iterations are k times smaller each time. Worst-case scenario is when exceptions=range(e) and random x comes out as 0. Then there would be e iterations, with small increase in every one of them. If e is small compared to n this shouldn't be a problem; otherwise, you might further improve this solution by storing exceptions as intervals instead, effectively merging all subsequent exceptions into only one 'hole' and one operation.

python memory usage dict and variables large dataset

So, I'm making a game in Python 3.4. In the game I need to keep track of a map. It is a map of joined rooms, starting at (0,0) and continuing in every direction, generated in a filtered-random way(only correct matches for the next position are used for a random list select).
I have several types of rooms, which have a name, and a list of doors:
RoomType = namedtuple('Room','Type,EntranceLst')
typeA = RoomType("A",["Bottom"])
...
For the map at the moment I keep a dict of positions and the type of room:
currentRoomType = typeA
currentRoomPos = (0,0)
navMap = {currentRoomPos: currentRoomType}
I have loop that generates 9.000.000 rooms, to test the memory usage.
I get around 600 and 800Mb when I run it.
I was wondering if there is a way to optimize that.
I tried with instead of doing
navMap = {currentRoomPos: currentRoomType}
I would do
navMap = {currentRoomPos: "A"}
but this doesn't have a real change in usage.
Now I was wondering if I could - and should - keep a list of all the types, and for every type keep the positions on which it occurs. I do not know however if it will make a difference with the way python manages its variables.
This is pretty much a thought-experiment, but if anything useful comes from it I will probably implement it.
You can use sys.getsizeof(object) to get the size of a Python object. However, you have to be careful when calling sys.getsizeof on containers: it only gives the size of the container, not the content -- see this recipe for an explanation of how to get the total size of a container, including contents. In this case, we don't need to go quite so deep: we can just manually add up the size of the container and the size of its contents.
The sizes of the types in question are:
# room type size
>>> sys.getsizeof(RoomType("A",["Bottom"])) + sys.getsizeof("A") + sys.getsizeof(["Bottom"]) + sys.getsizeof("Bottom")
233
# position size
>>> sys.getsizeof((0,0)) + 2*sys.getsizeof(0)
120
# One character size
>>> sys.getsizeof("A")
38
Let's look at the different options, assuming you have N rooms:
Dictionary from position -> room_type. This involves keeping N*(size(position) + size(room_type)) = 353 N bytes in memory.
Dictionary from position -> 1-character string. This involves keeping N*158 bytes in memory.
Dictionary from type -> set of positions. This involves keeping N*120 bytes plus a tiny overhead with storing dictionary keys.
In terms of memory usage, the third option is clearly better. However, as is often the case, you have a CPU memory tradeoff. It's worth thinking briefly about the computational complexity of the queries you are likely to do. To find the type of a room given its position, with each of the three choices above you have to:
Look up the position in a dictionary. This is an O(1) lookup, so you'll always have the same run time (approximately), independent of the number of rooms (for a large number of rooms).
Same
Look at each type, and for each type, ask if that position is in the set of positions for that type. This is an O(ntypes) lookup, that is, the time it takes is proportional to the number of types that you have. Note that, if you had gone for list instead of a set to store the rooms of a given type, this would grow to O(nrooms * ntypes), which would kill your performance.
As always, when optimising, it is important to consider the effect of an optimisation on both memory usage and CPU time. The two are often at odds.
As an alternative, you could consider keeping the types in a 2-dimensional numpy array of characters, if your map is sufficiently rectangular. I believe this would be far more efficient. Each character in a numpy array is a single byte, so the memory usage would be much less, and the CPU time would still be O(1) lookup from room position to type:
# Generate random 20 x 10 rectangular map
>>> map = np.repeat('a', 100).reshape(20, 10)
>>> map.nbytes
200 # ie. 1 byte per character.
Some additionally small scale optimisations:
Encode the room type as an int rather than a string. Ints have size 24 bytes, while one-character strings have size 38.
Encode the position as a single integer, rather than a tuple. For instance:
# Random position
xpos = 5
ypos = 92
# Encode the position as a single int, using high-order bits for x and low-order bits for y
pos = 5*1000 + ypos
# Recover the x and y values of the position.
xpos = pos / 1000
ypos = pos % 1000
Note that this kills readability, so it's only worth doing if you want to squeeze the last bits of performance. In practice, you might want to use a power of 2, rather than a power of 10, as your delimiter (but a power of 10 helps with debugging and readability). Note that this brings your number of bytes per position from 120 to 24. If you do go down this route, consider defining a Position class using __slots__ to tell Python how to allocate memory, and add xpos and ypos properties to the class. You don't want to litter your code with pos / 1000 and pos % 1000 statements.

Categories