I am trying to generate all 16^16,
but there are a few problems. Mainly memory.
I tried to generate them in python like this:
for y in range (0, 16**16):
print '0x%0*X' % (16,y)
This gives me:
OverflowError: range() result has too many items
If I use sys.maxint I get a MemoryError.
To be more precise, I want to generate all combinations of HEX in length of 16, i.e:
0000000000000000
0000000000000001
0000000000000002
...
FFFFFFFFFFFFFFFF
Also, how do I calculate the approximate time it will take me to generate them?
I am open to the use of any programming language as long as I can save them to an output file.
Well... 16^16 = 1.8446744e+19, so lets say you could calculate 10 values per nanosecond (that's a 10GHz rate btw). Then it would take you 16^16 / 10 nanoseconds to compute them all, or 58.4 years. Also, if you could somehow compress each value into 1-bit (which is impossible), it would require 2 exabytes of memory to contain those values (16^16/8/2^60).
This seems like a very artificial exercise. Is it homework, or is there a reason for generating this list? It will be very long (see other answers)!
Having said that, you should ask yourself: why is this happening? The answer is that in Python 2.x, range produces an actual list. If you want to avoid that, you can:
Use Python 3.x, in which range does not actually make a list, but a special generator-like object.
Use xrange, which also doesn't actually make a list, but again produces an object.
As for timing, all of the time will be in writing to the file or screen. You can get an estimate by making a somewhat smaller list and then doing some math, but you have to be careful that it's big enough that the time is dominated by writing the lines, and not opening and closing the file.
But you should also ask yourself how big the resultant file will be... You may not like what you find. Perhaps you mean 2^16?
Related
Hello friends!
Summarization:
I got a ee.FeatureCollection containing around 8500 ee.Point-objects. I would like to calculate the distance of these points to a given coordinate, lets say (0.0, 0.0).
For this i use the function geopy.distance.distance() (ref: https://geopy.readthedocs.io/en/latest/#module-geopy.distance). As input the the function takes 2 coordinates in the form of 2 tuples containing 2 floats.
Problem: When i am trying to convert the coordinates in form of an ee.List to float, i always use the getinfo() function. I know this is a callback and it is very time intensive but i don't know another way to extract them. Long story short: To extract the data as ee.Number it takes less than a second, if i want them as float it takes more than an hour. Is there any trick to fix this?
Code:
fc_containing_points = ee.FeatureCollection('projects/ee-philadamhiwi/assets/Flensburg_100') #ee.FeatureCollection
list_containing_points = fc_containing_points.toList(fc_containing_points.size()) #ee.List
fc_containing_points_length = fc_containing_points.size() #ee.Number
for index in range(fc_containing_points_length.getInfo()): #i need to convert ee.Number to int
point_tmp = list_containing_points.get(i) #ee.ComputedObject
point = ee.Feature(point_tmp) #transform ee.ComputedObject to ee.Feature
coords = point.geometry().coordinates() #ee.List containing 2 ee.Numbers
#when i run the loop with this function without the next part
#i got all the data i want as ee.Number in under 1 sec
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0]) #tuple containing 2 floats
#when i add this part to the function it takes hours
PS: This is my first question, pls be patient with me.
I would use .map instead of your looping. This stays server side until you export the table (or possibly do a .getInfo on the whole thing)
fc_containing_points = ee.FeatureCollection('projects/eephiladamhiwi/assets/Flensburg_100')
fc_containing_points.map(lambda feature: feature.set("distance_to_point", feature.distance(ee.Feature(ee.Geometry.Point([0.0,0.0])))
# Then export using ee.batch.Export.Table.toXXX or call getInfo
(An alternative might be to useee.Image.paint to convert the target point to an image then, use ee.Image.distance to calculate the distance to the point (as an image), then use reduceRegions over the feature collection with all points but 1) you can only calculate distance to a certain distance and 2) I don't think it would be any faster.)
To comment on your code, you are probably aware loops (especially client side loops) are frowned upon in GEE (primarily for the performance reasons you've run into) but also note that any time you call .getInfo on a server side object it incurs a performance cost. So this line
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0])
Would take roughly double the time as this
coords_client = coords.getInfo()
coords_as_tuple_of_ints = (coords_client[1],coords_client[0])
Finally, you could always just export your entire feature collection to a shapefile (using ee.batch.Export.Table.... as above) and do all the operations using geopy locally.
I have a list with around 1500 number (1 to 1500) and I want to get all the possible Permutation out of it to do some calculations and choose the smallest number out of all of it.
Problem is that the number of possibilites as you can figure is wayyy wayyy too big and my computer just freeze whiel running the code so I have to make a forced restart. Also my RAM is 8GB so it should be big enough (?) or so I hope.
To limit it I can specify a start point but that won't reduce it much.
Also it's a super important thing but I feel so lost. what do you think should I do to make it run ?
Use of generators and itertools saves memory a lot. Generators are just like lists with an exception that new elements generated one by one rather than stored in memory. Even if your problem has a better solution, mastering the generator will help you to save memory in future.
Note in Python 3 map is already a generator (while for Python 2 use imap instead of map)
import itertools
array = [ 1, 2, ... , 1500] # or other numbers
f = sum # or whatever function you have
def fmax_on_perms(array, f):
perms = itertools.permutations(array) # generator of all the
permutations rather than static list
fvalues = map(my_function, perms) # generator of values of your function on different permutations
return max(fvalues)
Again I have a question concerning large loops.
Suppose I have a function
limits
def limits(a,b):
*evaluate integral with upper and lower limits a and b*
return float result
A and B are simple np.arrays that store my values a and b. Now I want to calculate the integral 300'000^2/2 times because A and B are of the length of 300'000 each and the integral is symmetrical.
In Python I tried several ways like itertools.combinations_with_replacement to create the combinations of A and B and then put them into the integral but that takes huge amount of time and the memory is totally overloaded.
Is there any way, for example transferring the loop in another language, to speed this up?
I would like to run the loop
for i in range(len(A)):
for j in range(len(B)):
np.histogram(limits(A[i],B[j]))
I think histrogramming the return of limits is desirable in order not to store additional arrays that grow squarely.
From what I read python is not really the best choice for this iterative ansatzes.
So would it be reasonable to evaluate this loop in another language within Python, if yes, How to do it. I know there are ways to transfer code, but I have never done it so far.
Thanks for your help.
If you're worried about memory footprint, all you need to do is bin the results as you go in the for loop.
num_bins = 100
bin_upper_limits = np.linspace(-456, 456, num=num_bins-1)
# (last bin has no upper limit, it goes from 456 to infinity)
bin_count = np.zeros(num_bins)
for a in A:
for b in B:
if b<a:
# you said the integral is symmetric, so we can skip these, right?
continue
new_result = limits(a,b)
which_bin = np.digitize([new_result], bin_upper_limits)
bin_count[which_bin] += 1
So nothing large is saved in memory.
As for speed, I imagine that the overwhelming majority of time is spent evaluating limits(a,b). The looping and binning is plenty fast in this case, even in python. To convince yourself of this, try replacing the line new_result = limits(a,b) with new_result = 234. You'll find that the loop runs very fast. (A few minutes on my computer, much much less than the 4 hour figure you quote.) Python does not loop very fast compared to C, but it doesn't matter in this case.
Whatever you do to speed up the limits() call (including implementing it in another language) will speed up the program.
If you change the algorithm, there is vast room for improvement. Let's take an example of what it seems you're doing. Let's say A and B are 0,1,2,3. You're integrating a function over the ranges 0-->0, 0-->1, 1-->1, 1-->2, 0-->2, etc. etc. You're re-doing the same work over and over. If you have integrated 0-->1 and 1-->2, then you can add up those two results to get the integral 0-->2. You don't have to use a fancy integration algorithm, you just have to add two numbers you already know.
Therefore it seems to me that you can compute integrals in all the smallest ranges (0-->1, 1-->2, 2-->3), store the results in an array, and add subsets of the results to get the integral over whatever range you want. If you want this program to run in a few minutes instead of 4 hours, I suggest thinking through an alternative algorithm along those lines.
(Sorry if I'm misunderstanding the problem you're trying to solve.)
I have a project where I am reading in ASCII values from a microcontroller through a serial port (looks like this : AA FF BA 11 43 CF etc)
The input is coming in quickly (38 two character sets / second).
I'm taking this input and appending it to a running list of all measurements.
After about 5 hours, my list has grown to ~ 855000 entries.
I'm given to understand that the larger a list becomes, the slower list operations become. My intent is to have this test run for 24 hours, which should yield around 3M results.
Is there a more efficient, faster way to append to a list then list.append()?
Thanks Everyone.
I'm given to understand that the larger a list becomes, the slower list operations become.
That's not true in general. Lists in Python are, despite the name, not linked lists but arrays. There are operations that are O(n) on arrays (copying and searching, for instance), but you don't seem to use any of these. As a rule of thumb: If it's widely used and idiomatic, some smart people went and chose a smart way to do it. list.append is a widely-used builtin (and the underlying C function is also used in other places, e.g. list comprehensions). If there was a faster way, it would already be in use.
As you will see when you inspect the source code, lists are overallocating, i.e. when they are resized, they allocate more than needed for one item so the next n items can be appended without need to another resize (which is O(n)). The growth isn't constant, it is proportional with the list size, so resizing becomes rarer as the list grows larger. Here's the snippet from listobject.c:list_resize that determines the overallocation:
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
As Mark Ransom points out, older Python versions (<2.7, 3.0) have a bug that make the GC sabotage this. If you have such a Python version, you may want to disable the gc. If you can't because you generate too much garbage (that slips refcounting), you're out of luck though.
One thing you might want to consider is writing your data to a file as it's collected. I don't know (or really care) if it will affect performance, but it will help ensure that you don't lose all your data if power blips. Once you've got all the data, you can suck it out of the file and jam it in a list or an array or a numpy matrix or whatever for processing.
Appending to a python list has a constant cost. It is not affected by the number of items in the list (in theory). In practice appending to a list will get slower once you run out of memory and the system starts swapping.
http://wiki.python.org/moin/TimeComplexity
It would be helpful to understand why you actually append things into a list. What are you planning to do with the items. If you don't need all of them you could build a ring buffer, if you don't need to do computation you could write the list to a file, etc.
First of all, 38 two-character sets per second, 1 stop bit, 8 data bits, and no parity, is only 760 baud, not fast at all.
But anyway, my suggestion, if you're worried about having overly large lists/don't want to use one huge list, is just to store store a list on disk once it reaches a certain size and start a new list, repeating until you've gotten all the data, then combining all the lists into one once you're done receiving the data.
Though you may skip the sublists completely and just go with nmichaels' suggestion, writing the data to a file as you get it and using a small circular buffer to hold the received data that has not yet been written.
It might be faster to use numpy if you know how long the array is going to be and you can convert your hex codes to ints:
import numpy
a = numpy.zeros(3000000, numpy.int32)
for i in range(3000000):
a[i] = int(scanHexFromSerial(),16)
This will leave you with an array of integers (which you could convert back to hex with hex()), but depending on your application maybe that will work just as well for you.
I have the following code:
import random
rand1 = random.Random()
rand2 = random.Random()
rand1.seed(0)
rand2.seed(0)
rand1.jumpahead(1)
rand2.jumpahead(2)
x = [rand1.random() for _ in range(0,5)]
y = [rand2.random() for _ in range(0,5)]
According to the documentation of jumpahead() function I expected x and y to be (pseudo)independent sequences. But the output that I get is:
x: [0.038378463064751012, 0.79353887395667977, 0.13619161852307016, 0.82978789012683285, 0.44296031215986331]
y: [0.98374801970498793, 0.79353887395667977, 0.13619161852307016, 0.82978789012683285, 0.44296031215986331]
If you notice, the 2nd-5th numbers are same. This happens each time I run the code.
Am I missing something here?
rand1.seed(0)
rand2.seed(0)
You initialize them with the same values so you get the same (non-)randomness. Use some value like the current unix timestamp to seed it and you will get better values. But note that if you initialize two RNGs at the same time with the current time though, you will get the same "random" values from them of course.
Update: Just noticed the jumpahead() stuff: Have a look at How should I use random.jumpahead in Python - it seems to answer your question.
I think there is a bug, python's documentation does not make this as clear as it should.
The difference between your two parameters to jumpahead is 1, this means you are only guaranteed to get 1 unique value (which is what happens). if you want more values, you need larger parameters.
EDIT: Further Explanation
Originally, as the name suggests, jumpahead merely jumped ahead in the sequence. Its clear to see in that case where jumping 1 or 2 places ahead in the sequence would not produce independent results. As it turns out, jumping ahead in most random number generators is inefficient. For that reason, python only approximates jumping ahead. Because its only approximate, python can implement a more effecient algorithm. However, the method is "pretending" to jump ahead, passing two similiar integers will not result in a very different sequence.
To get different sequences you need the integers passed in to be far apart. In particular, if you want to read a million random integers, you need to seperate your jumpaheads by a million.
As a final note, if you have two random number generators, you only need to jumpahead on one of them. You can (and should) leave the other in its original state.