Random number generation with Python, ArcGIS 10.1 - python

I have a shapefile with 1,000+ cases and three fields (DOUBLE) ran1, ran2 and ran3, which I have set up to receive the product of separate random number generation operations.
Unfortunately, the Random Number Generator (Environment setting) documentation and Parser:Python do not seem to be appropriate for this sort of thing.
getRandomValue()
import numpy.random as R
def getRandomValue(fieldName1):
return R.random()
Any ideas are welcome.

I'm not sure why you deem the code you posted as not appropriate.
For me the code below works great and to get random values written into fields you would just wrap it in an UpdateCursor.
import numpy.random as R
def getRandomValue(fieldName1):
return R.random()
print getRandomValue()
If the range of random numbers is not suitable then this StackOverflow Question has a good Answer.
Please note that the GIS Stack Exchange might have been a good alternative location to post this Question because it uses ArcPy from ArcGIS.

Related

Converting MATLAB random function to python

My task is to convert one big MATLAB file into python.
There is a line in MATLAB
weightsEI_slow = random('binom',1,0.2,[EneuronNum_slow,IneuronNum_slow]);
I am trying to convert this into python code, I am not quite finding the right documentation. I looked for numpy library too. Does any one have any suggestions?
It looks like you generate a random number that follows the Binomial distribution with probability p=0.2 and sample size n=1. In turn, you can leverage numpy
import numpy as np
np.random.binomial(n=1, p=0.2)
>0
If you require replicability, add np.random.seed(3408) before the number is sampled. Otherwise, the output might be 0 or 1 depending on the execution. Of course, you can switch in another integer value as the seed instead of 3408.

What simple filter could I use to de-noise my data?

I'm processing some experimental data in Python 3. The data (raw_data in my code) is pretty noisy:
One of my goal is to find the peaks, and for this I'd like to filter the noise. Based on what I found in the documentation of SciPy's Signal module, the theory of filtering seems to be really complicated, and unfortunately I have zero background. Of course I got to learn it sooner or later - and I intend to - but now now the profit doesn't worth the time (and learning filter theory isn't the purpose of my work), so I shamefully copied the code in Lyken Syu's answer without a chance of understanding the background:
import numpy as np
from scipy import signal as sg
from matplotlib import pyplot as plt
# [...] code, resulting in this:
raw_data = [arr_of_xvalues, arr_of_yvalues] # xvalues are in decreasing order
# <magic beyond my understanding>
n = 20 # the larger n is, the smoother the curve will be
b = [1.0 / n] * n
a = 2
filt = sg.lfilter(b, a, raw_data)
filtered = sg.lfilter(b, a, filt)
# <\magic>
plt.plot(filtered[0], filtered[1], ".")
plt.show()
It kind of works:
What concerns me is the curve from 0 to the beginning of my dataset the filter adds. I guess it's a property of the IIR filter I used, but I don't know how to prevent this. Also, I couldn't make other filters work so far. I need to use this code on other experimental results alike this, so I need a somewhat more general solution than e.g. cutting out all y<10 points.
Is there a better (possibly simpler) way, or choice of filter that is easy to implement without serious theoretical background?
How, if, could I prevent my filter adding that curve to my data?

Full-range random number in Python

I'm generating a series of random floats using this line:
random.random()*(maxval-minval) + minval
I'm using it to add variable noise to a given variable, and the amount of noise added depends on a series of factors. In some cases, the noise should be so high that in practice the original value is lost, and I have a completely random value.
In this context, the code works works with finite values, but if I use "inf" it returns NaN. Is there a workaround to allow a continuos random range that might include the infinity? I don't want to tamper with os.random() as it is machine-specific.
If you define a uniform random distribution over an infinite domain, the probability of any value in the domain being chosen is infinitesimal. What you're asking for doesn't make any mathematical sense.
As it was said before, you can't have uniform distribution over the whole real line, but you can use other random distributions which have real line support. Consider Cauchy distribution. It has 'heavy-tails', which simply means that there is a decent probability of getting very big numbers.
After the discussion in comments i suggest the following :
>>> m=sys.maxint
>>> np.random.uniform(-m,m,5)
array([ -5.32362215e+18, -2.90131323e+18, 5.14492175e+18,
-5.64238742e+18, -3.49640768e+18])
As is said the you can get the max integer with sys.maxint then you can use np.random.randint to get a random number between the maxint and -maxint.
As #Asad says, what you are trying is mathematically not quite sound. But what you could do, is the following:
define a very big number (maybe this post helps: What is the range of values a float can have in Python?)
use random.uniform(0, biggestValue) as an approximation for random values according to your needs.
Maybe this is what you are looking for.

very slow function with two for loops using Arcpy in python

I wrote a code which is working perfectly with the small size data, but when I run it over a dataset with 52000 features, it seems to be stuck in the below function:
def extract_neighboring_OSM_nodes(ref_nodes,cor_nodes):
time_start=time.time()
print "here we start finding neighbors at ", time_start
for ref_node in ref_nodes:
buffered_node = ref_node[2].buffer(10)
for cor_node in cor_nodes:
if cor_node[2].within(buffered_node):
ref_node[4].append(cor_node[0])
cor_node[4].append(ref_node[0])
# node[4][:] = [cor_nodes.index(x) for x in cor_nodes if x[2].within(buffered_node)]
time_end=time.time()
print "neighbor extraction took ", time_end
return ref_nodes
the ref_node and cor_node are a list of tuples as follows:
[(FID, point, geometry, links, neighbors)]
neighbors is an empty list which is going to be populated in the above function.
As I said the last message printed out is the first print command in this function. it seems that this function is so slow but for 52000 thousand features it should not take 24 hours, should it?
Any Idea where the problem would be or how to make the function faster?
You can try multiprocessing, here is an example - http://pythongisandstuff.wordpress.com/2013/07/31/using-arcpy-with-multiprocessing-%E2%80%93-part-3/.
If you want to get K Nearest Neighbors of every (or some, it doesn't matter) sample of a dataset or eps neighborhood of samples, there is no need to implement it yourself. There is libraries out there specially for this purpose.
Once they built the data structure (usually some kind of tree) you can query the data for neighborhood of a certain sample. Usually for high dimensional data these data structure are not as good as they are for low dimensions but there is solutions for high dimensional data as well.
One I can recommend here is KDTree which has a Scipy implementation.
I hope you find it useful as I did.

Python's implementation of Mutual Information

I am having some issues implementing the Mutual Information Function that Python's machine learning libraries provide, in particular :
sklearn.metrics.mutual_info_score(labels_true, labels_pred, contingency=None)
(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html)
I am trying to implement the example I find in the Stanford NLP tutorial site:
The site is found here : http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html#mifeatsel2
The problem is I keep getting different results, without figuring out the reason yet.
I get the concept of Mutual Information and feature selection, I just don't understand how it is implemented in Python. What I do is that I provide the mutual_info_score method with two arrays based on the NLP site example, but it outputs different results. The other interesting fact is that anyhow you play around and change numbers on those arrays you are most likely to get the same result. Am I supposed to use another data structure specific to Python or what is the issue behind this? If anyone has used this function successfully in the past it would be of a great help to me, thank you for your time.
I encountered the same issue today. After a few trials I found the real reason: you take log2 if you strictly followed NLP tutorial, but sklearn.metrics.mutual_info_score uses natural logarithm(base e, Euler's number). I didn't find this detail in sklearn documentation...
I verified this by:
import numpy as np
def computeMI(x, y):
sum_mi = 0.0
x_value_list = np.unique(x)
y_value_list = np.unique(y)
Px = np.array([ len(x[x==xval])/float(len(x)) for xval in x_value_list ]) #P(x)
Py = np.array([ len(y[y==yval])/float(len(y)) for yval in y_value_list ]) #P(y)
for i in xrange(len(x_value_list)):
if Px[i] ==0.:
continue
sy = y[x == x_value_list[i]]
if len(sy)== 0:
continue
pxy = np.array([len(sy[sy==yval])/float(len(y)) for yval in y_value_list]) #p(x,y)
t = pxy[Py>0.]/Py[Py>0.] /Px[i] # log(P(x,y)/( P(x)*P(y))
sum_mi += sum(pxy[t>0]*np.log2( t[t>0]) ) # sum ( P(x,y)* log(P(x,y)/( P(x)*P(y)) )
return sum_mi
If you change this np.log2 to np.log, I think it would give you the same answer as sklearn. The only difference is that when this method returns 0, sklearn will return a number very near to 0. ( And of course, use sklearn if you don't care about log base, my piece of code is just for demo, it gives poor performance...)
FYI, 1)sklearn.metrics.mutual_info_score takes lists as well as np.array; 2) the sklearn.metrics.cluster.entropy uses also log, not log2
Edit: as for "same result", I'm not sure what you really mean. In general, the values in the vectors don't really matter, it is the "distribution" of values that matters. You care about P(X=x), P(Y=y) and P(X=x,Y=y), not the value x,y.
The code below should provided a result: 0.00011053558610110256
c=np.concatenate([np.ones(49), np.zeros(27652), np.ones(141), np.zeros(774106) ])
t=np.concatenate([np.ones(49), np.ones(27652), np.zeros(141), np.zeros(774106)])
computeMI(c,t)

Categories