How to efficiently convert a list into probability distribution? - python

I am trying to convert a list into probability distribution.
x = [2, 4]
I want it the following array in that order.
probability_array = [1-(2+4)/10, 2/10, 4/10]
So I did the following...
y = 1 - (2 + 4)/10
new_x = [2/10, 4/10]
probability_array = [y] + new_x
The problem is I'm working with 10,000 data sets like x. Is there a faster way to do this?

I think you can do this easily with numpy. Here is an example of correctness
x=[[1, 2], [3,4]]
x=np.array(x)
sum1 = np.sum(x, axis=1).reshape(2,1)
prob = x/sum1
I think it would be pretty fast even if size of x>10000. Let's take 100 features for 10000 examples
x=np.random.randint(1, 100, size=1000000)
print(x.shape)
start=time.time()
x=x.reshape(-1, 10000)
sum1=np.sum(x, axis=1).reshape((-1, 1))
prob=x/sum1
stop=time.time()
print(stop-start)
This takes around 0.021 sec on my MBP.

Related

Sequential Sampling

To sample from N(1,2) with sample size 100 and calculating the mean of this sample we can do this:
import numpy as np
s = np.random.normal(1, 2, 100)
mean = np.mean(s)
Now if we want to produce 10000 samples and save mean of each of them we can do:
sample_means = []
for x in range(10000):
sample = np.random.normal(1, 2, 100)
sample_means.append (sample.mean())
How can I do it when we want to sample sequentially from N(1,2) and estimate the distribution mean sequentially?
IIUC you meant accumulative
sample = np.random.normal(1,2,(10000, 100))
sample_mean = []
for i,_ in enumerate(sample):
sample_mean.append(sample[:i+1,:].ravel().mean())
Then sample_mean contains the accumulative samples mean
sample_mean[:10]
[1.1185342714036368,
1.3270808654923423,
1.3266440422140355,
1.2542028664103761,
1.179358517854582,
1.1224645540064788,
1.1416887857272255,
1.1156887336750463,
1.0894328800573165,
1.0878896099712452]
Maybe list comprehension?
sample_means = [np.random.normal(1, 2, 100).mean() for i in range(10000)]
TIP Use lower case to name variables in Python

Rearrange 3D array in python

I have big binary 3D data and I want to re-arrange the data such as it is a sequence of values in order achieved by parsing the original data as sub-arrays of size (4x4x4).
For example, if the data is 2D and I want to re-arrange the data from 2x2 sub-arrays
example image
I used simple loops for this but just iterating over the loops took way more times, I am trying to to use some numpy functions to do so but I am new to SciPy
My code looks like this
x,y,z = 1200,800,400
data = np.fromfile(file_name, dtype=np.float32)
data.shape = (z,y,x)
new_data = np.empty(shape=x*y*z, dtype = np.float32)
index = 0
for zz in range(0,z,4):
for yy in range(0,y,4):
for xx in range(0,x,4):
for zShift in range(4):
for yShift in range(4):
for xShift in range(4):
new_data[index] = data[zz+zShift][yy+yShift][xx+xShift]
index+=1
new_data.tofile(output)
However, this takes a lot of time, any better implementation ideas?
As I said, the code works as intended, however, I need a smarter, pythonic way to achieve my output
Thank you!
x,y,z = 1200,800,400
data = np.empty([x,y,z])
# numpy calculates the shape of -1
out = data.reshape(-1, 4, 4, 4)
out.shape
>>> (6000000, 4, 4, 4)
Perform the following test, for smaller data and block size:
x, y, z = 4, 4, 4 # Dimensions
stp = 2 # Block size (in each dimension)
# Create the test array
arr = np.arange(x * y * z).reshape((x, y, z))
And to create a list of "blocks", run:
new_data = []
for xx in range(0, x, stp):
for yy in range(0, y, stp):
for zz in range(0, z, stp):
print('Index:', xx, yy, zz)
obj = arr[xx:xx+stp, yy:yy+stp, zz:zz+stp].copy()
print(obj)
new_data.append(obj)
In the target version of your code:
restore original values of x, y and z,
read the array from your source,
change stp back to 4,
drop test printouts.
Note also that your code adds individual elements to new_data,
only iterating over blocks of size 4 * 4 * 4,
whereas you wrote that you want a sequence of smaller arrays
(i.e. slices) of size 4 * 4 * 4, what my code does.
So if you need a list of slices (smaller arrays), not a single
4-D array, use my code.

numpy vectorized approach to regression -multiple dependent columns (x) on single independent columns (y)

consider the below (3, 13) np.array
from scipy.stats import linregress
a = [-0.00845,-0.00568,-0.01286,-0.01302,-0.02212,-0.01501,-0.02132,-0.00783,-0.00942,0.00158,-0.00016,0.01422,0.01241]
b = [0.00115,0.00623,0.00160,0.00660,0.00951,0.01258,0.00787,0.01854,0.01462,0.01479,0.00980,0.00607,-0.00106]
c = [-0.00233,-0.00467,0.00000,0.00000,-0.00952,-0.00949,-0.00958,-0.01696,-0.02212,-0.01006,-0.00270,0.00763,0.01005]
array = np.array([a,b,c])
yvalues = pd.to_datetime(['2019-12-15','2019-12-16','2019-12-17','2019-12-18','2019-12-19','2019-12-22','2019-12-23','2019-12-24',\
'2019-12-25','2019-12-26','2019-12-29','2019-12-30','2019-12-31'], errors='coerce')
I can run the OLS regression on one column at a time successfully, as in below:
out = linregress(array[0], y=yvalues.to_julian_date())
print(out)
LinregressResult(slope=329.141087037396, intercept=2458842.411731361, rvalue=0.684426534581417, pvalue=0.009863937200252878, stderr=105.71465449878443)
However, what i wish to accomplish is to: run the regression on the matrix array with 'y' variable (yvalues) being constant for all columns -in one go (loop is possible solution but tiresome). I tried to extend 'yvalues' to match array shape with (np.tile). but is seems not to be the right approach. thank you all for your help.
IIUC you are looking for something like the following list comprehension in a vectorized way:
out = [linregress(array[i], y=yvalues.to_julian_date()) for i in range(array.shape[0])]
out
[LinregressResult(slope=329.141087037396, intercept=2458842.411731361, rvalue=0.684426534581417, pvalue=0.009863937200252876, stderr=105.71465449878443),
LinregressResult(slope=178.44888292241782, intercept=2458838.7056912296, rvalue=0.1911788042719021, pvalue=0.5315353013148307, stderr=276.24376878908953),
LinregressResult(slope=106.86168938856262, intercept=2458840.7656617565, rvalue=0.17721031419860186, pvalue=0.5624701260912525, stderr=178.940293876864)]
To be honest I've never seen what you are looking for implemented using scipy or statsmodels functionalities.
Therefore we can implement it ourselves exploiting numpy broadcasting:
x = array
y = np.array(yvalues.to_julian_date())
# mean of our inputs and outputs
x_mean = np.mean(x, axis=1)
y_mean = np.mean(y)
#total number of values
n = x.shape[1]
# using the formula to calculate the slope and intercept
n = np.sum((x - x_mean[:,np.newaxis]) * (y - y_mean)[np.newaxis,:], axis=1)
d = np.sum((x - x_mean[:,np.newaxis])**2, axis=1)
slopes = n/d
intercepts = y_mean - slopes*x_mean
slopes
array([329.14108704, 178.44888292, 106.86168939])
intercepts
array([2458842.41173136, 2458838.70569123, 2458840.76566176])

Efficient way of computing the cross products between two sets of vectors numpy

I have two sets of 2000 3D vectors each, and I need to compute the cross product between each possible pair. I currently do it like this
for tx in tangents_x:
for ty in tangents_y:
cross = np.cross(tx, ty)
(... do something with the cross variable...)
This works, but it's pretty slow. Is there a way to make it faster?
If I was interested in the element-wise product, I could just do the following
# Define initial vectors
tx = np.array([np.random.randn(3) for i in range(2000)])
ty = np.array([np.random.randn(3) for i in range(2000)])
# Store them into matrices
X = np.array([tx for i in range(2000)])
Y = np.array([ty for i in range(2000)]).T
# Compute the element-wise product
ew = X * Y
# Use the element_wise product as usual
for i,tx in enumerate(tangents_x):
for j,ty in enumerate(tangents_y):
(... use the element wise product of tx and ty as ew[i,j])
How can I apply this to the cross product instead of the element-wise one? Or, do you see another alternative?
Thanks much :)
Like many numpy functions cross supports broadcasting, therefore you can simply do:
np.cross(tangents_x[:, None, :], tangents_y)
or - more verbose but maybe easier to read
np.cross(tangents_x[:, None, :], tangents_y[None, :, :])
This reshapes tangents_x and tangents_y to shapes 2000, 1, 3 and 1, 2000, 3. By the rules of broadcasting this will be interpreted like two arrays of shape 2000, 2000, 3 where tangents_x is repeated along axis 1 and tangents_y is repeated along axis 0.
Just write it out and compile it
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def calc_cros(vec_1,vec_2):
res=np.empty((vec_1.shape[0],vec_2.shape[0],3),dtype=vec_1.dtype)
for i in nb.prange(vec_1.shape[0]):
for j in range(vec_2.shape[0]):
res[i,j,0]=vec_1[i,1] * vec_2[j,2] - vec_1[i,2] * vec_2[j,1]
res[i,j,1]=vec_1[i,2] * vec_2[j,0] - vec_1[i,0] * vec_2[j,2]
res[i,j,2]=vec_1[i,0] * vec_2[j,1] - vec_1[i,1] * vec_2[j,0]
return res
Performance
#create data
tx = np.random.rand(3000,3)
ty = np.random.rand(3000,3)
#don't measure compilation overhead
comb=calc_cros(tx,ty)
t1=time.time()
comb=calc_cros(tx,ty)
print(time.time()-t1)
This gives 0.08s for the two (3000,3) matrices.
np.dot is almost always going to be faster. So you could convert one of the vectors into a matrix.
def skew(x):
return np.array([[0, -x[2], x[1]],
[x[2], 0, -x[0]],
[-x[1], x[0], 0]])
On my machine this runs faster:
tx = np.array([np.random.randn(3) for i in range(100)])
ty = np.array([np.random.randn(3) for i in range(100)])
tt=time.clock()
for x in tx:
for y in ty:
cross = np.cross(x, y)
print(time.clock()-tt)
0.207 sec
tt=time.clock()
for x in tx:
m=skew(x)
for y in ty:
cross = np.dot(m, y)
print(time.clock()-tt)
0.015 sec
This result may vary depending on the computer.
You could use np.meshgrid() to build the combination matrix and then decompose the cross product. The rest is fiddling around with the axes etc:
# build two lists of 5 3D vecotrs as example values:
a_list = np.random.randint(0, 10, (5, 3))
b_list = np.random.randint(0, 10, (5, 3))
# here the original approach using slow list comprehensions:
slow = np.array([[ np.cross(a, b) for a in a_list ] for b in b_list ])
# now the faster proposed version:
g = np.array([ np.meshgrid(a_list[:,i], b_list[:,i]) for i in range(3) ])
fast = np.array([ g[1,0] * g[2,1] - g[2,0] * g[1,1],
g[2,0] * g[0,1] - g[0,0] * g[2,1],
g[0,0] * g[1,1] - g[1,0] * g[0,1] ]).transpose(1, 2, 0)
I tested this with 10000×10000 elements (instead of the 5×5 in the example above) and it took 6.4 seconds with the fast version. The slow version already took 27 seconds for 500 elements.
For your 2000×2000 elements the fast version takes 0.23s on my computer. Fast enough for you?
Use a cartesian product to get all possible pairs
import itertools as it
all_pairs = it.product(tx, ty)
And then use map to loop over all pairs and compute the cross product:
map(lambda x: np.cross(x[0], x[1]), all_pairs)

Root mean square of a function in python

I want to calculate root mean square of a function in Python. My function is in a simple form like y = f(x). x and y are arrays.
I tried Numpy and Scipy Docs and couldn't find anything.
I'm going to assume that you want to compute the expression given by the following pseudocode:
ms = 0
for i = 1 ... N
ms = ms + y[i]^2
ms = ms / N
rms = sqrt(ms)
i.e. the square root of the mean of the squared values of elements of y.
In numpy, you can simply square y, take its mean and then its square root as follows:
rms = np.sqrt(np.mean(y**2))
So, for example:
>>> y = np.array([0, 0, 1, 1, 0, 1, 0, 1, 1, 1]) # Six 1's
>>> y.size
10
>>> np.mean(y**2)
0.59999999999999998
>>> np.sqrt(np.mean(y**2))
0.7745966692414834
Do clarify your question if you mean to ask something else.
You could use the sklearn function
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_actual,[0 for _ in y_actual], squared=False)
numpy.std(x) tends to rms(x) in cases of mean(x) value tends to 0 (thanks to #Seb), like it can be with sound records, vibrations, and other signals of fluctuations from zero.
rms = lambda x_seq: (sum(x*x for x in x_seq)/len(x_seq))**(1/2)
In case you'd like to frame your array before compute RMS, this is a numpy solution:
nframes = 1000
rms = np.array([
np.sqrt(np.mean(arr**2))
for arr in np.array_split(arr,nframes)
])
If you'd like to specify frame length instead of frame counts, you'd do this first:
frame_length = 200
arr_length = arr.shape[0]
nframes = arr_length // frame_length +1

Categories