Regrid numpy array based on cell area - python

import numpy as np
from skimage.measure import block_reduce
arr = np.random.random((6, 6))
area_cell = np.random.random((6, 6))
block_reduce(arr, block_size=(2, 2), func=np.ma.mean)
I would like to regrid a numpy array arr from 6 x 6 size to 3 x 3. Using the skimage function block_reduce for this.
However, block_reduce assumes each grid cell has same size. How can I solve this problem, when each grid cell has a different size? In this case size of each grid cell is given by the numpy array area_cell
-- EDIT:
An example:
arr
0.25 0.58 0.69 0.74
0.49 0.11 0.10 0.41
0.43 0.76 0.65 0.79
0.72 0.97 0.92 0.09
If all elements of area_cell were 1, and we were to convert 4 x 4 arr into 2 x 2, result would be:
0.36 0.48
0.72 0.61
However, if area_cell is as follows:
0.00 1.00 1.00 0.00
0.00 1.00 0.00 0.50
0.20 1.00 0.80 0.80
0.00 0.00 1.00 1.00
Then, result becomes:
0.17 0.22
0.21 0.54

It seems you are still reducing by blocks, but after scaling arr with area_cell. So, you just need to perform element-wise multiplication between these two arrays and use the same block_reduce code on that product array, like so -
block_reduce(arr*area_cell, block_size=(2, 2), func=np.ma.mean)
Alternatively, we can simply use np.mean after reshaping to a 4D version of the product array, like so -
m,n = arr.shape
out = (arr*area_cell).reshape(m//2,2,n//2,2).mean(axis=(1,3))
Sample run -
In [21]: arr
Out[21]:
array([[ 0.25, 0.58, 0.69, 0.74],
[ 0.49, 0.11, 0.1 , 0.41],
[ 0.43, 0.76, 0.65, 0.79],
[ 0.72, 0.97, 0.92, 0.09]])
In [22]: area_cell
Out[22]:
array([[ 0. , 1. , 1. , 0. ],
[ 0. , 1. , 0. , 0.5],
[ 0.2, 1. , 0.8, 0.8],
[ 0. , 0. , 1. , 1. ]])
In [23]: block_reduce(arr*area_cell, block_size=(2, 2), func=np.ma.mean)
Out[23]:
array([[ 0.1725 , 0.22375],
[ 0.2115 , 0.5405 ]])
In [24]: m,n = arr.shape
In [25]: (arr*area_cell).reshape(m//2,2,n//2,2).mean(axis=(1,3))
Out[25]:
array([[ 0.1725 , 0.22375],
[ 0.2115 , 0.5405 ]])

Related

Correlation of a 1D array from beginning to all points, a kind of sliding correlation

I have a 1D array and want to find the correlation of the series of first 2 elements, then first 3 elements... until all elements.
I can do it with numpy in a loop; here is my code:
data = np.array([10,5,8,9,15,22,26,11,15,16,18,7,4,8,-2,-3,-4,-6,-2,0,10,0,5,8])
correl = np.zeros(data.shape)
for i in range(1, data.shape[0]):
correl[i] = np.corrcoef(data[0: i+1], np.arange(i+1))[0, 1]
print(correl)
and the result is:
[ 0. -1. -0.397 0. 0.607 0.799 0.88 0.64 0.581 0.556
0.574 0.322 0.078 -0.02 -0.237 -0.383 -0.489 -0.572 -0.614 -0.634
-0.568 -0.59 -0.573 -0.533]
I wonder how I can make it in numpy without a loop, i.e. be smarter/more efficient
Any idea please?

Compare current row with next row in a DataFrame with pandas

I have a DataFrame called "DataExample" and an ascending sorted list called "normalsizes".
import pandas as pd
if __name__ == "__main__":
DataExample = [[0.6, 0.36 ,0.00],
[0.6, 0.36 ,0.00],
[0.9, 0.81 ,0.85],
[0.8, 0.64 ,0.91],
[1.0, 1.00 ,0.92],
[1.0, 1.00 ,0.95],
[0.9, 0.81 ,0.97],
[1.2, 1.44 ,0.97],
[1.0, 1.00 ,0.97],
[1.0, 1.00 ,0.99],
[1.2, 1.44 ,0.99],
[1.1, 1.21 ,0.99]]
DataExample = pd.DataFrame(data = DataExample, columns = ['Lx', 'A', 'Ratio'])
normalsizes = [0, 0.75, 1, 1.25, 1.5, 1.75 ,2, 2.25, 2.4, 2.5, 2.75, 3,
3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, 5, 5.25, 5.5, 5.75, 6]
# for i in example.index:
#
# numb = example['Lx'][i]
What I am looking for is that each “DataExample [‘ Lx ’]” is analyzed and located within a range of normalsizes, for example:
For DataExample [‘Lx’] [0] = 0.6 -----> then it is between the interval of [0, 0.75] -----> 0.6> 0 and 0.6 <= 0.75 -----> so I take the largest of that interval, that is, 0.75. This for each row.
With this I should have the following result:
Lx A Ratio
1 0.36 0
1 0.36 0
1 0.81 0.85
1 0.64 0.91
1.25 1 0.92
1.25 1 0.95
1 0.81 0.97
1.25 1.44 0.97
1.25 1 0.97
1.25 1 0.99
1.25 1.44 0.99
1.25 1.21 0.99
numpy.searchsorted will get you what you want
import numpy as np
normalsizes = np.array(normalsizes) # convert to numpy array
DataExample["Lx"] = normalsizes[np.searchsorted(normalsizes, DataExample["Lx"])]

issue when loading a data file with numpy

I want to train a classifier with scikit, but for doing this first I need to load the corresponding data. I am using the following data file available in:
https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/
When I open it in word it has the following contents:
ADT1_YEAST 0.58 0.61 0.47 0.13 0.50 0.00 0.48 0.22 MIT
ADT2_YEAST 0.43 0.67 0.48 0.27 0.50 0.00 0.53 0.22 MIT
ADT3_YEAST 0.64 0.62 0.49 0.15 0.50 0.00 0.53 0.22 MIT
AAR2_YEAST 0.58 0.44 0.57 0.13 0.50 0.00 0.54 0.22 NUC
Each file is separated by a double space and every line with a return carriage.
I want to read it with the following command:
f=open("yeast.data")
data = np.loadtxt(f,delimiter=" ")
and at the end I want to be able to use the following:
X = data[:,:-1] # select all columns except the last
y = data[:, -1] # select the last column
for using:
X_train, X_test, y_train, y_test = train_test_split(X, y)
but when I try to read it the following error appears:
ValueError: could not convert string to float: ADT1_YEAST
so how can I read this file in Python to use later the MLPClassifier?
Thanks
You can skip the f=open(...), and you can to use dtype='O' to make sure numpy reads it as an mix of numericals and strings. Because of some inconsistancies in the data structure in the file you linked, it's best to use genfromtxt instead of loadtxt:
data = np.genfromtxt('yeast.data',dtype='O')
>>> data
array([[b'ADT1_YEAST', b'0.58', b'0.61', ..., b'0.48', b'0.22', b'MIT'],
[b'ADT2_YEAST', b'0.43', b'0.67', ..., b'0.53', b'0.22', b'MIT'],
[b'ADT3_YEAST', b'0.64', b'0.62', ..., b'0.53', b'0.22', b'MIT'],
...,
[b'ZNRP_YEAST', b'0.67', b'0.57', ..., b'0.56', b'0.22', b'ME2'],
[b'ZUO1_YEAST', b'0.43', b'0.40', ..., b'0.53', b'0.39', b'NUC'],
[b'G6PD_YEAST', b'0.65', b'0.54', ..., b'0.53', b'0.22', b'CYT']], dtype=object)
>>> data.shape
(1484, 10)
You can change the dtypes when you call genfromtxt (see documentation), or you can change them manually after like this:
data[:,0] = data[:,0].astype(str)
data[:,1:-1]= data[:,1:-1].astype(float)
data[:,-1] = data[:,-1].astype(str)
>>> data
array([['ADT1_YEAST', 0.58, 0.61, ..., 0.48, 0.22, 'MIT'],
['ADT2_YEAST', 0.43, 0.67, ..., 0.53, 0.22, 'MIT'],
['ADT3_YEAST', 0.64, 0.62, ..., 0.53, 0.22, 'MIT'],
...,
['ZNRP_YEAST', 0.67, 0.57, ..., 0.56, 0.22, 'ME2'],
['ZUO1_YEAST', 0.43, 0.4, ..., 0.53, 0.39, 'NUC'],
['G6PD_YEAST', 0.65, 0.54, ..., 0.53, 0.22, 'CYT']], dtype=object)

How to plot 3d triangles in matplotlib with triangles vertices's coordinates (9 numbers for each triangle)?

I have many triangles (say N=10^6) with (x,y,z) coordinates of each vertex of the triangles stored in a file. So each triangle has 9 numbers stored as a row in the file. Hence the file has N rows. Now I just want to plot (in 3d) all the triangles filled with some colour. The triangles may or may not be adjacent. I am very very confused surfing through matplotlib documentation. Kindly help. Don't scold me please.
Plotting 10 million triangles on a plot which has at most 1 million pixels may not make too much sense. In any case, if you do not have information about which vertex is adjacent to which other, you cannot directly use the plot_trisurf method.
I see two options:
Plot a Poly3DCollection.
Filter the unique points from the data and supply those to plot_trisurf. Using this method, you may not be able to colorize the triangles to your wishes, but only according to z-Value.
The following would be an example on how to plot a Poly3DCollection from your input data. For the purpose of demonstration we first need to provide some sample data (this needs to be the duty of the questioner, not the answerer).
import numpy as np
np.set_printoptions(threshold='nan')
phi = np.linspace(0,2*np.pi, 7)
x = np.cos(phi) + np.sin(phi)
y = -np.sin(phi) + np.cos(phi)
z = np.cos(phi)*0.12+0.7
a = np.zeros((len(phi)-1, 9))
a[:,0] = x[:-1]
a[:,1] = y[:-1]
a[:,2] = z[:-1]
a[:,3:6] = np.roll( a[:,0:3], -1, axis=0)
a[:,8] = np.ones_like(phi[:-1])
a = np.around(a, 2)
print a
which prints
[[ 1. 1. 0.82 1.37 -0.37 0.76 0. 0. 1. ]
[ 1.37 -0.37 0.76 0.37 -1.37 0.64 0. 0. 1. ]
[ 0.37 -1.37 0.64 -1. -1. 0.58 0. 0. 1. ]
[-1. -1. 0.58 -1.37 0.37 0.64 0. 0. 1. ]
[-1.37 0.37 0.64 -0.37 1.37 0.76 0. 0. 1. ]
[-0.37 1.37 0.76 1. 1. 0.82 0. 0. 1. ]]
(every set of 3 columns belongs to one point, first column is x, second y, third, z).
Now we can actually build the Poly3Dcollection.
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
fc = ["crimson" if i%2 else "gold" for i in range(a.shape[0])]
poly3d = [[ a[i, j*3:j*3+3] for j in range(3) ] for i in range(a.shape[0])]
ax.add_collection3d(Poly3DCollection(poly3d, facecolors=fc, linewidths=1))
ax.set_xlim(-1.5,1.5)
ax.set_ylim(-1.5,1.5)
plt.show()

Should stats.norm.pdf gives same result as stats.gaussian_kde in Python?

I was trying to estimate PDF of 1-D using gaussian_kde. However, when I plot pdf using stats.norm.pdf, it gives me different result. Please correct me if I am wrong, I think they should give quite similar result. Here's my code.
npeaks = 9
mean = np.array([0.2, 0.3, 0.38, 0.55, 0.65,0.7,0.75,0.8,0.82]) #peak locations
support = np.arange(0,1.01,0.01)
std = 0.03
pkfun = sum(stats.norm.pdf(support, loc=mean[i], scale=std) for i in range(0,npeaks))
df = pd.DataFrame(support)
X = df.iloc[:,0]
min_x, max_x = X.min(), X.max()
plt.figure(1)
plt.plot(support,pkfun)
kernel = stats.gaussian_kde(X)
grid = 100j
X= np.mgrid[min_x:max_x:grid]
Z = np.reshape(kernel(X), X.shape)
# plot KDE
plt.figure(2)
plt.plot(X, Z)
plt.show()
Also, when I get the first derivative of stats.gaussian_kde was far from the original signal. However, the result of first derivative of stats.norm.pdf does make sense. So, I am assuming I might have error in my code above.
Value of X= np.mgrid[min_x:max_x:grid]:
[
0. 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
0.06060606 0.07070707 0.08080808 0.09090909 0.1010101 0.11111111
0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
0.18181818 0.19191919 0.2020202 0.21212121 0.22222222 0.23232323
0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
0.3030303 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
0.36363636 0.37373737 0.38383838 0.39393939 0.4040404 0.41414141
0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.5959596
0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
0.66666667 0.67676768 0.68686869 0.6969697 0.70707071 0.71717172
0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
0.78787879 0.7979798 0.80808081 0.81818182 0.82828283 0.83838384
0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.8989899
0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
0.96969697 0.97979798 0.98989899 1. ]
Value of X = df.iloc[:,0]:
[ 0. 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11
0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23
0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35
0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47
0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71
0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83
0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95
0.96 0.97 0.98 0.99 1. ]
In the row below you make pdf calculations in every peak-point along 100 datapoints with the std = 0,03. So you get a matrix with array with 100 elements per row then you summerize it elementwise, result:
Thus you get a graph with 9 narrow -because of std = 0,03- U-shape.
Are you sure, that this was your purpose with this row?
This will never get the similar graph as the kernel estimate base of the original data, result:
pkfun = sum(stats.norm.pdf(support, loc=mean[i], scale=std) for i in
range(0,npeaks))

Categories