Related
I have a 2D NumPy array. I want to slice out unequal length subsets of columns and put them in a single array with the rest of the values being filled by nan. That is to say in:
data = np.random.normal(size=(100,4))
I want to index from [75, 33, 42, 54] to end. That is to say row index from 75 to the end in column 0, row 33 to the end in column 1 and so on.
I tried data[[slice(75,100),slice(33,100)],:] but it didn't work.
You can do it by creating a mask, with True for the indices you want to be np.nan and False otherwise:
import numpy as np
data = np.random.normal(size=(5,4))
b = np.array([0, 1, 2, 3])
mask = np.arange(len(data))[:, None] < b
data[mask] = np.nan
data
Output:
array([[-0.53306108, nan, nan, nan],
[ 1.32282687, 0.83204007, nan, nan],
[-1.07143908, 0.12972517, -0.4783274 , nan],
[ 0.39686727, -1.20532247, -2.17043218, 0.74859079],
[ 1.82548696, 0.98669461, -1.17961517, -0.7813723 ]])
To get repeatable results, I seeded the random generator:
np.random.seed(0)
and then created the source array:
data = np.random.normal(size=(100,4))
The list of starting indices I defined as:
ind = [75, 33, 42, 54]
An initial step is to compute the max length of the output column:
ml = max([ 100 - i for i in ind ])
Then you can generate the output array as:
result = np.array([ np.pad(data[rInd:, i[0]], (0, ml - 100 + rInd),
constant_values = np.nan) for i, rInd in np.ndenumerate(ind) ]).T
This code:
takes the required slice of the source array (data[rInd:, i[0]]),
pads it with the requered number of NaN values,
creates a Numpy array (so far each row contains what the target column should contain),
so the only remaining step is to transpose this array.
The result, for my source data, is:
array([[-1.30652685, 0.03183056, 0.92085882, -0.04225715],
[ 0.66638308, -0.20829876, -1.03424284, 0.48148147],
[ 0.69377315, 0.4393917 , -0.4555325 , 0.23218104],
[-1.12682581, 0.94447949, -0.6436184 , -0.49331988],
[-0.04217145, -0.4615846 , -1.10438334, 0.7811981 ],
[-0.71960439, -0.82643854, -1.29285691, 0.67690804],
[-1.15735526, -1.07993151, 0.52327666, -0.29779088],
[-0.70470028, 1.92953205, 2.16323595, 1.07961859],
[ 0.77325298, 0.84436298, 1.0996596 , -0.57578797],
[-1.75589058, 0.31694261, -0.02432612, 0.69474914],
[ 1.0685094 , -0.65102559, 0.91017891, 0.61037938],
[-0.44092263, -0.68954978, -0.94444626, -0.0525673 ],
[ 0.5785215 , -1.37495129, 2.25930895, 0.08842209],
[ 1.36453185, -1.60205766, -0.46359597, -2.77259276],
[-1.84306955, 1.5430146 , 0.15650654, -0.39095338],
[ 0.69845715, -1.1680935 , -1.42406091, 2.06449286],
[-0.01568211, 0.82350415, -1.15618243, 1.53637705],
[-0.26773354, -0.23937918, 0.42625873, 1.21114529],
[ 0.84163126, -1.61695604, -0.13288058, -0.48102712],
[ 0.64331447, -0.09815039, 1.15233156, 1.13689136],
[-1.69810582, -0.4664191 , 0.52106488, 0.37005589],
[ 0.03863055, 0.37915174, 0.69153875, -0.6801782 ],
[ 1.64813493, -0.34598178, -1.5829384 , -1.34671751],
[-0.35343175, 0.06326199, -0.59631404, 1.07774381],
[ 0.85792392, -0.23792173, 0.52389102, 0.09435159],
[ nan, 0.41605005, 0.39904635, -0.10730528],
[ nan, -2.06998503, -0.65240858, -0.89091508],
[ nan, -0.39727181, -2.03068447, 2.2567235 ],
[ nan, -1.67600381, -0.69204985, -1.18894496],
[ nan, -1.46642433, -1.04525337, 0.60631952],
[ nan, -0.31932842, -0.62808756, 1.6595508 ],
[ nan, -1.38336396, -0.1359497 , -1.2140774 ],
[ nan, -0.50681635, -0.39944903, 0.15670386],
[ nan, 0.1887786 , -0.11816405, -1.43779147],
[ nan, 0.09740017, -1.33425847, -0.52118931],
[ nan, 0.39009332, -0.13370156, 0.6203583 ],
[ nan, -0.11610394, -0.38487981, 0.33996498],
[ nan, 1.02017271, -0.0616264 , -0.39484951],
[ nan, 0.60884383, 0.27451636, -0.99312361],
[ nan, 1.30184623, -0.15766702, 0.49383678],
[ nan, -1.06001582, 0.74718833, 0.88017891],
[ nan, 0.58295368, -2.65917224, -1.02250684],
[ nan, 1.65813068, -0.6840109 , -1.47183501],
[ nan, -0.46071979, -0.68783761, -0.2226751 ],
[ nan, -0.15957344, -0.36469354, -0.76149221],
[ nan, -0.73067775, -0.76414392, 0.85255194],
[ nan, -0.28688719, -0.6522936 , nan],
[ nan, -0.81299299, -0.47965581, nan],
[ nan, -0.31229225, 0.93184837, nan],
[ nan, 0.94326072, -0.19065349, nan],
[ nan, -1.18388064, 0.28044171, nan],
[ nan, 0.45093446, 0.04949498, nan],
[ nan, -0.4533858 , -0.20690368, nan],
[ nan, -0.2803555 , -2.25556423, nan],
[ nan, 0.34965446, -0.98551074, nan],
[ nan, -0.68944918, 0.56729028, nan],
[ nan, -0.477974 , -0.29183736, nan],
[ nan, 0.00377089, 1.46657872, nan],
[ nan, 0.16092817, nan, nan],
[ nan, -1.12801133, nan, nan],
[ nan, -0.24945858, nan, nan],
[ nan, -1.57062341, nan, nan],
[ nan, 0.38728048, nan, nan],
[ nan, -1.6567151 , nan, nan],
[ nan, 0.16422776, nan, nan],
[ nan, -1.61647419, nan, nan],
[ nan, 1.14110187, nan, nan]])
Note that the above code contains i[0], because np.ndenumerate
returns as the first result a tuple of indices.
Since data is a 1-D array, we are interested in the first index only,
so after i I put [0].
This one works for me:
data = np.random.normal(size=(100,4))
slices_values = [75, 33, 42, 54] # You name your slices here
slices = [] # In this list you will keep the slices
for i in slices_values:
x = slice(i, 100)
slices.append(data[x])
Now you can confirm the shape of each slices:
slices[0].shape # (25, 4)
slices[1].shape # (67, 4)
slices[2].shape # (58, 4)
slices[3].shape # (46, 4)
It is difficult to find out what is the expected result, But if IIUC, one way is to create an array and fill it using looping:
data = np.random.normal(size=(5, 4))
ids = np.array([2, 1, 2, 3])
def test(data, ids):
arr = np.empty_like(data)
for i, j in enumerate(ids):
arr[:j, i] = data[:j, i]
arr[j:, i] = np.nan
return arr
res = test(data, ids)
# [[ 0.1768507210788626 2.3777541249700573 0.998732857053734 -1.3101507969798436 ]
# [ 0.18018992116935298 nan -1.443125868756967 -1.3992855573400653 ]
# [ nan nan nan -0.2319322879433409 ]
# [ nan nan nan nan]
# [ nan nan nan nan]]
or:
def test(data, ids):
arr = np.empty_like(data)
for i, j in enumerate(ids):
arr[:j, i] = np.nan
arr[j:, i] = data[j:, i]
return arr
# [[ nan nan nan nan]
# [ nan -1.7647540193678475 nan nan]
# [ 0.8203539992532282 1.2952993197746814 0.9421974218807785 nan]
# [-0.6313979666045816 -0.6421770233773478 -0.3816716009896775 -1.7634440039930654]
# [ 1.611668212682313 -0.878108388861928 -0.4985770669099582 0.9072434022928676]]
Given a parameter p, be any float or integer.
For example, let p=4
time
1
2
3
4
5
Numbers
a1
a1*(0.5)^(1/p)^(2-1)
a1*(0.5)^(1/p)^(2-1)
a1*(0.5)^(1/p)^(3-1)
a1*(0.5)^(1/p)^(4-1)
Numbers
nan
a2
a2*(0.5)^(1/p)^(3-2)
a2*(0.5)^(1/p)^(4-2)
a2*(0.5)^(1/p)^(5-2)
Numbers
nan
nan
a3
a3*(0.5)^(1/p)^(4-3)
a3*(0.5)^(1/p)^(5-3)
Numbers
nan
nan
nan
a4
a4*(0.5)^(1/p)^(5-4)
Number
nan
nan
nan
nan
a5
Final Results
a1
sum of column 2
sum of column 3
sum of column 4
sum of column 5
Numbers like a1,a2,a3,a4,a5,...,at is given, our goal is to find the Final Results. Combining the answer provided by mozway, I wrote the following function which works well. It is a matrix way to solve the problem.
def hl(p,column):
a = np.arange(len(copy_raw))
factors = (a[:,None]-a)
factors = np.where(factors<0, np.nan, factors)
inter = ((1/2)**(1/p))**factors
copy_raw[column] = np.nansum(copy_raw[column].to_numpy()*inter, axis=1)
However, I don't think this method will work well if we are dealing with large dataframe. Are there any better way to fix the problem? (In this case, faster = better.)
Assuming your number of rows is not too large, you can achieve this with numpy broadcasting:
First create a 2D array of factors:
a = np.arange(len(df))
factors = (a[:,None]-a)
factors = np.where(factors<0, np.nan, factors)
# array([[ 0., nan, nan, nan, nan],
# [ 1., 0., nan, nan, nan],
# [ 2., 1., 0., nan, nan],
# [ 3., 2., 1., 0., nan],
# [ 4., 3., 2., 1., 0.]])
Then map to your data and sum:
df['number2'] = np.nansum(df['number'].to_numpy()*(1/2)**factors, axis=1)
example output:
Index Time number number2
0 0 1997-WK01 1 1.0000
1 1 1997-WK02 2 2.5000
2 2 1997-WK03 3 4.2500
3 3 1997-WK04 2 4.1250
4 4 1997-WK05 4 6.0625
intermediate:
df['number'].to_numpy()*(1/2)**factors
# array([[1. , nan, nan, nan, nan],
# [0.5 , 2. , nan, nan, nan],
# [0.25 , 1. , 3. , nan, nan],
# [0.125 , 0.5 , 1.5 , 2. , nan],
# [0.0625, 0.25 , 0.75 , 1. , 4. ]])
There are two 2-D ndarrays A and B.
A contains panel values for a feature, rows represents days, columns are different regions. There are ~3000 columns and ~5000 rows in A. Such as
A = array([[ 3.53, 3.56, nan, ..., nan, nan, nan], # day 1 data
[-4.91, -2.54, nan, ..., nan, nan, nan], # day 2 data
[-6.31, -3.39, nan, ..., nan, nan, nan], # day 3 data, etc
...,
[ 0. , -3.41, nan, ..., 12.69, 2.32, nan],
[-2.74, -4.14, nan, ..., -8.63, -1.45, nan],
[-1.74, -7.45, nan, ..., 0.68, -6.52, nan]])
B contains the type of each value corresponding in A. There are around 30 types in total. Such as
B = array([[ 'A', 'B', nan, ..., nan, nan, nan], # day 1 type
[ 'A', 'A', nan, ..., nan, nan, nan], # day 2 type, etc
...,
[ 'D', 'E', nan, ..., 'I', 'D', nan],
[ 'X', 'Y', nan, ..., 'O', 'S', nan]])
The goal is for each day (row), the regions should be split into 10 groups based on values (group 10 > group 9 ...). And for each group, the weight of each type should be equal to total number of the type in the row / 10. For example,
day 1:
# of A: 35 --> weight of A in each group: 3.5
# of B: 33 --> weight of B in each group: 3.3
...
# of Z: 6 --> weight of Z in each group: 0.6
And the result should be something like
weight_group_1 = array([[ 1, 1, nan, ..., 0.5, ..., 1, ..., nan, nan, nan]
# And the sum of each group's weights should be equal, if all steps correct.
weight_group_2 = array([[ 0, 0, nan, ..., 1, ..., 0.3, ..., nan, nan, nan]
and so on
Are there any efficient algorithms to achieve this? Please help, thanks in advance!
I would like to create a python function to linearly interpolate within a partly empty grid and get a nearest extrapolation out of bounds.
Let's say I have the following data stored in pandas DataFrame:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: x = [0,1,2,3,4]
In [4]: y = [0.5,1.5,2.5,3.5,4.5,5.5]
In [5]: z = np.array([[np.nan,np.nan,1.5,2.0,5.5,3.5],[np.nan,1.0,4.0,2.5,4.5,3.0],[2.0,0.5,6.0,1.5,3.5,np.nan],[np.nan,1.5,4.0,2.0,np.nan,np.nan],[np.nan,np.nan,2.0,np.nan,np.nan,np.nan]])
In [6]: df = pd.DataFrame(z,index=x,columns=y)
In [7]: df
Out[7]:
0.5 1.5 2.5 3.5 4.5 5.5
0 NaN NaN 1.5 2.0 5.5 3.5
1 NaN 1.0 4.0 2.5 4.5 3.0
2 2.0 0.5 6.0 1.5 3.5 NaN
3 NaN 1.5 4.0 2.0 NaN NaN
4 NaN NaN 2.0 NaN NaN NaN
I would like to get function myInterp that returns a linear interpolation within data boundaries (i.e. not NaN values) and get a nearest extrapolation outside bounds (i.e. NaN or no values) such as:
In [1]: myInterp([1.5,2.5]) #linear interpolation
Out[1]: 5.0
In [2]: myInterp([1.5,4.0]) #bi-linear interpolation
Out[2]: 3.0
In [3]: myInterp([0.0,2.0]) #nearest extrapolation (inside grid)
Out[3]: 1.5
In [4]: myInterp([5.0,2.5]) #nearest extrapolation (outside grid)
Out[4]: 2.0
I tried many combination of scipy.interpolate package with no success, does anyone have a suggestion how to do it ?
Yes, unfortunately scipy doesn't deal with nans
From the docs:
Note that calling interp2d with NaNs present in input values results in undefined behaviour.
Even masking the nans in a np.masked_array was not successful.
So my advice would be to remove all the nan entries from z by taking the opportunity to give sp.interp2d the full list of x- and y-coordinates for only the valid data and leave z also 1D:
X=[];Y=[];Z=[] # initialize new 1-D-lists for interp2
for i, xi in enumerate(x): # iterate through x
for k, yk in enumerate(y): # iterate through y
if not np.isnan(z[i, k]): # check if z-value is valid...
X.append(xi) # ...and if so, append coordinates and value to prepared lists
Y.append(yk)
Z.append(z[i, k])
This way at least sp.interp2d works and gives a result:
ip = sp.interpolate.interp2d(X,Y,Z)
However, the values in the result won't please you:
In: ip(x,y)
Out:
array([[ 18.03583061, -0.44933642, 0.83333333, -1. , -1.46105542],
[ 9.76791531, 1.3014037 , 2.83333333, 1.5 , 0.26947229],
[ 1.5 , 3.05214381, 4.83333333, 4. , 2. ],
[ 2. , 3.78378051, 1.5 , 2. , 0.8364618 ],
[ 5.5 , 3.57039277, 3.5 , -0.83019815, -0.7967441 ],
[ 3.5 , 3.29227922, 17.29607177, 0. , 0. ]])
compared to the input data:
In:z
Out:
array([[ nan, nan, 1.5, 2. , 5.5, 3.5],
[ nan, 1. , 4. , 2.5, 4.5, 3. ],
[ 2. , 0.5, 6. , 1.5, 3.5, nan],
[ nan, 1.5, 4. , 2. , nan, nan],
[ nan, nan, 2. , nan, nan, nan]])
But IMHO this is because the gradient changes in your data are far too high. Even more with respect to the low number of data samples.
I hope this is just a test data set and your real application has smoother gradients and some more samples. Then I'd be glad to hear if it works...
However, the trivial test with an array of zero gradient - only destructed by nans a little bit - could give a hint that interpolation should work, while extrapolation is only partly correct:
In:ip(x,y)
Out:
array([[ 3. , 3. , 3. , 3. , 0. ],
[ 3. , 3. , 3. , 3. , 1.94701008],
[ 3. , 3. , 3. , 3. , 3. ],
[ 3. , 3. , 3. , 3. , 1.54973345],
[ 3. , 3. , 3. , 3. , 0.37706713],
[ 3. , 3. , 2.32108317, 0.75435203, 0. ]])
resulting from the trivial test input
In:z
Out:
array([[ nan, nan, 3., 3., 3., 3.],
[ nan, 3., 3., nan, 3., 3.],
[ 3., 3., 3., 3., 3., nan],
[ nan, 3., 3., 3., nan, nan],
[ nan, nan, 3., nan, nan, nan]])
PS: Looking closer to the right hand side: there are even valid entries completely changed, i.e made wrong, which introduces errors in a following analysis.
But surprise: the cubic version performs much better here:
In:ip = sp.interpolate.interp2d(X,Y,Z, kind='cubic')
In:ip(x,y)
Out:
array([[ 3. , 3. , 3. , 3.02397028, 3.0958811 ],
[ 3. , 3. , 3. , 3. , 3. ],
[ 3. , 3. , 3. , 3. , 3. ],
[ 3. , 3. , 3. , 3. , 3. ],
[ 3. , 3. , 3. , 2.97602972, 2.9041189 ],
[ 3. , 3. , 3. , 2.9041189 , 2.61647559]])
In:z
Out:
array([[ nan, nan, 3., 3., 3., 3.],
[ nan, 3., 3., nan, 3., 3.],
[ 3., 3., 3., 3., 3., nan],
[ nan, 3., 3., 3., nan, nan],
[ nan, nan, 3., nan, nan, nan]])
Since scipy.interp2d doesn't deal with Nans, the solution is to fill the NaNs in the DataFrame before using interp2d. This can be done by using pandas.interpolate function.
In the previous example, the following provide the desired output:
In [1]: from scipy.interpolate import interp2d
In [2]: df = df.interpolate(limit_direction='both',axis=1,inplace=True)
In [3]: myInterp = interp2d(df.index,df.columns,df.values.T)
In [4]: myInterp(1.5,2.5)
Out[4]: array([5.])
In [5]: myInterp(1.5,4.0)
Out[5]: array([3.])
In [6]: myInterp(0.0,2.0)
Out[6]: array([1.5])
In [7]: myInterp(5.0,2.5)
Out[7]: array([2.])
My array is a 2D matrix and it has numpy.nan values besides negative and positive values:
>>> array
array([[ nan, nan, nan, ..., -0.04891211,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan],
...,
[-0.02510989, -0.02520096, -0.02669156, ..., nan,
nan, nan],
[-0.02725595, -0.02715945, -0.0286231 , ..., nan,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan]], dtype=float32)
(There are positive numbers in the array, they just don't show in the preview.)
And I want to replace all the positive numbers with a number and all the negative numbers with another number.
How can I perform that using python/numpy?
(For the record, the matrix is a result of geoimage, which I want to perform a classification)
The fact that you have np.nan in your array should not matter. Just use fancy indexing:
x[x>0] = new_value_for_pos
x[x<0] = new_value_for_neg
If you want to replace your np.nans:
x[np.isnan(x)] = something_not_nan
More info on fancy indexing a tutorial and the NumPy documentation.
Try:
a[a>0] = 1
a[a<0] = -1
to add or subtract to current value then (np.nan not affected)
import numpy as np
a = np.arange(-10, 10).reshape((4, 5))
print("after -")
print(a)
a[a<0] = a[a<0] - 2
a[a>0] = a[a>0] + 2
print(a)
output
[[-10 -9 -8 -7 -6]
[ -5 -4 -3 -2 -1]
[ 0 1 2 3 4]
[ 5 6 7 8 9]]
after -
[[-12 -11 -10 -9 -8]
[ -7 -6 -5 -4 -3]
[ 0 3 4 5 6]
[ 7 8 9 10 11]]
Pierre's answer doesn't work if new_value_for_pos is negative. In that case, you could use np.where() in a chain:
# Example values
x = np.array([np.nan, -0.2, 0.3])
new_value_for_pos = -1
new_value_for_neg = 2
x[:] = np.where(x>0, new_value_for_pos, np.where(x<0, new_value_for_neg, x))
Result:
array([nan, 2., -1.])