Get the sum of each column, with recursive values in each cell

Get the sum of each column, with recursive values in each cell - python

Given a parameter p, be any float or integer.
For example, let p=4
time
1
2
3
4
5
Numbers
a1
a1*(0.5)^(1/p)^(2-1)
a1*(0.5)^(1/p)^(2-1)
a1*(0.5)^(1/p)^(3-1)
a1*(0.5)^(1/p)^(4-1)
Numbers
nan
a2
a2*(0.5)^(1/p)^(3-2)
a2*(0.5)^(1/p)^(4-2)
a2*(0.5)^(1/p)^(5-2)
Numbers
nan
nan
a3
a3*(0.5)^(1/p)^(4-3)
a3*(0.5)^(1/p)^(5-3)
Numbers
nan
nan
nan
a4
a4*(0.5)^(1/p)^(5-4)
Number
nan
nan
nan
nan
a5
Final Results
a1
sum of column 2
sum of column 3
sum of column 4
sum of column 5
Numbers like a1,a2,a3,a4,a5,...,at is given, our goal is to find the Final Results. Combining the answer provided by mozway, I wrote the following function which works well. It is a matrix way to solve the problem.
def hl(p,column):
a = np.arange(len(copy_raw))
factors = (a[:,None]-a)
factors = np.where(factors<0, np.nan, factors)
inter = ((1/2)**(1/p))**factors
copy_raw[column] = np.nansum(copy_raw[column].to_numpy()*inter, axis=1)
However, I don't think this method will work well if we are dealing with large dataframe. Are there any better way to fix the problem? (In this case, faster = better.)

Assuming your number of rows is not too large, you can achieve this with numpy broadcasting:
First create a 2D array of factors:
a = np.arange(len(df))
factors = (a[:,None]-a)
factors = np.where(factors<0, np.nan, factors)
# array([[ 0., nan, nan, nan, nan],
# [ 1., 0., nan, nan, nan],
# [ 2., 1., 0., nan, nan],
# [ 3., 2., 1., 0., nan],
# [ 4., 3., 2., 1., 0.]])
Then map to your data and sum:
df['number2'] = np.nansum(df['number'].to_numpy()*(1/2)**factors, axis=1)
example output:
Index Time number number2
0 0 1997-WK01 1 1.0000
1 1 1997-WK02 2 2.5000
2 2 1997-WK03 3 4.2500
3 3 1997-WK04 2 4.1250
4 4 1997-WK05 4 6.0625
intermediate:
df['number'].to_numpy()*(1/2)**factors
# array([[1. , nan, nan, nan, nan],
# [0.5 , 2. , nan, nan, nan],
# [0.25 , 1. , 3. , nan, nan],
# [0.125 , 0.5 , 1.5 , 2. , nan],
# [0.0625, 0.25 , 0.75 , 1. , 4. ]])

Related

Concatenate unequal sized numpy arrays keeping index positioning fixed

Let's say I have data for 3 variable pairs, A, B, and C (in my actual application the number of variables is anywhere from 1000-3000 but could be even higher).
Let's also say that there are pieces of the data that come in arrays.
For example:
Array X:
np.array([[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]])
Where:
X[0,0] = corresponds to data for variables A and A
X[0,1] = corresponds to data for variables A and B
X[0,2] = corresponds to data for variables A and C
X[1,0] = corresponds to data for variables B and A
X[1,1] = corresponds to data for variables B and B
X[1,2] = corresponds to data for variables B and C
X[2,0] = corresponds to data for variables C and A
X[2,1] = corresponds to data for variables C and B
X[2,2] = corresponds to data for variables C and C
Array Y:
np.array([[2,12],
[-12, 2]])
Y[0,0] = corresponds to data for variables A and C
Y[0,1] = corresponds to data for variables A and B
Y[1,0] = corresponds to data for variables B and A
Y[1,1] = corresponds to data for variables C and A
Array Z:
np.array([[ 99, 77],
[-77, -99]])
Z[0,0] = corresponds to data for variables A and C
Z[0,1] = corresponds to data for variables B and C
Z[1,0] = corresponds to data for variables C and B
Z[1,1] = corresponds to data for variables C and A
I want to concatenate the above arrays keeping the variable position fixed as follows:
END_RESULT_ARRAY index 0 corresponds to variable A
END_RESULT_ARRAY index 1 corresponds to variable B
END_RESULT_ARRAY index 2 corresponds to variable C
Basically, there are N variables in the universe but can change every month (new ones can be introduced and existing ones can drop out and then return or never return). Within the N variables in the universe I compute permutations pairs and the positioning of each variable is fixed i.e. index 0 corresponds to variable A, index = 1 corresponds to variable B (as described above).
Given the above requirement the end END_RESULT_ARRAY should look like the following:
array([[[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]],
[[ nan, 12., 2.],
[-12., nan, nan],
[ 2., nan, nan]],
[[ nan, nan, 99.],
[ nan, nan, 77.],
[-99., -77., nan]]])
Keep in mind that the above is an illustration.
In my actual application, I have about 125 arrays and a new one is generated every month. Each monthly array may have different sizes and may only have data for a portion of the variables defined in my universe. Also, as new arrays are created each month there is no way of knowing what its size will be or which variables will have data (or which ones will be missing).
So up until the most recent monthly array, we can determine the max size from the available historical data. Each month we will have to re-check the max size of all the arrays as a new array comes available. Once we have the max size we can then re-stitch/concatenate all the arrays together IF THIS IS SOMETHING THAT IS DOABLE in numpy. This will be an on-going operation done every month.
I want a general mechanism to be able to stitch these arrays together keeping the requirements I describe regarding the index position for the variables fixed.
I actually want to use H5PY arrays as my data set will grow exponentially not too distant future. However, I would like to get this working with numpy as a first step.

Based on the comment made by #user3483203. The next step is to concatenate the arrays.
a = np.array([[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]])
b = np.array([[0,12], [-12, 0]])
out = np.full_like(a, np.nan); i, j = b.shape; out[:i, :j] = b
res = np.array([a, out])
print (res)

This answers the original question which has since been changed:
Lets say I have the following arrays:
np.array([[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]])
np.array([[0,12],
[-12, 0]])
I want to concatenate the above 2 arrays such that the end result is
as follows:
array([[[0, 2, 3],
[-2, 0, 4],
[-3,-4, 0]],
[[0,12, np.nan],
[-12, 0, np.nan],
[np.nan, np.nan, np.nan]]])
Find out how much each array exceeds the max size in each dimension, then use np.pad to pad at the end of each dimension, then finally np.stack to stack them together:
import numpy as np
a = np.arange(12).reshape(4,3).astype(np.float)
b = np.arange(4).reshape(1,4).astype(np.float)
arrs = (a,b)
dims = len(arrs[0].shape)
maxshape = tuple( max(( x.shape[i] for x in arrs)) for i in range(dims))
paddedarrs = ( np.pad(x, tuple((0, maxshape[i]-x.shape[i]) for i in range(dims)), 'constant', constant_values=(np. nan,)) for x in (a,b))
c = np.stack(paddedarrs,0)
print (a)
print(b,"\n======================")
print(c)
[[ 0. 1. 2.]
[ 3. 4. 5.]
[ 6. 7. 8.]
[ 9. 10. 11.]]
[[0. 1. 2. 3.]]
======================
[[[ 0. 1. 2. nan]
[ 3. 4. 5. nan]
[ 6. 7. 8. nan]
[ 9. 10. 11. nan]]
[[ 0. 1. 2. 3.]
[nan nan nan nan]
[nan nan nan nan]
[nan nan nan nan]]]

How to replace row with float values with in a nested numpy array with a row of `NaN`s?

Say i have a numpy array:
a=np.array([[7,2,4],[1.2,7.4,3],[1.5,3.6,3.4]])
And my goal is to replace rows that which contain floats with a row of NaNs, and so far this is my attempt:
a[a.dtype==float]=np.nan
Which works, but only the first row that should be NaN, there's an second row that should be NaN that's left alone.
So my desired output would look like:
[[ 7. 2. 4.]
[ nan nan nan]
[ nan nan nan]]

Try rounding:
a[np.round(a)!=a] = np.nan
#array([[ 7., 2., 4.],
# [nan, nan, 3.],
# [nan, nan, nan]])

a.dtype==float returns True, hence that doesn't really make any sense. Also, all of your values are floats (you can check this by slicing type(a[0][0]).
You could use the .is_integer method on floats, but I think np.mod will be faster
a[np.mod(a, 1) != 0] = np.nan

How to interpolate/extrapolate within partly empty regular grid?

I would like to create a python function to linearly interpolate within a partly empty grid and get a nearest extrapolation out of bounds.
Let's say I have the following data stored in pandas DataFrame:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: x = [0,1,2,3,4]
In [4]: y = [0.5,1.5,2.5,3.5,4.5,5.5]
In [5]: z = np.array([[np.nan,np.nan,1.5,2.0,5.5,3.5],[np.nan,1.0,4.0,2.5,4.5,3.0],[2.0,0.5,6.0,1.5,3.5,np.nan],[np.nan,1.5,4.0,2.0,np.nan,np.nan],[np.nan,np.nan,2.0,np.nan,np.nan,np.nan]])
In [6]: df = pd.DataFrame(z,index=x,columns=y)
In [7]: df
Out[7]:
0.5 1.5 2.5 3.5 4.5 5.5
0 NaN NaN 1.5 2.0 5.5 3.5
1 NaN 1.0 4.0 2.5 4.5 3.0
2 2.0 0.5 6.0 1.5 3.5 NaN
3 NaN 1.5 4.0 2.0 NaN NaN
4 NaN NaN 2.0 NaN NaN NaN
I would like to get function myInterp that returns a linear interpolation within data boundaries (i.e. not NaN values) and get a nearest extrapolation outside bounds (i.e. NaN or no values) such as:
In [1]: myInterp([1.5,2.5]) #linear interpolation
Out[1]: 5.0
In [2]: myInterp([1.5,4.0]) #bi-linear interpolation
Out[2]: 3.0
In [3]: myInterp([0.0,2.0]) #nearest extrapolation (inside grid)
Out[3]: 1.5
In [4]: myInterp([5.0,2.5]) #nearest extrapolation (outside grid)
Out[4]: 2.0
I tried many combination of scipy.interpolate package with no success, does anyone have a suggestion how to do it ?

Yes, unfortunately scipy doesn't deal with nans
From the docs:
Note that calling interp2d with NaNs present in input values results in undefined behaviour.
Even masking the nans in a np.masked_array was not successful.
So my advice would be to remove all the nan entries from z by taking the opportunity to give sp.interp2d the full list of x- and y-coordinates for only the valid data and leave z also 1D:
X=[];Y=[];Z=[] # initialize new 1-D-lists for interp2
for i, xi in enumerate(x): # iterate through x
for k, yk in enumerate(y): # iterate through y
if not np.isnan(z[i, k]): # check if z-value is valid...
X.append(xi) # ...and if so, append coordinates and value to prepared lists
Y.append(yk)
Z.append(z[i, k])
This way at least sp.interp2d works and gives a result:
ip = sp.interpolate.interp2d(X,Y,Z)
However, the values in the result won't please you:
In: ip(x,y)
Out:
array([[ 18.03583061, -0.44933642, 0.83333333, -1. , -1.46105542],
[ 9.76791531, 1.3014037 , 2.83333333, 1.5 , 0.26947229],
[ 1.5 , 3.05214381, 4.83333333, 4. , 2. ],
[ 2. , 3.78378051, 1.5 , 2. , 0.8364618 ],
[ 5.5 , 3.57039277, 3.5 , -0.83019815, -0.7967441 ],
[ 3.5 , 3.29227922, 17.29607177, 0. , 0. ]])
compared to the input data:
In:z
Out:
array([[ nan, nan, 1.5, 2. , 5.5, 3.5],
[ nan, 1. , 4. , 2.5, 4.5, 3. ],
[ 2. , 0.5, 6. , 1.5, 3.5, nan],
[ nan, 1.5, 4. , 2. , nan, nan],
[ nan, nan, 2. , nan, nan, nan]])
But IMHO this is because the gradient changes in your data are far too high. Even more with respect to the low number of data samples.
I hope this is just a test data set and your real application has smoother gradients and some more samples. Then I'd be glad to hear if it works...
However, the trivial test with an array of zero gradient - only destructed by nans a little bit - could give a hint that interpolation should work, while extrapolation is only partly correct:
In:ip(x,y)
Out:
array([[ 3. , 3. , 3. , 3. , 0. ],
[ 3. , 3. , 3. , 3. , 1.94701008],
[ 3. , 3. , 3. , 3. , 3. ],
[ 3. , 3. , 3. , 3. , 1.54973345],
[ 3. , 3. , 3. , 3. , 0.37706713],
[ 3. , 3. , 2.32108317, 0.75435203, 0. ]])
resulting from the trivial test input
In:z
Out:
array([[ nan, nan, 3., 3., 3., 3.],
[ nan, 3., 3., nan, 3., 3.],
[ 3., 3., 3., 3., 3., nan],
[ nan, 3., 3., 3., nan, nan],
[ nan, nan, 3., nan, nan, nan]])
PS: Looking closer to the right hand side: there are even valid entries completely changed, i.e made wrong, which introduces errors in a following analysis.
But surprise: the cubic version performs much better here:
In:ip = sp.interpolate.interp2d(X,Y,Z, kind='cubic')
In:ip(x,y)
Out:
array([[ 3. , 3. , 3. , 3.02397028, 3.0958811 ],
[ 3. , 3. , 3. , 3. , 3. ],
[ 3. , 3. , 3. , 3. , 3. ],
[ 3. , 3. , 3. , 3. , 3. ],
[ 3. , 3. , 3. , 2.97602972, 2.9041189 ],
[ 3. , 3. , 3. , 2.9041189 , 2.61647559]])
In:z
Out:
array([[ nan, nan, 3., 3., 3., 3.],
[ nan, 3., 3., nan, 3., 3.],
[ 3., 3., 3., 3., 3., nan],
[ nan, 3., 3., 3., nan, nan],
[ nan, nan, 3., nan, nan, nan]])

Since scipy.interp2d doesn't deal with Nans, the solution is to fill the NaNs in the DataFrame before using interp2d. This can be done by using pandas.interpolate function.
In the previous example, the following provide the desired output:
In [1]: from scipy.interpolate import interp2d
In [2]: df = df.interpolate(limit_direction='both',axis=1,inplace=True)
In [3]: myInterp = interp2d(df.index,df.columns,df.values.T)
In [4]: myInterp(1.5,2.5)
Out[4]: array([5.])
In [5]: myInterp(1.5,4.0)
Out[5]: array([3.])
In [6]: myInterp(0.0,2.0)
Out[6]: array([1.5])
In [7]: myInterp(5.0,2.5)
Out[7]: array([2.])

numpy argmax with max less than some number

I have a numpy array as:
myArray
array([[ 1. , nan, nan, nan, nan],
[ 1. , nan, nan, nan, nan],
[ 0.63 , 0.79 , 1. , nan, nan],
[ 0.25 , 0.4 , 0.64 , 0.84 , nan]])
I need to find for each row, the column numbers for max value but the max has to be less than 1.
In the above array, row 0,1 should return Nan.
Row 2 should return 1.
Row 3 should return 3.
I am not sure how to condition this on argmax.

Here's one approach with np.where -
m = a < 1 # Mask of elems < 1 and non-NaNs
# Set NaNs and elems > 1 to global minimum values minus 1,
# so that when used with argmax those would be ignored
idx0 = np.where(m, a,np.nanmin(a)-1).argmax(1)
# Look for rows with no non-NaN and < 1 elems and set those in o/p as NaNs
idx = np.where(m.any(1), idx0, np.nan)
Sample run -
In [97]: a
Out[97]:
array([[ 1. , nan, nan, nan, nan],
[ 1. , nan, nan, nan, nan],
[ 0.63, 0.79, 1. , nan, nan],
[ 0.25, 0.4 , 0.64, 0.84, nan]])
In [98]: m = a < 1
In [99]: idx0 = np.where(m, a,np.nanmin(a)-1).argmax(1)
In [100]: idx0
Out[100]: array([0, 0, 1, 3])
In [101]: np.where(m.any(1), idx0, np.nan)
Out[101]: array([ nan, nan, 1., 3.])

How can I conditionally change the values in a numpy array taking into account nans?

My array is a 2D matrix and it has numpy.nan values besides negative and positive values:
>>> array
array([[ nan, nan, nan, ..., -0.04891211,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan],
...,
[-0.02510989, -0.02520096, -0.02669156, ..., nan,
nan, nan],
[-0.02725595, -0.02715945, -0.0286231 , ..., nan,
nan, nan],
[ nan, nan, nan, ..., nan,
nan, nan]], dtype=float32)
(There are positive numbers in the array, they just don't show in the preview.)
And I want to replace all the positive numbers with a number and all the negative numbers with another number.
How can I perform that using python/numpy?
(For the record, the matrix is a result of geoimage, which I want to perform a classification)

The fact that you have np.nan in your array should not matter. Just use fancy indexing:
x[x>0] = new_value_for_pos
x[x<0] = new_value_for_neg
If you want to replace your np.nans:
x[np.isnan(x)] = something_not_nan
More info on fancy indexing a tutorial and the NumPy documentation.

Try:
a[a>0] = 1
a[a<0] = -1

to add or subtract to current value then (np.nan not affected)
import numpy as np
a = np.arange(-10, 10).reshape((4, 5))
print("after -")
print(a)
a[a<0] = a[a<0] - 2
a[a>0] = a[a>0] + 2
print(a)
output
[[-10 -9 -8 -7 -6]
[ -5 -4 -3 -2 -1]
[ 0 1 2 3 4]
[ 5 6 7 8 9]]
after -
[[-12 -11 -10 -9 -8]
[ -7 -6 -5 -4 -3]
[ 0 3 4 5 6]
[ 7 8 9 10 11]]

Pierre's answer doesn't work if new_value_for_pos is negative. In that case, you could use np.where() in a chain:
# Example values
x = np.array([np.nan, -0.2, 0.3])
new_value_for_pos = -1
new_value_for_neg = 2
x[:] = np.where(x>0, new_value_for_pos, np.where(x<0, new_value_for_neg, x))
Result:
array([nan, 2., -1.])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get the sum of each column, with recursive values in each cell - python

Related

Concatenate unequal sized numpy arrays keeping index positioning fixed

How to replace row with float values with in a nested numpy array with a row of `NaN`s?

How to interpolate/extrapolate within partly empty regular grid?

numpy argmax with max less than some number

How can I conditionally change the values in a numpy array taking into account nans?

Categories

Resources