My data looks like:
list=[44359, 16610, 8364, ..., 1, 1, 1]
For each element in list I want to take i*([i+1]+[i-1])/2, where i is an element in the list, and i+1 and i-1 are the adjacent elements.
For some reason I cannot seem to do this cleanly in NumPy.
Here's what I've tried:
weights=[]
weights.append(1)
for i in range(len(hoff[3])-1):
weights.append((hoff[3][i-1]+hoff[3][i+1])/2)
Where I append 1 to the weights list so that lengths will match at the end. I arbitrarily picked 1, I'm not sure how to deal with the leftmost and rightmost points either.
You can use numpy's array operations to represent your "loop". If you think of data as bellow, where pL and pR are the values you choose to "pad" your data with on the left and right:
[pL, 0, 1, 2, ..., N-2, N-1, pR]
What you're trying to do is this:
[0, ..., N - 1] * ([pL, 0, ..., N-2] + [1, ..., N -1, pR]) / 2
Written in code it looks something like this:
import numpy as np
data = np.random.random(10)
padded = np.concatenate(([data[0]], data, [data[-1]]))
data * (padded[:-2] + padded[2:]) / 2.
Repeating the first and last value is known as "extending" in image processing, but there are other edge handling methods you could try.
I would use pandas for this, filling in the missing left- and right-most values with 1 (but you can use any value you want):
import numpy
import pandas
numpy.random.seed(0)
data = numpy.random.randint(0, 10, size=15)
df = (
pandas.DataFrame({'hoff': data})
.assign(before=lambda df: df['hoff'].shift(1).fillna(1).astype(int))
.assign(after=lambda df: df['hoff'].shift(-1).fillna(1).astype(int))
.assign(weight=lambda df: df['hoff'] * df[['before', 'after']].mean(axis=1))
)
print(df.to_string(index=False)
And that gives me:
hoff before after weight
5 1 0 2.5
0 5 3 0.0
3 0 3 4.5
3 3 7 15.0
7 3 9 42.0
9 7 3 45.0
3 9 5 21.0
5 3 2 12.5
2 5 4 9.0
4 2 7 18.0
7 4 6 35.0
6 7 8 45.0
8 6 8 56.0
8 8 1 36.0
1 8 1 4.5
A pure numpy-based solution would look like this (again, filling with 1):
before_after = numpy.ones((data.shape[0], 2))
before_after[1:, 0] = data[:-1]
before_after[:-1, 1] = data[1:]
weights = data * before_after.mean(axis=1)
print(weights)
array([ 2.5, 0. , 4.5, 15. , 42. , 45. , 21. , 12.5, 9. ,
18. , 35. , 45. , 56. , 36. , 4.5])
Related
Suppose I have a vector like so:
s = pd.Series(range(50))
The rolling sum over, let's say a 2-element window is easily calculated:
s.rolling(window=2, min_periods=2).mean()
0 NaN
1 0.5
2 1.5
3 2.5
4 3.5
5 4.5
6 5.5
7 6.5
8 7.5
9 8.5
...
Now I don't want to take the adjacent 2 elements for the window, but I want to take e.g. every third element. Still only take the last 2 of them. It would result in this vector:
0 NaN
1 NaN
2 NaN
3 1.5 -- (3+0)/2
4 2.5 -- (4+1)/2
5 3.5 -- (5+2)/2
6 4.5 -- ...
7 5.5
8 6.5
9 7.5
...
How can I achieve this efficiently?
Thanks!
use stride parameter in the numpy.ndarray.strides attribute, which allows you to specify the number of bytes to step in each dimension when traversing an array.
import numpy as np
arr = np.arange(10)
strided = np.lib.stride_tricks.as_strided(arr, shape=(len(arr)//3, 3), strides=(3*arr.itemsize, arr.itemsize))
result = np.mean(strided[:, -2:], axis=1)
output:
array([1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5])
This is not directly possible with rolling.
A workaround would be:
out = s.add(s.shift(3)).div(2)
Otherwise you need to use the underlying numpy array (see #John's answer)
Assume I have a matrix which is N items long x M columns long (where M<=N). I want to know the average rank for each of the N across the M columns.
arr = np.array([
[0,1],
[2,0],
[1,2]
])
I could loop through each of the N values and do something like the following, but I'm wondering whether there's a better approach to this
for n in range(3):
np.where(arr==n)[0].mean()
Edit
Sorry, it seems my choice of example has caused some confusion. To better illustrate, let me swap in letters since the values in the matrix are identifiers, not numbers to be calculated on.
arr = np.array([
['A','B'],
['C','A'],
['B','C']
])
I am not trying to do a simple row-wise average. I'm trying to say that
A average rank is 0.5 (0 + 1) / 2
B average rank is 1.0 (0 + 2) / 2
C average rank is 1.5 (1 + 2) / 2
Hopefully this clarified my request
It looks like you want to get the mean of your array along a certain axis. You can do this using the axis= argument of numpy.mean:
import numpy as np
arr = np.array([
[0,1],
[2,0],
[1,2]
])
np.mean(arr, axis=1)
# [ 0.5 1. 1.5]
If you want row wise mean
>>> np.mean(arr, axis=1)
array([ 0.5, 1. , 1.5])
To get rank (as OP's description)
First generate 2D array of indices
import numpy as np
M = 5
N = 7
narray = np.array(np.tile(np.arange(N), M)).reshape(N, M)
print(narray)
Output:
[[0 1 2 3 4]
[5 6 0 1 2]
[3 4 5 6 0]
[1 2 3 4 5]
[6 0 1 2 3]
[4 5 6 0 1]
[2 3 4 5 6]]
Now take row wise mean to get rank
mean_value = np.mean(narray, axis=1)
print(mean_value)
Output
[ 2. 2.8 3.6 3. 2.4 3.2 4. ]
If each of the N items appears exactly 1 time in each column,
(i.e each column is a ranking) , you can simply do :
#arr = np.array([['A','B'],['C','A'],['B','C']])
means = arr.argsort(0).mean(1)
#array([ 0.5, 1. , 1.5])
Here is my attempt to "improve" your original solution. My solution has the benefit of not needing to perform two (possibly very time consuming) operations all over again for each value in the array: np.where(arr==n) (1. find all values equal to n; 2. find indices of elements for which previous equality is true).
values, inverse, counts = np.unique(arr, return_inverse=True, return_counts=True)
rows = np.argsort(inverse) // len(arr[0])
cumsum = np.cumsum(counts)
avranks = np.add.reduceat(rows, cumsum - cumsum[0]) / counts
Then, for your original data,
>>> print(avranks)
[0.5 1. 1.5]
I've got a numpy array that looks like this:
1 0 0 0 200 0 0 0 1
6 0 0 0 2 0 0 0 4.3
5 0 0 0 1 0 0 0 7.1
expected out put would be
1 100 100 100 200 100 100 100 1
6 4 4 4 2 3.15 3.15 3.15 4.3
5 3 3 3 1 4.05 4.05 4.05 7.1
and I would like to replace all the 0 values with an average of their neighbours. Any hints welcome! Many thanks!
If the structure in the sample array is preserved throughout your array, then this code will work:
In [159]: def avg_func(r):
lavg = (r[0] + r[4])/2.0
ravg = (r[4] + r[-1])/2.0
r[1:4] = lavg
r[5:-1] = ravg
return r
In [160]: np.apply_along_axis(avg_func, 1, arr)
Out[160]:
array([[ 1. , 100.5 , 100.5 , 100.5 , 200. , 100.5 , 100.5 ,
100.5 , 1. ],
[ 6. , 4. , 4. , 4. , 2. , 3.15, 3.15,
3.15, 4.3 ],
[ 5. , 3. , 3. , 3. , 1. , 4.05, 4.05,
4.05, 7.1 ]])
But, as you can see this is kinda messy with hardcoding the indexes. You just have to get creative when you define avg_func here. Feel free to improve this solution and get creative. Also, note that this implementation does in-place modification on the input array.
I have a dataframe of 6M+ observations, where 20 of the columns are weights that will be applied to a single score column. I.e., Wgt1 * Wgt2 * Wgt3...* Score. In addition, not each weight is applicable to every observation, so I have created 20 columns that represent a weight mask. I.e., (Wgt1*Msk1) * (Wgt2*Msk2) * (Wgt3*Msk3) ... Score. When the mask is 0, the weight is not applicable; when the mask is 1, it is applicable.
For each row in the dataframe, I want to:
1, Check 2 qualifying conditions that indicate the row should be processed
2, find the product of the weights, subject to the presence of the corresponding mask (ttl_wgt)
3, multiply this product by the score (prob) to create a final weighted score
To do this, I have created a user-defined function:
import functools
import operator
import time
def mymult(a):
ttl_wgt = float('NaN') #Initialize to NaN
if ~np.isnan(a['ID']): #condition 1, only process if an ID is present
if a['prob'] > -1.0: #condition 2, only process if our unweighted score is NOT -1.0
b = np.where(a[msks] ==1)[0] #index for which of our masks is 1?
ttl_wgt = functools.reduce(operator.mul, a[np.asarray(wgt_nms)[b]], 1)
return ttl_wgt
I ran out of memory during development, so I decided to chunk it up into 500000 rows at a time. I use a lambda function to apply to the chunk:
msks = ['Msk1','Msk2','Msk3','Msk4',...,'Msk20']
wgt_nms = ['Wgt1','Wgt2','Wgt3','Wgt4',...,'Wgt20']
print('Determining final weights...')
chunksize = 500000 #we'll operate on this many rows at a time
start_time = time.time()
ttl_wgts = [] #initialize list to hold weight products
for i in range(0,len(df),chunksize):
ttl_wgts.extend(df[i:(i+chunksize)].apply(lambda x: mymult(x), axis=1))
print("--- %s seconds ---" % (time.time() - start_time)) #Expect between 30 and 40 minutes
print('Done!')
Then I assignthe ttl_wgts list as a new column in the dataframe, and do the final product of weight * initial score.
#Initialize the fields
#Might not be necessary or evenuseful
df['ttl_wgt'] = float('NaN')
df['wgt_prob'] = float('NaN')
df['ttl_wgt'] = ttl_wgts
df['wgt_prob'] = df['ttl_wgt'] * df['prob']
I checked out a prior post on multiplying elements in a list. It was great food for thought, but I wasn't able to turn it into anything more efficient for my 6M+ observations. Are there other approaches I should be considering?
Adding an example df, as suggested
A sample of the dataframe might looks something like this, with only 3 masks/weights:
df = pd.DataFrame({'id': [999999999,136550,80010170,80010177,90002408,90002664,16207501,62992,np.nan,80010152],
'prob': [-1,0.180274382,0.448361456,0.000945058,0.005060279,0.009893078,0.169686288,0.109541453,0.117907763,0.266242921],
'Msk1': [0,1,1,1,0,0,1,0,0,0],
'Msk2': [0,0,1,0,0,0,0,1,0,0],
'Msk3': [1,0,0,0,1,1,0,0,1,1],
'Wgt1': [np.nan,0.919921875,1.08984375,1.049804688,np.nan,np.nan,np.nan,0.91015625,np.nan,0.810058594],
'Wgt2': [np.nan,1.129882813,1.120117188,0.970214844,np.nan,np.nan,np.nan,1.0703125,np.nan,0.859863281],
'Wgt3': [np.nan,1.209960938,1.23046875,1,np.nan,np.nan,np.nan,1.150390625,np.nan,0.649902344]
})
In the first observation, the prob field is -1, so the row would not be processed. In the second observation, Msk1 is turned on while Msk2 and Msk3 are turned off. Thus the final weight would be the value of Wgt1 = 0.919922. In the 3rd row, Mask1 and Msk2 are on, while Msk3 is off. Therefore the final weight would be Wgt1*Wgt2 = 1.089844*1.120117 = 1.220752.
IIUC:
you want to fill in your masked weights with 1. Then you can multiply them all together with no impact from the ones being masked. That's the trick. You'll have to apply it as needed.
create msk
msk = df.filter(like='Msk')
print(msk)
Msk1 Msk2 Msk3
0 0 0 1
1 1 0 0
2 1 1 0
3 1 0 0
4 0 0 1
5 0 0 1
6 1 0 0
7 0 1 0
8 0 0 1
9 0 0 1
create wgt
wgt = df.filter(like='Wgt')
print(wgt)
Wgt1 Wgt2 Wgt3
0 NaN NaN NaN
1 0.919922 1.129883 1.209961
2 1.089844 1.120117 1.230469
3 1.049805 0.970215 1.000000
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 0.910156 1.070312 1.150391
8 NaN NaN NaN
9 0.810059 0.859863 0.649902
create new_weight
new_wgt = np.where(msk, wgt, 1)
print(new_wgt)
[[ 1. 1. nan]
[ 0.91992188 1. 1. ]
[ 1.08984375 1.12011719 1. ]
[ 1.04980469 1. 1. ]
[ 1. 1. nan]
[ 1. 1. nan]
[ nan 1. 1. ]
[ 1. 1.0703125 1. ]
[ 1. 1. nan]
[ 1. 1. 0.64990234]]
final prod_wgt
prod_wgt = pd.Series(new_wgt.prod(1), wgt.index)
print(prod_wgt)
0 NaN
1 0.919922
2 1.220753
3 1.049805
4 NaN
5 NaN
6 NaN
7 1.070312
8 NaN
9 0.649902
dtype: float64
I Have a matrix with int inside it.
I need to replace the 111 with the median of its immediate 4 neighbourhood if any of the neighbors are 111 then they are ignored.
For eg:-
matrix = 1 2 3 4 5 6
101 111 1 3 44 3
111 3 4 4 5 6
1 2 4 5 7 7
after replacing expected op
1 2 3 4 5 6
101 2.5 1 3 44 3
3 3 4 4 5 6
1 2 4 5 7 7
My code is pretty bad and probably very slow. any help appreciated
def median_fil_mat(matrix):
rows,columns= np.where(matrix==111)
r,c=np.shape(matrix)
for each_row in rows:
for each_colmn in columns:
if each_row==r-1:
r1=[each_row-1]
elif each_row>0 & each_row!=r-1:
r1= [each_row-1,each_row+1]
else:
r1=[each_row+1]
if each_colmn ==c-1:
c1=[each_colmn-1]
elif each_colmn >0 & each_colmn!=c-1:
c1=[each_colmn-1,each_colmn+1]
else:
c1=[each_colmn+1]
med_lis=list()
for rr in r1:
for cc in c1:
med_lis.append(matrix[rr,cc])
med_lis=[x for x in med_lis if x!=111 ]
matrix[each_row,each_colmn]= np.median(med_lis)
return matrix
def func(array):
if array[2]==111:
return np.median(array[array!=111])
else:
return array[2]
fp = np.array([[0, 1, 0], [1, 1, 1], [0, 1, 0]])
a = np.fromstring("""1 2 3 4 5 6
101 111 1 3 44 3
111 3 4 4 5 6
1 2 4 5 7 7""", sep=" ").reshape((4, 6))
generic_filter(a, func, footprint=fp, mode='nearest')
Returns
array([[ 1. , 2. , 3. , 4. , 5. , 6. ],
[ 101. , 2.5, 1. , 3. , 44. , 3. ],
[ 3. , 3. , 4. , 4. , 5. , 6. ],
[ 1. , 2. , 4. , 5. , 7. , 7. ]])