Fill an array with nan in numpy - python

I have a numpy vector my_indexes of size 1XN which contain boolean values of indexing and a 2D array my_array of size MxK where K << N. Actually, the boolean vector correspond to columns that I remove (or keep in the array my_array) and I want to add those columns back filled with zeros (or 'NaNs'). My code for removing the columns:
my_array= np.array(my_array[:, ~np.all(np.isnan(my_array), axis=0)])
cols = np.all(np.isnan(my_array), axis=0)
my_array = some_process(my_array)
# How can I add the removed columns
My array if of size MXN and then the size is MXK. How can I fill the removed columns again with nan or zeros?
An example could be:
0.1 nan 0.3 .04 nan 0.12 0.12
0.1 nan 0.3 .04 nan 0.12 0.12
0.1 nan 0.3 .04 nan 0.12 0.12
Firstly, I want to remove the nan columns using my_array= np.array(my_array[:, ~np.all(np.isnan(my_array), axis=0)]).
0.1 0.3 .04 0.12 0.12
0.1 0.3 .04 0.12 0.12
0.1 0.3 .04 0.12 0.12
And the my_indexes vector is:
False True False False True False False
Then I want to process the matrix and then have the nan columns back (note that the preprocessing cannot happened with the nan columns). I guess that I need to use the np.insert function however how can I do so using my boolean vector

You can probably use masked arrays for that:
import numpy as np
import numpy.ma as ma
def some_process(x):
return x**2
x = np.arange(9, dtype=float).reshape(3, 3)
x[:,1] = np.nan
print(x)
# [[ 0. nan 2.]
# [ 3. nan 5.]
# [ 6. nan 8.]]
# mask all np.nan and np.inf
masked_x = ma.masked_invalid(x)
# Compute the process only on the unmasked values and fill back np.nan
x = ma.filled(some_process(masked_x), np.nan)
print(x)
# [[ 0. nan 4.]
# [ 9. nan 25.]
# [ 36. nan 64.]]

Related

Python: effective way to find the cumulative sum of repeated index (numpy method) [duplicate]

This question already has answers here:
Pandas Groupby and Sum Only One Column
(3 answers)
Pandas sum by groupby, but exclude certain columns
(4 answers)
Closed 4 years ago.
I have a 2d numpy array with repeated values in first column.
The repeated values can have any corresponding value in second column.
Its easy to find the cumsum using numpy, but, I have to find the cumsum for all the repeated values.
How can we do this effectively using numpy or pandas?
Here, I have solved the problem using ineffective for-loop.
I was wondering if there is a more elegant solution.
Question
How can we get the same result in more effective fashion?
Help will be appreciated.
#!python
# -*- coding: utf-8 -*-#
#
# Imports
import pandas as pd
import numpy as np
np.random.seed(42) # make results reproducible
aa = np.random.randint(1, 20, size=10).astype(float)
bb = np.arange(10)*0.1
unq = np.unique(aa)
ans = np.zeros(len(unq))
print(aa)
print(bb)
print(unq)
for i, u in enumerate(unq):
for j, a in enumerate(aa):
if a == u:
print(a, u)
ans[i] += bb[j]
print(ans)
"""
# given data
idx col0 col1
0 7. 0.0
1 15. 0.1
2 11. 0.2
3 8. 0.3
4 7. 0.4
5 19. 0.5
6 11. 0.6
7 11. 0.7
8 4. 0.8
9 8. 0.9
# sorted data
4. 0.8
7. 0.0
7. 0.4
8. 0.9
8. 0.3
11. 0.6
11. 0.7
11. 0.2
15. 0.1
19. 0.5
# cumulative sum for repeated serial
4. 0.8
7. 0.0 + 0.4
8. 0.9 + 0.3
11. 0.6 + 0.7 + 0.2
15. 0.1
19. 0.5
# Required answer
4. 0.8
7. 0.4
8. 1.2
11. 1.5
15. 0.1
19. 0.5
"""
You can groupby col0 and find the .sum() for col1.
df.groupby('col0')['col1'].sum()
Output:
col0
4.0 0.8
7.0 0.4
8.0 1.2
11.0 1.5
15.0 0.1
19.0 0.5
Name: col1, dtype: float64
I think a pandas method such as the one offered by #HarvIpan is best for readability and functionality, but since you asked for a numpy method as well, here is a way to do it in numpy using a list comprehension, which is more succinct than your original loop:
np.array([[i,np.sum(bb[np.where(aa==i)])] for i in np.unique(aa)])
which returns:
array([[ 4. , 0.8],
[ 7. , 0.4],
[ 8. , 1.2],
[ 11. , 1.5],
[ 15. , 0.1],
[ 19. , 0.5]])

Product of masked dataframe

I have a dataframe of 6M+ observations, where 20 of the columns are weights that will be applied to a single score column. I.e., Wgt1 * Wgt2 * Wgt3...* Score. In addition, not each weight is applicable to every observation, so I have created 20 columns that represent a weight mask. I.e., (Wgt1*Msk1) * (Wgt2*Msk2) * (Wgt3*Msk3) ... Score. When the mask is 0, the weight is not applicable; when the mask is 1, it is applicable.
For each row in the dataframe, I want to:
1, Check 2 qualifying conditions that indicate the row should be processed
2, find the product of the weights, subject to the presence of the corresponding mask (ttl_wgt)
3, multiply this product by the score (prob) to create a final weighted score
To do this, I have created a user-defined function:
import functools
import operator
import time
def mymult(a):
ttl_wgt = float('NaN') #Initialize to NaN
if ~np.isnan(a['ID']): #condition 1, only process if an ID is present
if a['prob'] > -1.0: #condition 2, only process if our unweighted score is NOT -1.0
b = np.where(a[msks] ==1)[0] #index for which of our masks is 1?
ttl_wgt = functools.reduce(operator.mul, a[np.asarray(wgt_nms)[b]], 1)
return ttl_wgt
I ran out of memory during development, so I decided to chunk it up into 500000 rows at a time. I use a lambda function to apply to the chunk:
msks = ['Msk1','Msk2','Msk3','Msk4',...,'Msk20']
wgt_nms = ['Wgt1','Wgt2','Wgt3','Wgt4',...,'Wgt20']
print('Determining final weights...')
chunksize = 500000 #we'll operate on this many rows at a time
start_time = time.time()
ttl_wgts = [] #initialize list to hold weight products
for i in range(0,len(df),chunksize):
ttl_wgts.extend(df[i:(i+chunksize)].apply(lambda x: mymult(x), axis=1))
print("--- %s seconds ---" % (time.time() - start_time)) #Expect between 30 and 40 minutes
print('Done!')
Then I assignthe ttl_wgts list as a new column in the dataframe, and do the final product of weight * initial score.
#Initialize the fields
#Might not be necessary or evenuseful
df['ttl_wgt'] = float('NaN')
df['wgt_prob'] = float('NaN')
df['ttl_wgt'] = ttl_wgts
df['wgt_prob'] = df['ttl_wgt'] * df['prob']
I checked out a prior post on multiplying elements in a list. It was great food for thought, but I wasn't able to turn it into anything more efficient for my 6M+ observations. Are there other approaches I should be considering?
Adding an example df, as suggested
A sample of the dataframe might looks something like this, with only 3 masks/weights:
df = pd.DataFrame({'id': [999999999,136550,80010170,80010177,90002408,90002664,16207501,62992,np.nan,80010152],
'prob': [-1,0.180274382,0.448361456,0.000945058,0.005060279,0.009893078,0.169686288,0.109541453,0.117907763,0.266242921],
'Msk1': [0,1,1,1,0,0,1,0,0,0],
'Msk2': [0,0,1,0,0,0,0,1,0,0],
'Msk3': [1,0,0,0,1,1,0,0,1,1],
'Wgt1': [np.nan,0.919921875,1.08984375,1.049804688,np.nan,np.nan,np.nan,0.91015625,np.nan,0.810058594],
'Wgt2': [np.nan,1.129882813,1.120117188,0.970214844,np.nan,np.nan,np.nan,1.0703125,np.nan,0.859863281],
'Wgt3': [np.nan,1.209960938,1.23046875,1,np.nan,np.nan,np.nan,1.150390625,np.nan,0.649902344]
})
In the first observation, the prob field is -1, so the row would not be processed. In the second observation, Msk1 is turned on while Msk2 and Msk3 are turned off. Thus the final weight would be the value of Wgt1 = 0.919922. In the 3rd row, Mask1 and Msk2 are on, while Msk3 is off. Therefore the final weight would be Wgt1*Wgt2 = 1.089844*1.120117 = 1.220752.
IIUC:
you want to fill in your masked weights with 1. Then you can multiply them all together with no impact from the ones being masked. That's the trick. You'll have to apply it as needed.
create msk
msk = df.filter(like='Msk')
print(msk)
Msk1 Msk2 Msk3
0 0 0 1
1 1 0 0
2 1 1 0
3 1 0 0
4 0 0 1
5 0 0 1
6 1 0 0
7 0 1 0
8 0 0 1
9 0 0 1
create wgt
wgt = df.filter(like='Wgt')
print(wgt)
Wgt1 Wgt2 Wgt3
0 NaN NaN NaN
1 0.919922 1.129883 1.209961
2 1.089844 1.120117 1.230469
3 1.049805 0.970215 1.000000
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 0.910156 1.070312 1.150391
8 NaN NaN NaN
9 0.810059 0.859863 0.649902
create new_weight
new_wgt = np.where(msk, wgt, 1)
print(new_wgt)
[[ 1. 1. nan]
[ 0.91992188 1. 1. ]
[ 1.08984375 1.12011719 1. ]
[ 1.04980469 1. 1. ]
[ 1. 1. nan]
[ 1. 1. nan]
[ nan 1. 1. ]
[ 1. 1.0703125 1. ]
[ 1. 1. nan]
[ 1. 1. 0.64990234]]
final prod_wgt
prod_wgt = pd.Series(new_wgt.prod(1), wgt.index)
print(prod_wgt)
0 NaN
1 0.919922
2 1.220753
3 1.049805
4 NaN
5 NaN
6 NaN
7 1.070312
8 NaN
9 0.649902
dtype: float64

Issue replacing values in numpy array

I am trying to copy an array, replace all values in the copy below a threshold but keep the original array in tact.
Here is a simplified example of what I need to do.
import numpy as np
A = np.arange(0,1,.1)
B = A
B[B<.3] = np.nan
print ('A =', A)
print ('B =', B)
Which yields
A = [ nan nan nan 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
B = [ nan nan nan 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
I can't understand why the values in A <= .3 are also overwritten?
Can someone explain this to me and suggest a work around?
Change B = A to B = A.copy() and this should work as expected. As written, B and A refer to the same object in memory.

Python: Element wise division operator error

I would like to know is there any better way to perform element wise division operator in python. The code below suppose to perform division A1 with B1 row and A2 with B2 rows therefore my expected output is only two rows. However the division part is A1 with B1, A1 with B2, A2 with B1 and A2 with B2. Can anyone help me?
The binary file is for A,C,G,T representations using 1000,0100,0010,0001.
Division file has four columns each each A, C, G, T and therefore the values obtained
earlier must divide accordingly.
Code
import numpy as np
from numpy import genfromtxt
import csv
csvfile = open('output.csv', 'wb')
writer = csv.writer(csvfile)
#open csv file into arrays
with open('binary.csv') as actg:
actg=actg.readlines()
with open('single.csv') as single:
single=single.readlines()
with open('division.csv') as division:
division=division.readlines()
# Converting binary line and single line into 3 rows and 4 columns
# binary values using reshape
for line in actg:
myarray = np.fromstring(line, dtype=float, sep=',')
myarray = myarray.reshape((-1, 3, 4))
for line2 in single:
single1 = np.fromstring(line2, dtype=float, sep=',')
single1 = single1.reshape((-1, 4))
# This division is in 2 rows and 4 column: first column
# represents 1000, 2nd-0100, 3rd-0010, 4th-0001 in the
# binary.csv. Therefore the division part where 1000's
# value should be divided by 1st column, 0010 should be
# divided by 3rd column value
for line1 in division:
division1 = np.fromstring(line1, dtype=float, sep=',')
m=np.asmatrix(division1)
m=np.array(m)
res2 = (single1[np.newaxis,:,:] / m[:,np.newaxis,:] * myarray).sum(axis=-1)
print(res2)
writer.writerow(res2)
csvfile.close()
binary.csv
0,1,0,0,1,0,0,0,0,0,0,1
0,0,1,0,1,0,0,0,1,0,0,0
single.csv:
0.28,0.22,0.23,0.27,0.12,0.29,0.34,0.21,0.44,0.56,0.51,0.65
division.csv
0.4,0.5,0.7,0.1
0.2,0.8,0.9,0.3
Expected output
0.44,0.3,6.5
0.26,0.6,2.2
Actual output
0.44,0.3,6.5
0.275,0.6,2.16666667
0.32857143,0.3,1.1
0.25555556,0.6,2.2
Explanation on the error
Let division file as follows:
A,B,C,D
E,F,G,H
Let after single and binary computation result as follows:
1,3,4
2,2,1
Let the number 1,2,3,4 is assigned to the location A,B,C,D and next row E,F,G,H
1/A,3/C,4/D
2/F,2/F,1/E
where 1 divided by A, 3 divided by C and so on. Basically this is what the code can do. Unfortunately the division part it happened to be like what described earlier. 221 operates with BBC and 134 operates with EGH therefore the output has 4 rows which is not what I want.
I don't know if this is what you are looking for, but here is a short way to get what (I think) you want:
import numpy as np
binary = np.genfromtxt('binary.csv', delimiter = ',').reshape((2, 3, 4))
single = np.genfromtxt('single.csv', delimiter = ',').reshape((1, 3, 4))
divisi = np.genfromtxt('division.csv', delimiter = ',').reshape((2, 1, 4))
print(np.sum(single / divisi * binary, axis = -1))
Output:
[[ 0.44 0.3 6.5 ]
[ 0.25555556 0.6 2.2 ]]
The output of your program looks kind of like this:
myarray
[ 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1.]
[[[ 0. 1. 0. 0.]
[ 1. 0. 0. 0.]
[ 0. 0. 0. 1.]]]
single1
[ 0.28 0.22 0.23 0.27 0.12 0.29 0.34 0.21 0.44 0.56 0.51 0.65]
[[ 0.28 0.22 0.23 0.27]
[ 0.12 0.29 0.34 0.21]
[ 0.44 0.56 0.51 0.65]]
division
[ 0.4 0.5 0.7 0.1]
m
[[ 0.4 0.5 0.7 0.1]]
res2
[[ 0.44 0.3 6.5 ]]
division
[ 0.2 0.8 0.9 0.3]
m
[[ 0.2 0.8 0.9 0.3]]
res2
[[ 0.275 0.6 2.16666667]]
myarray
[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0.]
[[[ 0. 0. 1. 0.]
[ 1. 0. 0. 0.]
[ 1. 0. 0. 0.]]]
single1
[ 0.28 0.22 0.23 0.27 0.12 0.29 0.34 0.21 0.44 0.56 0.51 0.65]
[[ 0.28 0.22 0.23 0.27]
[ 0.12 0.29 0.34 0.21]
[ 0.44 0.56 0.51 0.65]]
division
[ 0.4 0.5 0.7 0.1]
m
[[ 0.4 0.5 0.7 0.1]]
res2
[[ 0.32857143 0.3 1.1 ]]
division
[ 0.2 0.8 0.9 0.3]
m
[[ 0.2 0.8 0.9 0.3]]
res2
[[ 0.25555556 0.6 2.2 ]]
So, with that in mind, it looks like your last two lines of the output, the one's you did not expect are caused by the second line in binary.csv. So don't use that line in your calculations if you don't want 4 line in your result.

Python dataframe: calculating average/diff/sum/... over all columns

I have a large DataFrame (i.e. thousands of rows and 20 columns) and I want to calculate the average (or any other mathmathical function like the total sum etc) over all columns. example:
x = [
[0.5 0.7 0.1 4 80 101],
[0.1 0.7 0.8 5 4 58],
[0.4 0.1 0.6 6 1 66],
...
[0.9 0.4 0.1 7 44 12]
]
This should result in
avg = [0.475 0.95 ...]
or
sum = [15.1 8.17 ...]
Is there any quick formula or oneliner that can easily apply this formula? It does not have to be a pandas.DataFrame, a Numpy array is also good
df.mean(axis=0)
df.sum(axis=0)

Categories