Count number of nan values on a matrix with strings - python

I want to count the number of 'nan' values per column inside a matrix full of string values. Like this one:
m:
[['CB_2' 'CB_3']
['CB_1-1' 'CB_4-1']
['CB_1-2' 'CB_4-2']
['CB_2-1' 'CB_5-1']
['CB_2-2' 'CB_5-2']
[nan 'CB_6-1']
[nan 'CB_6-2']]
I tried using np.count_nonzero(~np.isnan(m) but it seems to work only with numerical values. Perhaps if I convert it into an empty string or zero (?).
Also, I created a sample numpy array with strings (to try several options) (np.array([['a','b'],['c','d'],['e','f'],['e','g'],['k','ñ'],['w','q'],['y','d']])) but when I use np.nan it doesnt seems to works correctly since it adds the nan value as a string ('nan').
Thanks,

You can transform the array into something numerical (I could not reproduce array with nans, but you can make function to return 0 for non-strings):
def f(x):
if isinstance(x, str):
if x == 'nan':
return 0
else:
return 1
return 0
vf = np.vectorize(f)
x = np.array([['CB_2', 'CB_3'],
['CB_1-1', 'CB_4-1'],
['CB_1-2', 'CB_4-2'],
['CB_2-1', 'CB_5-1'],
['CB_2-2', 'CB_5-2'],
[np.nan, 'CB_6-1'],
[np.nan, 'CB_6-2']])
>>> x
array([['CB_2', 'CB_3'],
['CB_1-1', 'CB_4-1'],
['CB_1-2', 'CB_4-2'],
['CB_2-1', 'CB_5-1'],
['CB_2-2', 'CB_5-2'],
['nan', 'CB_6-1'],
['nan', 'CB_6-2']], dtype='<U6')
>>> vf(x)
array([[1, 1],
[1, 1],
[1, 1],
[1, 1],
[1, 1],
[0, 1],
[0, 1]])

Related

Place numbers from a list to an array where the element is not a np.nan

I have a list of numbers
a = [1, 2, 3, 4, 5]
and an existed array
b = [[np.nan, 10, np.nan],
[11, 12, 13],
[np.nan, 14, np.nan]]
How can I place the numbers from "list a" to the elements on array b that contains a number which I should get
c = [[np.nan, 1, np.nan],
[2, 3, 4],
[np.nan, 5, np.nan]]
Maybe it can be done with loops but I want to avoid it because the length of the list and the dimension of the array will change. However, the length of the list will always match the number of the elements that are not an np.nan in the array.
Here is an approach to solve it without using loops.
First, we flatten the array b to convert it to a 1D array and then replace the none nan values with contents of a. Then, convert the array back to its initial shape.
flat_b = b.flatten()
flat_b[~np.isnan(flat_b)] = a
flat_b.reshape(b.shape)
You can np.isnan to create a boolean mask. Then use it in indexing1.
m = np.isnan(b)
b[~m] = a
print(b)
[[nan 1. nan]
[ 2. 3. 4.]
[nan 5. nan]]
1. NumPy's Boolean Indexing
c = b
current = 0
for i in range(len(c)):
for j in range(len(c[i])):
if c[i][j] != np.nan and current < len(a):
c[i][j] = a[current]
current += 1
While this may look long and complicated, it actually only has a O(n) complexity. It just iterates through the 2D array and replaces the non-nan values with the current value from a.

Python numpy : Sum an array selectively given information from a second array

Let's say I have a N-dimensionnal array, for example:
A = [ [ 1, 2] ,
[6, 10] ]
and another array B that defines an index associated with each value of A
B = [[0, 1], [1, 0]]
And I want to obtain a 1D list or array that for each index contains the sum of the values of A associated with that index. For our example, we would want
C = [11, 8]
Is there a way to do this efficiently, without looping over the arrays manually ?
Edit: To make it clearer what I want, if we now take A the same and B equal to :
B = [[1, 1], [1,1]]
Then I want all the values of A to sum into the index 1 of C, which yields
C = [0, 19]
Or I can write a code snippet :
C = np.zeros(np.max(B))
for i in range(...):
for j in range(...):
C[B[i,j]] += A[i,j]
return C
I think I found the best answer for now actually.
I can just use:
np.histogram(B, weights = A)
This code provides the solution I want.

Find all NaN slice in numpy array

I have a four dimensional Numpy ndarray (time, pressure level, latitude, longitude), and I want to check for each time and pressure level (dimensions 0 and 1) if there is an all-NaN slice along the latitude or longitude dimenstion (2 and 3).
I'd like to to it in a vectorized way, so without looping over the array, but I can't figure out how.
import numpy as np
a=np.ones([2,3,5,5])
a[0,2,:,2]=np.nan*np.ones_like(a[0,2,:,2])
a[0,1,1,:]=np.nan*np.ones_like(a[0,1,1,:])
a[0,0,1,2]=np.nan
a[1,1,:,2]=np.nan*np.ones_like(a[0,2,:,2])
a[1,1,1,:]=np.nan*np.ones_like(a[0,1,1,:])
print(a)
The array now holds ones (i.e. numbers), and in some locations slices of only NaNs. I'd like to know these locations. So in this case, I need to find that the NaN slices are at [0,2,:,2], [0,1,1,:], [1,1,:,2], and a[1,1,1,:].
You should use the np.isnan function which creates a boolean matrix of the same size as your original matrix. Then just use boolean reduction operations like np.all. Thus the following code stores in idx the index of the lines (axis=1) of which all the elements are equal to np.nan.
arr = np.array([[0, 0, 0], [np.nan, np.nan, np.nan], [1, np.nan, 1]])
arr_isnan = np.isnan(arr)
idx = np.argwhere(arr_isnan.all(axis=1))
Output:
>>>print(idx)
[[1]]
Following your example this methods gives you this output :
arr_isnan = np.isnan(a)
idx = np.argwhere(arr_isnan.all(axis=2))
>>>print(idx) #[0,2,:,2] and [1,1,:,2] because axis=2
array([[0, 2, 2],
[1, 1, 2]], dtype=int64)
>>>print(a[idx[:,0], idx[:,1], :, idx[:,2]])
[[nan nan nan nan nan]
[nan nan nan nan nan]]
So you just have to adjust the position of ":" according to the axis.

Detect Missing Column Labels in Pandas

I'm working with the dataset outlined here:
https://archive.ics.uci.edu/ml/datasets/Balance+Scale
I'm trying create a general function to be able to parse any categorical data following these two rules:
Must have a column labeled class containing the class of the object
Each row must have the same numbers of columns
Minimal example of the data that I'm working with:
Class,LW,LD,RW,RD
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
This provides 3 unique classes: B, L, R. It also provides 4 features which pertain to each entry: LW, LD, RW and RD.
The following is a part of my function to handle generic cases, but my issue with it is that I don't know how to check if any column labels are simply missing:
import pandas as pd
import sys
dataframe = pd.read_csv('Balance_Data.csv')
columns = list(dataframe.columns.values)
if "Class" not in columns:
sys.exit("'Class' is not a column in the data")
if "Class.1" in columns:
sys.exit("Cannot specify more than one 'Class' column")
columns.remove("Class")
inputX = dataframe.loc[:, columns].as_matrix()
inputY = dataframe.loc[:, ['Class']].as_matrix()
At this point, the correct values are:
inputX = array([[1, 1, 1, 1],
[1, 2, 1, 1],
[1, 2, 1, 3],
[2, 2, 4, 5]])
inputY = array([['B'],
['L'],
['R'],
['R'],
['R'],
['R']], dtype=object)
But if I remove the last column label (RD) and reprocess,
Class,LW,LD,RW
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
I get:
inputX = array([[1, 1, 1],
[2, 1, 1],
[2, 1, 3],
[2, 4, 5]])
inputY = array([[1],
[1],
[1],
[2]])
This indicates that it reads label values from right to left instead of left to right, which means that if any data is input into this function that doesn't have the right amount of labels, it's not going to work correctly.
How can I check that the dimension of the rows is the same as the number of columns? (It can be assumed that there are no gaps in the data itself, that each row of data beyond the columns always has the same number of elements in it)
I would pull it out as follows:
In [11]: df = pd.read_csv('Balance_Data.csv', index_col=0)
In [12]: df
Out[12]:
LW LD RW RD
Class
B 1 1 1 1
L 1 2 1 1
R 1 2 1 3
R 2 2 4 5
That way the assertion check can be:
if "Class" in df.columns:
sys.exit("class must be the first and only the column and number of columns must match all rows")
and then check that the there are no NaNs in the last column:
In [21]: df.iloc[:, -1].notnull().all()
Out[21]: True
Note: this happens e.g. with the following (bad) csv:
In [31]: !cat bad.csv
A,B,C
1,2
3,4
In [32]: df = pd.read_csv('bad.csv', index_col=0)
In [33]: df
Out[33]:
B C
A
1 2 NaN
3 4 NaN
In [34]: df.iloc[:, -1].notnull().all()
Out[34]: False
I think these are the only two failing cases (but I think the error messages can be made clearer)...

produce a matrix of strings on the basis of a list

I want to produce a matrix on the basis of this data I have:
[[0, 1], [1, 0], [0, 2], [1, 1], [2, 0], [0, 3], [1, 2], [2, 1], [3, 0]]
What I want to do, is if the sum inside the square brackets is equal to 1, produce a string variable y_n where n is the counter of lists meeting that condition,
and yxn if the sum is greater than one, where n counts the number of strings produced.
So for my data it should produce:
y_1
y_2
yx1
yx2
up to
yx7
So my best attempt is:
if len(gcounter) != 0:
hg = len(gcounter[0])
else:
hg=1
LHS=Matrix(hg,1,lambda i,j:(var('yx%d' %i)))
print(LHS)
The data is called gcounter.
It's not giving me an error, but its not filling LHS up with anything
I'm not entirely sure I understand what you're doing, but I think this generator does what you want:
def gen_y_strings(data):
counter_1 = counter_other = 0
for item in data:
if sum(item) == 1:
counter_1 += 1
yield "y_{}".format(counter_1)
else:
counter_other += 1
yield "yx{}".format(counter_other)
You can run it like this:
for result in gen_y_strings(gcounter):
print(result)
Which, given the example data, outputs what you wanted:
y_1
y_2
yx1
yx2
yx3
yx4
yx5
yx6
yx7

Categories