Recode missing data Numpy - python

I am reading in census data using the matplotlib cvs2rec function - works fine gives me a nice ndarray.
But there are several columns where all the values are '"none"" with dtype |04. This is cuasing problems when I lode into Atpy "TypeError: object of NoneType has no len()". Something like '9999' or other missing would work for me. Mask is not going to work in this case because I am passing the real array to ATPY and it will not convert MASK. The Put function in numpy will not work with none values wich is the best way to change values(I think). I think some sort of boolean array is the way to go but I can't get it to work.
So what is a good/fast way to change none values and/or uninitialized numpy array to something like '9999'or other recode. No Masking.
Thanks,
Matthew

Here is a solution to this problem, although if your data is a record array you should only apply this operation to your column, rather than the whole array:
import numpy as np
# initialise some data with None in it
a = np.array([1, 2, 3, None])
a = np.where(a == np.array(None), 9999, a)
Note that you need to cast None into a numpy array for this to work

you can use mask array when you do calculation. and when pass the array to ATPY, you can call filled(9999) method of the mask array to convert the mask array to normal array with invalid values replaced by 9999.

Related

Cannot plot or use .tolist() on pd dataframe column

so I am reading in data from a csv and saving it to a dataframe so I can use the columns. Here is my code:
filename = open(r"C:\Users\avalcarcel\Downloads\Data INSTR 9 8_16_2022 11_02_42.csv")
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(filename,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
length_ = len(df.date)
scan = list(range(1,length_+1))
plt.plot(scan,df.ch104)
plt.show()
When I try to plot scan vs. df.ch104, I get the following exception thrown:
'value' must be an instance of str or bytes, not a None
So what I thought to do was make each column in my df a list:
ch104 = df.ch104.tolist()
But it is turning my data from this to this:
before .tolist()
To this:
after .tolist()
This also happens when I use df.ch104.values.tolist()
Can anyone help me? I haven't used python/pandas in a while and I am just trying to get the data read in first. Thanks!
So, the df.ch104.values.tolist() code beasicly turns your column into a 2d 1XN array. But what you want is a 1D array of size N.
So transpose it before you call .tolist(). Lastly call [0] to convert Nx1 array to N array
df.ch104.values.tolist()[0]
Might I also suggest you include dropna() to avoid 'value' must be an instance of str or bytes, not a Non
df.dropna(subset=['ch104']).ch104.values.tolist()[0]
The error clearly says there are None or NaN values in your dataframe. You need to check for None and deal with them - replace with a suitable value or delete them.

Converting pandas DataFrame to numpy array but with an inconsistency

I am running into a weird inconsistency. So I had to learn the difference between immutable and mutable data types. For my purpose, I need to convert my pandas DataFrame into Numpy apply operations and convert it back, as I do not wish to alter my input.
so I am converting like follows:
mix=pd.DataFrame(array,columns=columns)
def mix_to_pmix(mix,p_tank):
previous=0
columns,mix_in=np.array(mix) #<---
mix_in*=p_tank
previous=0
for count,val in enumerate(mix_in):
mix_in[count]=val+previous
previous+=val
return pd.DataFrame(mix_in,columns=columns)
This works perfectly fine, but the function:
columns,mix_in=np.array(mix)
seems to not be consistent as in the case:
def to_molfrac(mix):
columns,mix_in=np.array(mix)
shape=mix_in.shape
for i in range(shape[0]):
mix_in[i,:]*=1/max(mix_in[i,:])
for k in range(shape[1]-1,0,-1):
mix_in[:,k]+=-mix_in[:,k-1]
mix_in=mix_in/mix_in.sum(axis=1)[:,np.newaxis]
return pd.DataFrame(mix_in,columns=columns)
I receive the error:
ValueError: too many values to unpack (expected 2)
The input of the latter function is the output of the previous function. So it should be the same case.
It's impossible to understand the input of to_molfrac and mix_to_pmix without an example.
But the pandas objects has a .value attribute which allows you to access the underlying numpy array. So, its probably better to use mix_in = mix.values instead.
columns, values = df.columns, df.values

Python: How to convert whole array into int

I wish to have an int matrix which has only its first column filled and the rest of elements are Null. Sorry but, I have a background of R. So, I know if I leave some Null elements it would be easier to manage them later. Meanwhile, if I leave 0 then it would be lots of problems later.
I have the following code:
import numpy as np
import numpy.random as random
import pandas as pa
def getRowData():
rowDt = np.full((80,20), np.nan)
rowDt[:,0] = random.choice([1,2,3],80) # Set the first column
return rowDt
I wish that this function returns the int, but seems that it gives me float.
I have seen this link, and tried the below code:
return pa.to_numeric(rowDt)
But, it did not help me. Also the rowDT object does not have .astype(<type>).
How can I convert an int array?
You create a full (np.full ) matrix of np.nan, which holds float dtype. This means you start off with a matrix defined to hold float numbers, not integers.
To fix this, fefine a full matrix with the integer 0 as initial value. That way, the dtype of your array is np.int and there is no need for astype or type casting.
rowDt = np.full((80,20), 0)
If you still wish to hold np.nan in your matrix, then I'm afraid you cannot use numpy arrays for that. You either hold all integers, or all floats.
You can use numpy.ma.masked_array() to create a numpy masked array
The numpy masked array "remembers" which elements are "masked". It provides methods and functions similar to those of numpy arrays, but excluding the masked values from the computations (such as, eg, mean()).
Once you have the masked array, you can always mask or unmask specific elements or rows or columns of elements whenever you want.

Unable to insert NA's in numpy array

I was working on this piece of code and was stuck here.
import numpy as np
a = np.arange(10)
a[7:] = np.nan
By theory, it should insert missing values starting from index 7 until the end of the array. However, when I ran the code, some random values are inserted into the array instead of NA's.
Can someone explain what happened here and how should I insert missing values intentionally into numpy arrays?
Not-a-number (NA) is a special type of floating point number. By default, np.arange() creates an array of type int. Casting this to float should allow you to add NA's:
import numpy as np
a = np.arange(10).astype(float)
a[7:] = np.nan

Normalising rows in numpy matrix

I am trying to normalize rows of a numpy matrix using L2 norm (unity length).
I am seeing a problem when I do that.
Assuming my matrix 'b' is as follows:
Now when I do the normalization of first row as below it works fine.
But when I try to do it by iterating through all the rows and converting the same matrix b as below it gives me all zeros.
Any idea why is that happening and how to get the correct normalization?.
Any faster way of row normalizing of matrix without having to iterate each row?. I don't want to use sci-kit learn normalization function though.
Thanks
The problem comes from the fact that b has type int so when you fill in row by row, numpy automatically converts the results of you computation (float) to int, hence the zeros. One way to avoid that is to define b with type float by using 0., 1. etc... or just adding .astype(float) at definition.
This should work to do the computation in one go which also doesn't require converting to float first:
b = b / np.linalg.norm(b, axis=1, keepdims=True)
This works because you are redefining the whole array rather than changing its rows one by one, and numpy is clever enough to make it float.

Categories