Why does select raise a FutureWarning? - python

In my code I have a 2D numpy.ndarray filled with numpy.str_ values. I'm trying to change values "null" to "nan" using the select method. The problem is that this method raises a FutureWarning.
I have read this. On a suggestion there I tried to not compare Python strings a Numpy strings, but convert Python string to Numpy string at the start. Obviously that doesn't help and I'm looking for an advice.
I would like to avoid shutting down the warning (as it is in the link). It seems to me like a very dirty approach.
My code snippet:
import pandas_datareader as pd
import numpy as np
import datetime as dt
start_date = dt.datetime(year=2013, month=1, day=1)
end_date = dt.datetime(year=2013, month=2, day=1)
df = pd.DataReader("AAA", "yahoo", start_date, end_date + dt.timedelta(days=1))
array = df.to_numpy()
null = np.str_("null")
nan = np.str_("nan")
array = np.select([array == null, not array == null], [nan, array])
print(array[0][0].__class__)
print(null.__class__)
C\Python\Project.py:13: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
array = np.select([array == null, not array == null], [nan, array])
<class 'numpy.str_'>
<class 'numpy.str_'>
I'm quite new to Python so every help will be appreciated. And also - if you have a better way how to achieve that, please let me know.
Thank you!
Edit: Sorry for that. Now it should work as it is.

I don't have 50 reputation yet, so I can't comment..
As I understand it you only want to change al 'null'-entries to 'nan' instead?
Your code creates a Numpy Array of float-values, but for some reason you expect strings of 'null' in the array?
Perhaps you should've written
array = df.to_numpy()
array = array.astype(str)
to make it more clear.
From here, the array consists only of strings, and to make the change from 'null' to 'nan', you only have to write
array[array == 'null'] = 'nan'
and the warning is gone. You don't even have to use np.select.
If you want floating-point values in your array, you could use Numpy's own np.nan instead of a string, and do
array = array.astype(float)
The nan-strings are automatically converted to np.nan, which is seen as a float.

Related

type conversion in pandas on assignment of DataFrame series

I am noticing something a little strange in pandas (1.4.3). Is this the expected behaviour? The result of an optimization, or a bug? Basically I'd like to guarantee the type does not change unexpectedly, I'd at least like to see an error raised, so any tips are welcome.
If you assign all values of a series in a DataFrame this way, the dtype is altered
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame({"a": np.array([1,2,3], dtype="int32")})
>>> df1.iloc[:, df1.columns.get_loc("a")] = 0
>>> df1["a"].dtype
dtype('int64')
and if you index the rows in a different way pandas does not convert the dtype
>>> df2 = pd.DataFrame({"a": np.array([1,2,3], dtype="int32")})
>>> df2.iloc[0:len(df2.index), df2.columns.get_loc("a")] = 0
>>> df2["a"].dtype
dtype('int32')
Not really an answer but some thoughts that might help you in your quest. My guess as to what is happening is this. In your multiple choice question above I am picking option A - optimization.
I think when 'pandas' sees df1.iloc[:, df1.columns.get_loc("a")] = 0 it is thinking full column(s) replacement of all rows. No slicing - even though df1.iloc[: ... ] is involved. [:] gets translated into all-rows-not-a-slice mode. When it sees = 0 it sees that (via broadcast) as full column(s) of int64. And since it is full replacement then the new column has the same dtype as the source.
But when it sees df2.iloc[0:len(df2.index), df2.columns.get_loc("a")] = 0 it goes into index-slice mode. Even though it is a full-column index slice it doesn't know that and makes an early decision to go into index-slice mode. Index-slice mode then operates on the assumption that only part of the column is going to be updated - not a replacement. Then in update mode the column is assumed to be partially updated and retains its existing dtype.
I got the above hypothesis from looking around at this: https://github.com/pandas-dev/pandas/blob/main/pandas/core/indexes/base.py
If I didn't have a day job I might have the time to actually find the smoking gun in those 6242 lines of code.
If you look at this code ( I wrote your code little differently to see what is
happening in the middle)
from pandas._libs import index
import pandas as pd
import numpy as np
dfx= pd.DataFrame({"x": np.array([4,5,6], dtype="int32")}
P=dfx.iloc[:, dfx.columns.get_loc("x")] = 0
P1=dfx.iloc[:, dfx.columns.get_loc("x")]
print(P1)# here you are automatically changing the datatype to int64 ( while
keep the value 0 , as int64 is default access mechanism for the hardware to
process the data.
print(P)
print(dfx["x"].dtype)
dfy= pd.DataFrame({"y": np.array([4,5,6], dtype="int32")})
Q=dfy.iloc[0:len(dfy.index), dfy.columns.get_loc("y")] = 0
print(Q)
Q1=dfy.iloc[0:len(dfy.index), dfy.columns.get_loc("y")]
print(Q1)
print(dfy["y"].dtype)
print(len(dfx.index))
print(len(dfy.index))
Don't know why this is happening, but adding square brackets seem to solve the issue:
df1.iloc[:, [df1.columns.get_loc("a")]] = 0
An other solution seems to be:
df1.iloc[range(len(df1.index)), df1.columns.get_loc("a")] = 0

Unable to insert NA's in numpy array

I was working on this piece of code and was stuck here.
import numpy as np
a = np.arange(10)
a[7:] = np.nan
By theory, it should insert missing values starting from index 7 until the end of the array. However, when I ran the code, some random values are inserted into the array instead of NA's.
Can someone explain what happened here and how should I insert missing values intentionally into numpy arrays?
Not-a-number (NA) is a special type of floating point number. By default, np.arange() creates an array of type int. Casting this to float should allow you to add NA's:
import numpy as np
a = np.arange(10).astype(float)
a[7:] = np.nan

DataFrame of objects `astype(float)` behaviour different depending if lists or arrays

I'll preface this with the statement that I wouldn't do this in the first place and that I ran across this helping a friend.
Consider the data frame df
df = pd.DataFrame(pd.Series([[1.2]]))
df
0
0 [1.2]
This is a data frame of objects where the objects are lists. In my friend's code, they had:
df.astype(float)
Which breaks as I had hoped
ValueError: setting an array element with a sequence.
However, if those values were numpy arrays instead:
df = pd.DataFrame(pd.Series([np.array([1.2])]))
df
0
0 [1.2]
And I tried the same thing:
df.astype(float)
0
0 1.2
It's happy enough to do something and convert my 1-length arrays to scalars. This feels very dirty!
If instead they were not 1-length arrays
df = pd.DataFrame(pd.Series([np.array([1.2, 1.3])]))
df
0
0 [1.2, 1.3]
Then it breaks
ValueError: setting an array element with a sequence.
Question
Please tell me this is a bug and we can fix it. Or can someone explain why and in what world this makes sense?
Response to #root
You are right. Is this worth an issue? Do you expect/want this?
a = np.empty((1,), object)
a[0] = np.array([1.2])
a.astype(float)
array([ 1.2])
And
a = np.empty((1,), object)
a[0] = np.array([1.2, 1.3])
a.astype(float)
ValueError: setting an array element with a sequence.
This is due to the unsafe default-value for the castingargument of astype. In the docs the argument casting is described as such:
"Controls what kind of data casting may occur. Defaults to ‘unsafe’ for backwards compatibility." (my emphasis)
Any of the other possible castings return a TypeError.
a = np.empty((1,), object)
a[0] = np.array([1.2])
a.astype(float, casting='same_kind')
Results in:
TypeError: Cannot cast array from dtype('O') to dtype('float64') according to the rule 'same_kind'
This is true for all castings except unsafe, namely: no, equiv, safe, and same_kind.

unwanted type conversion in pandas.DataFrame.update

Is there any reason why pandas changes the type of columns from int to float in update, and can I prevent it from doing it? Here is some example code of the problem
import pandas as pd
import numpy as np
df = pd.DataFrame({'int': [1, 2], 'float': [np.nan, np.nan]})
print('Integer column:')
print(df['int'])
for _, df_sub in df.groupby('int'):
df_sub['float'] = float(df_sub['int'])
df.update(df_sub)
print('NO integer column:')
print(df['int'])
here's the reason for this: since you are effectively masking certain values on a column and replace them (with your updates), some values could become `nan
in an integer array this is impossible, so numeric dtypes are apriori converted to float (for efficiency), as checking first is more expensive that doing this
a change of dtype back is possible...just not in the code right now, therefor this a bug (a bit non-trivial to fix though): github.com/pydata/pandas/issues/4094
This causes data precision loss if you have big values in your int64 column, when update converts them to float. So going back with what Jeff suggests: df['int'].astype(int)
is not always possible.
My workaround for cases like this is:
df_sub['int'] = df_sub['int'].astype('Int64') # Int64 with capital I, supports NA values
df.update(df_sub)
df_sub['int'] = df_sub['int'].astype('int')
The above avoids the conversion to float type. The reason I am converting back to int type (instead of leaving it as Int64) is that pandas seems to lack support for that type in several operations (e.g. concat gives an error about missing .view).
Maybe they could incorporate the above fix in issue 4094

Recode missing data Numpy

I am reading in census data using the matplotlib cvs2rec function - works fine gives me a nice ndarray.
But there are several columns where all the values are '"none"" with dtype |04. This is cuasing problems when I lode into Atpy "TypeError: object of NoneType has no len()". Something like '9999' or other missing would work for me. Mask is not going to work in this case because I am passing the real array to ATPY and it will not convert MASK. The Put function in numpy will not work with none values wich is the best way to change values(I think). I think some sort of boolean array is the way to go but I can't get it to work.
So what is a good/fast way to change none values and/or uninitialized numpy array to something like '9999'or other recode. No Masking.
Thanks,
Matthew
Here is a solution to this problem, although if your data is a record array you should only apply this operation to your column, rather than the whole array:
import numpy as np
# initialise some data with None in it
a = np.array([1, 2, 3, None])
a = np.where(a == np.array(None), 9999, a)
Note that you need to cast None into a numpy array for this to work
you can use mask array when you do calculation. and when pass the array to ATPY, you can call filled(9999) method of the mask array to convert the mask array to normal array with invalid values replaced by 9999.

Categories