When I use
x['test'] = df['a_variable'].str.contains('some string')
I get-
True
NaN
NaN
True
NaN
If I use
x[x['test'] != True]
Should I receive back the rows with value NaN?
Thanks.
Yes this is expected behaviour:
In [3]:
df = pd.DataFrame({'some_string':['asdsa','some',np.NaN, 'string']})
df
Out[3]:
some_string
0 asdsa
1 some
2 NaN
3 string
In [4]:
df['some_string'].str.contains('some')
Out[4]:
0 False
1 True
2 NaN
3 False
Name: some_string, dtype: object
Using the above as a mask:
In [13]:
df[df['some_string'].str.contains('some') != False]
Out[13]:
some_string
1 some
2 NaN
So the above is expected behaviour.
If you specify the value for NaN values using na=value then you can get whatever value you set as the returned value:
In [6]:
df['some_string'].str.contains('some', na=False)
Out[6]:
0 False
1 True
2 False
3 False
Name: some_string, dtype: bool
The above becomes important as indexing with NaN values will result in a KeyError.
Yes we would expect it to happen
ex.)
x=pd.DataFrame([True,NaN,True,NaN])
print x
0
0 True
1 NaN
2 True
3 NaN
print x[x[0] != True]
0
1 NaN
3 NaN
x[x[0] != True] would return every thing where the value is not True
Like wise
print x[x[0] != False]
0
0 True
1 NaN
2 True
3 NaN
Since equation says to return all value which are not False
Related
There is a dataframe with a column name ADDRESS:
I try to count how much rows where address is null, false, Nan, None, empty string and etc.
I have tried this:
t = len(new_dfr[new_dfr['ADDRESS'] == ''])
print(r)
How to do that in Pandas?
You can count NA values with isna():
df['ADDRESS'].isna().sum()
This will count all None, NaN values, but not False or empty strings. You could replace False with None to cover that:
df['ADDRESS'].replace('', None).replace(False, None).isna().sum()
If I understood you correctly, you basically want to count all the falsy values including the NaNs (note that NaNs are considered truthy). In pandas terms this can be translated into
# (ADDRESS is NaN) OR (ADDRESS is not truthy)
(new_dfr['ADDRESS'].isna() | ~new_dfr['ADDRESS'].astype(bool)).sum()
Example:
new_dfr = pd.DataFrame({
'ADDRESS': [np.nan, None, False, '', 0, 1, True, 'not empty']
})
>>> new_dfr
ADDRESS
0 NaN
1 None
2 False
3
4 0
5 1
6 True
7 not empty
>>> new_dfr['ADDRESS'].isna()
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 False
Name: ADDRESS, dtype: bool
>>> ~new_dfr['ADDRESS'].astype(bool)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: ADDRESS, dtype: bool
>>> new_dfr['ADDRESS'].isna() | ~new_dfr['ADDRESS'].astype(bool)
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: ADDRESS, dtype: bool
>>> (new_dfr['ADDRESS'].isna() | ~new_dfr['ADDRESS'].astype(bool)).sum()
5
I want to list a dataframe where a specific column is either null or not null, I have it working using -
df[df.Survive.notnull()] # contains no missing values
df[df.Survive.isnull()] #---> contains missing values
This works perfectly but I want to make my code more dynamic and pass the column "Survive" as a variable but it's not working for me.
I tried:
variableToPredict = ['Survive']
df[df[variableToPredict].notnull()]
I get the error - ValueError: cannot reindex from a duplicate axis
I'm sure I'm making a silly mistake, what can I do to fix this?
So idea is always is necessary Series or list or 1d array for mask for filtering.
If want test only one column use scalar:
variableToPredict = 'Survive'
df[df[variableToPredict].notnull()]
But if add [] output is one column DataFrame, so is necessaty change function for test by any (test if at least one NaN per row, sense in multiple columns) or all (test if all NaNs per row, sense in multiple columns) functions:
variableToPredict = ['Survive']
df[df[variableToPredict].notnull().any(axis=1)]
variableToPredict = ['Survive', 'another column']
df[df[variableToPredict].notnull().any(axis=1)]
Sample:
df = pd.DataFrame({'Survive':[np.nan, 'A', 'B', 'B', np.nan],
'another column':[np.nan, np.nan, 'a','b','b']})
print (df)
Survive another column
0 NaN NaN
1 A NaN
2 B a
3 B b
4 NaN b
First if test only one column:
variableToPredict = 'Survive'
df1 = df[df[variableToPredict].notnull()]
print (df1)
Survive another column
1 A NaN
2 B a
3 B b
print (type(df[variableToPredict]))
<class 'pandas.core.series.Series'>
#Series
print (df[variableToPredict])
0 NaN
1 A
2 B
3 B
4 NaN
Name: Survive, dtype: object
print (df[variableToPredict].isnull())
0 True
1 False
2 False
3 False
4 True
Name: Survive, dtype: bool
If use list - here one element list:
variableToPredict = ['Survive']
print (type(df[variableToPredict]))
<class 'pandas.core.frame.DataFrame'>
#one element DataFrame
print (df[variableToPredict])
Survive
0 NaN
1 A
2 B
3 B
4 NaN
If testing per rows it is same output for any or all:
print (df[variableToPredict].notnull().any(axis=1))
0 False
1 True
2 True
3 True
4 False
dtype: bool
print (df[variableToPredict].notnull().all(axis=1))
0 False
1 True
2 True
3 True
4 False
dtype: bool
If test one or more columns in list:
variableToPredict = ['Survive', 'another column']
print (type(df[variableToPredict]))
<class 'pandas.core.frame.DataFrame'>
print (df[variableToPredict])
Survive another column
0 NaN NaN
1 A NaN
2 B a
3 B b
4 NaN b
print (df[variableToPredict].notnull())
Survive another column
0 False False
1 True False
2 True True
3 True True
4 False True
#at least one NaN per row, at least one True
print (df[variableToPredict].notnull().any(axis=1))
0 False
1 True
2 True
3 True
4 True
dtype: bool
#all NaNs per row, all Trues
print (df[variableToPredict].notnull().all(axis=1))
0 False
1 False
2 True
3 True
4 False
dtype: bool
Adding all at the end
df[df[variableToPredict].notnull().all(1)]
I have a data frame
A AA B D C E
True 2 False 33 False False
False 3 False 43 True False
True 5 True 56 False True
False 2 False 7 nan True
I want to get column named "result" which will return the columns name from A,B and C if it is true and nan, if any of it is not True.
Expected Column
result
A
C
A,B
nan
First compare values by Trues, then add matrix multiplication with columns with separator by DataFrame.dot, remove separator from right side by Series.str.rstrip and last replace empty values to missing values:
df['result'] = df.eq(True).dot(df.columns + ',').str.rstrip(',').replace('',np.nan)
print (df)
A AA B D C result
0 True 2 False 33 False A
1 False 3 False 43 True C
2 True 5 True 56 False A,B
3 False 2 False 7 NaN NaN
I am writing a data quality script using pandas, where the script would be checking certain conditions on each column
At the moment i need to find out the rows that don't have a decimal or an actual number in a a particular column. I am able to find the numbers if its a whole number, but the methods I have seen so far ie isdigit() , isnumeric(), isdecimal() etc fail to correctly identify when the number is a decimal number. eg: 2.5, 0.1245 etc.
Following is some sample code & data:
>>> df = pd.DataFrame([
[np.nan, 'foo', 0],
[1, '', 1],
[-1.387326, np.nan, 2],
[0.814772, ' baz', ' '],
["a", ' ', 4],
[" ", 'foo qux ', ' '],
], columns='A B C'.split(),dtype=str)
>>> df
A B C
0 NaN foo 0
1 1 1
2 -1.387326 NaN 2
3 0.814772 baz
4 a 4
5 foo qux
>>> df['A']
0 NaN
1 1
2 -1.387326
3 0.814772
4 a
5
Name: A, dtype: object
The following method all fails to identify the decimal numbers
df['A'].fillna('').str.isdigit()
df['A'].fillna('').str.isnumeric()
df['A'].fillna('').str.isdecimal()
0 False
1 True
2 False
3 False
4 False
5 False
Name: A, dtype: bool
So when i try the following I only get 1 row
>>> df[df['A'].fillna('').str.isdecimal()]
A B C
1 1 1
NB: I am using dtype=str to get the data wihtout pandas interpreting/changing the values of the dtypes. The actual data could have spaces in column A, I will trim that out using replace(), I have kept the code simple here so as not to confuse things.
Use to_numeric with errors='coerce' for non numeric to NaNs and then test by Series.notna:
print (pd.to_numeric(df['A'], errors='coerce').notna())
0 False
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
If need return Trues for missing values:
print (pd.to_numeric(df['A'], errors='coerce').notna() | df['A'].isna())
0 True
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
Another solution with custom function:
def test_numeric(x):
try:
float(x)
return True
except Exception:
return False
print (df['A'].apply(test_numeric))
0 True
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
print (df['A'].fillna('').apply(test_numeric))
0 False
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
Alternativ, if you want to keep the string structure you can use:
df['A'].str.contains('.')
0 False
1 True
2 False
3 False
4 False
5 False
The only risk in that case could be that you identify words with .as well..which is not your wish
So, I've got a df like so,
ID,A,B,C,D,E,F,G
1,123,30,3G,1,123,30,3G
2,456,40,4G,NaN,NaN,NaN,4G
3,789,35,5G,NaN,NaN,NaN,NaN
I also have a list that has a subset of the header list of df like so,
header_list = ["D","E","F","G"]
Now I'd like to get those records from df that CONTAINS Null values FOR ALL OF the Column Names in the header_list.
Expected Output:
ID,A,B,C,D,E,F,G
3,789,35,5G,NaN,NaN,NaN,NaN
I tried,
new_df = df[df[header_list].isnull()] but this throws error, ValueError: Boolean array expected for the condition, not float64
I know I can do something like this,
new_df = df[(df['D'].isnull()) & (df['E'].isnull()) & (df['F'].isnull()) & (df['G'].isnull())]
But I don't want to hard code it like this. So is there a better way of doing this?
You can filter this with:
df[df[header_list].isnull().all(axis=1)]
We thus check if a row contains values where .all() values are .isnull().
For the given sample input, this gives the expected output:
>>> df[df[header_list].isnull().all(axis=1)]
A B C D E F G
3 789 35 5G NaN NaN NaN NaN
The .all(axis=1) [pandas-doc] will thus return True for a row, given all columns for that row are True, and False otherwise. So for the given sample input, we get:
>>> df[header_list]
D E F G
1 1.0 123.0 30.0 3G
2 NaN NaN NaN 4G
3 NaN NaN NaN NaN
>>> df[header_list].isnull()
D E F G
1 False False False False
2 True True True False
3 True True True True
>>> df[header_list].isnull().all(axis=1)
1 False
2 False
3 True
dtype: bool