There is a dataframe with a column name ADDRESS:
I try to count how much rows where address is null, false, Nan, None, empty string and etc.
I have tried this:
t = len(new_dfr[new_dfr['ADDRESS'] == ''])
print(r)
How to do that in Pandas?
You can count NA values with isna():
df['ADDRESS'].isna().sum()
This will count all None, NaN values, but not False or empty strings. You could replace False with None to cover that:
df['ADDRESS'].replace('', None).replace(False, None).isna().sum()
If I understood you correctly, you basically want to count all the falsy values including the NaNs (note that NaNs are considered truthy). In pandas terms this can be translated into
# (ADDRESS is NaN) OR (ADDRESS is not truthy)
(new_dfr['ADDRESS'].isna() | ~new_dfr['ADDRESS'].astype(bool)).sum()
Example:
new_dfr = pd.DataFrame({
'ADDRESS': [np.nan, None, False, '', 0, 1, True, 'not empty']
})
>>> new_dfr
ADDRESS
0 NaN
1 None
2 False
3
4 0
5 1
6 True
7 not empty
>>> new_dfr['ADDRESS'].isna()
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 False
Name: ADDRESS, dtype: bool
>>> ~new_dfr['ADDRESS'].astype(bool)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: ADDRESS, dtype: bool
>>> new_dfr['ADDRESS'].isna() | ~new_dfr['ADDRESS'].astype(bool)
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: ADDRESS, dtype: bool
>>> (new_dfr['ADDRESS'].isna() | ~new_dfr['ADDRESS'].astype(bool)).sum()
5
Related
id. datcol1 datacol2 datacol-n final col(to be created in output)
1 false true true 0
2 false false false 2
3 true true true 0
4 true false false 1
there are multiple columns say 13,
So the job is to take each row id across all the column and
check if the columns have atleast or equalto two "true" strings then assign 0 ; and if one "true "string then assign 1, if no "true" at all then assign 2
Considering df to be:
In [1542]: df
Out[1542]:
id. datcol1 datacol2 datacol-n
0 1 False True True
1 2 False False False
2 3 True True True
3 4 True False False
Use numpy.select, df.filter, Series.ge and df.sum:
In [1546]: import numpy as np
In [1547]: x = df.filter(like='dat').sum(1)
In [1548]: conds = [x.ge(2), x.eq(1), x.eq(0)]
In [1549]: choices = [0, 1, 2]
In [1553]: df['flag'] = np.select(conds, choices)
In [1554]: df
Out[1554]:
id. datcol1 datacol2 datacol-n flag
0 1 False True True 0
1 2 False False False 2
2 3 True True True 0
3 4 True False False 1
i have dataframe having
A B C D
0 True 5 True True
1 True 6 False False
2 False 5 True True
3 False 8 True False
4 True 2 True True
It should print the count when Column D is True, how many times Column A and Column C are True.
Expected Output
A : 2
C : 3
You can filter by column D because boolean in boolean indexing with DataFrame.loc for also filter by columns names and last for count Trues values is used sum:
s = df.loc[df.D, ['A','C']].sum()
print (s)
A 2
C 3
dtype: int64
Details:
print (df.loc[df.D, ['A','C']])
A C
0 True True
2 False True
4 True True
I am writing a data quality script using pandas, where the script would be checking certain conditions on each column
At the moment i need to find out the rows that don't have a decimal or an actual number in a a particular column. I am able to find the numbers if its a whole number, but the methods I have seen so far ie isdigit() , isnumeric(), isdecimal() etc fail to correctly identify when the number is a decimal number. eg: 2.5, 0.1245 etc.
Following is some sample code & data:
>>> df = pd.DataFrame([
[np.nan, 'foo', 0],
[1, '', 1],
[-1.387326, np.nan, 2],
[0.814772, ' baz', ' '],
["a", ' ', 4],
[" ", 'foo qux ', ' '],
], columns='A B C'.split(),dtype=str)
>>> df
A B C
0 NaN foo 0
1 1 1
2 -1.387326 NaN 2
3 0.814772 baz
4 a 4
5 foo qux
>>> df['A']
0 NaN
1 1
2 -1.387326
3 0.814772
4 a
5
Name: A, dtype: object
The following method all fails to identify the decimal numbers
df['A'].fillna('').str.isdigit()
df['A'].fillna('').str.isnumeric()
df['A'].fillna('').str.isdecimal()
0 False
1 True
2 False
3 False
4 False
5 False
Name: A, dtype: bool
So when i try the following I only get 1 row
>>> df[df['A'].fillna('').str.isdecimal()]
A B C
1 1 1
NB: I am using dtype=str to get the data wihtout pandas interpreting/changing the values of the dtypes. The actual data could have spaces in column A, I will trim that out using replace(), I have kept the code simple here so as not to confuse things.
Use to_numeric with errors='coerce' for non numeric to NaNs and then test by Series.notna:
print (pd.to_numeric(df['A'], errors='coerce').notna())
0 False
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
If need return Trues for missing values:
print (pd.to_numeric(df['A'], errors='coerce').notna() | df['A'].isna())
0 True
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
Another solution with custom function:
def test_numeric(x):
try:
float(x)
return True
except Exception:
return False
print (df['A'].apply(test_numeric))
0 True
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
print (df['A'].fillna('').apply(test_numeric))
0 False
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
Alternativ, if you want to keep the string structure you can use:
df['A'].str.contains('.')
0 False
1 True
2 False
3 False
4 False
5 False
The only risk in that case could be that you identify words with .as well..which is not your wish
Given a Pandas data frame like
ID VALUE
1 false
2 true
3 false
4 false
5 false
6 true
7 true
8 true
9 false
the result should be true for the next row following a group of true values
ID RESULT
1 false
2 false
3 true
4 false
5 false
6 false
7 false
8 false
9 true
How to achieve this in Pandas?
You can check if the diff() result of the VALUE column is equal to -1:
df.VALUE.astype(int).diff() == -1
#0 False
#1 False
#2 True
#3 False
#4 False
#5 False
#6 False
#7 False
#8 True
#Name: VALUE, dtype: bool
You can compare the values against an offset version to find where a new false is after trues:
>>> df['VALUE'] = df['VALUE'].astype('bool')
>>> (~df['VALUE'] & df['VALUE'].shift())
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
Name: VALUE, dtype: bool
import pandas as pd
values = ['false','true','false','false','false','true','true','true','false']
df = pd.DataFrame(values,columns=['values'])
print "Before changes: "
print df
to_become_false = df[df['values'] == 'true'].index.tolist()
to_become_true = [idx+1 for idx in to_become_false if not(idx+1 in to_become_false)]
df['values'][to_become_false] = 'false'
df['values'][to_become_true] = 'true'
print "\n\nAfter changes: "
print df
result :
Before changes:
values
0 false
1 true
2 false
3 false
4 false
5 true
6 true
7 true
8 false
After changes:
values
0 false
1 false
2 true
3 false
4 false
5 false
6 false
7 false
8 true
When I use
x['test'] = df['a_variable'].str.contains('some string')
I get-
True
NaN
NaN
True
NaN
If I use
x[x['test'] != True]
Should I receive back the rows with value NaN?
Thanks.
Yes this is expected behaviour:
In [3]:
df = pd.DataFrame({'some_string':['asdsa','some',np.NaN, 'string']})
df
Out[3]:
some_string
0 asdsa
1 some
2 NaN
3 string
In [4]:
df['some_string'].str.contains('some')
Out[4]:
0 False
1 True
2 NaN
3 False
Name: some_string, dtype: object
Using the above as a mask:
In [13]:
df[df['some_string'].str.contains('some') != False]
Out[13]:
some_string
1 some
2 NaN
So the above is expected behaviour.
If you specify the value for NaN values using na=value then you can get whatever value you set as the returned value:
In [6]:
df['some_string'].str.contains('some', na=False)
Out[6]:
0 False
1 True
2 False
3 False
Name: some_string, dtype: bool
The above becomes important as indexing with NaN values will result in a KeyError.
Yes we would expect it to happen
ex.)
x=pd.DataFrame([True,NaN,True,NaN])
print x
0
0 True
1 NaN
2 True
3 NaN
print x[x[0] != True]
0
1 NaN
3 NaN
x[x[0] != True] would return every thing where the value is not True
Like wise
print x[x[0] != False]
0
0 True
1 NaN
2 True
3 NaN
Since equation says to return all value which are not False