how find rows where a particular column has decimal numbers using pandas? - python

I am writing a data quality script using pandas, where the script would be checking certain conditions on each column
At the moment i need to find out the rows that don't have a decimal or an actual number in a a particular column. I am able to find the numbers if its a whole number, but the methods I have seen so far ie isdigit() , isnumeric(), isdecimal() etc fail to correctly identify when the number is a decimal number. eg: 2.5, 0.1245 etc.
Following is some sample code & data:
>>> df = pd.DataFrame([
[np.nan, 'foo', 0],
[1, '', 1],
[-1.387326, np.nan, 2],
[0.814772, ' baz', ' '],
["a", ' ', 4],
[" ", 'foo qux ', ' '],
], columns='A B C'.split(),dtype=str)
>>> df
A B C
0 NaN foo 0
1 1 1
2 -1.387326 NaN 2
3 0.814772 baz
4 a 4
5 foo qux
>>> df['A']
0 NaN
1 1
2 -1.387326
3 0.814772
4 a
5
Name: A, dtype: object
The following method all fails to identify the decimal numbers
df['A'].fillna('').str.isdigit()
df['A'].fillna('').str.isnumeric()
df['A'].fillna('').str.isdecimal()
0 False
1 True
2 False
3 False
4 False
5 False
Name: A, dtype: bool
So when i try the following I only get 1 row
>>> df[df['A'].fillna('').str.isdecimal()]
A B C
1 1 1
NB: I am using dtype=str to get the data wihtout pandas interpreting/changing the values of the dtypes. The actual data could have spaces in column A, I will trim that out using replace(), I have kept the code simple here so as not to confuse things.

Use to_numeric with errors='coerce' for non numeric to NaNs and then test by Series.notna:
print (pd.to_numeric(df['A'], errors='coerce').notna())
0 False
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
If need return Trues for missing values:
print (pd.to_numeric(df['A'], errors='coerce').notna() | df['A'].isna())
0 True
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
Another solution with custom function:
def test_numeric(x):
try:
float(x)
return True
except Exception:
return False
print (df['A'].apply(test_numeric))
0 True
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool
print (df['A'].fillna('').apply(test_numeric))
0 False
1 True
2 True
3 True
4 False
5 False
Name: A, dtype: bool

Alternativ, if you want to keep the string structure you can use:
df['A'].str.contains('.')
0 False
1 True
2 False
3 False
4 False
5 False
The only risk in that case could be that you identify words with .as well..which is not your wish

Related

How to count where column value is falsy in Pandas?

There is a dataframe with a column name ADDRESS:
I try to count how much rows where address is null, false, Nan, None, empty string and etc.
I have tried this:
t = len(new_dfr[new_dfr['ADDRESS'] == ''])
print(r)
How to do that in Pandas?
You can count NA values with isna():
df['ADDRESS'].isna().sum()
This will count all None, NaN values, but not False or empty strings. You could replace False with None to cover that:
df['ADDRESS'].replace('', None).replace(False, None).isna().sum()
If I understood you correctly, you basically want to count all the falsy values including the NaNs (note that NaNs are considered truthy). In pandas terms this can be translated into
# (ADDRESS is NaN) OR (ADDRESS is not truthy)
(new_dfr['ADDRESS'].isna() | ~new_dfr['ADDRESS'].astype(bool)).sum()
Example:
new_dfr = pd.DataFrame({
'ADDRESS': [np.nan, None, False, '', 0, 1, True, 'not empty']
})
>>> new_dfr
ADDRESS
0 NaN
1 None
2 False
3
4 0
5 1
6 True
7 not empty
>>> new_dfr['ADDRESS'].isna()
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 False
Name: ADDRESS, dtype: bool
>>> ~new_dfr['ADDRESS'].astype(bool)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: ADDRESS, dtype: bool
>>> new_dfr['ADDRESS'].isna() | ~new_dfr['ADDRESS'].astype(bool)
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: ADDRESS, dtype: bool
>>> (new_dfr['ADDRESS'].isna() | ~new_dfr['ADDRESS'].astype(bool)).sum()
5

How to get the no of same boolean occur in two list in python

i have dataframe having
A B C D
0 True 5 True True
1 True 6 False False
2 False 5 True True
3 False 8 True False
4 True 2 True True
It should print the count when Column D is True, how many times Column A and Column C are True.
Expected Output
A : 2
C : 3
You can filter by column D because boolean in boolean indexing with DataFrame.loc for also filter by columns names and last for count Trues values is used sum:
s = df.loc[df.D, ['A','C']].sum()
print (s)
A 2
C 3
dtype: int64
Details:
print (df.loc[df.D, ['A','C']])
A C
0 True True
2 False True
4 True True

How can I filter dataframe based on null/not null using a column name as a variable?

I want to list a dataframe where a specific column is either null or not null, I have it working using -
df[df.Survive.notnull()] # contains no missing values
df[df.Survive.isnull()] #---> contains missing values
This works perfectly but I want to make my code more dynamic and pass the column "Survive" as a variable but it's not working for me.
I tried:
variableToPredict = ['Survive']
df[df[variableToPredict].notnull()]
I get the error - ValueError: cannot reindex from a duplicate axis
I'm sure I'm making a silly mistake, what can I do to fix this?
So idea is always is necessary Series or list or 1d array for mask for filtering.
If want test only one column use scalar:
variableToPredict = 'Survive'
df[df[variableToPredict].notnull()]
But if add [] output is one column DataFrame, so is necessaty change function for test by any (test if at least one NaN per row, sense in multiple columns) or all (test if all NaNs per row, sense in multiple columns) functions:
variableToPredict = ['Survive']
df[df[variableToPredict].notnull().any(axis=1)]
variableToPredict = ['Survive', 'another column']
df[df[variableToPredict].notnull().any(axis=1)]
Sample:
df = pd.DataFrame({'Survive':[np.nan, 'A', 'B', 'B', np.nan],
'another column':[np.nan, np.nan, 'a','b','b']})
print (df)
Survive another column
0 NaN NaN
1 A NaN
2 B a
3 B b
4 NaN b
First if test only one column:
variableToPredict = 'Survive'
df1 = df[df[variableToPredict].notnull()]
print (df1)
Survive another column
1 A NaN
2 B a
3 B b
print (type(df[variableToPredict]))
<class 'pandas.core.series.Series'>
#Series
print (df[variableToPredict])
0 NaN
1 A
2 B
3 B
4 NaN
Name: Survive, dtype: object
print (df[variableToPredict].isnull())
0 True
1 False
2 False
3 False
4 True
Name: Survive, dtype: bool
If use list - here one element list:
variableToPredict = ['Survive']
print (type(df[variableToPredict]))
<class 'pandas.core.frame.DataFrame'>
#one element DataFrame
print (df[variableToPredict])
Survive
0 NaN
1 A
2 B
3 B
4 NaN
If testing per rows it is same output for any or all:
print (df[variableToPredict].notnull().any(axis=1))
0 False
1 True
2 True
3 True
4 False
dtype: bool
print (df[variableToPredict].notnull().all(axis=1))
0 False
1 True
2 True
3 True
4 False
dtype: bool
If test one or more columns in list:
variableToPredict = ['Survive', 'another column']
print (type(df[variableToPredict]))
<class 'pandas.core.frame.DataFrame'>
print (df[variableToPredict])
Survive another column
0 NaN NaN
1 A NaN
2 B a
3 B b
4 NaN b
print (df[variableToPredict].notnull())
Survive another column
0 False False
1 True False
2 True True
3 True True
4 False True
#at least one NaN per row, at least one True
print (df[variableToPredict].notnull().any(axis=1))
0 False
1 True
2 True
3 True
4 True
dtype: bool
#all NaNs per row, all Trues
print (df[variableToPredict].notnull().all(axis=1))
0 False
1 False
2 True
3 True
4 False
dtype: bool
Adding all at the end
df[df[variableToPredict].notnull().all(1)]

How to differentiate string and Alphanumeric?

df:
company_name product Id rating
0 matrix mobile Id456 2.5
1 ins-faq alpha1 Id956 3.5
2 metric5 sounds-B Id-356 2.5
3 ingsaf digital Id856 4star
4 matrix win11p Idklm 2.0
5 4567 mobile 596 3.5
df2:
Col_name Datatype
0 company_name String #(pure string)
1 Product String #(pure string)
2 Id Alpha-Numeric #(must contain atleast 1 number and 1 alphabet)
3 rating Float or int
df is the main dataframe and df2 is the expected datatype information of main dataframe.
how to check every column values extract wrong datatype values.
output:
row_num col_name current_value expected_dtype
0 2 company_name metric5 string
1 5 company_name 4567 string
2 1 Product alpha1 string
3 4 Product win11p string
4 4 Id Idklm Alpha-Numeric
5 5 Id 596 Alpha-Numeric
6 3 rating 4star Float or int
For columns that cannot contain numbers, you can find the exceptions with:
In [5]: df['product'].str.contains(r'[0-9]')
Out[5]:
0 False
1 True
2 False
3 False
4 True
5 False
Name: product, dtype: bool
For Alpha-Numeric columns identify compliance with:
In [7]: df['Id'].str.contains(r'(?:\d\D)|(?:\D\d)')
Out[7]:
0 True
1 True
2 True
3 True
4 False
5 False
Name: Id, dtype: bool
For int or float columns find exceptions with
In [8]: df['rating'].str.contains(r'[^0-9.+-]')
Out[8]:
0 False
1 False
2 False
3 True
4 False
5 False
That may be problematic, it won't catch things with multiple or misplaced plus,minus, or dot characters, like 9.4.1 or 6+3.-12. But you could use
In [11]: def check(thing):
...: try:
...: return bool(float(thing)) or thing==0
...: except ValueError:
...: return False
...:
In [12]: df['rating'].apply(check)
Out[12]:
0 True
1 True
2 True
3 False
4 True
5 True
Name: rating, dtype: bool

Does pandas != 'a value' return NaNs?

When I use
x['test'] = df['a_variable'].str.contains('some string')
I get-
True
NaN
NaN
True
NaN
If I use
x[x['test'] != True]
Should I receive back the rows with value NaN?
Thanks.
Yes this is expected behaviour:
In [3]:
df = pd.DataFrame({'some_string':['asdsa','some',np.NaN, 'string']})
df
Out[3]:
some_string
0 asdsa
1 some
2 NaN
3 string
In [4]:
df['some_string'].str.contains('some')
Out[4]:
0 False
1 True
2 NaN
3 False
Name: some_string, dtype: object
Using the above as a mask:
In [13]:
df[df['some_string'].str.contains('some') != False]
Out[13]:
some_string
1 some
2 NaN
So the above is expected behaviour.
If you specify the value for NaN values using na=value then you can get whatever value you set as the returned value:
In [6]:
df['some_string'].str.contains('some', na=False)
Out[6]:
0 False
1 True
2 False
3 False
Name: some_string, dtype: bool
The above becomes important as indexing with NaN values will result in a KeyError.
Yes we would expect it to happen
ex.)
x=pd.DataFrame([True,NaN,True,NaN])
print x
0
0 True
1 NaN
2 True
3 NaN
print x[x[0] != True]
0
1 NaN
3 NaN
x[x[0] != True] would return every thing where the value is not True
Like wise
print x[x[0] != False]
0
0 True
1 NaN
2 True
3 NaN
Since equation says to return all value which are not False

Categories