Pandas - check if dataframe has negative value in any column

Pandas - check if dataframe has negative value in any column - python

I wonder how to check if a pandas dataframe has negative value in 1 or more columns and
return only boolean value (True or False). Can you please help?
In[1]: df = pd.DataFrame(np.random.randn(10, 3))
In[2]: df
Out[2]:
0 1 2
0 -1.783811 0.736010 0.865427
1 -1.243160 0.255592 1.670268
2 0.820835 0.246249 0.288464
3 -0.923907 -0.199402 0.090250
4 -1.575614 -1.141441 0.689282
5 -1.051722 0.513397 1.471071
6 2.549089 0.977407 0.686614
7 -1.417064 0.181957 0.351824
8 0.643760 0.867286 1.166715
9 -0.316672 -0.647559 1.331545
Expected output:-
Out[3]: True

Actually, if speed is important, I did a few tests:
df = pd.DataFrame(np.random.randn(10000, 30000))
Test 1, slowest: pure pandas
(df < 0).any().any()
# 303 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test 2, faster: switching over to numpy with .values for testing the presence of a True entry
(df < 0).values.any()
# 269 ms ± 8.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test 3, maybe even faster, though not significant: switching over to numpy for the whole thing
(df.values < 0).any()
# 267 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can chain two any
df.lt(0).any().any()
Out[96]: True

This does the trick:
(df < 0).any().any()
To break it down, (df < 0) gives a dataframe with boolean entries. Then the first .any() returns a series of booleans, testing within each column for the presence of a True value. And then, the second .any() asks whether this returned series itself contains any True value.
This returns a simple:
True

Related

Check if a row in column is unique python Dataframe

I have the following Dataframe:
id1
result
2
0.5
3
1.4
4
1.4
7
3.4
2
1.4
I want to check for every row in the column ['id1'] if the value is unique
The output should be:
False
True
True
True
False
The first and the last are False because id 2 exists twice.
I used this method:
bool = df["id1"].is_unique but that checks if the whole column is unique. I want to check it for each row

df['id1'].map(~(df.groupby('id1').size() > 1))
Output
0 False
1 True
2 True
3 True
4 False
Name: id1, dtype: bool

Since I saw you tagged this question with pandas, I'm assuming you're using the pandas package.
You can create an array with a bunch of id1 here, then use pd.Series.duplicated method like the following example.
You can get the pandas docs here.
import pandas as pd
check_id1_duplicate = pd.Index([2, 3, 4, 7, 2])
check_id1_duplicate.duplicated(keep=False)
# Results would be array([True, False, False, False, True])

To add to #ShiriNmi's answer, the duplicated solution is more intuitive and about 8 times faster, while returning the same result.
%timeit -n 10_000 df['id1'].map(~(df.groupby('id1').size() > 1))
# 697 µs ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit ~df['id1'].duplicated(keep=False)
# 89.5 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Pandas Dataframe change column values with given interval

I am trying to convert my columns values according to an interval like,
if(x<5)
x=2
else if(x>=5 %% x<10)
x=3
and try to doing in python with single line code. Using mask and cut method but I could not do.
this is my trial,
dataset['CURRENT_RATIO' ] = dataset['CURRENT_RATIO'].mask((dataset['CURRENT_RATIO'] < 0.02, -7.0) | (dataset['CURRENT_RATIO'] > =0.02 & dataset['CURRENT_RATIO'] < 0.37),-5))
I need this actually if x<0.02 then -7 else if x>=0.02 and x<0.37 then -5...
inptut output
0.015 -7
0.02 -5
0.37 -3
0.75 1

TL;DR
A list comprehension will do:
dataset.CURRENT_RATIO = [-7 if i<.02 else -5 if i<.37 else -3 if i<.75 else 1 for i in dataset.CURRENT_RATIO]
Timing it with a random dataset of 1,000,000 rows gave the following result:
334 ms ± 5.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Multiple Columns
If you were to do this with multiple columns with different thresholds and update values, I would do the following:
df = pd.DataFrame({'RATIO1': np.random.rand(10000000),
'RATIO2': np.random.rand(10000000)})
update_per_col = {'RATIO1': [(.02, -7), (.37, -5), (.75, -3), 1],
'RATIO2': [(.12, 5), (.47, 6), (.85, 7), 8]}
cols_to_be_updated = ['RATIO1', 'RATIO1']
for col in cols_to_be_updated:
df[col] = [update_per_col[col][0][1] if i<update_per_col[col][0][0] else
update_per_col[col][1][1] if i<update_per_col[col][1][0] else
update_per_col[col][2][1] if i<update_per_col[col][2][0] else update_per_col[col][3]
for i in df[col]]
When we time the for loop with 10,000,000 rows and two columns we get:
9.37 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
To address Joe Ferndz's comment, lets try to speed it up. I came up with two methods: apply() and a lambda. With apply() we run the following code (only the for loop was timed):
def update_ratio(x, updates):
if x < updates[0][0]:
return updates[0][1]
elif x < updates[1][0]:
return updates[1][1]
elif x < updates[2][0]:
return updates[2][1]
else:
return updates[3]
for col in cols_to_be_updated:
df[col] = df[col].apply(update_ratio, updates=update_per_col[col])
This gives us:
11.8 s ± 285 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Finally the lambda gives:
for col in cols_to_be_updated:
df[col] = df[col].apply(lambda x: update_per_col[col][0][1] if x<update_per_col[col][0][0] else
update_per_col[col][1][1] if x<update_per_col[col][1][0] else
update_per_col[col][2][1] if x<update_per_col[col][2][0] else
update_per_col[col][3])
8.91 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This means that a lambda is the fastest method, but a list comprehension is not far off.

use
dataset['CURRENT_RATIO'][dataset['CURRENT_RATIO']<10]=3
dataset['CURRENT_RATIO'][dataset['CURRENT_RATIO']<5]=2
first line you make all value less than 10 to 3
then all values less than 5 to 2
it makes all value between 5 to 10 remain 3

It would be smart to write your own lambda-function and use pandas.apply()
def my_lambda(x):
if(x<5):
x=2
elif(x<10):
x=3
return x
dataset['CURRENT_RATIO' ] = dataset['CURRENT_RATIO'].apply(my_lambda)

How to apply changes to subset dataframe to source dataframe

I'm trying to determine and flag duplicate 'Sample' values in a dataframe using groupby with lambda:
rdtRows["DuplicateSample"] = False
rdtRowsSampleGrouped = rdtRows.groupby( ['Sample']).filter(lambda x: len(x) > 1)
rdtRowsSampleGrouped["DuplicateSample"] = True
# How to get flag changes made on rdtRowsSampleGrouped to apply to rdtRows??
How do I make changes / apply the "DuplicateSample" to the source rdtRows data? I'm stumped
:(

Use GroupBy.transform with GroupBy.size:
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
Or use Series.duplicated with keep=False if need faster solution:
df['DuplicateSample'] = df['Sample'].duplicated(keep=False)
Performance in some sample data (in real should be different, depends of number of rows, number of duplicated values):
np.random.seed(2020)
N = 100000
df = pd.DataFrame({'Sample': np.random.randint(100000, size=N)})
In [51]: %timeit df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
17 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['DuplicateSample1'] = df['Sample'].duplicated(keep=False)
3.73 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Stef solution is unfortunately 2734times slowier like duplicated solution
In [53]: %timeit df['DuplicateSample2'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
10.2 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can use transform:
import pandas as pd
df = pd.DataFrame({'Sample': [1,2,2,3,4,4]})
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
Result:
Sample DuplicateSample
0 1 False
1 2 True
2 2 True
3 3 False
4 4 True
5 4 True

Find indices of where Pandas Series contains element containing character

Example:
import pandas as pd
arr = pd.Series(['a',['a','b'],'c'])
I would like to get the indices of where the series contains elements containing 'a'. So I would like to get back indices 0 and 1.
I've tried writing
arr.str.contains('a')
but this returns
0 True
1 NaN
2 False
dtype: object
while I'd like it to return
0 True
1 True
2 False
dtype: object

use Series.str.join() to concatenate lists/arrays in cells into a single string and then use .str.contains('a'):
In [78]: arr.str.join(sep='~').str.contains('a')
Out[78]:
0 True
1 True
2 False
dtype: bool

Use Series.apply and Python's in keyword which works on both lists and strings
arr.apply(lambda x: 'a' in x)
This will work fine if you don't have any NaN values in your Series, but if you do, you can use:
arr.apply(lambda x: 'a' in x if x is not np.nan else x)
This is much faster than using Series.str.
Benchmarks:
%%timeit
arr.str.join(sep='~').str.contains('a')
Takes: 249 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
arr.apply(lambda x: 'a' in x)
Takes: 70.1 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
arr.apply(lambda x: 'a' in x if x is not np.nan else x)
Takes: 69 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

How to determine whether a Pandas Column contains a particular value

I am trying to determine whether there is an entry in a Pandas column that has a particular value. I tried to do this with if x in df['id']. I thought this was working, except when I fed it a value that I knew was not in the column 43 in df['id'] it still returned True. When I subset to a data frame only containing entries matching the missing id df[df['id'] == 43] there are, obviously, no entries in it. How to I determine if a column in a Pandas data frame contains a particular value and why doesn't my current method work? (FYI, I have the same problem when I use the implementation in this answer to a similar question).

in of a Series checks whether the value is in the index:
In [11]: s = pd.Series(list('abc'))
In [12]: s
Out[12]:
0 a
1 b
2 c
dtype: object
In [13]: 1 in s
Out[13]: True
In [14]: 'a' in s
Out[14]: False
One option is to see if it's in unique values:
In [21]: s.unique()
Out[21]: array(['a', 'b', 'c'], dtype=object)
In [22]: 'a' in s.unique()
Out[22]: True
or a python set:
In [23]: set(s)
Out[23]: {'a', 'b', 'c'}
In [24]: 'a' in set(s)
Out[24]: True
As pointed out by #DSM, it may be more efficient (especially if you're just doing this for one value) to just use in directly on the values:
In [31]: s.values
Out[31]: array(['a', 'b', 'c'], dtype=object)
In [32]: 'a' in s.values
Out[32]: True

You can also use pandas.Series.isin although it's a little bit longer than 'a' in s.values:
In [2]: s = pd.Series(list('abc'))
In [3]: s
Out[3]:
0 a
1 b
2 c
dtype: object
In [3]: s.isin(['a'])
Out[3]:
0 True
1 False
2 False
dtype: bool
In [4]: s[s.isin(['a'])].empty
Out[4]: False
In [5]: s[s.isin(['z'])].empty
Out[5]: True
But this approach can be more flexible if you need to match multiple values at once for a DataFrame (see DataFrame.isin)
>>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 4, 7]})
>>> df.isin({'A': [1, 3], 'B': [4, 7, 12]})
A B
0 True False # Note that B didn't match 1 here.
1 False True
2 True True

found = df[df['Column'].str.contains('Text_to_search')]
print(found.count())
the found.count() will contains number of matches
And if it is 0 then means string was not found in the Column.

You can try this to check a particular value 'x' in a particular column named 'id'
if x in df['id'].values

I did a few simple tests:
In [10]: x = pd.Series(range(1000000))
In [13]: timeit 999999 in x.values
567 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [24]: timeit 9 in x.values
666 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: timeit (x == 999999).any()
6.86 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [21]: timeit x.eq(999999).any()
7.03 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [22]: timeit x.eq(9).any()
7.04 ms ± 60 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [15]: timeit x.isin([999999]).any()
9.54 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [17]: timeit 999999 in set(x)
79.8 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Interestingly it doesn't matter if you look up 9 or 999999, it seems like it takes about the same amount of time using the in syntax (must be using some vectorized computation)
In [24]: timeit 9 in x.values
666 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [25]: timeit 9999 in x.values
647 µs ± 5.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [26]: timeit 999999 in x.values
642 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [27]: timeit 99199 in x.values
644 µs ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [28]: timeit 1 in x.values
667 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Seems like using x.values is the fastest, but maybe there is a more elegant way in pandas?

Or use Series.tolist or Series.any:
>>> s = pd.Series(list('abc'))
>>> s
0 a
1 b
2 c
dtype: object
>>> 'a' in s.tolist()
True
>>> (s=='a').any()
True
Series.tolist makes a list about of a Series, and the other one i am just getting a boolean Series from a regular Series, then checking if there are any Trues in the boolean Series.

Simple condition:
if any(str(elem) in ['a','b'] for elem in df['column'].tolist()):

Use
df[df['id']==x].index.tolist()
If x is present in id then it'll return the list of indices where it is present, else it gives an empty list.

I had a CSV file to read:
df = pd.read_csv('50_states.csv')
And after trying:
if value in df.column:
print(True)
which never printed true, even though the value was in the column;
I tried:
for values in df.column:
if value == values:
print(True)
#Or do something
else:
print(False)
Which worked. I hope this can help!

Use query() to find the rows where the condition holds and get the number of rows with shape[0]. If there exists at least one entry, this statement is True:
df.query('id == 123').shape[0] > 0

Suppose you dataframe looks like :
Now you want to check if filename "80900026941984" is present in the dataframe or not.
You can simply write :
if sum(df["filename"].astype("str").str.contains("80900026941984")) > 0:
print("found")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - check if dataframe has negative value in any column - python

You can chain two any df.lt(0).any().any() Out[96]: True

Related

Check if a row in column is unique python Dataframe

Pandas Dataframe change column values with given interval

How to apply changes to subset dataframe to source dataframe

Find indices of where Pandas Series contains element containing character

How to determine whether a Pandas Column contains a particular value

Categories

Resources