Pandas - check if dataframe has negative value in any column - python

I wonder how to check if a pandas dataframe has negative value in 1 or more columns and
return only boolean value (True or False). Can you please help?
In[1]: df = pd.DataFrame(np.random.randn(10, 3))
In[2]: df
Out[2]:
0 1 2
0 -1.783811 0.736010 0.865427
1 -1.243160 0.255592 1.670268
2 0.820835 0.246249 0.288464
3 -0.923907 -0.199402 0.090250
4 -1.575614 -1.141441 0.689282
5 -1.051722 0.513397 1.471071
6 2.549089 0.977407 0.686614
7 -1.417064 0.181957 0.351824
8 0.643760 0.867286 1.166715
9 -0.316672 -0.647559 1.331545
Expected output:-
Out[3]: True

Actually, if speed is important, I did a few tests:
df = pd.DataFrame(np.random.randn(10000, 30000))
Test 1, slowest: pure pandas
(df < 0).any().any()
# 303 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test 2, faster: switching over to numpy with .values for testing the presence of a True entry
(df < 0).values.any()
# 269 ms ± 8.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test 3, maybe even faster, though not significant: switching over to numpy for the whole thing
(df.values < 0).any()
# 267 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can chain two any
df.lt(0).any().any()
Out[96]: True

This does the trick:
(df < 0).any().any()
To break it down, (df < 0) gives a dataframe with boolean entries. Then the first .any() returns a series of booleans, testing within each column for the presence of a True value. And then, the second .any() asks whether this returned series itself contains any True value.
This returns a simple:
True

Related

Check if a row in column is unique python Dataframe

I have the following Dataframe:
id1
result
2
0.5
3
1.4
4
1.4
7
3.4
2
1.4
I want to check for every row in the column ['id1'] if the value is unique
The output should be:
False
True
True
True
False
The first and the last are False because id 2 exists twice.
I used this method:
bool = df["id1"].is_unique but that checks if the whole column is unique. I want to check it for each row
df['id1'].map(~(df.groupby('id1').size() > 1))
Output
0 False
1 True
2 True
3 True
4 False
Name: id1, dtype: bool
Since I saw you tagged this question with pandas, I'm assuming you're using the pandas package.
You can create an array with a bunch of id1 here, then use pd.Series.duplicated method like the following example.
You can get the pandas docs here.
import pandas as pd
check_id1_duplicate = pd.Index([2, 3, 4, 7, 2])
check_id1_duplicate.duplicated(keep=False)
# Results would be array([True, False, False, False, True])
To add to #ShiriNmi's answer, the duplicated solution is more intuitive and about 8 times faster, while returning the same result.
%timeit -n 10_000 df['id1'].map(~(df.groupby('id1').size() > 1))
# 697 µs ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit ~df['id1'].duplicated(keep=False)
# 89.5 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Pandas Dataframe change column values with given interval

I am trying to convert my columns values according to an interval like,
if(x<5)
x=2
else if(x>=5 %% x<10)
x=3
and try to doing in python with single line code. Using mask and cut method but I could not do.
this is my trial,
dataset['CURRENT_RATIO' ] = dataset['CURRENT_RATIO'].mask((dataset['CURRENT_RATIO'] < 0.02, -7.0) | (dataset['CURRENT_RATIO'] > =0.02 & dataset['CURRENT_RATIO'] < 0.37),-5))
I need this actually if x<0.02 then -7 else if x>=0.02 and x<0.37 then -5...
inptut output
0.015 -7
0.02 -5
0.37 -3
0.75 1
TL;DR
A list comprehension will do:
dataset.CURRENT_RATIO = [-7 if i<.02 else -5 if i<.37 else -3 if i<.75 else 1 for i in dataset.CURRENT_RATIO]
Timing it with a random dataset of 1,000,000 rows gave the following result:
334 ms ± 5.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Multiple Columns
If you were to do this with multiple columns with different thresholds and update values, I would do the following:
df = pd.DataFrame({'RATIO1': np.random.rand(10000000),
'RATIO2': np.random.rand(10000000)})
update_per_col = {'RATIO1': [(.02, -7), (.37, -5), (.75, -3), 1],
'RATIO2': [(.12, 5), (.47, 6), (.85, 7), 8]}
cols_to_be_updated = ['RATIO1', 'RATIO1']
for col in cols_to_be_updated:
df[col] = [update_per_col[col][0][1] if i<update_per_col[col][0][0] else
update_per_col[col][1][1] if i<update_per_col[col][1][0] else
update_per_col[col][2][1] if i<update_per_col[col][2][0] else update_per_col[col][3]
for i in df[col]]
When we time the for loop with 10,000,000 rows and two columns we get:
9.37 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
To address Joe Ferndz's comment, lets try to speed it up. I came up with two methods: apply() and a lambda. With apply() we run the following code (only the for loop was timed):
def update_ratio(x, updates):
if x < updates[0][0]:
return updates[0][1]
elif x < updates[1][0]:
return updates[1][1]
elif x < updates[2][0]:
return updates[2][1]
else:
return updates[3]
for col in cols_to_be_updated:
df[col] = df[col].apply(update_ratio, updates=update_per_col[col])
This gives us:
11.8 s ± 285 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Finally the lambda gives:
for col in cols_to_be_updated:
df[col] = df[col].apply(lambda x: update_per_col[col][0][1] if x<update_per_col[col][0][0] else
update_per_col[col][1][1] if x<update_per_col[col][1][0] else
update_per_col[col][2][1] if x<update_per_col[col][2][0] else
update_per_col[col][3])
8.91 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This means that a lambda is the fastest method, but a list comprehension is not far off.
use
dataset['CURRENT_RATIO'][dataset['CURRENT_RATIO']<10]=3
dataset['CURRENT_RATIO'][dataset['CURRENT_RATIO']<5]=2
first line you make all value less than 10 to 3
then all values less than 5 to 2
it makes all value between 5 to 10 remain 3
It would be smart to write your own lambda-function and use pandas.apply()
def my_lambda(x):
if(x<5):
x=2
elif(x<10):
x=3
return x
dataset['CURRENT_RATIO' ] = dataset['CURRENT_RATIO'].apply(my_lambda)

How to apply changes to subset dataframe to source dataframe

I'm trying to determine and flag duplicate 'Sample' values in a dataframe using groupby with lambda:
rdtRows["DuplicateSample"] = False
rdtRowsSampleGrouped = rdtRows.groupby( ['Sample']).filter(lambda x: len(x) > 1)
rdtRowsSampleGrouped["DuplicateSample"] = True
# How to get flag changes made on rdtRowsSampleGrouped to apply to rdtRows??
How do I make changes / apply the "DuplicateSample" to the source rdtRows data? I'm stumped
:(
Use GroupBy.transform with GroupBy.size:
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
Or use Series.duplicated with keep=False if need faster solution:
df['DuplicateSample'] = df['Sample'].duplicated(keep=False)
Performance in some sample data (in real should be different, depends of number of rows, number of duplicated values):
np.random.seed(2020)
N = 100000
df = pd.DataFrame({'Sample': np.random.randint(100000, size=N)})
In [51]: %timeit df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
17 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['DuplicateSample1'] = df['Sample'].duplicated(keep=False)
3.73 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Stef solution is unfortunately 2734times slowier like duplicated solution
In [53]: %timeit df['DuplicateSample2'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
10.2 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use transform:
import pandas as pd
df = pd.DataFrame({'Sample': [1,2,2,3,4,4]})
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
Result:
Sample DuplicateSample
0 1 False
1 2 True
2 2 True
3 3 False
4 4 True
5 4 True

Find indices of where Pandas Series contains element containing character

Example:
import pandas as pd
arr = pd.Series(['a',['a','b'],'c'])
I would like to get the indices of where the series contains elements containing 'a'. So I would like to get back indices 0 and 1.
I've tried writing
arr.str.contains('a')
but this returns
0 True
1 NaN
2 False
dtype: object
while I'd like it to return
0 True
1 True
2 False
dtype: object
use Series.str.join() to concatenate lists/arrays in cells into a single string and then use .str.contains('a'):
In [78]: arr.str.join(sep='~').str.contains('a')
Out[78]:
0 True
1 True
2 False
dtype: bool
Use Series.apply and Python's in keyword which works on both lists and strings
arr.apply(lambda x: 'a' in x)
This will work fine if you don't have any NaN values in your Series, but if you do, you can use:
arr.apply(lambda x: 'a' in x if x is not np.nan else x)
This is much faster than using Series.str.
Benchmarks:
%%timeit
arr.str.join(sep='~').str.contains('a')
Takes: 249 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
arr.apply(lambda x: 'a' in x)
Takes: 70.1 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
arr.apply(lambda x: 'a' in x if x is not np.nan else x)
Takes: 69 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

How to determine whether a Pandas Column contains a particular value

I am trying to determine whether there is an entry in a Pandas column that has a particular value. I tried to do this with if x in df['id']. I thought this was working, except when I fed it a value that I knew was not in the column 43 in df['id'] it still returned True. When I subset to a data frame only containing entries matching the missing id df[df['id'] == 43] there are, obviously, no entries in it. How to I determine if a column in a Pandas data frame contains a particular value and why doesn't my current method work? (FYI, I have the same problem when I use the implementation in this answer to a similar question).
in of a Series checks whether the value is in the index:
In [11]: s = pd.Series(list('abc'))
In [12]: s
Out[12]:
0 a
1 b
2 c
dtype: object
In [13]: 1 in s
Out[13]: True
In [14]: 'a' in s
Out[14]: False
One option is to see if it's in unique values:
In [21]: s.unique()
Out[21]: array(['a', 'b', 'c'], dtype=object)
In [22]: 'a' in s.unique()
Out[22]: True
or a python set:
In [23]: set(s)
Out[23]: {'a', 'b', 'c'}
In [24]: 'a' in set(s)
Out[24]: True
As pointed out by #DSM, it may be more efficient (especially if you're just doing this for one value) to just use in directly on the values:
In [31]: s.values
Out[31]: array(['a', 'b', 'c'], dtype=object)
In [32]: 'a' in s.values
Out[32]: True
You can also use pandas.Series.isin although it's a little bit longer than 'a' in s.values:
In [2]: s = pd.Series(list('abc'))
In [3]: s
Out[3]:
0 a
1 b
2 c
dtype: object
In [3]: s.isin(['a'])
Out[3]:
0 True
1 False
2 False
dtype: bool
In [4]: s[s.isin(['a'])].empty
Out[4]: False
In [5]: s[s.isin(['z'])].empty
Out[5]: True
But this approach can be more flexible if you need to match multiple values at once for a DataFrame (see DataFrame.isin)
>>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 4, 7]})
>>> df.isin({'A': [1, 3], 'B': [4, 7, 12]})
A B
0 True False # Note that B didn't match 1 here.
1 False True
2 True True
found = df[df['Column'].str.contains('Text_to_search')]
print(found.count())
the found.count() will contains number of matches
And if it is 0 then means string was not found in the Column.
You can try this to check a particular value 'x' in a particular column named 'id'
if x in df['id'].values
I did a few simple tests:
In [10]: x = pd.Series(range(1000000))
In [13]: timeit 999999 in x.values
567 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [24]: timeit 9 in x.values
666 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: timeit (x == 999999).any()
6.86 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [21]: timeit x.eq(999999).any()
7.03 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [22]: timeit x.eq(9).any()
7.04 ms ± 60 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [15]: timeit x.isin([999999]).any()
9.54 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [17]: timeit 999999 in set(x)
79.8 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Interestingly it doesn't matter if you look up 9 or 999999, it seems like it takes about the same amount of time using the in syntax (must be using some vectorized computation)
In [24]: timeit 9 in x.values
666 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [25]: timeit 9999 in x.values
647 µs ± 5.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [26]: timeit 999999 in x.values
642 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [27]: timeit 99199 in x.values
644 µs ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [28]: timeit 1 in x.values
667 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Seems like using x.values is the fastest, but maybe there is a more elegant way in pandas?
Or use Series.tolist or Series.any:
>>> s = pd.Series(list('abc'))
>>> s
0 a
1 b
2 c
dtype: object
>>> 'a' in s.tolist()
True
>>> (s=='a').any()
True
Series.tolist makes a list about of a Series, and the other one i am just getting a boolean Series from a regular Series, then checking if there are any Trues in the boolean Series.
Simple condition:
if any(str(elem) in ['a','b'] for elem in df['column'].tolist()):
Use
df[df['id']==x].index.tolist()
If x is present in id then it'll return the list of indices where it is present, else it gives an empty list.
I had a CSV file to read:
df = pd.read_csv('50_states.csv')
And after trying:
if value in df.column:
print(True)
which never printed true, even though the value was in the column;
I tried:
for values in df.column:
if value == values:
print(True)
#Or do something
else:
print(False)
Which worked. I hope this can help!
Use query() to find the rows where the condition holds and get the number of rows with shape[0]. If there exists at least one entry, this statement is True:
df.query('id == 123').shape[0] > 0
Suppose you dataframe looks like :
Now you want to check if filename "80900026941984" is present in the dataframe or not.
You can simply write :
if sum(df["filename"].astype("str").str.contains("80900026941984")) > 0:
print("found")

Categories