Python Dataframe delete rows after comparing multiple column values with a value - python

I have data frame of many columns consisting float values. I want to delete a row if any of the columns have value below 20.
code:
xdf = pd.DataFrame({'A':np.random.uniform(low=-50, high=53.3, size=(5)),'B':np.random.uniform(low=10, high=130, size=(5)),'C':np.random.uniform(low=-50, high=130, size=(5)),'D':np.random.uniform(low=-100, high=200, size=(5))})
xdf =
A B C D
0 -9.270533 42.098425 91.125009 148.350655
1 17.771411 55.564825 106.396381 -89.082831
2 -22.602563 99.330643 17.590466 73.985202
3 15.890920 76.011631 52.366311 194.023063
4 35.202379 41.973846 32.576890 100.523902
# my code
xdf[xdf[cols].ge(20).all(axis=1)]
Out[17]:
A B C D
4 35.202379 41.973846 32.57689 100.523902
Expected output: drop a row if any column has below 20 value
xdf =
A B C D
4 35.202379 41.973846 32.576890 100.523902
Is this the best way of doing it?

To do it in numpy:
xdf = pd.DataFrame({'A':np.random.uniform(low=-50, high=53.3, size=(5)),'B':np.random.uniform(low=10, high=130, size=(5)),'C':np.random.uniform(low=-50, high=130, size=(5)),'D':np.random.uniform(low=-100, high=200, size=(5))})
%timeit xdf[xdf[['A','B','C','D']].ge(20).all(axis=1)]
%timeit xdf[(xdf[['A','B','C','D']].values >= 20).all(axis=1)]
705 µs ± 277 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
460 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you do not want to keep result in DataFrame this can even be faster:
xdf.values[(xdf[['A','B','C','D']].values >= 20).all(axis=1)]

As numpy is lighter and therefore faster in terms of calculations with numbers, try this:
a = np.array([np.random.uniform(low=-50, high=53.3, size=(5)),
np.random.uniform(low=10, high=130, size=(5)),
np.random.uniform(low=-50, high=130, size=(5)),
np.random.uniform(low=-100, high=200, size=(5))])
print(a[np.all(a > 20, axis=1)])
If you want to stick with pandas, another idea would be:
xdfFiltered = xdf.loc[(xdf["A"] > 20) & (xdf["B"] > 20) & (xdf["C"] > 20) & (xdf["D"] > 20)]

You can use the numpy equivalent of .ge instead:
xdf.loc[np.greater(xdf,20).all(axis=1)]

Related

Python - Remove row if item is above certain value and replace if between other values

I'm working in a pandas dataframe trying to clean up some data and I want to assign multiple rules to a certain column. If the column value is greater than 500 I want to drop the column. If the column value is between 101 and 500 I want to replace the value with 100. When the column is less than 101 return the column value.
I'm able to do it in 2 lines of code, but I was curious if there's a cleaner more efficient way to do this. I tried with an If/Elif/Else, but I couldn't get it to run or a lambda function, but again I couldn't get it to run.
# This drops all rows that are greater than 500
df.drop(df[df.Percent > 500].index, inplace = True)
# This sets the upper limit on all values at 100
df['Percent'] = df['Percent'].clip(upper = 100)
You can use .loc with boolean mask instead of .drop() with index and use fast numpy function numpy.where() to achieve more efficient / better performance, as follows:
import numpy as np
df2 = df.loc[df['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
Performance Comparison:
Part 1: Original size dataframe
Old Codes:
%%timeit
df.drop(df[df.Percent > 500].index, inplace = True)
# This sets the upper limit on all values at 100
df['Percent'] = df['Percent'].clip(upper = 100)
1.58 ms ± 56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
New Codes:
%%timeit
df2 = df.loc[df['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
784 µs ± 8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Benchmarking result:
The new codes take 784 µs while the old codes take 1.58 ms:
Around 2x times faster
Part 2: Large size dataframe
Let's use a dataframe 10000 times the original size:
df9 = pd.concat([df] * 10000, ignore_index=True)
Old Codes:
%%timeit
df9.drop(df9[df9.Percent > 500].index, inplace = True)
# This sets the upper limit on all values at 100
df9['Percent'] = df9['Percent'].clip(upper = 100)
3.87 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New Codes:
%%timeit
df2 = df9.loc[df9['Percent'] <= 500]
df2['Percent'] = np.where(df2['Percent'] >= 101, 100, df2['Percent'])
1.96 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Benchmarking result:
The new codes take 1.96 ms while the old codes take 3.87 ms :
Also around 2x times faster

How to apply changes to subset dataframe to source dataframe

I'm trying to determine and flag duplicate 'Sample' values in a dataframe using groupby with lambda:
rdtRows["DuplicateSample"] = False
rdtRowsSampleGrouped = rdtRows.groupby( ['Sample']).filter(lambda x: len(x) > 1)
rdtRowsSampleGrouped["DuplicateSample"] = True
# How to get flag changes made on rdtRowsSampleGrouped to apply to rdtRows??
How do I make changes / apply the "DuplicateSample" to the source rdtRows data? I'm stumped
:(
Use GroupBy.transform with GroupBy.size:
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
Or use Series.duplicated with keep=False if need faster solution:
df['DuplicateSample'] = df['Sample'].duplicated(keep=False)
Performance in some sample data (in real should be different, depends of number of rows, number of duplicated values):
np.random.seed(2020)
N = 100000
df = pd.DataFrame({'Sample': np.random.randint(100000, size=N)})
In [51]: %timeit df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
17 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['DuplicateSample1'] = df['Sample'].duplicated(keep=False)
3.73 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Stef solution is unfortunately 2734times slowier like duplicated solution
In [53]: %timeit df['DuplicateSample2'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
10.2 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use transform:
import pandas as pd
df = pd.DataFrame({'Sample': [1,2,2,3,4,4]})
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
Result:
Sample DuplicateSample
0 1 False
1 2 True
2 2 True
3 3 False
4 4 True
5 4 True

How to perform a conditional calculation

Trying to deal with an issue how to implement dictionary to existing DataFrame's column by certain indexes, i.e.:
Primary DataFrame:
country year Corruption Index
0 X 2010 6,5
1 X 2015 78,0
So I have dictionary (separately devide 78 by 10):
x = {1: 7,8}
I want to change value of index 1, column 'Corruption Index'.
I now I can use for loop, but are there any faster variants ? Or maybe I can directly divide values in Existing DataFrame which values are higher than 10 by ten? (the problem is that official statistics after 2012 turns from 1-10 to 1-100)
The fastest solution is the answer from Mykola Zotko
For a vectorized solution, use pandas.DataFrame.where, which will replace values where the condition is False.
.where will be much faster than using .apply, as df['Corruption Index'].apply(lambda x: x/10 if x > 10 else x)
numpy.where is slightly different, in that it works on the True condition, and is faster than pandas.DataFrame.where, but requires importing numpy.
Since this issue seems to be related to official statistics after 2012 turns from 1-10 to 1-100, m = df.year <= 2012 should be used as the condition.
import pandas as pd
import numpy as np
# test dataframe
df = pd.DataFrame({'country': ['X', 'X'], 'year': [2010, 2015], 'Corruption_Index': [6.5, 78.0]})
# display(df)
country year Corruption_Index
0 X 2010 6.5
1 X 2015 78.0
# create a Boolean mask for the condition
m = df.year <= 2012
# use pandas.DataFrame.where to calculate on the False condition
df['Corruption_Index'].where(m, df['Corruption_Index'] / 10, inplace=True)
# alternatively, use np.where to calculate on the True condition
df['Corruption_Index'] = np.where(df.year > 2012, df['Corruption_Index'] / 10, df['Corruption_Index'])
# display(df)
country year Corruption_Index
0 X 2010 6.5
1 X 2015 7.8
%%timeit Comparison
# test data with 1M rows
np.random.seed(365)
df = pd.DataFrame({'v': [np.random.randint(20) for _ in range(1000000)]})
%%timeit
df.loc[df['v'] > 10, 'v'] /= 10
[out]:
2.66 ms ± 61.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
np.where(df['v'] > 10, df['v'] / 10, df['v'])
[out]:
8.11 ms ± 403 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df['v'].where(df['v'] <= 10, df['v'] / 10)
[out]:
17.9 ms ± 615 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df['v'].apply(lambda x: x/10 if x > 10 else x)
[out]:
319 ms ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use:
df.loc[df['Corruption_Index'] > 10, 'Corruption_Index'] /= 10

Pandas - check if dataframe has negative value in any column

I wonder how to check if a pandas dataframe has negative value in 1 or more columns and
return only boolean value (True or False). Can you please help?
In[1]: df = pd.DataFrame(np.random.randn(10, 3))
In[2]: df
Out[2]:
0 1 2
0 -1.783811 0.736010 0.865427
1 -1.243160 0.255592 1.670268
2 0.820835 0.246249 0.288464
3 -0.923907 -0.199402 0.090250
4 -1.575614 -1.141441 0.689282
5 -1.051722 0.513397 1.471071
6 2.549089 0.977407 0.686614
7 -1.417064 0.181957 0.351824
8 0.643760 0.867286 1.166715
9 -0.316672 -0.647559 1.331545
Expected output:-
Out[3]: True
Actually, if speed is important, I did a few tests:
df = pd.DataFrame(np.random.randn(10000, 30000))
Test 1, slowest: pure pandas
(df < 0).any().any()
# 303 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test 2, faster: switching over to numpy with .values for testing the presence of a True entry
(df < 0).values.any()
# 269 ms ± 8.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test 3, maybe even faster, though not significant: switching over to numpy for the whole thing
(df.values < 0).any()
# 267 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can chain two any
df.lt(0).any().any()
Out[96]: True
This does the trick:
(df < 0).any().any()
To break it down, (df < 0) gives a dataframe with boolean entries. Then the first .any() returns a series of booleans, testing within each column for the presence of a True value. And then, the second .any() asks whether this returned series itself contains any True value.
This returns a simple:
True

Pandas: Replace a string with 'other' if it is not present in a list of strings

I have the following data frame, df, with column 'Class'
Class
0 Individual
1 Group
2 A
3 B
4 C
5 D
6 Group
I would like to replace everything apart from Group and Individual with 'Other', so the final data frame is
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
The dataframe is huge, with over 600 K rows. What is the best way to optimally look for values other than 'Group' and 'Individual' and replace them with 'Other'?
I have seen examples for replace, such as:
df['Class'] = df['Class'].replace({'A':'Other', 'B':'Other'})
but since the sheer amount of unique values i have are too many i cannot individually do this. I want to rather just use the exclude subset of 'Group' and 'Individual'.
I think you need:
df['Class'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
print (df)
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
Another solution (slower):
m = (df['Class'] == 'Individual') | (df['Class'] == 'Group')
df['Class'] = np.where(m, df['Class'], 'Other')
Another solution:
df['Class'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
Performance (in real data depends of number of replacements):
#[700000 rows x 1 columns]
df = pd.concat([df] * 100000, ignore_index=True)
#print (df)
In [208]: %timeit df['Class1'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
25.9 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [209]: %timeit df['Class2'] = np.where((df['Class'] == 'Individual') | (df['Class'] == 'Group'), df['Class'], 'Other')
120 ms ± 6.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [210]: %timeit df['Class3'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
95.7 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [211]: %timeit df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
97.8 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Another approach could be:
df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
You can do it this way for example
get list of unique items list = df['Class'].unique()
remove your known class list.remove('Individual')....
then list all Other rows df[df.class is in list]
replace class values df[df.class is in list].class = 'Other'
Sorry for this pseudo-pseudo code, but principle is same.
You can use pd.Series.where:
df['Class'].where(df['Class'].isin(['Individual', 'Group']), 'Other', inplace=True)
print(df)
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
This should be efficient versus map + fillna:
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['Class'].where(df['Class'].isin(['Individual', 'Group']), 'Other')
# 60.3 ms per loop
%timeit df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
# 133 ms per loop
Another way using apply :
df['Class'] = df['Class'].apply(lambda cl : cl if cl in ["Individual","Group"] else "Other"]

Categories