Selecting row based on a column variation

Selecting row based on a column variation - python

Suppose we have a file, named any_csv.csv, containing...
A,B,random
1,2,300
3,4,300
5,6,300
1,2,300
3,4,350
8,9,350
4,5,350
5,6,320
7,8,300
3,3,300
I wish to keep all the rows, where random variates/changes.
I made this little program to achieve this, but, as I wish to learn more about pandas and as my program is slower than I expect it to be (~130 seconds to process a 1.2 million line log file), I ask for your help.
import pandas as pd
import numpy as np
df = pd.read_csv('any_csv.csv')
mask = np.zeros(len(df.index), dtype=bool)
# Initializing my current value for comparison purposes.
mask[0] = 1
previous_val = df.iloc[0]['random']
for index, row in df.iterrows():
if row['random'] != previous_val:
# If a variation has been detected, switch to True current, and previous index.
previous_val = row['random']
mask[index] = 1
mask[index - 1] = 1
# Keeping the last item.
mask[-1] = 1
df = df.loc[mask]
df.to_csv('any_other_csv.csv', index=False)
I guess that in short, I wish to know how to apply my if, in this homemade for-loop, that is averall pretty slow.
Results :
A,B,random
1,2,300
1,2,300
3,4,350
4,5,350
5,6,320
7,8,300
3,3,300

You can utilize pd.Series.shift to create a mask of Boolean values. The Boolean mask indicates when a value is different to a value above or below it within the series.
You can then apply the Boolean mask to your dataframe directly.
mask = (df['random'] != df['random'].shift()) | \
(df['random'] != df['random'].shift(-1))
df = df[mask]
print(df)
A B random
0 1 2 300
3 1 2 300
4 3 4 350
6 4 5 350
7 5 6 320
8 7 8 300
9 3 3 300

Use boolean indexing with 2 masks for check different values with shift and ne for not equal:
df = df[df['random'].ne(df['random'].shift()) | df['random'].ne(df['random'].shift(-1))]
print (df)
A B random
0 1 2 300
3 1 2 300
4 3 4 350
6 4 5 350
7 5 6 320
8 7 8 300
9 3 3 300
For better verifying:
df['mask1'] = df['random'].ne(df['random'].shift())
df['mask2'] = df['random'].ne(df['random'].shift(-1))
df['mask3'] = df['random'].ne(df['random'].shift()) | df['random'].ne(df['random'].shift(-1))
print (df)
A B random mask1 mask2 mask3
0 1 2 300 True False True
1 3 4 300 False False False
2 5 6 300 False False False
3 1 2 300 False True True
4 3 4 350 True False True
5 8 9 350 False False False
6 4 5 350 False True True
7 5 6 320 True True True
8 7 8 300 True False True
9 3 3 300 False True True
Timings:
N = 1000
In [157]: %timeit orig(df)
56.8 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %timeit (df[df['random'].ne(df['random'].shift()) |
df['random'].ne(df['random'].shift(-1))])
939 µs ± 7.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#jpp solution - a bit slowier
In [159]: %timeit df[(df['random'] != df['random'].shift()) | (df['random'] != df['random'].shift(-1))]
1.11 ms ± 8.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N = 10000
In [160]: %timeit orig(df)
538 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [161]: %timeit (df[df['random'].ne(df['random'].shift()) | df['random'].ne(df['random'].shift(-1))])
1.16 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#jpp solution - a bit slowier
In [162]: %timeit df[(df['random'] != df['random'].shift()) | (df['random'] != df['random'].shift(-1))]
1.28 ms ± 8.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.random.seed(123)
N = 1000
df = pd.DataFrame({'random':np.random.randint(2, size=N)})
print (df)
def orig(df):
mask = np.zeros(len(df.index), dtype=bool)
# Initializing my current value for comparison purposes.
mask[0] = 1
previous_val = df.iloc[0]['random']
for index, row in df.iterrows():
if row['random'] != previous_val:
# If a variation has been detected, switch to True current, and previous index.
previous_val = row['random']
mask[index] = 1
mask[index - 1] = 1
# Keeping the last item.
mask[-1] = 1
return df.loc[mask]

You could try something like below:`
df.groupby(["A", "Random"]).filter(lambda df:df.shape[0] == 1)

Related

How do I check if a value already appeared in pandas df column?

I have a Dataframe of stock prices...
I wish to have a boolean column that indicates if the price had reached a certain threshold in the previous rows or not.
My output should be something like this (let's say my threshold is 100):
index
price
bool
0
98
False
1
99
False
2
100.5
True
3
101
True
4
99
True
5
98
True
I've managed to do this with the following code but it's not efficient and takes a lot of time:
(df.loc[:, 'price'] > threshold).cumsum().fillna(0).gt(0)
Please, any suggestions?

Use a comparison and cummax:
threshold = 100
df['bool'] = df['price'].ge(threshold).cummax()
Note that it would work the other way around (although maybe less efficiently*):
threshold = 100
df['bool'] = df['price'].cummax().ge(threshold)
Output:
index price bool
0 0 98.0 False
1 1 99.0 False
2 2 100.5 True
3 3 101.0 True
4 4 99.0 True
5 5 98.0 True
* indeed on a large array:
%%timeit
df['price'].ge(threshold).cummax()
# 193 µs ± 4.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df['price'].cummax().ge(threshold)
# 309 µs ± 4.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
timing
# setting up a dummy example with 10M rows
np.random.seed(0)
df = pd.DataFrame({'price': np.random.choice([0,1], p=[0.999,0.001], size=10_000_000)})
threshold = 0.5
## comparison
%%timeit
df['bool'] = (df.loc[:, 'price'] > threshold).cumsum().fillna(0).gt(0)
# 271 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['bool'] = df['price'].ge(threshold).cummax()
# 109 ms ± 5.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
%%timeit
df['bool'] = np.maximum.accumulate(df['price'].to_numpy()>threshold)
# 75.8 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fastest ways to filter for values in pandas dataframe

A similar dataframe can be created:
import pandas as pd
df = pd.DataFrame()
df["nodes"] = list(range(1, 11))
df["x"] = [1,4,9,12,27,87,99,121,156,234]
df["y"] = [3,5,6,1,8,9,2,1,0,-1]
df["z"] = [2,3,4,2,1,5,9,99,78,1]
df.set_index("nodes", inplace=True)
So the dataframe looks like this:
x y z
nodes
1 1 3 2
2 4 5 3
3 9 6 4
4 12 1 2
5 27 8 1
6 87 9 5
7 99 2 9
8 121 1 99
9 156 0 78
10 234 -1 1
My first try for searching e.g. all nodes containing number 1 is:
>>> df[(df == 1).any(axis=1)].index.values
[1 4 5 8 10]
As i have to do this for many numbers and my real dataframe is much bigger than this one, i'm searching for a very fast way to do this.

Just tried something that may be enlightening
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
df.set_index("A", inplace=True)
df_no_index = df.reset_index()
So set up a dataframe with ints right the way through. This is not the same as yours but it will suffice.
Then I ran four tests
%timeit df[(df == 1).any(axis=1)].index.values
%timeit df[(df['B'] == 1) | (df['C']==1)| (df['D']==1)].index.values
%timeit df_no_index[(df_no_index == 1).any(axis=1)].A.values
%timeit df_no_index[(df_no_index['B'] == 1) | (df_no_index['C']==1)| (df_no_index['D']==1)].A.values
The results I got were,
940 µs ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.47 ms ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.08 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.55 ms ± 51.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Which showed that the initial method that you took, with index seems to be the fastest of these approaches. Removing the index does not improve the speed with a moderately sized dataframe

fastest way to change multiple loc in a dataframe

I have a pandas dataframe with 1 million rows. I want to replace values in 900,000 rows in a column by another set of values. Is there fast way to do this without a for loop (which takes me two days to complete)?
For example, look at this sample dataframe where I have condensed 1 million rows to 8 rows
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['a'] = [-1,-3,-4,-4,-3, 4,5,6]
df['b'] = [23,45,67,89,0,-1, 2, 3]
L2 = [-1,-3,-4]
L5 = [9,10,11]
I want to replace values where a is -1, -3, -4 in a single shot if possible or as fast as possible without a for loop.
The crucial part is that values in L5 have to be repeated as needed.
I have tried
df.loc[df.a < 0, 'a'] = L5
but this works only when len(df.a.values) == len(L5)

Use map by dictionary created from both lists by zip, last replace to original non matched values by fillna:
d = dict(zip(L2, L5))
print (d)
{-1: 9, -3: 10, -4: 11}
df['a'] = df['a'].map(d).fillna(df['a'])
print (df)
a b
0 9.0 23
1 10.0 45
2 11.0 67
3 11.0 89
4 10.0 0
5 4.0 -1
6 5.0 2
7 6.0 3
Performance:
It depends of number of values for replace anf of lenght of lists:
Length of lists is 100:
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'a':np.random.randint(1000, size=N)})
L2 = np.arange(100)
L5 = np.arange(100) + 10
In [336]: %timeit df['d'] = np.select([df['a'] == i for i in L2], L5, df['a'])
180 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %timeit df['a'].map(dict(zip(L2, L5))).fillna(df['a'])
56.9 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If length of lists is small (e.g. 3):
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'a':np.random.randint(100, size=N)})
L2 = np.arange(3)
L5 = np.arange(3) + 10
In [339]: %timeit df['d'] = np.select([df['a'] == i for i in L2], L5, df['a'])
11.9 ms ± 40.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [340]: %timeit df['a'].map(dict(zip(L2, L5))).fillna(df['a'])
54 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

you can use np.select such as:
import numpy as np
condition = [df['a'] == i for i in L2]
df['a'] = np.select(condition, L5, df['a'])
and you get:
a b
0 9 23
1 10 45
2 11 67
3 11 89
4 10 0
5 4 -1
6 5 2
7 6 3
Timing: let's create a bigger dataframe such as with your df:
df_l = pd.concat([df]*10000)
print (df_l.shape)
(80000, 2)
Now some timeit:
# with map, #jezrael
d = dict(zip(L2, L5))
%timeit df_l['a'].map(d).fillna(df_l['a'])
100 loops, best of 3: 7.71 ms per loop
# with np.select
condition = [df_l['a'] == i for i in L2]
%timeit np.select(condition, L5, df_l['a'])
1000 loops, best of 3: 350 µs per loop

How to map new variable in pandas in effective way

Here's my data
Id Amount
1 6
2 2
3 0
4 6
What I need, is to map : if Amount is more than 3 , Map is 1. But,if Amount is less than 3, Map is 0
Id Amount Map
1 6 1
2 2 0
3 0 0
4 5 1
What I did
a = df[['Id','Amount']]
a = a[a['Amount'] >= 3]
a['Map'] = 1
a = a[['Id', 'Map']]
df= df.merge(a, on='Id', how='left')
df['Amount'].fillna(0)
It works, but not highly configurable and not effective.

Convert boolean mask to integer:
#for better performance convert to numpy array
df['Map'] = (df['Amount'].values >= 3).astype(int)
#pure pandas solution
df['Map'] = (df['Amount'] >= 3).astype(int)
print (df)
Id Amount Map
0 1 6 1
1 2 2 0
2 3 0 0
3 4 6 1
Performance:
#[400000 rows x 3 columns]
df = pd.concat([df] * 100000, ignore_index=True)
In [133]: %timeit df['Map'] = (df['Amount'].values >= 3).astype(int)
2.44 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df['Map'] = (df['Amount'] >= 3).astype(int)
2.6 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas Convert Dataframe Values to Labels

I am trying to convert values within the current dataframe as the "Index" and the dataframe's Index as the "Labels". For Example:
Value1 Value2
0 0 1
1 2 4
2 NaN 3
This would result in
Labels
0 0
1 0
2 1
3 2
4 1
Currently I managed to do this using a loop to check and apply the necessary labels/values but with millions of labels to mark this process becomes extremely time consuming. Is there a way to do this in a smarter and quicker way? Thanks in advance.

Use stack with DataFrame constructor:
s = df.stack()
df = pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
print (df)
Labels
0 0
1 0
2 1
3 2
4 1
Detail:
print (df.stack())
0 Value1 0.0
Value2 1.0
1 Value1 2.0
Value2 4.0
2 Value2 3.0
dtype: float64

Came up with a really good one (thanks to the collective effort of the pandas community). This one should be fast.
It uses the power a flexibility of repeat and ravel to flatten your data.
s = pd.Series(df.index.repeat(2), index=df.values.ravel())
s[s.index.notnull()].sort_index()
0.0 0
1.0 0
2.0 1
3.0 2
4.0 1
dtype: int64
A subsequent conversion results in an integer index:
df.index = df.index.astype(int)
A similar (slightly faster depending on your data) solution which also results in an integer index is performing the filtering before converting to Series -
v = df.index.repeat(df.shape[1])
i = df.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
s
0 0
1 0
2 1
3 2
4 1
dtype: int64
Performance
df2 = pd.concat([df] * 10000, ignore_index=True)
# jezrael's solution
%%timeit
s = df2.stack()
pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
4.57 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(df2.index.repeat(2), index=df2.values.ravel())
s[s.index.notnull()].sort_index()
3.12 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df2.index.repeat(df.shape[1])
i = df2.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
3.1 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting row based on a column variation - python

You could try something like below:` df.groupby(["A", "Random"]).filter(lambda df:df.shape[0] == 1)

Related

How do I check if a value already appeared in pandas df column?

Fastest ways to filter for values in pandas dataframe

fastest way to change multiple loc in a dataframe

How to map new variable in pandas in effective way

Pandas Convert Dataframe Values to Labels

Categories

Resources