I have this basic dataframe:
dur type src dst
0 0 new 543 1
1 0 new 21 1
2 1 old 4828 2
3 0 new 321 1
...
(total 450000 rows)
My aim is to replace the values in src with either 0, 1 or 2 depending on the values. I created a for loop/if else below:
for i in df['src']:
if i <= 1000:
df['src'].replace(to_replace = [i], value = [1], inplace = True)
elif i <= 2500:
df['src'].replace(to_replace = [i], value = [2], inplace = True)
elif i <= 5000:
df['src'].replace(to_replace = [i], value = [3], inplace = True)
else:
print('End!')
The above works as intended, but it is awfully slow trying to replace the entire dataframe with 450000 rows (it is taking more than 30 minutes to do this!).
Is there a more Pythonic way to speed up this algorithm?
Try numpy.select, for multiple conditions:
cond1 = df.src.le(1000)
cond2 = df.src.le(2500)
cond3 = df.src.le(5000)
condlist = [cond1, cond2, cond3]
choicelist = [1, 2, 3]
df.assign(src=np.select(condlist, choicelist))
dur type src dst
0 0 new 1 1
1 0 new 1 1
2 1 old 3 2
3 0 new 1 1
I have not tested this, but I think this should work
pd.cut(df.src, [0, 1000, 2500, 5000], labels=[1,2,3] )
Related
Let's label a dataframe with two columns, A,B, and 100M rows. Starting at the index i, we want to know if the data in column B is trending down or trending up comparing to the data at [i, 'A'].
Here is a loop:
import pandas as pd
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
for i in range (0,5):
j = i
while j in range (i,i+5) and df.at[i,'label'] == 0: #if classfied, no need to continue
if df.at[j,'B']-df.at[i,'A']>= 10:
df.at[i,'label'] = 1 #Label 1 means trending up
if df.at[j,'B']-df.at[i,'A']<= -10:
df.at[i,'label'] = 2 #Label 2 means trending down
j=j+1
[out]
A B label
0 1 1
1 10 2
2 -10 2
3 2 0
5 3 0
...
The estimated finishing time for this code is 30 days. (A human with a plot and a ruler might finish this task faster.)
What is a fast way to do this? Ideally without a loop.
Looping on Dataframe is slow compared to using Pandas methods.
The task can be accomplished using Pandas vectorized methods:
rolling method which does computations in a rolling window
min & max methods which we compute in the rolling window
where method DataFrame where allows us to set values based upon logic
Code
def set_trend(df, threshold = 10, window_size = 2):
'''
Use rolling_window to find max/min values in a window from the current point
rolling window normally looks at backward values
We use technique from https://stackoverflow.com/questions/22820292/how-to-use-pandas-rolling-functions-on-a-forward-looking-basis/22820689#22820689
to look at forward values
'''
# To have a rolling window on lookahead values in column B
# We reverse values in column B
df['B_rev'] = df["B"].values[::-1]
# Max & Min in B_rev, then reverse order of these max/min
# https://stackoverflow.com/questions/50837012/pandas-rolling-min-max
df['max_'] = df.B_rev.rolling(window_size, min_periods = 0).max().values[::-1]
df['min_'] = df.B_rev.rolling(window_size, min_periods = 0).min().values[::-1]
nrows = df.shape[0] - 1 # adjustment for argmax & armin indexes since rows are in reverse order
# i.e. idx = nrows - x.argmax() give index for max in non-reverse row
df['max_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmax(), raw = True).values[::-1]
df['min_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmin(), raw = True).values[::-1]
# Use np.select to implement label assignment logic
conditions = [
(df['max_'] - df["A"] >= threshold) & (df['max_idx'] <= df['min_idx']), # max above & comes first
(df['min_'] - df["A"] <= -threshold) & (df['min_idx'] <= df['max_idx']), # min below & comes first
df['max_'] - df["A"] >= threshold, # max above threshold but didn't come first
df['min_'] - df["A"] <= -threshold, # min below threshold but didn't come first
]
choices = [
1, # max above & came first
2, # min above & came first
1, # max above threshold
2, # min above threshold
]
df['label'] = np.select(conditions, choices, default = 0)
# Drop scratch computation columns
df.drop(['B_rev', 'max_', 'min_', 'max_idx', 'min_idx'], axis = 1, inplace = True)
return df
Tests
Case 1
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
display(set_trend(df, 10, 4))
Case 2
df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]})
display(set_trend(df, 10, 4))
Output
Case 1
A B label
0 0 1 1
1 1 10 2
2 2 -10 2
3 3 2 0
4 5 3 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Case 2
A B label
0 0 1 2
1 1 -10 2
2 2 10 0
I have a pandas dataframe(100,000 obs) with 11 columns.
I'm trying to assign df['trade_sign'] values based on the df['diff'] (which is a pd.series object of integer values)
If diff is positive, then trade_sign = 1
if diff is negative, then trade_sign = -1
if diff is 0, then trade_sign = 0
What I've tried so far:
pos['trade_sign'] = (pos['trade_sign']>0) <br>
pos['trade_sign'].replace({False: -1, True: 1}, inplace=True)
But this obviously doesn't take into account 0 values.
I also tried for loops with if conditions but that didn't work.
Essentially, how do I fix my .replace function to take account of diff values of 0.
Ideally, I'd prefer a solution that uses numpy over for loops with if conditions.
There's a sign function in numpy:
df["trade_sign"] = np.sign(df["diff"])
If you want integers,
df["trade_sign"] = np.sign(df["diff"]).astype(int)
a = [-1 if df['diff'].values[i] < 0 else 1 for i in range(len(df['diff'].values))]
df['trade_sign'] = a
You could do it this way:
pos['trade_sign'] = (pos['diff'] > 0) * 1 + (pos['diff'] < 0) * -1
The boolean results of the element-wise > and < comparisons automatically get converted to int in order to allow multiplication with 1 and -1, respectively.
This sample input and test code:
import pandas as pd
pos = pd.DataFrame({'diff':[-9,0,9,-8,0,8,-7-6-5,4,3,2,0]})
pos['trade_sign'] = (pos['diff'] > 0) * 1 + (pos['diff'] < 0) * -1
print(pos)
... gives this output:
diff trade_sign
0 -9 -1
1 0 0
2 9 1
3 -8 -1
4 0 0
5 8 1
6 -18 -1
7 4 1
8 3 1
9 2 1
10 0 0
UPDATE: In addition to the solution above, as well as some of the other excellent ideas in other answers, you can use numpy where:
pos['trade_sign'] = np.where(pos['diff'] > 0, 1, np.where(pos['diff'] < 0, -1, 0))
I am trying to create a new pandas column which is normalised data from another column.
I created three separate series and then merged them into one.
While this approache has provided me with the desired result, I was wondering whether there's a better way to do this.
x = df["Data Col"].copy()
#if the value is between 70 and 30 find the difference of the previous value.
#Positive difference = 1 & Negative difference = -1
btw = pd.Series(np.where(x.between(30, 70, inclusive=False), x.diff(), 0))
btw[btw < 0] = -1
btw[btw > 0] = 1
#All values above 70 are -1
up = pd.Series(np.where(x.gt(70), -1, 0))
#All values below 30 are 1
dw = pd.Series(np.where(x.lt(30), 1, 0))
combined = up + dw + btw
df["Normalised Col"] = np.array(combined)
I tried to use functions and loops directly on the Pandas Data Column but I couldn't figure out how to get the .diff()
Use numpy.select with chain masks by & for bitwise AND and | for bitwise OR:
np.random.seed(2019)
df = pd.DataFrame({'Data Col':np.random.randint(10, 100, size=10)})
#print (df)
d = df["Data Col"].diff()
m1 = df["Data Col"].between(30, 70, inclusive=False)
m2 = d < 0
m3 = d > 0
m4 = df["Data Col"].gt(70)
m5 = df["Data Col"].lt(30)
df["Normalised Col1"] = np.select([(m1 & m2) | m4, (m1 & m3) | m5], [-1, 1], default=0)
print (df)
Data Col Normalised Col1
0 82 -1
1 41 -1
2 47 1
3 98 -1
4 72 -1
5 34 -1
6 39 1
7 25 1
8 22 1
9 26 1
I used the code below to map the 2 values inside S column to 0 but it didn't work. Any suggestion on how to solve this?
N.B : I want to implement an external function inside the map.
df = pd.DataFrame({
'Age': [30,40,50,60,70,80],
'Sex': ['F','M','M','F','M','F'],
'S' : [1,1,2,2,1,2]
})
def app(value):
for n in df['S']:
if n == 1:
return 1
if n == 2:
return 0
df["S"] = df.S.map(app)
Use eq to create a boolean series and conver that boolean series to int with astype:
df['S'] = df['S'].eq(1).astype(int)
OR
df['S'] = (df['S'] == 1).astype(int)
Output:
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
Don't use apply, simply use loc to assign the values:
df.loc[df.S.eq(2), 'S'] = 0
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
If you need a more performant option, use np.select. This is also more scalable, as you can always add more conditions:
df['S'] = np.select([df.S.eq(2)], [0], 1)
You're close but you need a few corrections. Since you want to use a function, remove the for loop and replace n with value. Additionally, use apply instead of map. Apply operates on the entire column at once. See this answer for how to properly use apply vs applymap vs map
def app(value):
if value == 1:
return 1
elif value == 2:
return 0
df['S'] = df.S.apply(app)
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
If you only wish to change values equal to 2, you can use pd.DataFrame.loc:
df.loc[df['S'] == 0, 'S'] = 0
pd.Series.apply is not recommend and this is just a thinly veiled, inefficient loop.
You could use .replace as follows:
df["S"] = df["S"].replace([2], 0)
This will replace all of 2 values to 0 in one line
Go with vectorize numpy operation:
df['S'] = np.abs(df['S'] - 2)
and stand yourself out from competitions in interviews and SO answers :)
>>>df = pd.DataFrame({'Age':[30,40,50,60,70,80],'Sex':
['F','M','M','F','M','F'],'S':
[1,1,2,2,1,2]})
>>> def app(value):
return 1 if value == 1 else 0
# or app = lambda value : 1 if value == 1 else 0
>>> df["S"] = df["S"].map(app)
>>> df
Age S Sex
Age S Sex
0 30 1 F
1 40 1 M
2 50 0 M
3 60 0 F
4 70 1 M
5 80 0 F
You can do:
import numpy as np
df['S'] = np.where(df['S'] == 2, 0, df['S'])
I am iterating over a Python dataframe and finding it to be extremely slow. I understand that in Pandas you try to vectorize everything, but in this case I specifically need to iterate (or if it is possible to vectorize, I'm unclear how to do it).
The logic is simple: you have two columns "A" and "B" and a result column "signal." If A equals 1, then you set signal to 1. If B equals 1, then you set signal to 0. Otherwise, signals is whatever it was previously. In other words, column A is an "on" signal, column B is an "off" signal, and "signal" represents the state.
Here is my code:
def signals(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
data['signal'] = 0
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
data['signal'].iloc[i] = 1
elif data['B'].iloc[i] == 1:
data['signal'].iloc[i] = 0
else:
data['signal'].iloc[i] = data['signal'].iloc[i-1]
return data
Example input/output:
indata = pd.DataFrame(index = range(0,10))
indata['A'] = [0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
indata['B'] = [1, 0, 0, 0, 1, 0, 0, 0, 1, 1]
signals(indata)
Output:
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
This simple logic takes my computer 46 seconds to run on a dataframe of 2000 rows with randomly generated data.
df['signal'] = df.A.groupby((df.A != df.B).cumsum()).transform('head', 1)
df
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
The logic here involves dividing your series into groups based on the inequality between A and B, and every group's value is determined by A.
You dont need to iterate at all you can do some Boolean indexing
#set condition for A
indata.loc[indata.A == 1,'signal'] = 1
#set condition for B
indata.loc[indata.B == 1,'signal'] = 0
#forward fill NaN values
indata.signal.fillna(method='ffill',inplace=True)
The simplest answer to my problem was to not write to the dataframe while iterating through it. I created an array of zeros in numpy, then did my iterative logic in the array. Then I wrote the array to the column in my dataframe.
def signals3(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
out_signal = np.zeros(numrows)
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
out_signal[i] = 1
elif data['B'].iloc[i] == 1:
out_signal[i] = 0
else:
out_signal[i] = out_signal[i-1]
data['signal'] = out_signal
return data
On a dataframe of 2000 rows of random data, this takes only 43 milliseconds as opposed to 46 seconds (~1,000x faster).
I also tried a variant where I assigned the dataframe columns A and B to series, and then iterated through the series. This was a bit faster (27 milliseconds). But it appears most of the slowness is in writing to a dataframe.
Both coldspeed and djk's answers were faster than my solution (about 4.5ms) but in practice I'll probably just iterate through series even though that is not optimal.