Related
Let's label a dataframe with two columns, A,B, and 100M rows. Starting at the index i, we want to know if the data in column B is trending down or trending up comparing to the data at [i, 'A'].
Here is a loop:
import pandas as pd
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
for i in range (0,5):
j = i
while j in range (i,i+5) and df.at[i,'label'] == 0: #if classfied, no need to continue
if df.at[j,'B']-df.at[i,'A']>= 10:
df.at[i,'label'] = 1 #Label 1 means trending up
if df.at[j,'B']-df.at[i,'A']<= -10:
df.at[i,'label'] = 2 #Label 2 means trending down
j=j+1
[out]
A B label
0 1 1
1 10 2
2 -10 2
3 2 0
5 3 0
...
The estimated finishing time for this code is 30 days. (A human with a plot and a ruler might finish this task faster.)
What is a fast way to do this? Ideally without a loop.
Looping on Dataframe is slow compared to using Pandas methods.
The task can be accomplished using Pandas vectorized methods:
rolling method which does computations in a rolling window
min & max methods which we compute in the rolling window
where method DataFrame where allows us to set values based upon logic
Code
def set_trend(df, threshold = 10, window_size = 2):
'''
Use rolling_window to find max/min values in a window from the current point
rolling window normally looks at backward values
We use technique from https://stackoverflow.com/questions/22820292/how-to-use-pandas-rolling-functions-on-a-forward-looking-basis/22820689#22820689
to look at forward values
'''
# To have a rolling window on lookahead values in column B
# We reverse values in column B
df['B_rev'] = df["B"].values[::-1]
# Max & Min in B_rev, then reverse order of these max/min
# https://stackoverflow.com/questions/50837012/pandas-rolling-min-max
df['max_'] = df.B_rev.rolling(window_size, min_periods = 0).max().values[::-1]
df['min_'] = df.B_rev.rolling(window_size, min_periods = 0).min().values[::-1]
nrows = df.shape[0] - 1 # adjustment for argmax & armin indexes since rows are in reverse order
# i.e. idx = nrows - x.argmax() give index for max in non-reverse row
df['max_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmax(), raw = True).values[::-1]
df['min_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmin(), raw = True).values[::-1]
# Use np.select to implement label assignment logic
conditions = [
(df['max_'] - df["A"] >= threshold) & (df['max_idx'] <= df['min_idx']), # max above & comes first
(df['min_'] - df["A"] <= -threshold) & (df['min_idx'] <= df['max_idx']), # min below & comes first
df['max_'] - df["A"] >= threshold, # max above threshold but didn't come first
df['min_'] - df["A"] <= -threshold, # min below threshold but didn't come first
]
choices = [
1, # max above & came first
2, # min above & came first
1, # max above threshold
2, # min above threshold
]
df['label'] = np.select(conditions, choices, default = 0)
# Drop scratch computation columns
df.drop(['B_rev', 'max_', 'min_', 'max_idx', 'min_idx'], axis = 1, inplace = True)
return df
Tests
Case 1
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
display(set_trend(df, 10, 4))
Case 2
df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]})
display(set_trend(df, 10, 4))
Output
Case 1
A B label
0 0 1 1
1 1 10 2
2 2 -10 2
3 3 2 0
4 5 3 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Case 2
A B label
0 0 1 2
1 1 -10 2
2 2 10 0
I have a similar question to one I posed here, but subtly different as it includes an extra step to the process involving a probability:
Using a Python pandas dataframe column as input to a loop through another column
I've got two pandas dataframes: one has these variables
Year Count Probability
1 8 25%
2 26 19%
3 17 26%
4 9 10%
Another is a table with these variables:
ID Value
1 100
2 25
3 50
4 15
5 75
Essentially I need to use the Count x in the first dataframe to loop through the 2nd dataframe x times, but only pull a value from the 2nd dataframe y percent of the times (using random number generation) - and then create a new column in the first dataframe that represents the sum of the values in the loop.
So - just to demonstrate - in that first column, we'd loop through the 2nd table 8 times, but only pull a random value from that table 25% of the time - so we might get output of:
0 100 0 0 25 0 0 0
...which sums to 125 - so we our added column to the first table looks like
Year Count Probability Sum
1 8 25% 125
....and so on. Thanks in advance.
We'll use numpy binomial and pandas sample to get this done.
import pandas as pd
import numpy as np
# Set up dataframes
vals = pd.DataFrame([[1,8,'25%'], [2,26,'19%'], [3,17,'26%'],[4,9,'10%']])
vals.columns = ['Year', 'Count', 'Probability']
temp = pd.DataFrame([[1,100], [2,25], [3,50], [4,15], [5,75]])
temp.columns = ['ID', 'Value']
# Get probability fraction from string
vals['Numeric_Probability'] = pd.to_numeric(vals['Probability'].str.replace('%', '')) / 100
# Total rows is binomial random variable with n=Count, p=Probability.
vals['Total_Rows'] = np.random.binomial(n=vals['Count'], p=vals['Numeric_Probability'])
# Sample "total rows" from other DataFrame and sum.
vals['Sum'] = vals['Total_Rows'].apply(lambda x: temp['Value'].sample(
n=x, replace=True).sum())
# Drop intermediate rows
vals.drop(columns=['Numeric_Probability', 'Total_Rows'], inplace=True)
print(vals)
Year Count Probability Sum
0 1 8 25% 15
1 2 26 19% 350
2 3 17 26% 190
3 4 9 10% 0
You could use pass a probabilities list to np.random.choice:
In [1]: import numpy as np
...: import pandas as pd
In [2]: d_1 = {
...: 'Year': [1, 2, 3, 4],
...: 'Count': [8, 26, 17, 9],
...: 'Probability': ['25%', '19%', '26%', '10%'],
...: }
...: df_1 = pd.DataFrame(data=d_1)
In [3]: d_2 = {
...: 'ID': [1, 2, 3, 4, 5],
...: 'Value': [100, 25, 50, 15, 75],
...: }
...: df_2 = pd.DataFrame(data=d_2)
In [4]: def get_probabilities(values: pd.Series, percentage: float) -> list[float]:
...: percentage /= 100
...: perecent_per_val = percentage / values.size
...: return [perecent_per_val] * values.size + [1 - percentage]
...:
In [5]: df_1['Sum'] = [
...: np.random.choice(a=pd.concat([df_2['Value'], pd.Series([0])]),
...: size=n,
...: p=get_probabilites(values=df_2['Value'],
...: percentage=float(percent[:-1]))).sum()
...: for n, percent in zip(df_1['Count'], df_1['Probability'])
...: ]
...: df_1
Out[5]:
Year Count Probability Sum
0 1 8 25% 100
1 2 26 19% 375
2 3 17 26% 275
3 4 9 10% 50
This question already has answers here:
Mapping ranges of values in pandas dataframe [duplicate]
(2 answers)
Closed 2 years ago.
I have a pandas dataframe and I want to create categories in a new column based on the values of another column. I can solve my basic problem by doing this:
range = {
range(0, 5) : 'Below 5',
range(6,10): 'between',
range(11, 1000) : 'above'
}
df['range'] = df['value'].map(range)
In the final dictionary key I have chosen a large upper value for range to ensure it captures all the values I am trying to map. However, this seems an ugly hack and am wondering how to generalise this without specifying the upper limit. ie. if > 10 : 'above'.
Thanks
Assume you have a dataframe like this:
range value
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
Then you can apply the following function to the column 'value':
def get_value(range):
if range < 5:
return 'Below 5'
elif range < 10:
return 'Between 5 and 10'
else:
return 'Above 10'
df['value'] = df.apply(lambda col: get_value(col['range']), axis=1)
To get the result you want.
You could set all values first to 'above', and then use map() for the remaining options (thus with your range dict having only two items):
range = {
range(0, 5) : 'Below 5',
range(6,10): 'between',
}
df['range'] = 'above'
df['range'] = df['value'].map(range)
Thanks for the hints. I see I can achieve the same with:
df['range'] = df['value'].map(range).fillna('above')
pandas.Series.map accepts also function as first argument, so you could do:
def fun(x):
if x in range(0, 5):
return 'Below 5'
elif x in range(6, 10):
return 'between'
elif x >= 11:
return 'above'
then:
df['range'] = df['value'].map(fun)
Here's another approach using numpy.select, where you specify a list of boolean conditions, and a list of choices:
import numpy as np
# Setup
df = pd.DataFrame({'value': [1, 3, 6, 8, 20, 10000000]})
condlist = [
df['value'].lt(5),
df['value'].between(5, 10),
df['value'].gt(10)]
choicelist = ['Below 5', 'between', 'above']
df['out'] = np.select(condlist, choicelist)
print(df)
[out]
value out
0 1 Below 5
1 3 Below 5
2 6 between
3 8 between
4 20 above
5 10000000 above
Another idea would be to use pandas.cut with bins and labels parameters specified:
df['out'] = pd.cut(df['value'], bins=[-np.inf, 5, 10, np.inf],
labels=['below', 'between', 'above'])
value out
0 1 below
1 3 below
2 6 between
3 8 between
4 20 above
5 10000000 above
df['range'] = pd.cut(df['value'], bins = [0, 5, 10, 1000], labels = ["below 5", "between", "above"])
I have a dictionary 'wordfreq' like this:
{'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
and I want to put the keys in a list if the value is more than 5 and also if the key is not in another dataframe 'df', and then adding them to a list called 'stopword':here is a df dataframe:
word freq
1 paradies 1
5 tucuman 1
and here is the code I am using:
stopword = []
for k,v in wordfreq.items():
if v >= 5:
if k not in list_c:
stopword.append((k))
Anybody knows how can I do the same thing with isin() method or more efficiently at least?
I'd load your dict into a df:
In [177]:
wordfreq = {'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
df = pd.DataFrame({'word':list(wordfreq.keys()), 'freq':list(wordfreq.values())})
df
Out[177]:
freq word
0 1 frogfeet
1 1 tucuman
2 57 paradies
3 1 d8848
4 5000 jobvark
5 100 midgley
6 1 jiaoyuwang
7 30 techsmart
8 2 weisman
9 19 walter
10 2 amdahl
And then filter using isin against the other df (df_1 in my case) like this:
In [181]:
df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]
Out[181]:
freq word
4 5000 jobvark
5 100 midgley
7 30 techsmart
9 19 walter
So the boolean condition looks for freq values greater than 5 and also where the word is not in the other df using isin and invert the boolean mask ~.
You can then now get a list easily:
In [182]:
list(df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]['word'])
Out[182]:
['jobvark', 'midgley', 'techsmart', 'walter']
I am writing an algorithm to determine the intervals of the "mountains" on a density plot. The plot is taken from the depths from a Kinect if anyone is interested. Here is a quick visual example of what this algorithm finds: (with the small mountains removed):
My current algorithm:
def find_peak_intervals(data):
previous = 0
peak = False
ranges = []
begin_range = 0
end_range = 0
for current in xrange(len(data)):
if (not peak) and ((data[current] - data[previous]) > 0):
peak = True
begin_range = current
if peak and (data[current] == 0):
peak = False
end_range = current
ranges.append((begin_range, end_range))
previous = current
return np.array(ranges)
The function works but it takes nearly 3 milliseconds on my laptop, and I need to be able to run my entire program at at least 30 frames per second. This function is rather ugly and I have to run it 3 times per frame for my program, so I would like any hints as to how to simplify and optimize this function (maybe something from numpy or scipy that I missed).
Assuming a pandas dataframe like so:
Value
0 0
1 3
2 2
3 2
4 1
5 2
6 3
7 0
8 1
9 3
10 0
11 0
12 0
13 1
14 0
15 3
16 2
17 3
18 1
19 0
You can get the contiguous non-zero ranges by using df["Value"].shift(x) where x could either be 1 or -1 so you can check if it's bounded by zeroes. Once you get the boundaries, you can just store their index pairs and use them later on when filtering the data.
The following code is based on the excellent answer here by #behzad.nouri.
import pandas as pd
df = pd.read_csv("data.csv")
# Or you can use df = pd.DataFrame.from_dict({'Value': {0: 0, 1: 3, 2: 2, 3: 2, 4: 1, 5: 2, 6: 3, 7: 0, 8: 1, 9: 3, 10: 0, 11: 0, 12: 0, 13: 1, 14: 0, 15: 3, 16: 2, 17: 3, 18: 1, 19: 0}})
# --
# from https://stackoverflow.com/questions/24281936
# credits to #behzad.nouri
df['tag'] = df['Value'] > 0
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
pr = [(i, j) for i, j in zip(fst, lst)]
# --
for i, j in pr:
print df.loc[i:j, "Value"]
This gives the result:
1 3
2 2
3 2
4 1
5 2
6 3
Name: Value, dtype: int64
8 1
9 3
Name: Value, dtype: int64
13 1
Name: Value, dtype: int64
15 3
16 2
17 3
18 1
Name: Value, dtype: int64
Timing it in IPython gives the following:
%timeit find_peak_intervals(df)
1000 loops, best of 3: 1.49 ms per loop
This is not too far from your attempt speed-wise. An alternative is to use convert the pandas series to a numpy array and operate from there. Let's take another excellent answer, this one by #Warren Weckesser, and modify it to suit your needs. Let's time it as well.
In [22]: np_arr = np.array(df["Value"])
In [23]: def greater_than_zero(a):
...: isntzero = np.concatenate(([0], np.greater(a, 0).view(np.int8), [0]))
...: absdiff = np.abs(np.diff(isntzero))
...: ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
...: return ranges
In [24]: %timeit greater_than_zero(np_arr)
100000 loops, best of 3: 17.1 µs per loop
Not so bad at 17.1 microseconds, and it gives the same ranges as well.
[1 7] # Basically same as indices 1-6 in pandas.
[ 8 10] # 8, 9
[13 14] # 13, 13
[15 19] # 15, 18