Speed up turn probabilities into binary features - python

I have a dataframe with 3 columns, in each row I have the probability that this row, the feature T has the value 1, 2 and 3
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})
For row 0, T is 1 with 80% chance, 2 with 10% and 3 with 10%
I want to simulate the value of T for each row and change the columns T1,T2, T3 to binary features.
I have a solution but it needs to loop on the rows of the dataframe, it is really slow (my real dataframe has over 1 million rows) :
possib = df.columns
for i in range(df.shape[0]):
probas = df.iloc[i][possib].tolist()
choix_transp = np.random.choice(possib,1, p=probas)[0]
for pos in possib:
if pos==choix_transp:
df.iloc[i][pos] = 1
else:
df.iloc[i][pos] = 0
Is there a way to vectorize this code ?
Thank you !

Here's one based on vectorized random.choice with a given matrix of probabilities -
def matrixprob_to_onehot(ar):
# Get one-hot encoded boolean array based on matrix of probabilities
c = ar.cumsum(axis=1)
idx = (np.random.rand(len(c), 1) < c).argmax(axis=1)
ar_out = np.zeros(ar.shape, dtype=bool)
ar_out[np.arange(len(idx)),idx] = 1
return ar_out
ar_out = matrixprob_to_onehot(df.values)
df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
Verify with a large dataset for the probabilities -
In [139]: df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})
In [140]: df
Out[140]:
T1 T2 T3
0 0.80 0.10 0.1
1 0.50 0.20 0.3
2 0.01 0.89 0.1
In [141]: p = np.array([matrixprob_to_onehot(df.values) for i in range(100000)]).argmax(2)
In [142]: np.array([np.bincount(p[:,i])/100000.0 for i in range(len(df))])
Out[142]:
array([[0.80064, 0.0995 , 0.09986],
[0.50051, 0.20113, 0.29836],
[0.01015, 0.89045, 0.0994 ]])
In [145]: np.round(_,2)
Out[145]:
array([[0.8 , 0.1 , 0.1 ],
[0.5 , 0.2 , 0.3 ],
[0.01, 0.89, 0.1 ]])
Timings on 1000,000 rows -
# Setup input
In [169]: N = 1000000
...: a = np.random.rand(N,3)
...: df = pd.DataFrame(a/a.sum(1,keepdims=1),columns=[['T1','T2','T3']])
# #gmds's soln
In [171]: %timeit pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
1 loop, best of 3: 4.82 s per loop
# Soln from this post
In [172]: %%timeit
...: ar_out = matrixprob_to_onehot(df.values)
...: df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
10 loops, best of 3: 43.1 ms per loop

We can use numpy for this:
result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
This generates a single column of random values and compares it to the column-wise cumsum of the dataframe, which results in a DataFrame of values where the first False value shows which "bucket" the random value falls in. With idxmax, we can get the index of this bucket, which we can then convert back with pd.get_dummies.
Example:
import numpy as np
import pandas as pd
np.random.seed(0)
data = np.random.rand(10, 3)
normalised = data / data.sum(axis=1)[:, np.newaxis]
df = pd.DataFrame(normalised)
result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
print(result)
Output:
0 1 2
0 1 0 0
1 0 0 1
2 0 1 0
3 0 1 0
4 1 0 0
5 0 0 1
6 0 1 0
7 0 1 0
8 0 0 1
9 0 1 0
A note:
Most of the slowdown comes from pd.get_dummies; if you use Divakar's method of pd.DataFrame(result.view('i1'), index=df.index, columns=df.columns), it gets a lot faster.

Related

Python/Pandas: Calculating RMS in sections

What is the best way to calculate the RMS of a column in sections in python/pandas. Here is a example for a better understanding what I mean:
index
x
x_rms
0
2
1
3
2.55
2
10
3
22
17.09
...
...
...
So 2.55 is the RMS of 2 and 3, 17.09 is the RMS of 10 and 22 and so on.
the following will work
import pandas as pd
df = pd.DataFrame([2,3,10,22], columns=["x"])
def rms(a, b):
# return round(np.sqrt((a**2+b**2)/2), 2) # for only two decimals
return np.sqrt((a**2+b**2)/2)
df["rms"] = [rms(df.loc[idx-1,"x"], val["x"]) if idx%2 != 0 else np.nan
for idx, val in df.iterrows()]
output
x rms
0 2 NaN
1 3 2.549510
2 10 NaN
3 22 17.088007
EDIT regarding comment
if your index is a date you should do this to have the same output
values = [2,3,10,22]
tidx = pd.date_range('2019-01-01', periods=len(values), freq='D')
df = pd.DataFrame([2,3,10,22], columns=["x"], index=tidx)
def rms(a, b):
# return round(np.sqrt((a**2+b**2)/2), 2) # for only two decimals
return np.sqrt((a**2+b**2)/2)
df = df.reset_index()
df["rms"] = [rms(df.loc[idx-1,"x"], val["x"]) if idx%2 != 0 else np.nan
for idx, val in df.iterrows()]
df.set_index("index")

Labeling whether the numbers in a dataframe is going up first or down first

Let's label a dataframe with two columns, A,B, and 100M rows. Starting at the index i, we want to know if the data in column B is trending down or trending up comparing to the data at [i, 'A'].
Here is a loop:
import pandas as pd
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
for i in range (0,5):
j = i
while j in range (i,i+5) and df.at[i,'label'] == 0: #if classfied, no need to continue
if df.at[j,'B']-df.at[i,'A']>= 10:
df.at[i,'label'] = 1 #Label 1 means trending up
if df.at[j,'B']-df.at[i,'A']<= -10:
df.at[i,'label'] = 2 #Label 2 means trending down
j=j+1
[out]
A B label
0 1 1
1 10 2
2 -10 2
3 2 0
5 3 0
...
The estimated finishing time for this code is 30 days. (A human with a plot and a ruler might finish this task faster.)
What is a fast way to do this? Ideally without a loop.
Looping on Dataframe is slow compared to using Pandas methods.
The task can be accomplished using Pandas vectorized methods:
rolling method which does computations in a rolling window
min & max methods which we compute in the rolling window
where method DataFrame where allows us to set values based upon logic
Code
def set_trend(df, threshold = 10, window_size = 2):
'''
Use rolling_window to find max/min values in a window from the current point
rolling window normally looks at backward values
We use technique from https://stackoverflow.com/questions/22820292/how-to-use-pandas-rolling-functions-on-a-forward-looking-basis/22820689#22820689
to look at forward values
'''
# To have a rolling window on lookahead values in column B
# We reverse values in column B
df['B_rev'] = df["B"].values[::-1]
# Max & Min in B_rev, then reverse order of these max/min
# https://stackoverflow.com/questions/50837012/pandas-rolling-min-max
df['max_'] = df.B_rev.rolling(window_size, min_periods = 0).max().values[::-1]
df['min_'] = df.B_rev.rolling(window_size, min_periods = 0).min().values[::-1]
nrows = df.shape[0] - 1 # adjustment for argmax & armin indexes since rows are in reverse order
# i.e. idx = nrows - x.argmax() give index for max in non-reverse row
df['max_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmax(), raw = True).values[::-1]
df['min_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmin(), raw = True).values[::-1]
# Use np.select to implement label assignment logic
conditions = [
(df['max_'] - df["A"] >= threshold) & (df['max_idx'] <= df['min_idx']), # max above & comes first
(df['min_'] - df["A"] <= -threshold) & (df['min_idx'] <= df['max_idx']), # min below & comes first
df['max_'] - df["A"] >= threshold, # max above threshold but didn't come first
df['min_'] - df["A"] <= -threshold, # min below threshold but didn't come first
]
choices = [
1, # max above & came first
2, # min above & came first
1, # max above threshold
2, # min above threshold
]
df['label'] = np.select(conditions, choices, default = 0)
# Drop scratch computation columns
df.drop(['B_rev', 'max_', 'min_', 'max_idx', 'min_idx'], axis = 1, inplace = True)
return df
Tests
Case 1
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
display(set_trend(df, 10, 4))
Case 2
df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]})
display(set_trend(df, 10, 4))
Output
Case 1
A B label
0 0 1 1
1 1 10 2
2 2 -10 2
3 3 2 0
4 5 3 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Case 2
A B label
0 0 1 2
1 1 -10 2
2 2 10 0

Looping through a second column using a probability input

I have a similar question to one I posed here, but subtly different as it includes an extra step to the process involving a probability:
Using a Python pandas dataframe column as input to a loop through another column
I've got two pandas dataframes: one has these variables
Year Count Probability
1 8 25%
2 26 19%
3 17 26%
4 9 10%
Another is a table with these variables:
ID Value
1 100
2 25
3 50
4 15
5 75
Essentially I need to use the Count x in the first dataframe to loop through the 2nd dataframe x times, but only pull a value from the 2nd dataframe y percent of the times (using random number generation) - and then create a new column in the first dataframe that represents the sum of the values in the loop.
So - just to demonstrate - in that first column, we'd loop through the 2nd table 8 times, but only pull a random value from that table 25% of the time - so we might get output of:
0 100 0 0 25 0 0 0
...which sums to 125 - so we our added column to the first table looks like
Year Count Probability Sum
1 8 25% 125
....and so on. Thanks in advance.
We'll use numpy binomial and pandas sample to get this done.
import pandas as pd
import numpy as np
# Set up dataframes
vals = pd.DataFrame([[1,8,'25%'], [2,26,'19%'], [3,17,'26%'],[4,9,'10%']])
vals.columns = ['Year', 'Count', 'Probability']
temp = pd.DataFrame([[1,100], [2,25], [3,50], [4,15], [5,75]])
temp.columns = ['ID', 'Value']
# Get probability fraction from string
vals['Numeric_Probability'] = pd.to_numeric(vals['Probability'].str.replace('%', '')) / 100
# Total rows is binomial random variable with n=Count, p=Probability.
vals['Total_Rows'] = np.random.binomial(n=vals['Count'], p=vals['Numeric_Probability'])
# Sample "total rows" from other DataFrame and sum.
vals['Sum'] = vals['Total_Rows'].apply(lambda x: temp['Value'].sample(
n=x, replace=True).sum())
# Drop intermediate rows
vals.drop(columns=['Numeric_Probability', 'Total_Rows'], inplace=True)
print(vals)
Year Count Probability Sum
0 1 8 25% 15
1 2 26 19% 350
2 3 17 26% 190
3 4 9 10% 0
You could use pass a probabilities list to np.random.choice:
In [1]: import numpy as np
...: import pandas as pd
In [2]: d_1 = {
...: 'Year': [1, 2, 3, 4],
...: 'Count': [8, 26, 17, 9],
...: 'Probability': ['25%', '19%', '26%', '10%'],
...: }
...: df_1 = pd.DataFrame(data=d_1)
In [3]: d_2 = {
...: 'ID': [1, 2, 3, 4, 5],
...: 'Value': [100, 25, 50, 15, 75],
...: }
...: df_2 = pd.DataFrame(data=d_2)
In [4]: def get_probabilities(values: pd.Series, percentage: float) -> list[float]:
...: percentage /= 100
...: perecent_per_val = percentage / values.size
...: return [perecent_per_val] * values.size + [1 - percentage]
...:
In [5]: df_1['Sum'] = [
...: np.random.choice(a=pd.concat([df_2['Value'], pd.Series([0])]),
...: size=n,
...: p=get_probabilites(values=df_2['Value'],
...: percentage=float(percent[:-1]))).sum()
...: for n, percent in zip(df_1['Count'], df_1['Probability'])
...: ]
...: df_1
Out[5]:
Year Count Probability Sum
0 1 8 25% 100
1 2 26 19% 375
2 3 17 26% 275
3 4 9 10% 50

How to count no of occurrence for each value in a given column of dataframe for a certain class interval?

this is my first question at stackoverflow.
I have two dataframes of different sizes df1(266808 rows) and df2 (201 rows).
df1
and
df2
I want to append the count of each value/number in df1['WS_140m'] to df2['count'] if number falls in a class interval given in df2['Class_interval'].
I have tried
1)
df2['count']=pd.cut(x=df1['WS_140m'], bins=df2['Class_interval'])
2)
df2['count'] = df1['WS_140m'].groupby(df1['Class_interval'])
3)
for anum in df1['WS_140m']:
if anum in df2['Class_interval']:
df2['count'] = df2['count'] + 1
Please guide, if someone knows.
Please try something like:
def in_class_interval(value, interval):
#TODO
def in_class_interval_closure(interval):
return lambda x: in_class_interval(x, interval)
df2['count'] = df2['Class_interval']
.apply(lambda x: df1[in_class_interval_closure(x)(df1['WS_140m'])].size,axis=1)
Define your function in_class_interval(value, interval), which returns boolean.
I guess something like this would do it:
In [330]: df1
Out[330]:
WS_140m
0 5.10
1 5.16
2 5.98
3 5.58
4 4.81
In [445]: df2
Out[445]:
count Class_interval
0 0 NaN
1 0 (0.05,0.15]
2 0 (0.15,0.25]
3 0 (0.25,0.35]
4 0 (3.95,5.15]
In [446]: df2.Class_interval = df2.Class_interval.str.replace(']', ')')
In [451]: from ast import literal_eval
In [449]: for i, v in df2.Class_interval.iteritems():
...: if pd.notnull(v):
...: df2.at[i, 'Class_interval'] = literal_eval(df2.Class_interval[i])
In [342]: df2['falls_in_range'] = df1.WS_140m.between(df2.Class_interval.str[0], df2.Class_interval.str[1])
You can increase the count wherever True comes like below :
In [360]: df2['count'] = df2.loc[df2.index[df2['falls_in_range'] == True].tolist()]['count'] +1
In [361]: df2
Out[361]:
count Class_interval falls_in_range
0 NaN NaN False
1 NaN (0.05, 0.15) False
2 NaN (0.15, 0.25) False
3 NaN (0.25, 0.35) False
4 1.0 (3.95, 5.15) True

Pandas: create new column which swaps values of other rows

I'm trying to create a pandas dataframe like this:
x2 x3
0 3.536220 0.681269
1 0.681269 3.536220
2 -0.402380 2.303833
3 2.303833 -0.402380
4 2.032329 3.334412
5 3.334412 2.032329
6 0.371338 5.879732
. . .
So x2 is a column of random numbers, and x3 has the values of row 0 and 1 in x2 swapped, the values of 2 and 3 swapped, and so on. My current code is like this:
import numpy as np
import pandas as pd
x2 = pd.Series(np.random.normal(loc = 2, scale = 2.5, size = 1000))
x3 = pd.Series([x2[i + 1] if i % 2 == 0 else x2[i - 1] for i in range(1000)])
df = pd.DataFrame({'x2': x2, 'x3': x3})
I'm wondering if there is any faster or more elegant way, particularly if I want to have many rows (e.g. 1 million?) or do this over and over again (e.g. Monte Carlo simulation)?
Instead of
[x2[i + 1] if i % 2 == 0 else x2[i - 1] for i in range(1000)]
you could use
def swap(arr):
result = np.empty_like(arr)
result[::2] = arr[1::2]
result[1::2] = arr[::2]
return result
For a sequence of length 1000, using swap is over 3000x faster:
In [84]: %timeit [x2[i + 1] if i % 2 == 0 else x2[i - 1] for i in range(1000)]
100 loops, best of 3: 12.7 ms per loop
In [98]: %timeit swap(x2.values)
100000 loops, best of 3: 3.82 µs per loop
import numpy as np
import pandas as pd
np.random.seed(2017)
x2 = pd.Series(np.random.normal(loc = 2, scale = 2.5, size = 1000))
x3 = [x2[i + 1] if i % 2 == 0 else x2[i - 1] for i in range(1000)]
def swap(arr):
result = np.empty_like(arr)
result[::2] = arr[1::2]
result[1::2] = arr[::2]
return result
df = pd.DataFrame({'x2': x2, 'x3': x3, 'x4': swap(x2.values)})
print(df.head())
prints
x2 x3 x4
0 -0.557363 1.649005 1.649005
1 1.649005 -0.557363 -0.557363
2 2.497731 3.433690 3.433690
3 3.433690 2.497731 2.497731
4 1.013555 0.679394 0.679394

Categories