Function to group by multiple objects

Function to group by multiple objects - python

I have a python function that groups by based on a single column and calculates the count and mean.
def calc_smooth_mean(df, by, on, m):
mean = df[on].mean()
agg_value = df.groupby(by)[on].agg(['count', 'mean'])
counts = agg_value['count']
means = agg_value['mean']
smooth = (counts * means + m * mean) / (counts + m)
return df[by].map(smooth)
When I pass more than 1 column to "by", it throws the error "Dataframe object has no attribute map". I tried converting it to list & passed it to the function, but did not work.

You should change the map to apply, since map is for Series and index only
We can fix it with transform
def calc_smooth_mean(df, by, on, m):
mean = df[on].mean()
counts = df.groupby(by)[on].transform('count')
means = df.groupby(by)[on].transform('mean')
smooth = (counts * means + m * mean) / (counts + m)
return smooth

Related

Using the map function to convert voltage to compass degrees [duplicate]

I am looking for ideas on how to translate one range values to another in Python. I am working on hardware project and am reading data from a sensor that can return a range of values, I am then using that data to drive an actuator that requires a different range of values.
For example lets say that the sensor returns values in the range 1 to 512, and the actuator is driven by values in the range 5 to 10. I would like a function that I can pass a value and the two ranges and get back the value mapped to the second range. If such a function was named translate it could be used like this:
sensor_value = 256
actuator_value = translate(sensor_value, 1, 512, 5, 10)
In this example I would expect the output actuator_value to be 7.5 since the sensor_value is in the middle of the possible input range.

One solution would be:
def translate(value, leftMin, leftMax, rightMin, rightMax):
# Figure out how 'wide' each range is
leftSpan = leftMax - leftMin
rightSpan = rightMax - rightMin
# Convert the left range into a 0-1 range (float)
valueScaled = float(value - leftMin) / float(leftSpan)
# Convert the 0-1 range into a value in the right range.
return rightMin + (valueScaled * rightSpan)
You could possibly use algebra to make it more efficient, at the expense of readability.

Using scipy.interpolate.interp1d
You can also use scipy.interpolate package to do such conversions (if you don't mind dependency on SciPy):
>>> from scipy.interpolate import interp1d
>>> m = interp1d([1,512],[5,10])
>>> m(256)
array(7.4951076320939336)
or to convert it back to normal float from 0-rank scipy array:
>>> float(m(256))
7.4951076320939336
You can do also multiple conversions in one command easily:
>>> m([100,200,300])
array([ 5.96868885, 6.94716243, 7.92563601])
As a bonus, you can do non-uniform mappings from one range to another, for intance if you want to map [1,128] to [1,10], [128,256] to [10,90] and [256,512] to [90,100] you can do it like this:
>>> m = interp1d([1,128,256,512],[1,10,90,100])
>>> float(m(400))
95.625
interp1d creates piecewise linear interpolation objects (which are callable just like functions).
Using numpy.interp
As noted by ~unutbu, numpy.interp is also an option (with less dependencies):
>>> from numpy import interp
>>> interp(256,[1,512],[5,10])
7.4951076320939336

This would actually be a good case for creating a closure, that is write a function that returns a function. Since you probably have many of these values, there is little value in calculating and recalculating these value spans and factors for every value, nor for that matter, in passing those min/max limits around all the time.
Instead, try this:
def make_interpolater(left_min, left_max, right_min, right_max):
# Figure out how 'wide' each range is
leftSpan = left_max - left_min
rightSpan = right_max - right_min
# Compute the scale factor between left and right values
scaleFactor = float(rightSpan) / float(leftSpan)
# create interpolation function using pre-calculated scaleFactor
def interp_fn(value):
return right_min + (value-left_min)*scaleFactor
return interp_fn
Now you can write your processor as:
# create function for doing interpolation of the desired
# ranges
scaler = make_interpolater(1, 512, 5, 10)
# receive list of raw values from sensor, assign to data_list
# now convert to scaled values using map
scaled_data = map(scaler, data_list)
# or a list comprehension, if you prefer
scaled_data = [scaler(x) for x in data_list]

I was looking for the same thing in python to map angles 0-300deg to raw dynamixel values 0-1023, or 1023-0 depending on the actuator orientations.
I ended up going very simple.
Variables:
x:input value;
a,b:input range
c,d:output range
y:return value
Function:
def mapFromTo(x,a,b,c,d):
y=(x-a)/(b-a)*(d-c)+c
return y
Usage:
dyn111.goal_position=mapFromTo(pos111,0,300,0,1024)

def translate(sensor_val, in_from, in_to, out_from, out_to):
out_range = out_to - out_from
in_range = in_to - in_from
in_val = sensor_val - in_from
val=(float(in_val)/in_range)*out_range
out_val = out_from+val
return out_val

Simple map range function:
def mapRange(value, inMin, inMax, outMin, outMax):
return outMin + (((value - inMin) / (inMax - inMin)) * (outMax - outMin))

def maprange(a, b, s):
(a1, a2), (b1, b2) = a, b
return b1 + ((s - a1) * (b2 - b1) / (a2 - a1))
a = [from_lower, from_upper]
b = [to_lower, to_upper]
found at https://rosettacode.org/wiki/Map_range#Python_
does not clamp the transformed values to the ranges a or b (it extrapolates)
also works when from_lower > from_upper or to_lower > to_upper

All of the existing answers are under the CC BY-SA license. Here's one that I wrote; to the extent possible under law, I waive all copyright or related and neighboring rights to this code. (Creative Commons CC0 Public Domain Dedication).
def remap(number, from_min, from_max, to_min, to_max):
number_s = number - from_min
from_max_s = from_max - from_min
to_max_s = to_max - to_min
return ((number_s / from_max_s) * to_max_s) + to_min

You could use a lambda function
translate = lambda a, b, c, d, e: (a - b) * (e - d) / (c - b) + d
sensor_value = 256
translate(sensor_value, 1, 512, 5, 10)
>> 7.495107632093934

iteration using it.chain

I'm stuck on returning the result from the function which is checking samples for A/B test and gave the result. The calculation is correct, but somehow I'm getting the result twice. The code and output below.
def test (sample1, sample2):
for i in it.chain (range(len(sample1)), range(len(sample2))):
alpha = .05
difference = (sample1['step_conversion'][i] - sample2['step_conversion'][i])/100
if (i > 0):
p_combined = (sample1['unq_user'][i] + sample2['unq_user'][i]) / (sample1['unq_user'][i-1] + sample2['unq_user'][i-1])
z_value = difference / mth.sqrt(
p_combined * (1 - p_combined) * (1 / sample1['unq_user'][i-1] + 1 / sample2['unq_user'][i-1]))
distr = st.norm(0, 1)
p_value = (1 - distr.cdf(abs(z_value))) * 2
print( sample1['event_name'][i], 'p-value: ', p_value)
if p_value < alpha:
print('Deny H0')
else:
print('Accept H0')
return
So I need the result in output just once (tagged in the box), but I get it twice from both samples.

When using Pandas dataframes, you should avoid most for loops, and use the standard vectorised approach. Use NumPy where applicable.
First, I've reset the indexes (indices) of the dataframes, to be sure .loc can be used with a standard numerical index.
sample1 = sample1.reset_index()
sample2 = sample2.reset_index()
The below does what I think you for loop does.
I can't test it, and without a clear description, example dataframes and expected outcome, it is anyone's guess if the code below does what you want. But it may get close, and mostly serves as an example of the vectorised approach.
import numpy as np
difference = (sample1['step_conversion'] - sample2['step_conversion']) / 100
n = len(sample1)
# Note that Pandas uses `n` as the highest *valid* index when using `.loc`, `n-1` is one lower
p_combined = ((sample1.loc[1:, 'unq_user'] + sample2.loc[1:, 'unq_user']).reset_index(drop=True) /
(sample1.loc[:n-1, 'unq_user'] + sample2.loc[:n-1, 'unq_user'])).reset_index(drop=True)
z_value = difference / np.sqrt(
p_combined * (1 - p_combined) * (
1 / sample1.loc[:n-1, 'unq_user'] + 1 / sample2.loc[:n-1, 'unq_user']))
distr = st.norm(0, 1) # ??
p_value = (1 - distr.cdf(np.abs(z_value))) * 2
sample1['p_value'] = p_value
print(sample1)
# The below prints a list of True values for elements for which the condition is valid.
# You can also use e.g. `print(sample1[p_value < alpha])`.
alpha = 0.05
print('Deny H0:')
print(p_value < alpha)
print('Accept H0:')
print(p_value > alpha)
No for loop needed, and for a large dataframe, the above will be notably faster.
Note that the .reset_index(drop=True) is a bit ugly. But if that is not there, Pandas will divide the two dataframes by equal indices, which is not what we want. This way, that is avoided.

shortcuts in if-else condition python

I have the following abbreviation of a function in my code:
s = 0.5
m = np.nonzero((velo>freq-fthrow - s*maskwidth_f))
velo_mask = np.delete(velo, m)
spec_mask = np.delete(spec, m)
if (average(velo_mask<0.9):
s = 0.8
m = np.nonzero((velo>freq-fthrow - s*maskwidth_f))
velo_mask = np.delete(velo, m)
spec_mask = np.delete(spec, m)
else:
s = 0.5
m = np.nonzero((velo>freq-fthrow - s*maskwidth_f))
velo_mask = np.delete(velo, m)
spec_mask = np.delete(spec, m)
This means that I have to compute the two arrays first based on the initial given value of s, then do the condition, based on it, I change the value of s and I want the re-run the whole previous code based on the new value of s. (I have a loop,and each time the whole data changes)
It is actually a huge code, and I don't want to re-write it 3 times, once to calculate the average, in the if condition, and in the else condition.
Is there maybe a way to tell python to re-run the whole previous part in the if-else condition.

Use functions to avoid code duplication. Example:
def create_mask(velo, spec, freq, fthrow, maskwidth_f, s):
m = np.nonzero((velo > freq - fthrow - s * maskwidth_f))
velo_mask = np.delete(velo, m)
spec_mask = np.delete(spec, m)
return velo_mask, spec_mask
...
s = 0.5
velo_mask, spec_mask = create_mask(velo, spec, freq, fthrow, maskwidth_f, s)
s = 0.8 if average(velo_mask < 0.9) else 0.5
velo_mask, spec_mask = create_mask(velo, spec, freq, fthrow, maskwidth_f, s)

Calculate column in Pandas Dataframe using adjacent rows without iterating through each row

I would like to see if there is a way to calculate a column in a dataframe that uses something similar to a moving average without iterating through each row.
Current working code:
def create_candles(ticks, instrument, time_slice):
candlesticks = ticks.price.resample(time_slice, base=00).ohlc().bfill()
volume = ticks.amount.resample(time_slice, base=00).sum()
candlesticks['volume'] = volume
candlesticks['instrument'] = instrument
candlesticks['ttr'] = 0
# candlesticks['vr_7'] = 0
candlesticks['vr_10'] = 0
candlesticks = calculate_indicators(candlesticks, instrument, time_slice)
return candlesticks
def calculate_indicators(candlesticks, instrument):
candlesticks.sort_index(inplace=True)
# candlesticks['rsi_14'] = talib.RSI(candlesticks.close, timeperiod=14)
candlesticks['lr_50'] = talib.LINEARREG(candlesticks.close, timeperiod=50)
# candlesticks['lr_150'] = talib.LINEARREG(candlesticks.close, timeperiod=150)
# candlesticks['ema_55'] = talib.EMA(candlesticks.close, timeperiod=55)
# candlesticks['ema_28'] = talib.EMA(candlesticks.close, timeperiod=28)
# candlesticks['ema_18'] = talib.EMA(candlesticks.close, timeperiod=18)
# candlesticks['ema_9'] = talib.EMA(candlesticks.close, timeperiod=9)
# candlesticks['wma_21'] = talib.WMA(candlesticks.close, timeperiod=21)
# candlesticks['wma_12'] = talib.WMA(candlesticks.close, timeperiod=12)
# candlesticks['wma_11'] = talib.WMA(candlesticks.close, timeperiod=11)
# candlesticks['wma_5'] = talib.WMA(candlesticks.close, timeperiod=5)
candlesticks['cmo_9'] = talib.CMO(candlesticks.close, timeperiod=9)
for row in candlesticks.itertuples():
current_index = candlesticks.index.get_loc(row.Index)
if current_index >= 1:
previous_close = candlesticks.iloc[current_index - 1, candlesticks.columns.get_loc('close')]
candlesticks.iloc[current_index, candlesticks.columns.get_loc('ttr')] = max(
row.high - row.low,
abs(row.high - previous_close),
abs(row.low - previous_close))
if current_index > 10:
candlesticks.iloc[current_index, candlesticks.columns.get_loc('vr_10')] = candlesticks.iloc[current_index, candlesticks.columns.get_loc('ttr')] / (
max(candlesticks.high[current_index - 9: current_index].max(), candlesticks.close[current_index - 11]) -
min(candlesticks.low[current_index - 9: current_index].min(), candlesticks.close[current_index - 11]))
candlesticks['timestamp'] = pd.to_datetime(candlesticks.index)
candlesticks['instrument'] = instrument
candlesticks.fillna(0, inplace=True)
return candlesticks
in the iteration, i am calculating the True Range ('TTR') and then the Volatility Ratio ('VR_10')
TTR is calculated on every row in the DF except for the first one. It uses the previous row's close column, and the current row's high and low column.
VR_10 is calculated on every row except for the first 10. it uses the high and low column of the previous 9 rows and the close of the 10th row back.
EDIT 2
I have tried many ways to add a text based data frame in this question, there just doesnt seem to be a solution with the width of my frame. there is no difference in the input and output dataframes other than the column TTR and VR_10 have all 0s in the input, and have non-zero values in the output.
an example would be this dataframe:
Is there a way I can do this without iteration?

With the nudge from Andreas to use rolling, I came to an answer:
first, I had to find out how to use rolling with multiple columns. found that here.
I made a modification because I need to roll up, not down
def roll(df, w, **kwargs):
df.sort_values(by='timestamp', ascending=0, inplace=True)
v = df.values
d0, d1 = v.shape
s0, s1 = v.strides
a = stride(v, (d0 - (w - 1), w, d1), (s0, s0, s1))
rolled_df = pd.concat({
row: pd.DataFrame(values, columns=df.columns)
for row, values in zip(df.index, a)
})
return rolled_df.groupby(level=0, **kwargs)
after that, I created 2 functions:
def calculate_vr(window):
return window.iloc[0].ttr / (max(window.high[1:9].max(), window.iloc[10].close) - min(window.low[1:9].min(), window.iloc[10].close))
def calculate_ttr(window):
return max(window.iloc[0].high - window.iloc[0].low, abs(window.iloc[0].high - window.iloc[1].close), abs(window.iloc[0].low - window.iloc[1].close))
and called those functions like this:
candlesticks['ttr'] = roll(candlesticks, 3).apply(calculate_ttr)
candlesticks['vr_10'] = roll(candlesticks, 11).apply(calculate_vr)
added timers to both ways and this way is roughly 3X slower than iteration.

trimmed/winsorized standard deviation

What's an efficient way to calculate a trimmed or winsorized standard deviation of a list?
I don't mind using numpy, but if I have to make a separate copy of the list, it's going to be quite slow.

This will make two copies, but you should give it a try because it should be very fast.
def trimmed_std(data, low, high):
tmp = np.asarray(data)
return tmp[(low <= tmp) & (tmp < high)].std()
Do you need to do rank order trimming (ie 5% trimmed)?
Update:
If you need percentile trimming, the best way I can think of is to sort the data first. Something like this should work:
def trimmed_std(data, percentile):
data = np.array(data)
data.sort()
percentile = percentile / 2.
low = int(percentile * len(data))
high = int((1. - percentile) * len(data))
return data[low:high].std(ddof=0)
You can obviously implement this without using numpy, but even including the time of converting the list to an array, using numpy is faster than anything I could think of.

This is what generator functions are for.
SD requires two passes, plus a count. For this reason, you'll need to "tee" some iterators over the base collection.
So.
trimmed = ( x for x in the_list if low <= x < high )
sum_iter, len_iter, var_iter = itertools.tee( trimmed, 3 )
n = sum( 1 for x in len_iter)
mean = sum( sum_iter ) / n
sd = math.sqrt( sum( (x-mean)**2 for x in var_iter ) / (n-1) )
Something like that might do what you want without copying anything.

In order to get an unbiased trimmed mean you have to account for fractional bits of items in the list as described here and (a little less directly) here. I wrote a function to do it:
def percent_tmean( data, pcent ):
# make sure data is a list
dc = list( data )
# find the number of items
n = len(dc)
# sort the list
dc.sort()
# get the proportion to trim
p = pcent / 100.0
k = n*p
# print "n = %i\np = %.3f\nk = %.3f" % ( n,p,k )
# get the decimal and integer parts of k
dec_part, int_part = modf( k )
# get an index we can use
index = int(int_part)
# trim down the list
dc = dc[ index: index * -1 ]
# deal with the case of trimming fractional items
if dec_part != 0.0:
# deal with the first remaining item
dc[ 0 ] = dc[ 0 ] * (1 - dec_part)
# deal with last remaining item
dc[ -1 ] = dc[ -1 ] * (1 - dec_part)
return sum( dc ) / ( n - 2.0*k )
I also made an iPython Notebook that demonstrates it.
My function will probably be slower than those already posted but it will give unbiased results.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Function to group by multiple objects - python

Related

Using the map function to convert voltage to compass degrees [duplicate]

iteration using it.chain

shortcuts in if-else condition python

Calculate column in Pandas Dataframe using adjacent rows without iterating through each row

trimmed/winsorized standard deviation

Categories

Resources