Given a DF with 4 features and 1 index column :
df = pd.DataFrame(np.random.randint(0,100, size= (100,4)), columns=list('ABCD'))
df['index'] = range(1, len(df) + 1)
I want to calculate the Manhattan distance given input from a user. The user's inputs will be represented by a,b,c,d. The function is defined below.
def Manhattan_d(a,b,c,d):
return (a - df['A']) + (b -df['B']) + (c - df['C']) + (d - df['D'])
When the answer is returned to me, it comes out as a list. Now, I want to find the minimum value returned to me AND link it back to the index number from where its from.
If I do return(min(formula)), I get an output of one number and I can't locate it back to the index it was originally from. If it's easier, the index represents a category. So I need to find the category with the minimum output after the formula is applied.
Hope that's clear.
Perhaps a better approach is to apply Manhattan distance to each row of the dataframe. At that point, you can use .idxmin() to find the index of the point in the original dataframe which is most similar (has lowest Manhattan distance) to the point a,b,c,d that you fed the function.
def Manhattan_d(a,b,c,d, df):
return df.apply(lambda row:abs(row['A']-a)+abs(row['B']-b)+abs(row['C']-c)+abs(row['D']-d), axis=1).idxmin()
Note: Manhattan distance requires the absolute value of the difference, which I have included.
Another note: it is generally good practice to pass all variables into a function, which is why I included df as an input to your function.
Another possibility is to use existing implementations, such as the DistanceMetric class from Scikit-learn.
Related
I have a large data file that consist of many cycles. I am trying to find some key points for each of these cycles. I would like a way to narrow down the points I have identified but have not been successful.
#Determine Relevant Points A and B
Data3['Diff'] = Data3['Value'].diff(2)
Data3['Diff2'] = Data3.groupby('Cycle#', as_index=False)['Value'].transform(lambda x: x - 2*x.shift(1) + x.shift(2))
Data3['POI'] = Data3.groupby('Cycle#').apply(
lambda g: np.select([g['Value'] == g['Value'].max(),
g['Diff2'] == g['Diff2'].min()],
['A', 'B'], default='-')
).explode().set_axis(Data3.index, axis=0)
I have tried this but would like to refine it more. Right now it goes through my column labeled values, and will get the first derivative and the 2nd derivative. You can see me identifying point A to the max of the value, and B to the min of the 2nd derivative.
I want to refine this more, and in addition to A being the max of the value I want it to also ensure the lag of the 2 values before it is less than the max. So for example, if the max is 1280, ensure the 2 rows behind it are less than this value. This helps get rid of some outliers in the data.
For B i want to get rid of any 2nd derivative values less than -83. This also helps get rid of outliers.
The question is, where would I add this info. Would this be in the equation here:
Data3['POI'] = Data3.groupby('Cycle#').apply(
lambda g: np.select([g['Value'] == g['Value'].max(),
g['Diff2'] == g['Diff2'].min()],
['A', 'B'], default='-')
I have a spreadsheet with this formula. I am able to understand the condition checking part, the calculation of ($R7/$O7) and default value if condition does not satisfy. What exactly happens inside the PRODUCT(1+($U7:Z7)))-1 ?
{=IF($T7>=AA$5,($R7/$O7)/(PRODUCT(1+($U7:Z7)))-1,"")}
Also, why do we have {}? If I manually type the formula in some cell, it does not work.
I am trying to convert this formula to python. This is the code I have:
df.loc[(df['T'] >= df['AA']), 'x'] = (df['R']/df['O'])/PRODUCT()-1
My question is how do I compute the PRODUCT part of this calculation?
If you just want to know how to calculate the product of an array where 1 is added to every value and 1 subtracted from the result it can be easily done with numpy:
import numpy as np
arr = np.array([1,2,3,4,5])
product = np.prod(arr+1) - 1
print product
Numpy calculations are done array-wise, so adding 1 to every value is simply array+1
Based on your updates in the comments this is how its done:
df.loc[(df['T'] >= df['AA']), 'x'] = (df['R']/df['O']) / ((df[['a', 'b']]+1).product(axis=1) - 1)
Where a and b are the column names. Notice that this formula returns NaN when df['T'] >= df['AA'] is false.
I have a line defined by mx+b, where m,b are variables acquired from np.linalg.lstsq.
Also have created a function called distance defined as:
def distance(x0, y0, slope, yintercept):
"""Returns the euclidean distance between a line and
a point"""
return abs(slope*x0-y0+yintercept)/(slope**2+1)**.5
For convenience I have created a vectorized form as in:
vdistance = np.vectorize(distance,otypes=[np.float])
I have a pandas array called spiral that contains a bunch of points over a irregular spiral. This pandas dataframe has three fields (among others): spiral.t , spiral.x, spiral.y, where t is the a increasing value on time and x,y are the coordinates of spiral on cartesian plane (rect coordinates).
Therefore for each spiral.x,spiral.y pair I have a correspondent spiral.t.
I can easily calculate the distance from each point on spiral to that line defined on start with
distance(spiral.x, spiral.y, m, b)
Since is a pandas dataframe, when I call spiral.x I got the whole column. There fore I did:
x0 = np.array(spiral.x)
y0 = np.array(spiral.y)
dist=vdistance(x0,y0,m,b)
And I have a np.array dist with all distances. With that I could get the indexes <= K, where K is a reasonable distance to me, near enough to the line (in this case 250) with:
near = np.where(dist <= 250)
And now for every value in near I can iterate over the spiral retrieving the correct ts. (because the t doesn't grows in the same rate).
ts=[]
for i in near:
ts += [ spiral.t[i] ]
My question is how do this in a single shot with pandas?
You can use df.apply() to iterate over rows and access multiple columns for a function.
df[df.apply(distance, axis=1)]
axis=1 here tells apply to iterate over rows. df.apply() will iterate over columns if axis=0. The result of this statement is a dataframe, which is a subset of df with fewer rows.
To make this work, your distance function should return a boolean value. The logic of this function could be:
def distance(row):
dist = compute_dist(row['x'], row['y'])
if dist < 250:
return True
return False
I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!
As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)
Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b
Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)
I'd like to find the worst record which make the correlation worse in pandas.DataFrame to remove anomaly records.
When I have the following DataFrame:
df = pd.DataFrame({'a':[1,2,3], 'b':[1,2,30]})
The correlation becomes better removing third row.
print df.corr() #-> correlation is 0.88
print df.ix[0:1].corr() # -> correlation is 1.00
In this case, my question is how to find the third row is an candidate of anomalies which make the correlation worse.
My idea is execute linear regression and calculate the error of each element (row). But, I don't know the simple way to try that idea and also believe there is more simple and straightforward way.
Update
Of course, you can remove all of elements and achieve the correlation is 1. But I'd like to find just one (or several) anomaly row(s). Intuitively, I hope to get non-trivial set of records which achieves better correlation.
First, you could brute force it to get exact solution:
import pandas as pd
import numpy as np
from itertools import combinations, chain, imap
df = pd.DataFrame(zip(np.random.randn(10), np.random.randn(10)))
# set the maximal number of lines you are willing to remove
reomve_up_to_n = 3
# all combinations of indices to keep
to_keep = imap(list, chain(*map(lambda i: combinations(df.index, df.shape[0] - i), range(1, reomve_up_to_n + 1))))
# find index with highest remaining correlation
highest_correlation_index = max(to_keep, key = lambda ks: df.ix[ks].corr().ix[0,1])
df_remaining = df.ix[highest_correlation_index]
This can be costly. You could get a greedy approximation by adding a column with something like row's contribution to correlation.
df['CorComp'] = (df.icol(0).mean() - df.icol(0)) * (df.icol(1).mean() - df.icol(1))
df = df.sort(['CorComp'])
Now you can remove rows starting from the top, which may raise your correlation.
Your question is about outliers detection. There is many way to perform this detection, but a simple way could be to exclude values with deviation exceeding x % of the standard deviation of the series.
# Keep only values with a deviation less than 10% of the standard deviation of the series.
df[np.abs(df.b-df.b.mean())<=(1.1*df.b.std())]
# result
a b
0 1 1
1 2 2