Calculating Manhattan distance in Python without result - python

I have these two data frames in python and I'm trying to calculate the Manhattan distance and later on the Euclidean distance, but I'm stuck in this Manhattan distance and can't figure it out what is going wrong.
Here is what I have tried so far:
ratings = pd.read_csv("toy_ratings.csv", ",")
person1 = ratings[ratings['Person'] == 1]['Rating']
person2 = ratings[ratings['Person'] == 2]['Rating']
ratings.head()
Person Movie Rating
0 1 11 2.5
1 1 12 3.5
2 1 15 2.5
3 3 14 3.5
4 2 12 3.5
Here is data inside the person1 and person2
print("*****person1*****")
print(person1)
*****person1*****
0 2.5
1 3.5
2 2.5
5 3.0
22 3.5
23 3.0
36 5.0
print("*****person2*****")
print(person2)
*****person2*****
4 3.5
6 3.0
8 1.5
9 5.0
11 3.0
24 3.5
This was the function that I have tried to build without any luck:
def ManhattanDist(person1, person2):
distance = 0
for rating in person1:
if rating in person2:
distance += abs(person1[rating] - person2[rating])
return distance
The thing is that the function gives 0 back and this is not correct, when I debug I can see that it never enters the second loop. How can I perform a check to see the both rows has a value and loop?

I think the function should give back (= return) the distance in any case: either the distance is zero as initiated, or it is is somethhing else. So the function should look like
def ManhattanDist(person1, person2):
distance = 0
for rating in person1:
if rating in person2:
distance += abs(person1[rating] - person2[rating])
return distance
I think the distance should be built by two vectors of the same length (at least I cannot imagine any thing else). If this is the case you can do (without your function)
import numpy as np
p1 = np.array(person1)
p2 = np.array(person2)
#--- scalar product as similarity indicator
dist1 = np.dot(p1,p2)
#--- Euclidean distance
dist2 = np.linalg.norm(p1-p2)
#--- manhatten distance
dist3 = np.sum(np.abs(p1-p2))

You function is returning 1 value ... It should (I guess) return a list of values.

Related

Using Pandas Count number of Cells in Column that is within a given radius

To set up the question. I have a dataframe containing spots and their x,y positions. I want to iterate over each spot and check all other spots to see if they are within a radius. I then want to count the number of spots within the radius in a new column of the dataframe. I would like to iterate over the index as I have a decent understanding on how that works. I know that I am missing something simple but I have not been able to find a solution that works for me yet. Thank you in advance!
radius = 3
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
spot_id x_pos y_pos
0 1 5 4
1 2 4 10
2 3 10 8
3 4 3 6
4 5 8 3
I then want to get something that looks like this
spot_id x_pos y_pos spots_within_radius
0 1 5 4 1
1 2 4 10 0
2 3 10 8 0
3 4 3 6 1
4 5 8 3 0
To do it in a vectorized way, you can use scipy.spatial.distance_matrix to compute the distance matrix, D, between all the N row/position vectors ('x_pos', 'y_pos'). D is a N x N matrix (2D numpy.ndarray) whose entry (i, j) is the Euclidean distance between the ith and jth rows/ positions .
Then, check which positions are a distance = radius from each other (D <= radius), which will give you a boolean matrix. Finally, you can count all the True values row-wise using sum(axis=0). You have to subtract 1 in the end since the former counts the distance between a vector with itself (diagonal entries).
import pandas as pd
from scipy.spatial import distance_matrix
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
radius = 3
pos = df[['x_pos','y_pos']]
df['spots_within_radius'] = (distance_matrix(pos, pos) <= radius).sum(axis=0) - 1
Output
>>> df
spot_id x_pos y_pos spots_within_radius
0 1 5 4 1
1 2 4 10 0
2 3 10 8 0
3 4 3 6 1
4 5 8 3 0
If you don't want to use scipy.spatial.distance_matrix, you can compute D yourself using numpy's broadcasting.
import numpy as np
pos = df[['x_pos','y_pos']].to_numpy()
D = np.sum((pos - pos[:, None])**2, axis=-1) ** 0.5
df['spots_within_radius'] = (D <= radius).sum(axis=0) - 1
I would suggest using a KD Tree to answer this kind of question. It's a data structure designed to efficiently search for nearby points, and it's faster than computing a distance matrix. You can use SciKit Learn to implement this.
The code
Here's how:
import sklearn.neighbors
import pandas as pd
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
def add_points_in_range_column_kd(df, radius):
# Get positions as numpy array
positions = df[['x_pos', 'y_pos']].to_numpy(dtype='float32')
# Build KD Tree on those positions
tree = sklearn.neighbors.KDTree(positions)
# For each position, check how many points are in range.
# Return a count, and not the actual points.
return tree.query_radius(positions, r=radius, count_only=True) - 1
df['spots_within_radius'] = add_points_in_range_column_kd(df, 3)
The efficiency argument
Since a distance matrix needs to calculate distance between all points, it has a time complexity of O(N^2). In contrast, the time required to find all of the points inside the KD Tree is proportional to the depth of the tree times the number of points you need to find. On average, this is O(N log N). So this method will be more efficient for a large number of points.
Benchmarking
Theory is nice, but is it actually faster in practice?
I ran both a KD Tree method, and a distance matrix method, on dataframes of sizes ranging from N=10 to N=3000. I used the timeit module, running both methods in random order for 100 iterations for all point sizes. Here is a graph of the time it takes with each method:
For small numbers of points, the distance matrix method is faster. After you get 300 points to compare to each other, the KD Tree is faster. Note that this graph has a log axis on both scales.
Full testing details can be found here.

How SelectKBest (chi2) calculates score?

I am trying to find the most valuable features by applying feature selection methods to my dataset. Im using the SelectKBest function for now. I can generate the score values and sort them as I want, but I don't understand exactly how this score value is calculated. I know that theoretically high score is more valuable, but I need a mathematical formula or an example to calculate the score for learning this deeply.
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(dataValues, dataTargetEncoded)
feat_importances = pd.Series(fit.scores_, index=dataValues.columns)
topFatures = feat_importances.nlargest(50).copy().index.values
print("TOP 50 Features (Best to worst) :\n")
print(topFatures)
Thank you in advance
Say you have one feature and a target with 3 possible values
X = np.array([3.4, 3.4, 3. , 2.8, 2.7, 2.9, 3.3, 3. , 3.8, 2.5])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2])
X y
0 3.4 0
1 3.4 0
2 3.0 0
3 2.8 1
4 2.7 1
5 2.9 1
6 3.3 2
7 3.0 2
8 3.8 2
9 2.5 2
First we binarize the target
y = LabelBinarizer().fit_transform(y)
X y1 y2 y3
0 3.4 1 0 0
1 3.4 1 0 0
2 3.0 1 0 0
3 2.8 0 1 0
4 2.7 0 1 0
5 2.9 0 1 0
6 3.3 0 0 1
7 3.0 0 0 1
8 3.8 0 0 1
9 2.5 0 0 1
Then perform a dot product between feature and target, i.e. sum all feature values by class value
observed = y.T.dot(X)
>>> observed
array([ 9.8, 8.4, 12.6])
Next take a sum of feature values and calculate class frequency
feature_count = X.sum(axis=0).reshape(1, -1)
class_prob = y.mean(axis=0).reshape(1, -1)
>>> class_prob, feature_count
(array([[0.3, 0.3, 0.4]]), array([[30.8]]))
Now as in the first step we take the dot product, and get expected and observed matrices
expected = np.dot(class_prob.T, feature_count)
>>> expected
array([[ 9.24],[ 9.24],[12.32]])
Finally we calculate a chi^2 value:
chi2 = ((observed.reshape(-1,1) - expected) ** 2 / expected).sum(axis=0)
>>> chi2
array([0.11666667])
We have a chi^2 value, now we need to judge how extreme it is. For that we use a chi^2 distribution with number of classes - 1 degrees of freedom and calculate the area from chi^2 to infinity to get the probability of chi^2 be the same or more extreme than what we've got. This is a p-value. (using chi square survival function from scipy)
p = scipy.special.chdtrc(3 - 1, chi2)
>>> p
array([0.94333545])
Compare with SelectKBest:
s = SelectKBest(chi2, k=1)
s.fit(X.reshape(-1,1),y)
>>> s.scores_, s.pvalues_
(array([0.11666667]), [0.943335449873492])

Using Scipy Signal to carry positive balances from previous calculation

Is there a way to simulate the follow output using scipy.signal instead of loops?
import pandas as pd
df_in = pd.DataFrame({'Generated':[13,8,7,6],'Consume':[8,10,20,5]})
print(df_in)
Generated Consume
0 13 8
1 8 10
2 7 20
3 6 5
df_in['balance'] = [5,3,0,1]
Where 13 - 8 equals a balance of 5, the 5 is carried balance to the next line and 5+8-10 yeilds a balance of 3.
The three is carried to the next line, 3+7-10 yeilds a negative number, but you can't carry a negative balance.
So, the next line 0 carry + 6 - 5 yeilds 1 balance.
print(df_in)
Expected output:
Generated Consume balance
0 13 8 5
1 8 10 3
2 7 20 0
3 6 5 1
If it wasn't for the requirement to only add to carry if the balance is positive, you could use an accumulator on the difference. This accumulator can be implemented using lfilter, obtaining the b and a parameters from the recurrence equation y[n+1] = y[n] + x[n]:
x = df_in['Generated'] - df_in['Consume']
df_in['balance'] = scipy.signal.lfilter([1], [1,-1], x)
Unfortunately adding the carry only if the balance stays positive makes the process non-linear which scipy.signal.lfilter is not made to handle. At this point you'd have to resort to using a loop to handle the special case.

Calculate weighted average with Pandas for decreasing cost

I am installing a ranking system and basically I have a field called site_fees that accounts for 10% of the total for consideration. A site fee of 0 would get all 10 points. What I want to do is calculate how many points the non-zero fields would get, but I am struggling to do so.
My initial approach was to split the dataframe into 2 dataframes (dfb where site_fees are 0 and dfa where they are > 0) and calculate the average for dfa, assign the rating for dfb as 10, then union the two.
The code is as follows:
dfSitesa = dfSites[dfSites['site_fees'].notnull()]
dfSitesb = dfSites[dfSites['site_fees'].isnull()]
dfSitesa['rating'] = FeeWeight * \
dfSitesa['site_fees'].min()/dfSitesa['site_fees']
dfSitesb['rating'] = FeeWeight
dfSites = pd.concat([dfSitesa,dfSitesb])
This produces an output, however the results of dfa are not correct because the minimum of dfa is 5000 instead of 0, so the rating of a site with $5000 in fees is 10 (the maximum, not correct). What am I doing wrong?
The minimum non-zero site_fee is 5000 and the maximum is 15000. Based on this, I would expect a general ranking system like:
15000 | 0
10000 | 3.3
5000 | 6.6
0 | 10
Here is a way to do it :
dfSites = pd.DataFrame({'site_fees':[0,1,2,3,5]})
FeeWeight = 10
dfSitesa = dfSites[dfSites['site_fees'].notnull()]
dfSitesb = dfSites[dfSites['site_fees'].isnull()]
dfSitesb['rating'] = FeeWeight
factor = (dfSitesa['site_fees'].max() - dfSitesa['site_fees'].min())
dfSitesa['rating'] = FeeWeight * ( 1 - ( (dfSitesa['site_fees'] - dfSitesa['site_fees'].min()) / factor) )
dfSites = pd.concat([dfSitesa,dfSitesb])
In [1] : print(dfSites)
Out[1] :
site_fees rating
0 0 10.0
1 1 8.0
2 2 6.0
3 3 4.0
4 5 0.0

Calculating grid values given the distance in python

I have a cell grid of big dimensions. Each cell has an ID (p1), cell value (p3) and coordinates in actual measures (X, Y). This is how first 10 rows/cells look like
p1 p2 p3 X Y
0 0 0.0 0.0 0 0
1 1 0.0 0.0 100 0
2 2 0.0 12.0 200 0
3 3 0.0 0.0 300 0
4 4 0.0 70.0 400 0
5 5 0.0 40.0 500 0
6 6 0.0 20.0 600 0
7 7 0.0 0.0 700 0
8 8 0.0 0.0 800 0
9 9 0.0 0.0 900 0
Neighbouring cells of cell i in the p1 can be determined as (i-500+1, i-500-1, i-1, i+1, i+500+1, i+500-1).
For example: p1 of 5 has neighbours - 4,6,504,505,506. (these are the ID of rows in the upper table - p1).
What I am trying to is:
For the chosen value/row i in p1, I would like to know all neighbours in the chosen distance from i and sum all their p3 values.
I tried to apply this solution (link), but I don't know how to incorporate the distance parameter. The cell value can be taken with df.iloc, but the steps before this are a bit tricky for me.
Can you give me any advice?
EDIT:
Using the solution from Thomas and having df called CO:
p3
0 45
1 580
2 12000
3 12531
4 22456
I'd like to add another column and use the values from p3 columns
CO['new'] = format(sum_neighbors(data, CO['p3']))
But it doesn't work. If I add a number instead of a reference to row CO['p3'] it works like charm. But how can I use values from p3 column automatically in format function?
SOLVED:
It worked with:
CO['new'] = CO.apply(lambda row: sum_neighbors(data, row.p3), axis=1)
Solution:
import numpy as np
import pandas
# Generating toy data
N = 10
data = pandas.DataFrame({'p3': np.random.randn(N)})
print(data)
# Finding neighbours
get_candidates = lambda i: [i-500+1, i-500-1, i-1, i+1, i+500+1, i+500-1]
filter = lambda neighbors, N: [n for n in neighbors if 0<=n<N]
get_neighbors = lambda i, N: filter(get_candidates(i), N)
print("Neighbors of 5: {}".format(get_neighbors(5, len(data))))
# Summing p3 on neighbors
def sum_neighbors(data, i, col='p3'):
return data.iloc[get_neighbors(i, len(data))][col].sum()
print("p3 sum on neighbors of 5: {}".format(sum_neighbors(data, 5)))
Output:
p3
0 -1.106541
1 -0.760620
2 1.282252
3 0.204436
4 -1.147042
5 1.363007
6 -0.030772
7 -0.461756
8 -1.110459
9 -0.491368
Neighbors of 5: [4, 6]
p3 sum on neighbors of 5: -1.1778133703169344
Notes:
I assumed p1 was range(N) as seemed to be implied (so we don't need it at all).
I don't think that 505 is a neighbour of 5 given the list of neighbors of i defined by the OP.

Categories