I have a df of coordinates representing points at various timescales. I want to calculate the average these points in relation to each other.
To achieve this, I'm aiming to calculate the space between each point and the rest of the points. I'm then hoping to average these points.
The following calculates the distance between each pair of points.
import pandas as pd
from scipy.spatial import distance
import itertools
df = pd.DataFrame({
'Time' : [1,1,1,2,2,2,3,3,3],
'id' : ['A','B','C','A','B','C','A','B','C'],
'X' : [1.0,3.0,2.0,2.0,4.0,3.0,3.0,5.0,4.0],
'Y' : [1.0,1.0,0.5,2.0,2.0,2.5,3.0,3.0,3.0],
})
ids = list(df['id'])
# get the points
points = df[["X", "Y"]].values
# calculate distance of each point from every other point.
# row i contains contains distances for point i.
# distances[i, j] contains distance of point i from point j.
distances = distance.cdist(points, points, "euclidean")
distances = distances.flatten()
# get the start and end points
cartesian = list(itertools.product(ids, ids))
data = dict(
start_region = [x[0] for x in cartesian],
end_region = [x[1] for x in cartesian],
distance = distances
)
df1 = pd.DataFrame(data)
All I really need to output is:
Time start_point end_point X Y
0 1 A B 2.0 0.0
1 1 A C 1.0 -0.5
2 1 B C -1.0 -0.5
3 2 A B 2.0 0.0
4 2 A C 1.0 0.5
5 2 B C -1.0 0.5
6 3 A B 2.0 0.0
7 3 A C 1.0 0.0
8 3 B C -1.0 0.0
So the average position of these points in relation to each other would be the green coordinates.
But if I average the dataset above it displays:
I understand how this occurs. It's not referencing the other points.
Here my take on it
import itertools
def relative_dist(gp):
combs = list(itertools.combinations(gp.index, 2))
df_gp = pd.concat([gp.loc[tup,:].diff() for tup in combs], keys=combs).dropna()
return df_gp
df_dist = (df.set_index('id').groupby('Time')[['X','Y']].apply(relative_dist)
.droplevel('id').rename_axis(['Time','start_point','end_point'])
.reset_index())
Out[341]:
Time start_point end_point X Y
0 1 A B 2.0 0.0
1 1 A C 1.0 -0.5
2 1 B C -1.0 -0.5
3 2 A B 2.0 0.0
4 2 A C 1.0 0.5
5 2 B C -1.0 0.5
6 3 A B 2.0 0.0
7 3 A C 1.0 0.0
8 3 B C -1.0 0.0
df_avg = df_dist.groupby(['start_point','end_point'], as_index=False)[['X','Y']].mean()
Out[347]:
start_point end_point X Y
0 A B 2.0 0.0
1 A C 1.0 0.0
2 B C -1.0 0.0
Here's a suggestion on how to visualise the relative positions of your points. I would want, for each timestamp, to plot an ellipse at position (X_, Y_) where:
X_ is the mean of your points X coordinates for that timestamp.
Y_ is the mean of your points X coordinates for that timestamp.
the width of the ellipse equals the variance of your points X coordinates for that timestamp.
the height of the ellipse equals the variance of your points Y coordinates for that timestamp.
In that way, in a glance and for each timestamp, you could read some very high level statistics about your coordinates distribution at that timestamp.
Here's some code to generate such a visualisation:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.patches import Ellipse
# sample data with 4 timestamps
df = pd.DataFrame({
'Time' : [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'id' : ['A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D'],
'X' : [1,2,1,2,1,2,1,2,4,4,3,4,10,8,5,6],
'Y' : [1,1,3,3,1,1,2,2,5,5,8,5,6,6,7,6],
})
# for each timestamp, compute means and variances within all samples for that timestamp
means = df.groupby("Time")[["X", "Y"]].mean()
variances = df.groupby("Time")[["X", "Y"]].var()
df_ = pd.concat([means, variances], axis=1)
df_.columns = ["X_", "Y_", "var_X", "var_Y"]
# plot
fig, ax = plt.subplots(subplot_kw={'aspect': 'equal'})
for row in df_.itertuples():
ellipse = Ellipse(xy=(row.X_, row.Y_), # position of the ellipse is (X,Y)
width=row.var_X, # width helps to get a grasp on X variance
height=row.var_Y, # height helps to get a grasp on Y variance
angle=0)
ax.add_artist(ellipse)
ellipse.set_clip_box(ax.bbox)
ellipse.set_alpha(.4)
plt.text(x=row.X_+0.2, y=row.Y_+0.2, s=f"t={row.Index}") # just add timestamp legend
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
plt.show()
Which would look like this:
What do you think? Another idea could be to do a GIF (in case the timestamps average collide too much).
Related
I am trying to change the color of each individual bar in my figure here. The code that I used it down below. Instead of each bar changing to the color that I have set in c, there are several colors within each bar. I have included a screenshot of this. How can I fix this? Thank you all in advance!
Clusters is just a categorical variable of 5 groups, ranging from 0 to 4. I have included a second screenshot of the dataframe.
So essentially, what I am trying to do is to plot each cluster for economic ideology and social ideology so I can have a visual comparison of the 5 different clusters over these two dimensions (economic and social ideology). Each cluster should be represented by one color. For example, cluster 0 should be red in color.
c = ['#bf1111', '#1c4975', '#278f36', '#47167a', '#de8314']
plt.subplot(1, 2, 1)
plt.bar(data = ANESdf_LatNEW, height = "EconIdeo",
x = "clusters", color = c)
plt.title('Economic Ideology')
plt.xticks([0, 1, 2, 3, 4])
plt.xlabel('Clusters')
plt.ylabel('')
plt.subplot(1, 2, 2)
plt.bar(data = ANESdf_LatNEW, height = "SocialIdeo",
x = "clusters", color = c)
plt.title('Social Ideology')
plt.xticks([0, 1, 2, 3, 4])
plt.xlabel('Clusters')
plt.ylabel('')
plt.show()
Bar graph here
Top 5 rows of dataframe
I have tried multiple ways of changing colors. For example, instead of having c, I had put in the colors directly at color = ... This did not work either.
Here is a script that does what you seem to be looking for based on your edits and comment.
Note that I do not assume that all clusters have the same size in this context; if that is the case, this approach can be simplified.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# sample dataframe
df = pd.DataFrame(
{
'EconIdeo':[1,2,3,4,3,5,7],
'Clusters':[2,3,0,1,3,0,3]
})
print(df)
# parameters: width for each cluster, colors for each cluster
# (if clusters are not sequential from zero, replace c with dictionary)
width = .75
c = ['#bf1111', '#1c4975', '#278f36', '#47167a', '#de8314']
df['xpos'] = df['Clusters']
df['width'] = width
df['color'] = ''
clusters = df['Clusters'].unique()
for k in clusters:
where = (df['Clusters'] == k)
n = where.sum()
df.loc[where,'xpos'] += np.linspace(-width/2,width/2,2*n+1)[1:-1:2]
df.loc[where,'width'] /=n
df.loc[where,'color'] = c[k]
plt.bar(data = df, height = "EconIdeo", x = 'xpos',
width = 'width', color = 'color')
plt.xticks(clusters,clusters)
plt.show()
Resulting plot:
Input dataframe:
EconIdeo Clusters
0 1 2
1 2 3
2 3 0
3 4 1
4 3 3
5 5 0
6 7 3
Dataframe after script applies changes (to include plotting specifications)
EconIdeo Clusters xpos width color
0 1 2 2.0000 0.750 #278f36
1 2 3 2.7500 0.250 #47167a
2 3 0 -0.1875 0.375 #bf1111
3 4 1 1.0000 0.750 #1c4975
4 3 3 3.0000 0.250 #47167a
5 5 0 0.1875 0.375 #bf1111
6 7 3 3.2500 0.250 #47167a
I am trying to compare different lines, to know if one is above the other one, and if not, at which x this change happens.
If I had the same x values and same length, that would be very easy and only difference in ys of the lines.
But I have different x values for different lines, and the vectors do not have the same length, but x intervals are the same for all curves.
As a very simple example I use the following data:
#curve 1: len = 9
x1 = np.array([5,6,7,8,9,10,11,12,13])
y1 = np.array([100,101,110,130,132,170,190,192,210])
#curve 2: len = 10
x2 = np.array([3,4,5,6,7,8,9,10,11,12])
y2 = np.array([90,210,211,250,260,261,265,180,200,210])
#curve 3: len = 8
x3 = np.array([7.3,8.3,9.3,10.3,11.3,12.3,13.3,14.3])
y3 = np.array([300,250,270,350,380,400,390,380])
They are supposed to be 2 regression lines. In this simple example, the result is supposed to be that Curve 2 has higher values than curve 1 in all x range.
I was trying to bin x in the range of 2.5-12.5 with the bin length of 1 to compare the corresponding ys in each bin.
My actual data are big, and this comparison needs to be done many times, so I need to find a solution that does not take much time.
Plot
Plot of data for given x-axis
plt.figure(figsize=(6, 6))
plt.plot(x1, y1, marker='o', label='y1')
plt.plot(x2, y2, marker='o', label='y2')
plt.plot(x3, y3, marker='o', label='y3')
plt.xticks(range(15))
plt.legend()
plt.grid()
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
Functions
def get_new_x uses np.digitize to re-bin the x-axis values.
def get_comparison adds a column of Booleans for each two columns compared
Currently each new column is added to the main dataframe, df, however this can be updated to be a separate comparison dataframe.
combs is a list column combinations
[Index(['y1', 'y2'], dtype='object'), Index(['y2', 'y3'], dtype='object')]
# function to create the bins
def get_bins(x_arrays: List[np.array]) -> np.array:
bin_len = np.diff(x_arrays[0][:2]) # calculate bin length
all_x = np.concatenate(x_arrays) # join arrays
min_x = min(all_x) # get min
max_x = max(all_x) # get max
return np.arange(min_x, max_x + bin_len, bin_len)
# function using np.digitize to bin the old x-axis into new bins
def get_new_x(x_arrays: List[np.array]) -> List[np.array]:
bins = get_bins(x_arrays) # get the bins
x_new = list()
for x in x_arrays:
x_new.append(bins[np.digitize(np.round(x), bins, right=True)]) # determine bins
return x_new
# function to create dataframe for arrays with new x-axis as index
def get_df(x_arrays: List[np.array], y_arrays: List[np.array]) -> pd.DataFrame:
x_new = get_new_x(x_arrays)
return pd.concat([pd.DataFrame(y, columns=[f'y{i+1}'], index=x_new[i]) for i, y in enumerate(y_arrays)], axis=1)
# compare each successive column of the dataframe
# if the left column is greater than the right column, then True
def get_comparison(df: pd.DataFrame):
cols = df.columns
combs = [cols[i:i+2] for i in range(0, len(cols), 1) if i < len(cols)-1]
for comb in combs:
df[f'{comb[0]} > {comb[1]}'] = df[comb[0]] > df[comb[1]]
Call functions:
import numpy as np
import pandas as pd
# put the arrays into a list
y = [y1, y2, y3]
x = [x1, x2, x3]
# call get_df
df = get_df(x, y)
# call get_comparison
get_comparison(df)
# get only the index of True values with Boolean indexing
for col in df.columns[3:]:
vals = df.index[df[col]].tolist()
if vals:
print(f'{col}: {vals}')
[out]:
y2 > y3: [8.0]
display(df)
y1 y2 y3 y1 > y2 y2 > y3
3.0 NaN 90.0 NaN False False
4.0 NaN 210.0 NaN False False
5.0 100.0 211.0 NaN False False
6.0 101.0 250.0 NaN False False
7.0 110.0 260.0 300.0 False False
8.0 130.0 261.0 250.0 False True
9.0 132.0 265.0 270.0 False False
10.0 170.0 180.0 350.0 False False
11.0 190.0 200.0 380.0 False False
12.0 192.0 210.0 400.0 False False
13.0 210.0 NaN 390.0 False False
14.0 NaN NaN 380.0 False False
Plot
fig, ax = plt.subplots(figsize=(8, 6))
# add markers for problem values
for i, col in enumerate(df.columns[3:], 1):
vals = df.iloc[:, i][df[col]]
if not vals.empty:
ax.scatter(vals.index, vals.values, color='red', s=110, label='bad')
df.iloc[:, :3].plot(marker='o', ax=ax) # plot the dataframe
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(range(16))
plt.title('y-values plotted against rebinned x-values')
plt.grid()
plt.show()
This is the answer I had in my mind when I first asked the question, but couldn't make it work back then. My idea is based on binning y1 and y2 based on x and comparing these two in each bin. So, as an example I have 3 curves and I want to compare them. The only similar thing among these curves is delta x (bin length) which is 1 here.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#curve 1
x1 = np.array([5,6,7,8,9,10,11,12,13])
y1 = np.array([100,101,110,130,132,170,190,192,210])
#curve 2
x2 = np.array([3,4,5,6,7,8,9,10,11,12])
y2 = np.array([90,210,211,250,260,261,265,180,200,210])
#curve 3
x3 = np.array([7.3,8.3,9.3,10.3,11.3,12.3,13.3,14.3])
y3 = np.array([300,250,270,350,380,400,390,380])
bin_length = 1
# x values have same intervals both in x1 and x2
x_min = min(x1[0],x2[0],x3[0])-bin_length/2
x_max = max(x1[-1],x2[-1],x3[-1])+bin_length/2
bins = np.arange(x_min,x_max+bin_length,bin_length)
# bin mid points to use as index
bin_mid = []
for i in range(len(bins)-1):
# compute mid point of the bins
bin_mid.append((bins[i] + bins[i+1])/2)
# This function bins y based on binning x
def bin_fun(x,y,bins,bin_length):
c = list(zip(x, y))
# define final out put of the function
final_y_binning = []
# define a list for holding members of each bin
bined_y_members = []
# compute length of each bin
for i in range(len(bins)-1):
# compute high and low threshold of the bins
low_threshold = bins[i]
high_threshold = bins[i+1]
# bin y according to x
for member in c:
if (member[0] < high_threshold and member[0] >= low_threshold):
bined_y_members.append(member[1])
final_y_binning.append(bined_y_members)
# fill out the container of the bin members
bined_y_members=[]
df = pd.DataFrame(final_y_binning)
return(df)
binned_y =pd.DataFrame(columns=[1,2,3])
Y1 = bin_fun(x1,y1,bins, bin_length)
Y1.columns =[1]
Y2 = bin_fun(x2,y2,bins, bin_length)
Y2.columns =[2]
Y3 = bin_fun(x3,y3,bins, bin_length)
Y3.columns =[3]
binned_y = binned_y.append(Y1)
binned_y[2] = Y2
binned_y[3] = Y3
binned_y.index = bin_mid
print(binned_y)
# comparing curve 2 and curve 1
for i in binned_y.index:
if (binned_y.loc[i][2]-binned_y.loc[i][1]<0):
print(i)
# comparing curve 3 and curve 2
for i in binned_y.index:
if (binned_y.loc[i][3]-binned_y.loc[i][2]<0):
print(i)
This returns 8 which is the index that y3<y2`
binned_y
1 2 3
3.0 NaN 90.0 NaN
4.0 NaN 210.0 NaN
5.0 100.0 211.0 NaN
6.0 101.0 250.0 NaN
7.0 110.0 260.0 300.0
8.0 130.0 261.0 250.0
9.0 132.0 265.0 270.0
10.0 170.0 180.0 350.0
11.0 190.0 200.0 380.0
12.0 192.0 210.0 400.0
13.0 210.0 NaN 390.0
14.0 NaN NaN 380.0
15.0 NaN NaN NaN
plot
binned_y.plot(marker='o', figsize=(6, 6)) # plot the dataframe
plt.legend(labels=['y1', 'y2', 'y3'], bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(range(16))
plt.grid()
The dataframe I am working with looks like this:
vid2 COS fsim FWeight
0 -_aaMGK6GGw_57_61 2 0.253792 0.750000
1 -_aaMGK6GGw_57_61 2 0.192565 0.250000
2 -_hbPLsZvvo_5_8 2 0.562707 0.333333
3 -_hbPLsZvvo_5_8 2 0.179969 0.666667
4 -_hbPLsZvvo_18_25 1 0.275962 0.714286
Here,
the features have the following meanings:
FWeight - weight of each fragment (or row)
fsim - similarity score between the two columns cap1 and cap2
The weighted formula is:
For example,
For vid2 "-_aaMGK6GGw_57_61", COS = 2
Thus, the two rows with vid2 comes under this.
fsim FWeight
0 0.253792 0.750000
1 0.192565 0.250000
The calculated value vid_score needs to be
vid_score(1st video) = (fsim[0] * FWeight[0] + fsim[1] * FWeight[1])/(FWeight[0] + FWeight[1])
The expected output value vid_score for vid2 = -_aaMGK6GGw_57_61 is
(0.750000) * (0.253792) + (0.250000) * (0.192565)
= 0.238485 (Final value)
For some videos, this COS = 1, 2, 3, 4, 5, ...
Thus this needs to be dynamic
I am trying to calculate the weighted similarity score for each video ID that is vid2 here. However, there are a number of captions and weights respectively for each video. It varies, some have 2, some 1, some 3, etc. This number of segments and captions has been stored in the feature COS (that is, count of segments).
I want to iterate through the dataframe where score for each video is stored as a weighted average score of the fsim (fragment similarity score). But the number of iteration is not regular.
I have tried this code. But I am not able to iterate dynamically with the iteration factor being COS instead of just a constant value
vems_score = 0.0
video_scores = []
for i, row in merged.iterrows():
vid_score = 0.0
total_weight = 0.0
for j in range(row['COS']):
total_weight = total_weight + row['FWeight']
vid_score = vid_score + (row['FWeight'] * row['fsim'])
i = i + row['COS']
vid_score = vid_score/total_weight
video_scores.append(vid_score)
print(video_scores)
Here is my sol which you can modify/optimize to your needs.
import pandas as pd, numpy as np
def computeSim():
vid=[1,1,2,2,3]
cos=[2,2,2,2,1]
fsim=[0.25,.19,.56,.17,.27]
weight = [.75,.25,.33,.66,.71]
df= pd.DataFrame({'vid':vid,'cos':cos,'fsim':fsim,'fw':weight})
print(df)
df2 = df.groupby('vid')
similarity=[]
for group in df2:
similarity.append( np.sum(group[1]['fsim']*group[1]['fw'])/ np.sum(group[1]['fw']))
return similarity
output:
0.235
0.30000000000000004
0.27
Solution
Try this with your data. I assume that you stored the dataframe as df.
df['Prod'] = df['fsim']*df['FWeight']
grp = df.groupby(['vid2', 'COS'])
result = grp['Prod'].sum()/grp['FWeight'].sum()
print(result)
Output with your data: Dummy Data (B)
vid2 COS
-_aaMGK6GGw_57_61 2 0.238485
-_hbPLsZvvo_18_25 1 0.275962
-_hbPLsZvvo_5_8 2 0.307548
dtype: float64
Dummy Data: A
I made the following dummy data to test a few aspects of the logic.
df = pd.DataFrame({'vid2': [1,1,2,5,2,6,7,4,8,7,6,2],
'COS': [2,2,3,1,3,2,2,1,1,2,2,3],
'fsim': np.random.rand(12),
'FWeight': np.random.rand(12)})
df['Prod'] = df['fsim']*df['FWeight']
print(df)
# Groupby and apply formula
grp = df.groupby(['vid2', 'COS'])
result = grp['Prod'].sum()/grp['FWeight'].sum()
print(result)
Output:
vid2 COS
1 2 0.405734
2 3 0.535873
4 1 0.534456
5 1 0.346937
6 2 0.369810
7 2 0.479250
8 1 0.065854
dtype: float64
Dummy Data: B (OP Provided)
This is your dummy data. I made this script so anyone could easily run it and load the data as a dataframe.
import pandas as pd
from io import StringIO
s = """
vid2 COS fsim FWeight
0 -_aaMGK6GGw_57_61 2 0.253792 0.750000
1 -_aaMGK6GGw_57_61 2 0.192565 0.250000
2 -_hbPLsZvvo_5_8 2 0.562707 0.333333
3 -_hbPLsZvvo_5_8 2 0.179969 0.666667
4 -_hbPLsZvvo_18_25 1 0.275962 0.714286
"""
df = pd.read_csv(StringIO(s), sep='\s+')
#print(df)
I'd like to add values calculated in a for loop to a series so that it can be its own column in a dataframe. So far I've got this: the y values are from a dataframe named block.
N = 12250
for i in range(0,N-1):
y1 = block.iloc[i]['y']
y2 = block.iloc[i+1]['y']
diffy[i] = y2-y1
I'd like to make diffy its own series instead of just replacing the diffy val on each loop
Some sample data (assume N = 5):
N = 5
np.random.seed(42)
block = pd.DataFrame({
'y': np.random.randint(0, 10, N)
})
y
0 6
1 3
2 7
3 4
4 6
You can calculate diffy as follow:
diffy = block['y'].diff().shift(-1)[:-1]
0 -3.0
1 4.0
2 -3.0
3 2.0
Name: y, dtype: float64
diffy is a pandas.Series. If you want list, add .to_list(). If you want a numpy array, add .values
I have a cell grid of big dimensions. Each cell has an ID (p1), cell value (p3) and coordinates in actual measures (X, Y). This is how first 10 rows/cells look like
p1 p2 p3 X Y
0 0 0.0 0.0 0 0
1 1 0.0 0.0 100 0
2 2 0.0 12.0 200 0
3 3 0.0 0.0 300 0
4 4 0.0 70.0 400 0
5 5 0.0 40.0 500 0
6 6 0.0 20.0 600 0
7 7 0.0 0.0 700 0
8 8 0.0 0.0 800 0
9 9 0.0 0.0 900 0
Neighbouring cells of cell i in the p1 can be determined as (i-500+1, i-500-1, i-1, i+1, i+500+1, i+500-1).
For example: p1 of 5 has neighbours - 4,6,504,505,506. (these are the ID of rows in the upper table - p1).
What I am trying to is:
For the chosen value/row i in p1, I would like to know all neighbours in the chosen distance from i and sum all their p3 values.
I tried to apply this solution (link), but I don't know how to incorporate the distance parameter. The cell value can be taken with df.iloc, but the steps before this are a bit tricky for me.
Can you give me any advice?
EDIT:
Using the solution from Thomas and having df called CO:
p3
0 45
1 580
2 12000
3 12531
4 22456
I'd like to add another column and use the values from p3 columns
CO['new'] = format(sum_neighbors(data, CO['p3']))
But it doesn't work. If I add a number instead of a reference to row CO['p3'] it works like charm. But how can I use values from p3 column automatically in format function?
SOLVED:
It worked with:
CO['new'] = CO.apply(lambda row: sum_neighbors(data, row.p3), axis=1)
Solution:
import numpy as np
import pandas
# Generating toy data
N = 10
data = pandas.DataFrame({'p3': np.random.randn(N)})
print(data)
# Finding neighbours
get_candidates = lambda i: [i-500+1, i-500-1, i-1, i+1, i+500+1, i+500-1]
filter = lambda neighbors, N: [n for n in neighbors if 0<=n<N]
get_neighbors = lambda i, N: filter(get_candidates(i), N)
print("Neighbors of 5: {}".format(get_neighbors(5, len(data))))
# Summing p3 on neighbors
def sum_neighbors(data, i, col='p3'):
return data.iloc[get_neighbors(i, len(data))][col].sum()
print("p3 sum on neighbors of 5: {}".format(sum_neighbors(data, 5)))
Output:
p3
0 -1.106541
1 -0.760620
2 1.282252
3 0.204436
4 -1.147042
5 1.363007
6 -0.030772
7 -0.461756
8 -1.110459
9 -0.491368
Neighbors of 5: [4, 6]
p3 sum on neighbors of 5: -1.1778133703169344
Notes:
I assumed p1 was range(N) as seemed to be implied (so we don't need it at all).
I don't think that 505 is a neighbour of 5 given the list of neighbors of i defined by the OP.