Extraploation with 'nearest' method in Python - python

I'm looking to find the Python equivalent of the following Matlab statement:
vq interp1(x,y, xq,'nearest','extrap')
It looks as if interp(xq, x, y) works perfectly for linear interpolation/extrapolation.
I also looked at
F = scipy.interpolate.interp1d(x, y, kind='nearest')
which works perfectly for the nearest method, but will not perform extrapolation.
Is there anything else I've overlooked? Thanks.

For linear interpolation that will extrapolate using nearest interpolation, use numpy.interp. It does this by default.
For example:
yi = np.interp(xi, x, y)
Otherwise, if you just want nearest interpolation everywhere, as you describe, you can do it in the short, but inefficient way: (you can make this a one-liner, if you want)
def nearest_interp(xi, x, y):
idx = np.abs(x - xi[:,None])
return y[idx.argmin(axis=1)]
Or in a more efficient way using searchsorted:
def fast_nearest_interp(xi, x, y):
"""Assumes that x is monotonically increasing!!."""
# Shift x points to centers
spacing = np.diff(x) / 2
x = x + np.hstack([spacing, spacing[-1]])
# Append the last point in y twice for ease of use
y = np.hstack([y, y[-1]])
return y[np.searchsorted(x, xi)]
To illustrate the difference between numpy.interp and the nearest interpolation examples above:
import numpy as np
import matplotlib.pyplot as plt
def main():
x = np.array([0.1, 0.3, 1.9])
y = np.array([4, -9, 1])
xi = np.linspace(-1, 3, 200)
fig, axes = plt.subplots(nrows=2, sharex=True, sharey=True)
for ax in axes:
ax.margins(0.05)
ax.plot(x, y, 'ro')
axes[0].plot(xi, np.interp(xi, x, y), color='blue')
axes[1].plot(xi, nearest_interp(xi, x, y), color='green')
kwargs = dict(x=0.95, y=0.9, ha='right', va='top')
axes[0].set_title("Numpy's $interp$ function", **kwargs)
axes[1].set_title('Nearest Interpolation', **kwargs)
plt.show()
def nearest_interp(xi, x, y):
idx = np.abs(x - xi[:,None])
return y[idx.argmin(axis=1)]
main()

In later versions of SciPy (at least v0.19.1+), scipy.interpolate.interp1d has the option fill_value = “extrapolate”.
For example:
import pandas as pd
>>> s = pd.Series([1, 2, 3])
Out[1]:
0 1
1 2
2 3
dtype: int64
>>> t = pd.concat([s, pd.Series(index=s.index + 0.1)]).sort_index()
Out[2]:
0.0 1.0
0.1 NaN
1.0 2.0
1.1 NaN
2.0 3.0
2.1 NaN
dtype: float64
>>> t.interpolate(method='nearest')
Out[3]:
0.0 1.0
0.1 1.0
1.0 2.0
1.1 2.0
2.0 3.0
2.1 NaN
dtype: float64
>>> t.interpolate(method='nearest', fill_value='extrapolate')
Out[4]:
0.0 1.0
0.1 1.0
1.0 2.0
1.1 2.0
2.0 3.0
2.1 3.0
dtype: float64

Related

matplotlib plot not showing empty vals at ends

I need to show the empty slots at the ends of the plot. Code to show what I mean:
a = pd.DataFrame([ 1,5,3,2,7 ], index=['b','e','g','h','d'])
i = pd.DataFrame(index=['a','b','c','d','e','f','g','h','i','j','k','l'])
c = pd.concat([i, a], axis=1)
plt.plot(c, lw=0, marker='o')
plt.show()
The content of c is
0
a NaN
b 1.0
c NaN
d 7.0
e 5.0
f NaN
g 3.0
h 2.0
i NaN
j NaN
k NaN
l NaN
This shows a chart (can't upload, not enough points, sorry) that has X axis labels b, c, d, e, f, g, h; c and f have no associated points, just as I want.
I have tried plt.xticks, ax.set_xlabels
How can I get the labels for a, i, j, k, l to show?
import pandas as pd
import matplotlib.pyplot as plt
a = pd.DataFrame([ 1,5,3,2,7 ], index=['b','e','g','h','d'])
i = pd.DataFrame(index=['a','b','c','d','e','f','g','h','i','j','k','l'])
c = pd.concat([i, a], axis=1)
fig, ax = plt.subplots()
ax.plot(c.index, c, lw=0, marker='o')
ax.set_xticks(c.index)
plt.show()

generate a 3d plot from data contained in a three columns file

I have a three columns data file structured in this way (example) :
X Y Z
0 0 0.2
0 1 0.3
0 2 0.1
1 0 0.2
1 1 0.3
1 2 0.9
2 0 0.6
2 1 0.8
2 2 0.99
I don't know how this kind of file is called ... but I did not find any example to plot this using 3d wireframe or 3d surface plot ...
EDIT But there is a way to produce a wireframe or surface plot with data structured in this way?
In order to create a surface plot, you first need to transform your x, y and z data into 2d arrays. Then you can plot it easily:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
# read out data from txt file
data = np.genfromtxt("data.txt")[1:]
x_data, y_data, z_data = data[:, 0], data[:, 1], data[:, 2]
# initialize a figure for the 3d plot
fig = plt.figure()
ax = Axes3D(fig)
# create matrix for z values
dim = int(np.sqrt(len(x_data)))
z = z_data.reshape((dim, dim))
# create matrix for the x and y points
x, y = np.arange(0, dim, 1), np.arange(0, dim, 1)
x, y = np.meshgrid(x, y)
# plot
ax.plot_surface(x, y, z, alpha=0.75)
plt.show()

Scatter Pie Plot Python Pandas

"Scatter Pie Plot" ( a scatter plot using pie charts instead of dots). I require this as I have to represent 3 dimensions of data.
1: x axis (0-6)
2: y axis (0-6)
3: Category lets say (A,B,C - H)
If two x and y values are the same I want a pie chart to be in that position representing that Category.
Similar to the graph seen in this link:
https://matplotlib.org/gallery/lines_bars_and_markers/scatter_piecharts.html#sphx-glr-gallery-lines-bars-and-markers-scatter-piecharts-py
or this image from Tableu:
[![enter image description here][1]][1]
As I am limited to only use python I have been struggling to manipulate the code to work for me.
Could anyone help me with this problem? I would very grateful!
Example data:
XVAL YVAL GROUP
1.3 4.5 A
1.3 4.5 B
4 2 E
4 6 A
2 4 A
2 4 B
1 1 G
1 1 C
1 2 B
1 2 D
3.99 4.56 G
The final output should have 6 pie charts on the X & Y with 1 containing 3 groups and 2 containing 3 groups.
My attempt:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def draw_pie(dist,
xpos,
ypos,
size,
ax=None):
if ax is None:
fig, ax = plt.subplots(figsize=(10,8))
# for incremental pie slices
cumsum = np.cumsum(dist)
cumsum = cumsum/ cumsum[-1]
pie = [0] + cumsum.tolist()
for r1, r2 in zip(pie[:-1], pie[1:]):
angles = np.linspace(2 * np.pi * r1, 2 * np.pi * r2)
x = [0] + np.cos(angles).tolist()
y = [0] + np.sin(angles).tolist()
xy = np.column_stack([x, y])
ax.scatter([xpos], [ypos], marker=xy, s=size)
return ax
fig, ax = plt.subplots(figsize=(40,40))
draw_pie([Group],'xval','yval',10000,ax=ax)
draw_pie([Group], 'xval', 'yval', 20000, ax=ax)
draw_pie([Group], 'xval', 'yval', 30000, ax=ax)
plt.show()
I'm not sure how to get 6 pie charts. If we group on XVAL and YVAL, there are 7 unique pairs. You can do something down this line:
fig, ax = plt.subplots(figsize=(40,40))
for (x,y), d in df.groupby(['XVAL','YVAL']):
dist = d['GROUP'].value_counts()
draw_pie(dist, x, y, 10000*len(d), ax=ax)
plt.show()
Output:

How to determine if any value in one array, is lower than any value in another array, for a given bin?

I am trying to compare different lines, to know if one is above the other one, and if not, at which x this change happens.
If I had the same x values and same length, that would be very easy and only difference in ys of the lines.
But I have different x values for different lines, and the vectors do not have the same length, but x intervals are the same for all curves.
As a very simple example I use the following data:
#curve 1: len = 9
x1 = np.array([5,6,7,8,9,10,11,12,13])
y1 = np.array([100,101,110,130,132,170,190,192,210])
#curve 2: len = 10
x2 = np.array([3,4,5,6,7,8,9,10,11,12])
y2 = np.array([90,210,211,250,260,261,265,180,200,210])
#curve 3: len = 8
x3 = np.array([7.3,8.3,9.3,10.3,11.3,12.3,13.3,14.3])
y3 = np.array([300,250,270,350,380,400,390,380])
They are supposed to be 2 regression lines. In this simple example, the result is supposed to be that Curve 2 has higher values than curve 1 in all x range.
I was trying to bin x in the range of 2.5-12.5 with the bin length of 1 to compare the corresponding ys in each bin.
My actual data are big, and this comparison needs to be done many times, so I need to find a solution that does not take much time.
Plot
Plot of data for given x-axis
plt.figure(figsize=(6, 6))
plt.plot(x1, y1, marker='o', label='y1')
plt.plot(x2, y2, marker='o', label='y2')
plt.plot(x3, y3, marker='o', label='y3')
plt.xticks(range(15))
plt.legend()
plt.grid()
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
Functions
def get_new_x uses np.digitize to re-bin the x-axis values.
def get_comparison adds a column of Booleans for each two columns compared
Currently each new column is added to the main dataframe, df, however this can be updated to be a separate comparison dataframe.
combs is a list column combinations
[Index(['y1', 'y2'], dtype='object'), Index(['y2', 'y3'], dtype='object')]
# function to create the bins
def get_bins(x_arrays: List[np.array]) -> np.array:
bin_len = np.diff(x_arrays[0][:2]) # calculate bin length
all_x = np.concatenate(x_arrays) # join arrays
min_x = min(all_x) # get min
max_x = max(all_x) # get max
return np.arange(min_x, max_x + bin_len, bin_len)
# function using np.digitize to bin the old x-axis into new bins
def get_new_x(x_arrays: List[np.array]) -> List[np.array]:
bins = get_bins(x_arrays) # get the bins
x_new = list()
for x in x_arrays:
x_new.append(bins[np.digitize(np.round(x), bins, right=True)]) # determine bins
return x_new
# function to create dataframe for arrays with new x-axis as index
def get_df(x_arrays: List[np.array], y_arrays: List[np.array]) -> pd.DataFrame:
x_new = get_new_x(x_arrays)
return pd.concat([pd.DataFrame(y, columns=[f'y{i+1}'], index=x_new[i]) for i, y in enumerate(y_arrays)], axis=1)
# compare each successive column of the dataframe
# if the left column is greater than the right column, then True
def get_comparison(df: pd.DataFrame):
cols = df.columns
combs = [cols[i:i+2] for i in range(0, len(cols), 1) if i < len(cols)-1]
for comb in combs:
df[f'{comb[0]} > {comb[1]}'] = df[comb[0]] > df[comb[1]]
Call functions:
import numpy as np
import pandas as pd
# put the arrays into a list
y = [y1, y2, y3]
x = [x1, x2, x3]
# call get_df
df = get_df(x, y)
# call get_comparison
get_comparison(df)
# get only the index of True values with Boolean indexing
for col in df.columns[3:]:
vals = df.index[df[col]].tolist()
if vals:
print(f'{col}: {vals}')
[out]:
y2 > y3: [8.0]
display(df)
y1 y2 y3 y1 > y2 y2 > y3
3.0 NaN 90.0 NaN False False
4.0 NaN 210.0 NaN False False
5.0 100.0 211.0 NaN False False
6.0 101.0 250.0 NaN False False
7.0 110.0 260.0 300.0 False False
8.0 130.0 261.0 250.0 False True
9.0 132.0 265.0 270.0 False False
10.0 170.0 180.0 350.0 False False
11.0 190.0 200.0 380.0 False False
12.0 192.0 210.0 400.0 False False
13.0 210.0 NaN 390.0 False False
14.0 NaN NaN 380.0 False False
Plot
fig, ax = plt.subplots(figsize=(8, 6))
# add markers for problem values
for i, col in enumerate(df.columns[3:], 1):
vals = df.iloc[:, i][df[col]]
if not vals.empty:
ax.scatter(vals.index, vals.values, color='red', s=110, label='bad')
df.iloc[:, :3].plot(marker='o', ax=ax) # plot the dataframe
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(range(16))
plt.title('y-values plotted against rebinned x-values')
plt.grid()
plt.show()
This is the answer I had in my mind when I first asked the question, but couldn't make it work back then. My idea is based on binning y1 and y2 based on x and comparing these two in each bin. So, as an example I have 3 curves and I want to compare them. The only similar thing among these curves is delta x (bin length) which is 1 here.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#curve 1
x1 = np.array([5,6,7,8,9,10,11,12,13])
y1 = np.array([100,101,110,130,132,170,190,192,210])
#curve 2
x2 = np.array([3,4,5,6,7,8,9,10,11,12])
y2 = np.array([90,210,211,250,260,261,265,180,200,210])
#curve 3
x3 = np.array([7.3,8.3,9.3,10.3,11.3,12.3,13.3,14.3])
y3 = np.array([300,250,270,350,380,400,390,380])
bin_length = 1
# x values have same intervals both in x1 and x2
x_min = min(x1[0],x2[0],x3[0])-bin_length/2
x_max = max(x1[-1],x2[-1],x3[-1])+bin_length/2
bins = np.arange(x_min,x_max+bin_length,bin_length)
# bin mid points to use as index
bin_mid = []
for i in range(len(bins)-1):
# compute mid point of the bins
bin_mid.append((bins[i] + bins[i+1])/2)
# This function bins y based on binning x
def bin_fun(x,y,bins,bin_length):
c = list(zip(x, y))
# define final out put of the function
final_y_binning = []
# define a list for holding members of each bin
bined_y_members = []
# compute length of each bin
for i in range(len(bins)-1):
# compute high and low threshold of the bins
low_threshold = bins[i]
high_threshold = bins[i+1]
# bin y according to x
for member in c:
if (member[0] < high_threshold and member[0] >= low_threshold):
bined_y_members.append(member[1])
final_y_binning.append(bined_y_members)
# fill out the container of the bin members
bined_y_members=[]
df = pd.DataFrame(final_y_binning)
return(df)
binned_y =pd.DataFrame(columns=[1,2,3])
Y1 = bin_fun(x1,y1,bins, bin_length)
Y1.columns =[1]
Y2 = bin_fun(x2,y2,bins, bin_length)
Y2.columns =[2]
Y3 = bin_fun(x3,y3,bins, bin_length)
Y3.columns =[3]
binned_y = binned_y.append(Y1)
binned_y[2] = Y2
binned_y[3] = Y3
binned_y.index = bin_mid
print(binned_y)
# comparing curve 2 and curve 1
for i in binned_y.index:
if (binned_y.loc[i][2]-binned_y.loc[i][1]<0):
print(i)
# comparing curve 3 and curve 2
for i in binned_y.index:
if (binned_y.loc[i][3]-binned_y.loc[i][2]<0):
print(i)
This returns 8 which is the index that y3<y2`
binned_y
1 2 3
3.0 NaN 90.0 NaN
4.0 NaN 210.0 NaN
5.0 100.0 211.0 NaN
6.0 101.0 250.0 NaN
7.0 110.0 260.0 300.0
8.0 130.0 261.0 250.0
9.0 132.0 265.0 270.0
10.0 170.0 180.0 350.0
11.0 190.0 200.0 380.0
12.0 192.0 210.0 400.0
13.0 210.0 NaN 390.0
14.0 NaN NaN 380.0
15.0 NaN NaN NaN
plot
binned_y.plot(marker='o', figsize=(6, 6)) # plot the dataframe
plt.legend(labels=['y1', 'y2', 'y3'], bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(range(16))
plt.grid()

Quantify relative position of coordinates - python

I have a df of coordinates representing points at various timescales. I want to calculate the average these points in relation to each other.
To achieve this, I'm aiming to calculate the space between each point and the rest of the points. I'm then hoping to average these points.
The following calculates the distance between each pair of points.
import pandas as pd
from scipy.spatial import distance
import itertools
df = pd.DataFrame({
'Time' : [1,1,1,2,2,2,3,3,3],
'id' : ['A','B','C','A','B','C','A','B','C'],
'X' : [1.0,3.0,2.0,2.0,4.0,3.0,3.0,5.0,4.0],
'Y' : [1.0,1.0,0.5,2.0,2.0,2.5,3.0,3.0,3.0],
})
ids = list(df['id'])
# get the points
points = df[["X", "Y"]].values
# calculate distance of each point from every other point.
# row i contains contains distances for point i.
# distances[i, j] contains distance of point i from point j.
distances = distance.cdist(points, points, "euclidean")
distances = distances.flatten()
# get the start and end points
cartesian = list(itertools.product(ids, ids))
data = dict(
start_region = [x[0] for x in cartesian],
end_region = [x[1] for x in cartesian],
distance = distances
)
df1 = pd.DataFrame(data)
All I really need to output is:
Time start_point end_point X Y
0 1 A B 2.0 0.0
1 1 A C 1.0 -0.5
2 1 B C -1.0 -0.5
3 2 A B 2.0 0.0
4 2 A C 1.0 0.5
5 2 B C -1.0 0.5
6 3 A B 2.0 0.0
7 3 A C 1.0 0.0
8 3 B C -1.0 0.0
So the average position of these points in relation to each other would be the green coordinates.
But if I average the dataset above it displays:
I understand how this occurs. It's not referencing the other points.
Here my take on it
import itertools
def relative_dist(gp):
combs = list(itertools.combinations(gp.index, 2))
df_gp = pd.concat([gp.loc[tup,:].diff() for tup in combs], keys=combs).dropna()
return df_gp
df_dist = (df.set_index('id').groupby('Time')[['X','Y']].apply(relative_dist)
.droplevel('id').rename_axis(['Time','start_point','end_point'])
.reset_index())
Out[341]:
Time start_point end_point X Y
0 1 A B 2.0 0.0
1 1 A C 1.0 -0.5
2 1 B C -1.0 -0.5
3 2 A B 2.0 0.0
4 2 A C 1.0 0.5
5 2 B C -1.0 0.5
6 3 A B 2.0 0.0
7 3 A C 1.0 0.0
8 3 B C -1.0 0.0
df_avg = df_dist.groupby(['start_point','end_point'], as_index=False)[['X','Y']].mean()
Out[347]:
start_point end_point X Y
0 A B 2.0 0.0
1 A C 1.0 0.0
2 B C -1.0 0.0
Here's a suggestion on how to visualise the relative positions of your points. I would want, for each timestamp, to plot an ellipse at position (X_, Y_) where:
X_ is the mean of your points X coordinates for that timestamp.
Y_ is the mean of your points X coordinates for that timestamp.
the width of the ellipse equals the variance of your points X coordinates for that timestamp.
the height of the ellipse equals the variance of your points Y coordinates for that timestamp.
In that way, in a glance and for each timestamp, you could read some very high level statistics about your coordinates distribution at that timestamp.
Here's some code to generate such a visualisation:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.patches import Ellipse
# sample data with 4 timestamps
df = pd.DataFrame({
'Time' : [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'id' : ['A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D'],
'X' : [1,2,1,2,1,2,1,2,4,4,3,4,10,8,5,6],
'Y' : [1,1,3,3,1,1,2,2,5,5,8,5,6,6,7,6],
})
# for each timestamp, compute means and variances within all samples for that timestamp
means = df.groupby("Time")[["X", "Y"]].mean()
variances = df.groupby("Time")[["X", "Y"]].var()
df_ = pd.concat([means, variances], axis=1)
df_.columns = ["X_", "Y_", "var_X", "var_Y"]
# plot
fig, ax = plt.subplots(subplot_kw={'aspect': 'equal'})
for row in df_.itertuples():
ellipse = Ellipse(xy=(row.X_, row.Y_), # position of the ellipse is (X,Y)
width=row.var_X, # width helps to get a grasp on X variance
height=row.var_Y, # height helps to get a grasp on Y variance
angle=0)
ax.add_artist(ellipse)
ellipse.set_clip_box(ax.bbox)
ellipse.set_alpha(.4)
plt.text(x=row.X_+0.2, y=row.Y_+0.2, s=f"t={row.Index}") # just add timestamp legend
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
plt.show()
Which would look like this:
What do you think? Another idea could be to do a GIF (in case the timestamps average collide too much).

Categories