Altair: Extract and display regression coefficients - python

This question addresses how to access and display the R2 value using mark_text()
I am interested in accessing and displaying the coefficients. Replacing rSquared with coef yields a flattened array of both the intercept and slope, as described in the documentation.
How can I index into this array to display only one of the values, e.g. the slope? I wondered if the mark_text() step should be preceded by a transform (possibly transform_filter(), or if altair.Text() could be used.
I am aware of other approaches which involve determining this information separately then adding it as an additional layer.
Apologies if this is a very straightforward question. Thanks in advance.
import altair as alt
import pandas as pd
import numpy as np
np.random.seed(42)
x = np.linspace(0, 10)
y = x - 5 + np.random.randn(len(x))
df = pd.DataFrame({'x': x, 'y': y})
chart = alt.Chart(df).mark_point().encode(
x='x',
y='y'
)
line = chart.transform_regression('x', 'y').mark_line()
params = alt.Chart(df).transform_regression(
'x', 'y', params=True
).mark_text(align='left').encode(
x=alt.value(20), # pixels from left
y=alt.value(20), # pixels from top
text='rSquared:N',
# text='coef:N' # flattened array
# text='coef[0]:N' # fails
)
chart + line + params

You can access this using a calculate transform:
params = alt.Chart(df).transform_regression(
'x', 'y', params=True
).transform_calculate(
intercept='datum.coef[0]',
slope='datum.coef[1]',
).mark_text(align='left').encode(
x=alt.value(20), # pixels from left
y=alt.value(20), # pixels from top
text='intercept:N'
)
chart + line + params

Related

Bar polar with areas proportional to values

Based on this question I have the plot below.
The issue is plotly misaligns the proportion between plot area and data value. I mean, higher values (e.g. going from 0.5 to 0.6) lead to a large increase in area (big dark green block) whereas from 0 to 0.1 is not noticiable (even if the actual data increment is the same 0.1).
import numpy as np
import pandas as pd
import plotly.express as px
df = px.data.wind()
df_test = df[df["strength"]=='0-1']
df_test_sectors = pd.DataFrame(columns=df_test.columns)
## this only works if each group has one row
for direction, df_direction in df_test.groupby('direction'):
frequency_stop = df_direction['frequency'].tolist()[0]
frequencies = np.arange(0.1, frequency_stop+0.1, 0.1)
df_sector = pd.DataFrame({
'direction': [direction]*len(frequencies),
'strength': ['0-1']*len(frequencies),
'frequency': frequencies
})
df_test_sectors = pd.concat([df_test_sectors, df_sector])
df_test_sectors = df_test_sectors.reset_index(drop=True)
df_test_sectors['direction'] = pd.Categorical(
df_test_sectors['direction'],
df_test.direction.tolist() #sort the directions into the same order as those in df_test
)
df_test_sectors['frequency'] = df_test_sectors['frequency'].astype(float)
df_test_sectors = df_test_sectors.sort_values(['direction', 'frequency'])
fig = px.bar_polar(df_test_sectors, r='frequency', theta='direction', color='frequency', color_continuous_scale='YlGn')
fig.show()
Is there any way to make the plot with proportional areas to blocks to keep a more "truthful" alignment between the aesthetics and the actual data? So the closer to the center, the "longer" the blocks so the areas of all blocks are equal? Is there any option in Plotly for this?
You can construct a new column called r_outer_diff that stores radius differences (as you go from the inner most to outer most sector for each direction) to ensure the area of each sector is equal. The values for this column can be calculated inside the loop we are using to construct df_test_sectors using the following steps:
we start with the inner sector of r = 0.1 and find the area of that sector as a reference since we want all subsequent sectors to have the same area
then to construct the next sector, we need to find r_outer so that pi*(r_outer-r_inner)**2 * (sector angle/360) = reference sector area
we solve this formula for r_outer for each iteration of the loop, and use r_outer as r_inner for the next iteration of the loop. since plotly will draw the sum of all of the radiuses, we actually want to keep track of r_outer-r_inner for each iteration of the loop and this is the value we will store in the r_outer_diffs column
Putting this into code:
import numpy as np
import pandas as pd
import plotly.express as px
df = px.data.wind()
df_test = df[df["strength"]=='0-1']
df_test_sectors = pd.DataFrame(columns=df_test.columns)
## this only works if each group has one row
for direction, df_direction in df_test.groupby('direction'):
frequency_stop = df_direction['frequency'].tolist()[0]
frequencies = np.arange(0.1, frequency_stop+0.1, 0.1)
r_base = 0.1
sector_area = np.pi * r_base**2 * (16/360)
## we can populate the list with the first radius of 0.1
## since that will stay fixed
## then we use the formula: sector_area = pi*(r_outer-r_inner)^2 * (sector angle/360)
r_adjusted_for_area = [0.1]
r_outer_diffs = [0.1]
for i in range(len(frequencies)-1):
r_inner = r_adjusted_for_area[-1]
inner_sector_area = np.pi * r_inner**2 * (16/360)
outer_sector_area = inner_sector_area + sector_area
r_outer = np.sqrt(outer_sector_area * (360/16) / np.pi)
r_outer_diff = r_outer - r_inner
r_adjusted_for_area.append(r_outer)
r_outer_diffs.append(r_outer_diff)
df_sector = pd.DataFrame({
'direction': [direction]*len(frequencies),
'strength': ['0-1']*len(frequencies),
'frequency': frequencies,
'r_outer_diff': r_outer_diffs
})
df_test_sectors = pd.concat([df_test_sectors, df_sector])
df_test_sectors = df_test_sectors.reset_index(drop=True)
df_test_sectors['direction'] = pd.Categorical(
df_test_sectors['direction'],
df_test.direction.tolist() #sort the directions into the same order as those in df_test
)
df_test_sectors['frequency'] = df_test_sectors['frequency'].astype(float)
df_test_sectors = df_test_sectors.sort_values(['direction', 'frequency'])
fig = px.bar_polar(df_test_sectors, r='r_outer_diff', theta='direction', color='frequency', color_continuous_scale='YlGn')
fig.show()

How to reorder or sort categories based on another variable values in python plotnine?

I am trying to use plotnine in python and couldn't do fct_reorder of r in python. Basically I would like to plot categories from the categorical variable to arrange on x axis based on the increasing value from another variable but I am unable to do so.
import numpy as np
import pandas as pd
from plotnine import *
test_df = pd.DataFrame({'catg': ['a','b','c','d','e'],
'val': [3,1,7,2,5]})
test_df['catg'] = test_df['catg'].astype('category')
When I sort & plot this based on .sort_values() then it doesn't rearrange the categories on x axis:
test_df = test_df.sort_values(by = ['val']).reset_index(drop=True)
(ggplot(data = test_df,
mapping = aes(x = test_df.iloc[:, 0], y = test_df['val']))
+ geom_line(linetype = 2)
+ geom_point()
+ labs(title = str('Weight of Evidence by ' + test_df.columns[0]),
x = test_df.columns[0],
y = 'Weight of Evidence')
+ theme(axis_text_x= element_text(angle = 0))
)
Desired output:
I saw this SO Post where they are using reorder but I couldn't find any reorder in plotnine to work.
Plotnine does have reorder. It is an internal function available when creating and aesthetic mapping, just like factor.
In your example you could use it like this:
ggplot(data=test_df, mapping=aes(x='reorder(catg, val)', y='val'))

Calculate relative phase between two angles - python

I'm trying to calculate the relative phase between a time series of two angles. Using below, the angles are measured by the rotation derived from the xy points associated to Label A and Label B. The angles are moving in a similar direction for the first 3 time points and then deviate for the remaining 3 time points.
My understanding was that the relative phase calculation using a Hilbert transform signified that values closer to 0 ° referred to a pattern of coordination or in-phase. Conversely, values closer to 180° referred to asynchronous patterns or anti-phase. Yet when I export the results below I'm not seeing this?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.signal import hilbert
df = pd.DataFrame({
'Time' : [1,1,2,2,3,3,4,4,5,5,6,6],
'Label' : ['A','B','A','B','A','B','A','B','A','B','A','B'],
'x' : [-2.0,-1.0,-1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0],
'y' : [-2.0,-1.0,-2.0,-1.0,-2.0,-1.0,-3.0,0.0,-4.0,1.0,-5.0,2.0],
})
x = df.groupby('Label')['x'].diff().fillna(0).astype(float)
y = df.groupby('Label')['y'].diff().fillna(0).astype(float)
df['Rotation'] = np.arctan2(y, x)
df['Angle'] = np.degrees(df['Rotation'])
df_A = df[df['Label'] == 'A'].reset_index(drop = True)
df_B = df[df['Label'] == 'B'].reset_index(drop = True)
y1 = df_A['Angle'].values
y2 = df_B['Angle'].values
ang1 = np.angle(hilbert(y1),deg=False)
ang2 = np.angle(hilbert(y2),deg=False)
f,ax = plt.subplots(3,1,figsize=(20,5),sharex=True)
ax[0].plot(y1,color='r',label='y1')
ax[0].plot(y2,color='b',label='y2')
ax[0].legend(bbox_to_anchor=(0., 1.02, 1., .102),ncol=2)
ax[1].plot(ang1,color='r')
ax[1].plot(ang2,color='b')
ax[1].set(title='Angle at each Timepoint')
phase_synchrony = 1-np.sin(np.abs(ang1-ang2)/2)
ax[2].plot(phase_synchrony)
ax[2].set(ylim=[0,1.1],title='Instantaneous Phase Synchrony',xlabel='Time',ylabel='Phase Synchrony')
plt.tight_layout()
plt.show()
By your description I would simply use
phase_synchrony = 1-np.sin(np.abs(y1-y2)/2)
The analytic representation via Hilbert Transform applies when you have only the real part of a signal you know (or assume based on reasonable principles) to be analytic, under such conditions you can find a imaginary part that makes the resulting function analytic.
But in your case you already have x and y, so you can calculate the angle directly as you done already.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.signal import hilbert
df = pd.DataFrame({
'Time' : [1,1,2,2,3,3,4,4,5,5,6,6],
'Label' : ['A','B','A','B','A','B','A','B','A','B','A','B'],
'x' : [-2.0,-1.0,-1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0],
'y' : [-2.0,-1.0,-2.0,-1.0,-2.0,-1.0,-3.0,0.0,-4.0,1.0,-5.0,2.0],
})
x = df.groupby('Label')['x'].diff().fillna(0).astype(float)
y = df.groupby('Label')['y'].diff().fillna(0).astype(float)
df['Rotation'] = np.arctan2(y, x)
df['Angle'] = np.degrees(df['Rotation'])
df_A = df[df['Label'] == 'A'].reset_index(drop = True)
df_B = df[df['Label'] == 'B'].reset_index(drop = True)
y1 = df_A['Angle'].values
y2 = df_B['Angle'].values
# no need to compute the hilbert transforms here
f,ax = plt.subplots(3,1,figsize=(20,5),sharex=True)
ax[0].plot(y1,color='r',label='y1')
ax[0].plot(y2,color='b',label='y2')
ax[0].legend(bbox_to_anchor=(0., 1.02, 1., .102),ncol=2)
ax[1].plot(ang1,color='r')
ax[1].plot(ang2,color='b')
ax[1].set(title='Angle at each Timepoint')
# all I changed
phase_synchrony = 1-np.sin(np.abs(y1-y2)/2)
ax[2].plot(phase_synchrony)
ax[2].set(ylim=[0,1.1],title='Instantaneous Phase Synchrony',xlabel='Time',ylabel='Phase Synchrony')
plt.tight_layout()
plt.show()

how to isolate data that are 2 and 3 sigma deviated from mean and then mark them in a plot in python?

I am reading from a dataset which looks like the following when plotted in matplotlib and then taken the best fit curve using linear regression.
The sample of data looks like following:
# ID X Y px py pz M R
1.04826492772e-05 1.04828050287e-05 1.048233088e-05 0.000107002791008 0.000106552433081 0.000108704469007 387.02 4.81947797625e+13
1.87380963036e-05 1.87370588085e-05 1.87372620448e-05 0.000121616280029 0.000151924707761 0.00012371156585 428.77 6.54636174067e+13
3.95579877816e-05 3.95603773653e-05 3.95610756809e-05 0.000163470663023 0.000265203868883 0.000228031803626 470.74 8.66961875758e+13
My code looks the following:
# Regression Function
def regress(x, y):
#Return a tuple of predicted y values and parameters for linear regression.
p = sp.stats.linregress(x, y)
b1, b0, r, p_val, stderr = p
y_pred = sp.polyval([b1, b0], x)
return y_pred, p
# plotting z
xz, yz = M, Y_z # data, non-transformed
y_pred, _ = regress(xz, np.log(yz)) # change here # transformed input
plt.semilogy(xz, yz, marker='o',color ='b', markersize=4,linestyle='None', label="l.o.s within R500")
plt.semilogy(xz, np.exp(y_pred), "b", label = 'best fit') # transformed output
However I can see a lot upward scatter in the data and the best fit curve is affected by those. So first I want to isolate the data points which are 2 and 3 sigma away from my mean data, and mark them with circle around them.
Then take the best fit curve considering only the points which fall within 1 sigma of my mean data
Is there a good function in python which can do that for me?
Also in addition to that may I also isolate the data from my actual dataset, like if the third row in the sample input represents 2 sigma deviation may I have that row as an output too to save later and investigate more?
Your help is most appreciated.
Here's some code that goes through the data in a given number of windows, calculates statistics in said windows, and separates data in well- and misbehaved lists.
Hope this helps.
from scipy import stats
from scipy import polyval
import numpy as np
import matplotlib.pyplot as plt
num_data = 10000
fake_data_x = np.sort(12.8+np.random.random(num_data))
fake_data_y = np.exp(fake_data_x) + np.random.normal(0,scale=50000,size=num_data)
# Regression Function
def regress(x, y):
#Return a tuple of predicted y values and parameters for linear regression.
p = stats.linregress(x, y)
b1, b0, r, p_val, stderr = p
y_pred = polyval([b1, b0], x)
return y_pred, p
# plotting z
xz, yz = fake_data_x, fake_data_y # data, non-transformed
y_pred, _ = regress(xz, np.log(yz)) # change here # transformed input
plt.figure()
plt.semilogy(xz, yz, marker='o',color ='b', markersize=4,linestyle='None', label="l.o.s within R500")
plt.semilogy(xz, np.exp(y_pred), "b", label = 'best fit') # transformed output
plt.show()
num_bin_intervals = 10 # approx number of averaging windows
window_boundaries = np.linspace(min(fake_data_x),max(fake_data_x),int(len(fake_data_x)/num_bin_intervals)) # window boundaries
y_good = [] # list to collect the "well-behaved" y-axis data
x_good = [] # list to collect the "well-behaved" x-axis data
y_outlier = []
x_outlier = []
for i in range(len(window_boundaries)-1):
# create a boolean mask to select the data within the averaging window
window_indices = (fake_data_x<=window_boundaries[i+1]) & (fake_data_x>window_boundaries[i])
# separate the pieces of data in the window
fake_data_x_slice = fake_data_x[window_indices]
fake_data_y_slice = fake_data_y[window_indices]
# calculate the mean y_value in the window
y_mean = np.mean(fake_data_y_slice)
y_std = np.std(fake_data_y_slice)
# choose and select the outliers
y_outliers = fake_data_y_slice[np.abs(fake_data_y_slice-y_mean)>=2*y_std]
x_outliers = fake_data_x_slice[np.abs(fake_data_y_slice-y_mean)>=2*y_std]
# choose and select the good ones
y_goodies = fake_data_y_slice[np.abs(fake_data_y_slice-y_mean)<2*y_std]
x_goodies = fake_data_x_slice[np.abs(fake_data_y_slice-y_mean)<2*y_std]
# extend the lists with all the good and the bad
y_good.extend(list(y_goodies))
y_outlier.extend(list(y_outliers))
x_good.extend(list(x_goodies))
x_outlier.extend(list(x_outliers))
plt.figure()
plt.semilogy(x_good,y_good,'o')
plt.semilogy(x_outlier,y_outlier,'r*')
plt.show()

how to visualize high volumn 3 dimensional data

I have a data set like the following:
import numpy as np
from pandas import DataFrame
mypos = np.random.randint(10, size=(100, 2))
mydata = DataFrame(mypos, columns=['x', 'y'])
myres = np.random.rand(100, 1)
mydata['res'] = myres
The res variable is continous, the x and y variables are integers representing
positions (therefore largely repetitive),
and res represents kind of correlations between pairs of positions.
I am wondering what are the best ways of visualizing this data set?
Possible approaches already considered:
Scatter plot, with the res variable visualized by a color gradient.
Parallel coordinates plot.
The first approach is problematic when the number of positions get large,
because high values (which are the values we care about) of the res variable would be drowned in a sea of
small dots.
The second approach could be promising, but I am having trouble producing it.
I have tried the parallel_coordinates function from the pandas module,
but it's not behaving as I would like it to. (see this question here:
parallel coordinates plot for continous data in pandas
)
I hope this helps to find a solution in R. Good luck.
# you need this package for the colour palette
library(RColorBrewer)
# create the random data
dd <- data.frame(
x = round(runif(100, 0, 10), 0),
y = round(runif(100, 0, 10), 0),
res = runif(100)
)
# pick the number of colours (granularity of colour scale)
nColors <- 100
# create the colour pallete
cols <-colorRampPalette(colors=c("white","blue"))(nColors)
# get a zScale for the colours
zScale <- seq(min(dd$res), max(dd$res), length.out = nColors)
# function that returns the nearest colour given a value of res
findNearestColour <- function(x) {
colorIndex <- which(abs(zScale - x) == min(abs(zScale - x)))
return(cols[colorIndex])
}
# the first plot is the scatterplot
### this has problems because points come out on top of eachother
plot(y ~ x, dd, type = "n")
for(i in 1:dim(dd)[1]){
with(dd[i,],
points(y ~ x, col = findNearestColour(res), pch = 19)
)
}
# this is your parallel coordinates plot (a little better)
plot(1, 1, xlim = c(0, 1), ylim = c(min(dd$x, dd$y), max(dd$x, dd$y)),
type = "n", axes = F, ylab = "", xlab = "")
for(i in 1:dim(dd)[1]){
with(dd[i,],
segments(0, x, 1, y, col = findNearestColour(res))
)
}

Categories