Based on this question I have the plot below.
The issue is plotly misaligns the proportion between plot area and data value. I mean, higher values (e.g. going from 0.5 to 0.6) lead to a large increase in area (big dark green block) whereas from 0 to 0.1 is not noticiable (even if the actual data increment is the same 0.1).
import numpy as np
import pandas as pd
import plotly.express as px
df = px.data.wind()
df_test = df[df["strength"]=='0-1']
df_test_sectors = pd.DataFrame(columns=df_test.columns)
## this only works if each group has one row
for direction, df_direction in df_test.groupby('direction'):
frequency_stop = df_direction['frequency'].tolist()[0]
frequencies = np.arange(0.1, frequency_stop+0.1, 0.1)
df_sector = pd.DataFrame({
'direction': [direction]*len(frequencies),
'strength': ['0-1']*len(frequencies),
'frequency': frequencies
})
df_test_sectors = pd.concat([df_test_sectors, df_sector])
df_test_sectors = df_test_sectors.reset_index(drop=True)
df_test_sectors['direction'] = pd.Categorical(
df_test_sectors['direction'],
df_test.direction.tolist() #sort the directions into the same order as those in df_test
)
df_test_sectors['frequency'] = df_test_sectors['frequency'].astype(float)
df_test_sectors = df_test_sectors.sort_values(['direction', 'frequency'])
fig = px.bar_polar(df_test_sectors, r='frequency', theta='direction', color='frequency', color_continuous_scale='YlGn')
fig.show()
Is there any way to make the plot with proportional areas to blocks to keep a more "truthful" alignment between the aesthetics and the actual data? So the closer to the center, the "longer" the blocks so the areas of all blocks are equal? Is there any option in Plotly for this?
You can construct a new column called r_outer_diff that stores radius differences (as you go from the inner most to outer most sector for each direction) to ensure the area of each sector is equal. The values for this column can be calculated inside the loop we are using to construct df_test_sectors using the following steps:
we start with the inner sector of r = 0.1 and find the area of that sector as a reference since we want all subsequent sectors to have the same area
then to construct the next sector, we need to find r_outer so that pi*(r_outer-r_inner)**2 * (sector angle/360) = reference sector area
we solve this formula for r_outer for each iteration of the loop, and use r_outer as r_inner for the next iteration of the loop. since plotly will draw the sum of all of the radiuses, we actually want to keep track of r_outer-r_inner for each iteration of the loop and this is the value we will store in the r_outer_diffs column
Putting this into code:
import numpy as np
import pandas as pd
import plotly.express as px
df = px.data.wind()
df_test = df[df["strength"]=='0-1']
df_test_sectors = pd.DataFrame(columns=df_test.columns)
## this only works if each group has one row
for direction, df_direction in df_test.groupby('direction'):
frequency_stop = df_direction['frequency'].tolist()[0]
frequencies = np.arange(0.1, frequency_stop+0.1, 0.1)
r_base = 0.1
sector_area = np.pi * r_base**2 * (16/360)
## we can populate the list with the first radius of 0.1
## since that will stay fixed
## then we use the formula: sector_area = pi*(r_outer-r_inner)^2 * (sector angle/360)
r_adjusted_for_area = [0.1]
r_outer_diffs = [0.1]
for i in range(len(frequencies)-1):
r_inner = r_adjusted_for_area[-1]
inner_sector_area = np.pi * r_inner**2 * (16/360)
outer_sector_area = inner_sector_area + sector_area
r_outer = np.sqrt(outer_sector_area * (360/16) / np.pi)
r_outer_diff = r_outer - r_inner
r_adjusted_for_area.append(r_outer)
r_outer_diffs.append(r_outer_diff)
df_sector = pd.DataFrame({
'direction': [direction]*len(frequencies),
'strength': ['0-1']*len(frequencies),
'frequency': frequencies,
'r_outer_diff': r_outer_diffs
})
df_test_sectors = pd.concat([df_test_sectors, df_sector])
df_test_sectors = df_test_sectors.reset_index(drop=True)
df_test_sectors['direction'] = pd.Categorical(
df_test_sectors['direction'],
df_test.direction.tolist() #sort the directions into the same order as those in df_test
)
df_test_sectors['frequency'] = df_test_sectors['frequency'].astype(float)
df_test_sectors = df_test_sectors.sort_values(['direction', 'frequency'])
fig = px.bar_polar(df_test_sectors, r='r_outer_diff', theta='direction', color='frequency', color_continuous_scale='YlGn')
fig.show()
Related
I'm trying to calculate the relative phase between a time series of two angles. Using below, the angles are measured by the rotation derived from the xy points associated to Label A and Label B. The angles are moving in a similar direction for the first 3 time points and then deviate for the remaining 3 time points.
My understanding was that the relative phase calculation using a Hilbert transform signified that values closer to 0 ° referred to a pattern of coordination or in-phase. Conversely, values closer to 180° referred to asynchronous patterns or anti-phase. Yet when I export the results below I'm not seeing this?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.signal import hilbert
df = pd.DataFrame({
'Time' : [1,1,2,2,3,3,4,4,5,5,6,6],
'Label' : ['A','B','A','B','A','B','A','B','A','B','A','B'],
'x' : [-2.0,-1.0,-1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0],
'y' : [-2.0,-1.0,-2.0,-1.0,-2.0,-1.0,-3.0,0.0,-4.0,1.0,-5.0,2.0],
})
x = df.groupby('Label')['x'].diff().fillna(0).astype(float)
y = df.groupby('Label')['y'].diff().fillna(0).astype(float)
df['Rotation'] = np.arctan2(y, x)
df['Angle'] = np.degrees(df['Rotation'])
df_A = df[df['Label'] == 'A'].reset_index(drop = True)
df_B = df[df['Label'] == 'B'].reset_index(drop = True)
y1 = df_A['Angle'].values
y2 = df_B['Angle'].values
ang1 = np.angle(hilbert(y1),deg=False)
ang2 = np.angle(hilbert(y2),deg=False)
f,ax = plt.subplots(3,1,figsize=(20,5),sharex=True)
ax[0].plot(y1,color='r',label='y1')
ax[0].plot(y2,color='b',label='y2')
ax[0].legend(bbox_to_anchor=(0., 1.02, 1., .102),ncol=2)
ax[1].plot(ang1,color='r')
ax[1].plot(ang2,color='b')
ax[1].set(title='Angle at each Timepoint')
phase_synchrony = 1-np.sin(np.abs(ang1-ang2)/2)
ax[2].plot(phase_synchrony)
ax[2].set(ylim=[0,1.1],title='Instantaneous Phase Synchrony',xlabel='Time',ylabel='Phase Synchrony')
plt.tight_layout()
plt.show()
By your description I would simply use
phase_synchrony = 1-np.sin(np.abs(y1-y2)/2)
The analytic representation via Hilbert Transform applies when you have only the real part of a signal you know (or assume based on reasonable principles) to be analytic, under such conditions you can find a imaginary part that makes the resulting function analytic.
But in your case you already have x and y, so you can calculate the angle directly as you done already.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.signal import hilbert
df = pd.DataFrame({
'Time' : [1,1,2,2,3,3,4,4,5,5,6,6],
'Label' : ['A','B','A','B','A','B','A','B','A','B','A','B'],
'x' : [-2.0,-1.0,-1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0],
'y' : [-2.0,-1.0,-2.0,-1.0,-2.0,-1.0,-3.0,0.0,-4.0,1.0,-5.0,2.0],
})
x = df.groupby('Label')['x'].diff().fillna(0).astype(float)
y = df.groupby('Label')['y'].diff().fillna(0).astype(float)
df['Rotation'] = np.arctan2(y, x)
df['Angle'] = np.degrees(df['Rotation'])
df_A = df[df['Label'] == 'A'].reset_index(drop = True)
df_B = df[df['Label'] == 'B'].reset_index(drop = True)
y1 = df_A['Angle'].values
y2 = df_B['Angle'].values
# no need to compute the hilbert transforms here
f,ax = plt.subplots(3,1,figsize=(20,5),sharex=True)
ax[0].plot(y1,color='r',label='y1')
ax[0].plot(y2,color='b',label='y2')
ax[0].legend(bbox_to_anchor=(0., 1.02, 1., .102),ncol=2)
ax[1].plot(ang1,color='r')
ax[1].plot(ang2,color='b')
ax[1].set(title='Angle at each Timepoint')
# all I changed
phase_synchrony = 1-np.sin(np.abs(y1-y2)/2)
ax[2].plot(phase_synchrony)
ax[2].set(ylim=[0,1.1],title='Instantaneous Phase Synchrony',xlabel='Time',ylabel='Phase Synchrony')
plt.tight_layout()
plt.show()
This question addresses how to access and display the R2 value using mark_text()
I am interested in accessing and displaying the coefficients. Replacing rSquared with coef yields a flattened array of both the intercept and slope, as described in the documentation.
How can I index into this array to display only one of the values, e.g. the slope? I wondered if the mark_text() step should be preceded by a transform (possibly transform_filter(), or if altair.Text() could be used.
I am aware of other approaches which involve determining this information separately then adding it as an additional layer.
Apologies if this is a very straightforward question. Thanks in advance.
import altair as alt
import pandas as pd
import numpy as np
np.random.seed(42)
x = np.linspace(0, 10)
y = x - 5 + np.random.randn(len(x))
df = pd.DataFrame({'x': x, 'y': y})
chart = alt.Chart(df).mark_point().encode(
x='x',
y='y'
)
line = chart.transform_regression('x', 'y').mark_line()
params = alt.Chart(df).transform_regression(
'x', 'y', params=True
).mark_text(align='left').encode(
x=alt.value(20), # pixels from left
y=alt.value(20), # pixels from top
text='rSquared:N',
# text='coef:N' # flattened array
# text='coef[0]:N' # fails
)
chart + line + params
You can access this using a calculate transform:
params = alt.Chart(df).transform_regression(
'x', 'y', params=True
).transform_calculate(
intercept='datum.coef[0]',
slope='datum.coef[1]',
).mark_text(align='left').encode(
x=alt.value(20), # pixels from left
y=alt.value(20), # pixels from top
text='intercept:N'
)
chart + line + params
I have a set of netcdf datasets that basically look like a CSV file with columns for latitude, longitude, value. These are points along tracks that I want to aggregate to a regular grid of (say) 1 degree from -90 to 90 and -180 to 180 degrees, by for example calculating the mean and/or standard deviation of all points that fall within a given cell.
This is quite easily done with a loop
D = np.zeros((180, 360))
for ilat in np.arange(-90, 90, 1, dtype=np.int):
for ilon in np.arange(-180, 180, 1, dtype=np.int):
p1 = np.logical_and(ds.lat >= ilat,
ds.lat <= ilat + 1)
p2 = np.logical_and(ds.lon >=ilon,
ds.lon <= ilon+1)
if np.sum(p1*p2) == 0:
D[90 + ilat, 180 +ilon] = np.nan
else:
D[90 + ilat, 180 + ilon] = np.mean(ds.var.values[p1*p2])
# D[90 + ilat, 180 + ilon] = np.std(ds.var.values[p1*p2])
Other than using numba/cython to speed this up, I was wondering whether this is something you can directly do with xarray in a more efficient way?
You should be able to solve this using pandas and xarray.
You will first need to convert your data set to a pandas data frame.
Once this is done, df is the dataframe and assuming longitude and latitude are lon/lat, you will need to round the lon/lats to the nearest integer value, and then calculate the mean for each lon/lat. You will then need to set lon/lat to indices. Then you can use xarray's to_xarray to convert to an array:
import xarray as xr
import pandas as pd
import numpy as np
df = df.assign(lon = lambda x: np.round(x.lon))
df = df.assign(lat = lambda x: np.round(x.lat))
df = df.groupby(["lat", "lon"]).mean()
df = df.set_index(["lat", "lon"])
df.to_xarray()
I use #robert-wilson as a starting point, and to_xarray is indeed part of my solution. Other inspiration came from here. The approach that I used is shown below. It's probably slower than numba-ing my solution above, but much simpler.
import netCDF4
import numpy as np
import xarray as xr
import pandas as pd
fname = "super_funky_file.nc"
f = netCDF4.Dataset(fname)
lat = f.variables['lat'][:]
lon = f.variables['lon'][:]
vari = f.variables['super_duper_variable'][:]
df = pd.DataFrame({"lat":lat,
"lon":lon,
"vari":vari})
# Simple functions to calculate the grid location in rows/cols
# using lat/lon as inputs. Global 0.5 deg grid
# Remember to cast to integer
to_col = lambda x: np.floor(
(x+90)/0.5).astype(
np.int)
to_row = lambda x: np.floor(
(x+180.)/0.5).astype(
np.int)
# Map the latitudes to columns
# Map the longitudes to rows
df['col'] = df.lat.map(to_col)
df['row'] = df.lon.map(to_row)
# Aggregate by row and col
gg = df.groupby(['col', 'row'])
# Now, create an xarray dataset with
# the mean of vari per grid cell
ds = gg.mean().to_xarray()
dx = gg.std().to_xarray()
ds['stdi'] = dx['vari']
dx = gg.count().to_xarray()
ds['counti'] = dx['vari']```
Sounds very complicated but a simple plot will make it easy to understand:
I have three curves of cumulative sum of some values over time, which are the blue lines.
I want to average (or somehow combine in a statistically correct way) the three curves into one smooth curve and add confidence interval.
I tried one simple solution - combining all the data into one curve, average it with the "rolling" function in pandas, getting the standard deviation for it. I plotted those as the purple curve with the confidence interval around it.
The problem with my real data, and as illustrated in the plot above is the curve isn't smooth at all, also there are sharp jumps in the confidence interval which also isn't a good representation of the 3 separate curves as there is no jumps in them.
Is there a better way to represent the 3 different curves in one smooth curve with a nice confidence interval?
I supply a test code, tested on python 3.5.1 with numpy and pandas (don't change the seed in order to get the same curves).
There are some constrains - increasing the number of points for the "rolling" function isn't a solution for me because some of my data is too short for that.
Test code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
np.random.seed(seed=42)
## data generation - cumulative analysis over time
df1_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df1_values = pd.DataFrame(np.random.randint(0,10000,size=100), columns=['vals'])
df1_combined_sorted = pd.concat([df1_time, df1_values], axis = 1).sort_values(by=['time'])
df1_combined_sorted_cumulative = np.cumsum(df1_combined_sorted['vals'])
df2_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df2_values = pd.DataFrame(np.random.randint(1000,13000,size=100), columns=['vals'])
df2_combined_sorted = pd.concat([df2_time, df2_values], axis = 1).sort_values(by=['time'])
df2_combined_sorted_cumulative = np.cumsum(df2_combined_sorted['vals'])
df3_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df3_values = pd.DataFrame(np.random.randint(0,4000,size=100), columns=['vals'])
df3_combined_sorted = pd.concat([df3_time, df3_values], axis = 1).sort_values(by=['time'])
df3_combined_sorted_cumulative = np.cumsum(df3_combined_sorted['vals'])
## combining the three curves
df_all_vals_cumulative = pd.concat([df1_combined_sorted_cumulative,.
df2_combined_sorted_cumulative, df3_combined_sorted_cumulative]).reset_index(drop=True)
df_all_time = pd.concat([df1_combined_sorted['time'],
df2_combined_sorted['time'], df3_combined_sorted['time']]).reset_index(drop=True)
df_all = pd.concat([df_all_time, df_all_vals_cumulative], axis = 1)
## creating confidence intervals
df_all_sorted = df_all.sort_values(by=['time'])
ma = df_all_sorted.rolling(10).mean()
mstd = df_all_sorted.rolling(10).std()
## plotting
plt.fill_between(df_all_sorted['time'], ma['vals'] - 2 * mstd['vals'],
ma['vals'] + 2 * mstd['vals'],color='b', alpha=0.2)
plt.plot(df_all_sorted['time'],ma['vals'], c='purple')
plt.plot(df1_combined_sorted['time'], df1_combined_sorted_cumulative, c='blue')
plt.plot(df2_combined_sorted['time'], df2_combined_sorted_cumulative, c='blue')
plt.plot(df3_combined_sorted['time'], df3_combined_sorted_cumulative, c='blue')
matplotlib.use('Agg')
plt.show()
First of all, your sample code could be re-written to make better use of pd. For example
np.random.seed(seed=42)
## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
df = pd.concat([times, vals], axis = 1).sort_values(by=['time']).\
reset_index().drop('index', axis=1)
df['cumulative'] = df.vals.cumsum()
return df
# generate the dataframes
df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)
# join
df_all = pd.concat(dfs, ignore_index=True).sort_values(by=['time'])
# render function
def render(window=10):
# compute rolling means and confident intervals
mean_val = df_all.cumulative.rolling(window).mean()
std_val = df_all.cumulative.rolling(window).std()
min_val = mean_val - 2*std_val
max_val = mean_val + 2*std_val
plt.figure(figsize=(16,9))
for df in dfs:
plt.plot(df.time, df.cumulative, c='blue')
plt.plot(df_all.time, mean_val, c='r')
plt.fill_between(df_all.time, min_val, max_val, color='blue', alpha=.2)
plt.show()
The reason your curves aren't that smooth is maybe your rolling window is not large enough. You can increase this window size to get smoother graphs. For example render(20) gives:
while render(30) gives:
Although, the better way might be imputing each of df['cumulative'] to the entire time window and compute the mean/confidence interval on these series. With that in mind, we can modify the code as follows:
np.random.seed(seed=42)
## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
# note that we set time as index of the returned data
df = pd.concat([times, vals], axis = 1).dropna().set_index('time').sort_index()
df['cumulative'] = df.vals.cumsum()
return df
df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)
# rename column for later plotting
for i,df in zip(range(3),dfs):
df.rename(columns={'cumulative':f'cummulative_{i}'}, inplace=True)
# concatenate the dataframes with common time index
df_all = pd.concat(dfs,sort=False).sort_index()
# interpolate each cumulative column linearly
df_all.interpolate(inplace=True)
# plot graphs
mean_val = df_all.iloc[:,1:].mean(axis=1)
std_val = df_all.iloc[:,1:].std(axis=1)
min_val = mean_val - 2*std_val
max_val = mean_val + 2*std_val
fig, ax = plt.subplots(1,1,figsize=(16,9))
df_all.iloc[:,1:4].plot(ax=ax)
plt.plot(df_all.index, mean_val, c='purple')
plt.fill_between(df_all.index, min_val, max_val, color='blue', alpha=.2)
plt.show()
and we get:
I can draw a boxplot from data:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.rand(100)
plt.boxplot(data)
Then, the box will range from the 25th-percentile to 75th-percentile, and the whisker will range from the smallest value to the largest value between (25th-percentile - 1.5*IQR, 75th-percentile + 1.5*IQR), where the IQR denotes the inter-quartile range. (Of course, the value 1.5 is customizable).
Now I want to know the values used in the boxplot, i.e. the median, upper and lower quartile, the upper whisker end point and the lower whisker end point. While the former three are easy to obtain by using np.median() and np.percentile(), the end point of the whiskers will require some verbose coding:
median = np.median(data)
upper_quartile = np.percentile(data, 75)
lower_quartile = np.percentile(data, 25)
iqr = upper_quartile - lower_quartile
upper_whisker = data[data<=upper_quartile+1.5*iqr].max()
lower_whisker = data[data>=lower_quartile-1.5*iqr].min()
I was wondering, while this is acceptable, would there be a neater way to do this? It seems that the values should be ready to pull-out from the boxplot, as it's already drawn.
Why do you want to do so? what you are doing is already pretty direct.
Yeah, if you want to fetch them for the plot, when the plot is already made, simply use the get_ydata() method.
B = plt.boxplot(data)
[item.get_ydata() for item in B['whiskers']]
It returns an array of the shape (2,) for each whiskers, the second element is the value we want:
[item.get_ydata()[1] for item in B['whiskers']]
I've had this recently and have written a function to extract the boxplot values from the boxplot as a pandas dataframe.
The function is:
def get_box_plot_data(labels, bp):
rows_list = []
for i in range(len(labels)):
dict1 = {}
dict1['label'] = labels[i]
dict1['lower_whisker'] = bp['whiskers'][i*2].get_ydata()[1]
dict1['lower_quartile'] = bp['boxes'][i].get_ydata()[1]
dict1['median'] = bp['medians'][i].get_ydata()[1]
dict1['upper_quartile'] = bp['boxes'][i].get_ydata()[2]
dict1['upper_whisker'] = bp['whiskers'][(i*2)+1].get_ydata()[1]
rows_list.append(dict1)
return pd.DataFrame(rows_list)
And is called by passing an array of labels (the ones that you would pass to the boxplot plotting function) and the data returned by the boxplot function itself.
For example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def get_box_plot_data(labels, bp):
rows_list = []
for i in range(len(labels)):
dict1 = {}
dict1['label'] = labels[i]
dict1['lower_whisker'] = bp['whiskers'][i*2].get_ydata()[1]
dict1['lower_quartile'] = bp['boxes'][i].get_ydata()[1]
dict1['median'] = bp['medians'][i].get_ydata()[1]
dict1['upper_quartile'] = bp['boxes'][i].get_ydata()[2]
dict1['upper_whisker'] = bp['whiskers'][(i*2)+1].get_ydata()[1]
rows_list.append(dict1)
return pd.DataFrame(rows_list)
data1 = np.random.normal(loc = 0, scale = 1, size = 1000)
data2 = np.random.normal(loc = 5, scale = 1, size = 1000)
data3 = np.random.normal(loc = 10, scale = 1, size = 1000)
labels = ['data1', 'data2', 'data3']
bp = plt.boxplot([data1, data2, data3], labels=labels)
print(get_box_plot_data(labels, bp))
plt.show()
Outputs the following from get_box_plot_data:
label lower_whisker lower_quartile median upper_quartile upper_whisker
0 data1 -2.491652 -0.587869 0.047543 0.696750 2.559301
1 data2 2.351567 4.310068 4.984103 5.665910 7.489808
2 data3 7.227794 9.278931 9.947674 10.661581 12.733275
And produces the following plot:
upper_whisker = data[data<=upper_quartile+1.5*iqr].max()
lower_whisker = data[data>=lower_quartile-1.5*iqr].min()
equal to
upper_whisker = data.max()
lower_whisker = data.min()
if you just want to get the real data points in the dataset. But statistically speaking, the whisker values are upper_quantile+1.5IQR and lower_quantile-1.5IQR