np.array mean to single column data frame - python

I have a 2 column array that I calculate the mean of (thus creating column A). I would like to be able to refer to and manipulate column A, but cannot seem to save it as a new single column. Here is my specific example, 'filtered' is what I'd like to be able to save/use/ Errors are regularly ValueError: Wrong number of items passed 2, placement implies 1.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df=pd.read_csv('/Users/myfile.csv', delimiter=',', usecols=['Time','Distance'])
x = df['Time']
y = df['Distance']
n = 25 #small n = less smoothed
fwd = pd.Series.ewm(df,span=n, adjust=True).mean()
bwd = pd.Series.ewm(df[::-1],span=n, adjust=True).mean()
filtered = np.stack(( fwd, bwd[::-1] ))
filtered2 = np.mean(filtered, axis=0)
plt.subplot(2,1,1)
plt.title('smoothed and raw data')
plt.plot(x,y, color = 'orange')
plt.plot(x,filtered, color='green')
plt.plot(x,fwd, color='red')
plt.plot(x[::-1],bwd, color='blue')
plt.xlabel('time')
plt.ylabel('distance')
df['filtered2'] = pd.DataFrame(filtered, dtype='str', index=None)
print(filtered2)
smoothed_velocity = ((df.filtered2 - df.filtered2.shift(1)) / df['Time'] - df['Time'].shift(1))
print(smoothed_velocity)
plt.subplot (2,1,2)
plt.title ('smoothed velocity')
plt.plot (smoothed_velocity, color = 'orange')
plt.tight_layout()
plt.show()
Because I define 'filtered' twice I tried changing one to a different variable with no luck. Error presented was ValueError: x and y must have same first dimension, but have shapes (458,) and (2, 458, 2)
Any help would be rad!

Related

How to automatically set the y-axis limits after limiting the x-axis

Let's say that I have a certain number of data sets that I want to plot together.
And then I want to zoom on a certain part (for example, using ax.set_xlim, or plt.xlim or plt.axis). When I do that it still keeps the calculated range prior to the zoom. How can I make it rescale to what is currently being shown?
For example, using
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
data_x = [d for d in range(100)]
data_y = [2*d for d in range(100)]
data_y2 = [(d-50)*(d-50) for d in range(100)]
fig = plt.figure(constrained_layout=True)
gs = gridspec.GridSpec(2, 1, figure=fig)
ax1 = fig.add_subplot(gs[0, 0])
ax1.grid()
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.scatter(data_x, data_y, s=0.5)
ax1.scatter(data_x, data_y2, s=0.5)
ax2 = fig.add_subplot(gs[1, 0])
ax2.grid()
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.scatter(data_x, data_y, s=0.5)
ax2.scatter(data_x, data_y2, s=0.5)
ax2.set_xlim(35,45)
fig.savefig('scaling.png', dpi=300)
plt.close(fig)
Which generate
as you can see the plot below gets hard to see something since the y-axis kept using the same range as the non-limited version.
I have tried using relim, autoscale or autoscale_view but that did not work. For a single data set, I could use ylim with the minimum and maximum values for that dataset. But for different data set, I would have to look through all of them.
Is there a better way to force a recalculation of the y-axis range?
Convert the lists to numpy arrays
create a Boolean mask of data_x based on xlim_min and xlim_max
use the mask to select the relevant data points in the y data
combine the two selected y arrays
select the min and max values from the selected y values and set them as ylim
import numpy as np
import matplotlib.pyplot as plt
# use a variable for the xlim limits
xlim_min = 35
xlim_max = 45
# convert lists to arrays
data_x = np.array(data_x)
data_y = np.array(data_y)
data_y2 = np.array(data_y2)
# create a mask for the values to be plotted based on the xlims
x_mask = (data_x >= xlim_min) & (data_x <= xlim_max)
# use the mask on y arrays
y2_vals = data_y2[x_mask]
y_vals = data_y[x_mask]
# combine y arrays
y_all = np.concatenate((y2_vals, y_vals))
# get min and max y
ylim_min = y_all.min()
ylim_max = y_all.max()
# other code from op
...
# use the values to set xlim and ylim
ax2.set_xlim(xlim_min, xlim_max)
ax2.set_ylim(ylim_min, ylim_max)
Instead of using ylim and xlim, you can do x_vals = data_x[x_mask] and then plot x_vals with y_vals and y2_vals, which removes 5 lines of code.
This is similar to Matplotlib - fixing x axis scale and autoscale y axis
# use a variable for the xlim limits
xlim_min = 35
xlim_max = 45
# convert lists to arrays
data_x = np.array(data_x)
data_y = np.array(data_y)
data_y2 = np.array(data_y2)
# create a mask for the values to be plotted based on the xlims
x_mask = (data_x >= xlim_min) & (data_x <= xlim_max)
# use the mask on x
x_vals = data_x[x_mask]
# use the mask on y
y2_vals = data_y2[x_mask]
y_vals = data_y[x_mask]
# other code from op
...
# plot
ax2.scatter(x_vals, y_vals, s=0.5)
ax2.scatter(x_vals, y2_vals, s=0.5)

Plot certain range of values with pandas and matplotlib

I have parsed out data form .json than plotted them but I only wants a certain range from it
e.g. year-mounth= 2014-12to 2020-03
THE CODE IS
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_json("observed-solar-cycle-indices.json", orient='records')
data = pd.DataFrame(data)
print(data)
x = data['time-tag']
y = data['ssn']
plt.plot(x, y, 'o')
plt.xlabel('Year-day'), plt.ylabel('SSN')
plt.show()
Here is the result, as you can see it is too many
here is the json file: https://services.swpc.noaa.gov/json/solar-cycle/observed-solar-cycle-indices.json
How to either parse out certain value from the JSON file or plot a certain range?
The following should work:
Select the data using a start and end date
ndata = data[ (data['time-tag'] > '2014-01') & (data['time-tag'] < '2020-12')]
Plot the data. The x-axis labeling is adapted to display only every 12th label
x = ndata['time-tag']
y = ndata['ssn']
fig, ax = plt.subplots()
plt.plot(x, y, 'o')
every_nth = 12
for n, label in enumerate(ax.xaxis.get_ticklabels()):
if n % every_nth != 0:
label.set_visible(False)
plt.xlabel('Year-Month')
plt.xticks(rotation='vertical')
plt.ylabel('SSN')
plt.show()
You could do a search for the index value of your start and end dates for both x and y values. Use this to create a smaller set of lists that you can plot.
For example, it might be something like
x = data['time-tag']
y = data['ssn']
start_index = x.index('2014-314')
end_index = x.index('2020-083')
x_subsection = x[start_index : end_index]
y_subsection = y[start_index : end_index]
plt.plot(x_subsection, y_subsection, 'o')
plt.xlabel('Year-day'), plt.ylabel('SSN')
plt.show()
You may need to convert the dataframe into an array with np.array().

plot the average of every x value

I plot a function which is based on the results of a curve fit I did in the query. Now I want to see how the curve fit actually fits the average values for every x value. I treid it with a for loop and a groupby.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
plt.style.use('seaborn-colorblind')
x = dataset['mrwSmpVWi']
c = dataset['c']
a = dataset['a']
b = dataset['b']
Snr = dataset['Seriennummer']
dataset["y"] = (c / (1 + (a) * np.exp(-b*(x))))
for number in dataset.groupby('mrwSmpVWi'):
dataset['m'] = dataset['mrwSmpP'].mean()
fig, ax = plt.subplots(figsize=(30,15))
for name, group in dataset.groupby('Seriennummer'):
group.plot(x="mrwSmpVWi", y="m", ax=ax, marker='o', linestyle='', ms=12, label =name)
group.plot(x="mrwSmpVWi", y="y", ax=ax, label =name)
plt.show()
The dataset with the values is huge and not sorted by mrwSmpVWi.
Has someone an idea why I only get a straight line for my average values?
You got to take a look at what you're doing with this line:
for number in dataset.groupby('mrwSmpVWi'):
dataset['m'] = dataset['mrwSmpP'].mean()
You probably want:
dataset['m'] = dataset.groupby('Seriennummer')['mrwSmpVWi'].transform('mean')
(assuming you were intending to calculate the mean of each group of Serienummer)

Can I render only part of a scatter_matrix?

On Python, using the Pandas library, I'm trying to generate the scatter plot of a DataFrame using scatter_matrix as follows:
scatter_matrix(df, alpha=0.5, figsize=(14,14), diagonal='kde')
That program takes very long to run and eventually crashes, possibly because there are too many (26) columns, and the resulting image would be to big. Nether less, I noticed I'm able to render 13 variables just fine. That way, one solution would be to generate 4 plots instead, one for each quadrant of the resulting scatter matrix, i.e., the ranges [[0,0],[13,13]], [[13,0],[26,13]], [[0,13],[13,26]], [[13,13],[26,26]]. Notice those don't refer to ranges on the source DataFrame, but of the target scatter matrix I'm rendering. Is it possible?
I couldn't find any official way to do it, so I modified the scatter_matrix implementation to receive 2 additional params, cols and rows, which are arrays with the labels you want to compare:
"""
This is a modification of the scatter_matrix method;
it allows plotting sub-sections of the scatter plot
matrix. With the official method you can only plot
the entire matrix. This method allows selecting the
`cols` and `rows` you're interested in plotting.
Note that this wouldn't be possible even by calling
scatter_matrix on a subset of the dataframe, because
this wouldn't allow comparing different `cols`/`rows'.
"""
import numpy as np
import matplotlib.pyplot as plt
import pandas
import pandas.tools.plotting
from pandas.compat import range, lrange, lmap, map, zip, string_types
def scatter_matrix(frame, cols, rows, figsize=None, ax=None, grid=False,
diagonal='hist', marker='.', density_kwds=None,
hist_kwds=None, range_padding=0.05, **kwds):
"""
Draw a matrix of scatter plots.
Parameters
----------
frame : DataFrame
cols : [str]
labels of the columns to be rendered
rows: [str]
labels of the rows to be rendered
alpha : float, optional
amount of transparency applied
figsize : (float,float), optional
a tuple (width, height) in inches
ax : Matplotlib axis object, optional
grid : bool, optional
setting this to True will show the grid
diagonal : {'hist', 'kde'}
pick between 'kde' and 'hist' for
either Kernel Density Estimation or Histogram
plot in the diagonal
marker : str, optional
Matplotlib marker type, default '.'
hist_kwds : other plotting keyword arguments
To be passed to hist function
density_kwds : other plotting keyword arguments
To be passed to kernel density estimate plot
range_padding : float, optional
relative extension of axis range in x and y
with respect to (x_max - x_min) or (y_max - y_min),
default 0.05
kwds : other plotting keyword arguments
To be passed to scatter function
Examples
--------
>>> df = DataFrame(np.random.randn(1000, 4), columns=['A','B','C','D'])
>>> scatter_matrix(df, alpha=0.2)
"""
import matplotlib.pyplot as plt
df = frame._get_numeric_data()
w = len(cols)
h = len(rows)
naxes = w * h
fig, axes = pandas.tools.plotting._subplots(naxes=naxes, figsize=figsize, ax=ax,
squeeze=False)
# no gaps between subplots
fig.subplots_adjust(wspace=0, hspace=0)
#mask = pandas.tools.plotting.notnull(df)
marker = pandas.tools.plotting._get_marker_compat(marker)
hist_kwds = hist_kwds or {}
density_kwds = density_kwds or {}
# workaround because `c='b'` is hardcoded in matplotlibs scatter method
#kwds.setdefault('c', plt.rcParams['patch.facecolor'])
cols_boundaries_list = []
for a in cols:
values = df[a]#.values[mask[a].values]
rmin_, rmax_ = np.min(values), np.max(values)
rdelta_ext = (rmax_ - rmin_) * range_padding / 2.
cols_boundaries_list.append((rmin_ - rdelta_ext, rmax_ + rdelta_ext))
rows_boundaries_list = []
for a in rows:
values = df[a]#.values[mask[a].values]
rmin_, rmax_ = np.min(values), np.max(values)
rdelta_ext = (rmax_ - rmin_) * range_padding / 2.
rows_boundaries_list.append((rmin_ - rdelta_ext, rmax_ + rdelta_ext))
for i, a in zip(lrange(w), cols):
for j, b in zip(lrange(h), rows):
ax = axes[i, j]
if cols[i] == rows[j]:
values = df[a]#.values[mask[a].values]
# Deal with the diagonal by drawing a histogram there.
if diagonal == 'hist':
ax.hist(values, **hist_kwds)
elif diagonal in ('kde', 'density'):
from scipy.stats import gaussian_kde
y = values
gkde = gaussian_kde(y)
ind = np.linspace(y.min(), y.max(), 1000)
ax.plot(ind, gkde.evaluate(ind), **density_kwds)
ax.set_xlim(cols_boundaries_list[i])
else:
#common = (mask[a] & mask[b]).values
ax.scatter(df[b], df[a], marker=marker, **kwds)
ax.set_xlim(rows_boundaries_list[j])
ax.set_ylim(cols_boundaries_list[i])
ax.set_xlabel(b)
ax.set_ylabel(a)
if j != 0:
ax.yaxis.set_visible(False)
if i != w - 1:
ax.xaxis.set_visible(False)
# what is that for?
#if len(df.columns) > 1:
#lim1 = cols_boundaries_list[0]
#locs = axes[0][1].yaxis.get_majorticklocs()
#locs = locs[(lim1[0] <= locs) & (locs <= lim1[1])]
#adj = (locs - lim1[0]) / (lim1[1] - lim1[0])
#lim0 = axes[0][0].get_ylim()
#adj = adj * (lim0[1] - lim0[0]) + lim0[0]
#axes[0][0].yaxis.set_ticks(adj)
#if np.all(locs == locs.astype(int)):
## if all ticks are int
#locs = locs.astype(int)
#axes[0][0].yaxis.set_ticklabels(locs)
pandas.tools.plotting._set_ticks_props(axes, xlabelsize=8, xrot=90, ylabelsize=8, yrot=0)
return axes
I was quite careless modifying that code so it is probably broken but served my needs.
The 4 quadrants will be:
mid_ind = len(df.index)//2
mid_col = len(df.columns)//2
df.iloc[:mid_ind,:mid_col]
df.iloc[mid_ind:,:mid_col]
df.iloc[mid_ind:,mid_col:]
df.iloc[:mid_ind,mid_col:]

Plot 2D array with Pandas, Matplotlib, and Numpy

As a result from simulations, I parsed the output using Pandas groupby(). I am having a bit of difficulty to plot the data the way I want. Here's the Pandas output file (suppressed for simplicity) that I'm trying to plot:
Avg-del Min-del Max-del Avg-retx Min-retx Max-retx
Prob Producers
0.3 1 8.060291 0.587227 26.709371 42.931779 5.130041 136.216642
5 8.330889 0.371387 54.468836 43.166326 3.340193 275.932170
10 1.012147 0.161975 4.320447 6.336965 2.026241 19.177802
0.5 1 8.039639 0.776463 26.053635 43.160880 5.798276 133.090358
5 4.729875 0.289472 26.717824 25.732373 2.909811 135.289244
10 1.043738 0.160671 4.353993 6.461914 2.015735 19.595393
My y-axis is delay and my x-axis is the number of producers. I want to have errorbars for probability p=0.3 and another one for p=0.5.
My python script is the following:
import sys
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.expand_frame_repr', False)
outputFile = 'averages.txt'
f_out = open(outputFile, 'w')
data = pd.read_csv(sys.argv[1], delimiter=",")
result = data.groupby(["Prob", "Producers"]).mean()
print "Writing to output file: " + outputFile
result_s = str(result)
f_out.write(result_s)
f_out.close()
*** Update from James ***
for prob_index in result.index.levels[0]:
r = result.loc[prob_index]
labels = [col for col in r]
lines = plt.plot(r)
[line.set_label(str(prob_index)+" "+col) for col, line in zip(labels, lines)]
ax = plt.gca()
ax.legend()
ax.set_xticks(r.index)
ax.set_ylabel('Latency (s)')
ax.set_xlabel('Number of producer nodes')
plt.show()
Now I have 4 sliced arrays, one for each probability.
How do I slice them again based on delay(del) and retx, and plot errorbars based on ave, min, max?
Ok, there is a lot going on here. First, it is plotting 6 lines. When your code calls
plt.plot(np.transpose(np.array(result)[0:3, 0:3]), label = 'p=0.3')
plt.plot(np.transpose(np.array(result)[3:6, 0:3]), label = 'p=0.5')
it is calling plt.plot on a 3x3 array of data. plt.plot interprets this input not as an x and y, but rather as 3 separate series of y-values (with 3 points each). For the x values, it is imputing the values 0,1,2. In other words it for the first plot call it is plotting the data:
x = [1,2,3]; y = [8.060291, 8.330889, 1.012147]
x = [1,2,3]; y = [0.587227, 0.371387, 0.161975]
x = [1,2,3]; y = [26.709371, 54.468836, 4.320447]
Based on your x-label, I think you want the values to be x = [1,5,10]. Try this to see if it gets the plot you want.
# iterate over the first dataframe index
for prob_index in result.index.levels[0]:
r = result.loc[prob_index]
labels = [col for col in r]
lines = plt.plot(r)
[line.set_label(str(prob_index)+" "+col) for col, line in zip(labels, lines)]
ax = plt.gca()
ax.legend()
ax.set_xticks(r.index)
ax.set_ylabel('Latency (s)')
ax.set_xlabel('Number of producer nodes')

Categories