How would I offset a pandas column of data by different amounts?

How would I offset a pandas column of data by different amounts? - python

I am plotting some columns within a pandas dataframe using matplotlib. I have a strategy for plotting whereby I'm zeroing to the initial value and then offset each chosen variable by a set amount. For example, this is my current plotting method:
fig, ax = plt.subplots()
# data is in a dataframe called inputData
timeseries_plots=['var1','var3','var8']
offsetFactor = 20
for ii,var in enumerate(timeseries_plots)
offsetRef = inputData[var].loc[~inputData[var].isnull()].iloc[0]
ax.plot(inputData[TimeIndex], offsetFactor*(len(timeseries_plots_avg)-ii-1)+inputData[timeseries_plots_avg[ii]]-offsetRef, label=var,markersize=1,marker='None',linestyle = 'solid',color=colour)
plt.show()
This produces something like this (with some matplotlib finessing):
As you can see, it removes the offsetRef (which in this case is the initial value of the variable), and then adds a constant offsetFactor (equal to 20 in this case) to each variable. The result is lines which start vertically offset by 20.
However, this can be a problem when the values start to drift over time, and one variable might cross another. What I'd like to do is reset the vertical offset - such as by changing the offsetRef beyond a certain date.
I have tried to do this in the following way. I start by initialising an array equal to the size of the variable. I then fill it with the offsetRef recalculated at the resetDates. I've included comments marked #PSEUDOCODE where I'm roughly writing what I want to do - but sorry in advance for them being pretty rough. Thank you in advance!
fig, ax = plt.subplots()
inputData = pd.DataFrame(np.random.randint(100, size=(100, 5)), columns=['timestamp','var2','var3','var4','var5'])
inputData['timestamp'][:]=pd.date_range('2020-may-01','2020-aug-08')
timeseries_plots=['var1','var3','var4']
offsetFactor = 20
resetDates = ['2020-jun-23','2020-jul-05']
for ii,var in enumerate(timeseries_plots)
offsetRef = np.zeros(inputData[var].size)
for tt,ttdate in enumerate(resetDates):
if tt=0:
#PSEUDO CODE: offsetRef[ inputData['timestamp'] <resetDates[tt]] = inputData[var].loc[~inputData[var].isnull()].iloc[0]
#PSEUDO CODE: offsetRef[ inputData['timestamp'] >=resetDates[tt]] = inputData[var].loc[~inputData[var].isnull()].iloc[ttdate]
#PSEUDO CODE: offsetRef[ inputData['timestamp'] >=resetDates[tt]] = inputData[var].loc[~inputData[var].isnull()].iloc[ttdate]
ax.plot(inputData[TimeIndex], offsetFactor*(len(timeseries_plots_avg)-ii-1)+inputData[timeseries_plots_avg[ii]]-offsetRef, label=var,markersize=1,marker='None',linestyle = 'solid',color=colour)
plt.show()

This is the current solution that I'll stick here so that it might be useful to others:
fig, ax = plt.subplots()
# set up df
inputData = pd.DataFrame(np.random.randint(100, size=(100, 5)), columns=['timestamp','var2','var3','var4','var5'])
inputData['timestamp'][:]=pd.date_range('2020-may-01','2020-aug-08')
inputData['var2']=np.arange(0,100,1)
inputData['var2'][0:3]=49
inputData['var4']=np.arange(0,200,2)
inputData['var2'][0:3]=np.nan
# set constants and settings
dispFactor=20
timeseries_plots=['var2','var4']
resetDates=['2020-05-05','2020-05-20', '2020-08-04']
offsetFactor = dispFactor
#begin
fig, ax=plt.subplots()
for ii,var in enumerate(timeseries_plots):
offsetRef = np.zeros(inputData[var].size)
for tt,ttdate in enumerate(resetDates):
if tt==0:
if inputData[var].loc[inputData['timestamp']==ttdate].isna().bool(): #if date is nan
print('a',inputData[var].loc[~inputData[var].isnull()].iloc[0],inputData[var].bfill().loc[inputData['timestamp']==ttdate])
offsetRef[(inputData['timestamp']<ttdate)]= inputData[var].loc[~inputData[var].isnull()].iloc[0]
offsetRef[(inputData['timestamp']>=ttdate)]=inputData[var].bfill().loc[inputData['timestamp']==ttdate]
else:
print('b',inputData[var].loc[~inputData[var].isnull()].iloc[0],inputData[var].loc[inputData['timestamp']==ttdate])
offsetRef[(inputData['timestamp']<ttdate)]= inputData[var].loc[~inputData[var].isnull()].iloc[0]
offsetRef[(inputData['timestamp']>=ttdate)]= inputData[var].loc[inputData['timestamp']==ttdate]
else:
if inputData[var].loc[inputData['timestamp']==ttdate].isna().bool(): #if date is nan
print('c')
offsetRef[ inputData['timestamp'] >=resetDates[tt]] = inputData[var].bfill().loc[inputData['timestamp']==ttdate]
else:
print('d',inputData[var].loc[inputData['timestamp']==ttdate])
offsetRef[ inputData['timestamp'] >=resetDates[tt]] = inputData[var].loc[inputData['timestamp']==ttdate]
print(offsetRef)
ax.plot(inputData['timestamp'], offsetFactor*(len(timeseries_plots)-ii-1)+inputData[var]-offsetRef)
plt.show()
This 'resets' the offset to 20 at the chosen resetDates to produce the following figure:
I possibly don't need the if-logic catches for nan data (and just rely on .bfill()) to work in either case - but this makes me feel that it's safer. I will edit as I improve the solution.

Related

Openpyxl minor gridlines

I am working on a Python application where I am collecting data from a device, and attempting to plot it in an excel file by using the Openpyxl library. I am successfully able to do everything including plotting the data, and formatting the scatter plot that I made, but I am having some trouble in adding minor gridlines to the plot.
I feel like this is definitely possible because in the API, I can see under the openpyxl.chart.axis module, there is a “minorGridlines” attribute, but it is not a boolean input (ON/OFF), rather it takes a Chartlines class. I tried going a bit down the rabbit-hole of seeing how I would do this, but I am wondering what the most straightforward way of adding the minor-gridlines would be? Do you have to construct chart lines manually, or is there a simple way of doing this?
I would really appreciate any help!

I think I answered my own question, but I will post it here if anybody else needs this (as I don’t see any other answers to this question on the forum).
Example Code (see lines 4, 38):
# Imports for script
from openpyxl import Workbook # For plotting things in excel
from openpyxl.chart import ScatterChart, Reference, Series
from openpyxl.chart.axis import ChartLines
from math import log10
# Variables for script
fileName = 'testFile.xlsx'
dataPoints = 100
# Generating a workbook to test with
wb = Workbook()
ws = wb.active # Fill data into the first sheet
ws_name = ws.title
# We will just generate a logarithmic plot, and scale the axis logarithmically (will look linear)
x_data = []
y_data = []
for i in range(dataPoints):
x_data.append(i + 1)
y_data.append(log10(i + 1))
# Go back through the data, and place the data into the sheet
ws['A1'] = 'x_data'
ws['B1'] = 'y_data'
for i in range(dataPoints):
ws['A%d' % (i + 2)] = x_data[i]
ws['B%d' % (i + 2)] = y_data[i]
# Generate a reference to the cells that we can plot
x_axis = Reference(ws, range_string='%s!A2:A%d' % (ws_name, dataPoints + 1))
y_axis = Reference(ws, range_string='%s!B2:B%d' % (ws_name, dataPoints + 1))
function = Series(xvalues=x_axis, values=y_axis)
# Actually create the scatter plot, and append all of the plots to it
ScatterPlot = ScatterChart()
ScatterPlot.x_axis.minorGridlines = ChartLines()
ScatterPlot.x_axis.scaling.logBase = 10
ScatterPlot.series.append(function)
ScatterPlot.x_axis.title = 'X_Data'
ScatterPlot.y_axis.title = 'Y_Data'
ScatterPlot.title = 'Openpyxl Plotting Test'
ws.add_chart(ScatterPlot, 'D2')
# Save the file at the end to output it
wb.save(fileName)
Background on solution:
I looked at how the code for Openpyxl generates the Major axis gridlines, which seems to follow a similar convention as the Minor axis gridlines, and I found that in the ‘NumericAxis’ class, they generated the major gridlines with the following line (labeled ‘##### This Line #####’ which is originally copied from the ‘openpyxl->chart->axis’ file):
class NumericAxis(_BaseAxis):
tagname = "valAx"
axId = _BaseAxis.axId
scaling = _BaseAxis.scaling
delete = _BaseAxis.delete
axPos = _BaseAxis.axPos
majorGridlines = _BaseAxis.majorGridlines
minorGridlines = _BaseAxis.minorGridlines
title = _BaseAxis.title
numFmt = _BaseAxis.numFmt
majorTickMark = _BaseAxis.majorTickMark
minorTickMark = _BaseAxis.minorTickMark
tickLblPos = _BaseAxis.tickLblPos
spPr = _BaseAxis.spPr
txPr = _BaseAxis.txPr
crossAx = _BaseAxis.crossAx
crosses = _BaseAxis.crosses
crossesAt = _BaseAxis.crossesAt
crossBetween = NestedNoneSet(values=(['between', 'midCat']))
majorUnit = NestedFloat(allow_none=True)
minorUnit = NestedFloat(allow_none=True)
dispUnits = Typed(expected_type=DisplayUnitsLabelList, allow_none=True)
extLst = Typed(expected_type=ExtensionList, allow_none=True)
__elements__ = _BaseAxis.__elements__ + ('crossBetween', 'majorUnit',
'minorUnit', 'dispUnits',)
def __init__(self,
crossBetween=None,
majorUnit=None,
minorUnit=None,
dispUnits=None,
extLst=None,
**kw
):
self.crossBetween = crossBetween
self.majorUnit = majorUnit
self.minorUnit = minorUnit
self.dispUnits = dispUnits
kw.setdefault('majorGridlines', ChartLines()) ######## THIS Line #######
kw.setdefault('axId', 100)
kw.setdefault('crossAx', 10)
super(NumericAxis, self).__init__(**kw)
#classmethod
def from_tree(cls, node):
"""
Special case value axes with no gridlines
"""
self = super(NumericAxis, cls).from_tree(node)
gridlines = node.find("{%s}majorGridlines" % CHART_NS)
if gridlines is None:
self.majorGridlines = None
return self
I took a stab, and after importing the ‘Chartlines’  class like so:
from openpyxl.chart.axis import ChartLines
 
I was able to add minor gridlines to the x-axis like so:
ScatterPlot.x_axis.minorGridlines = ChartLines()
As far as formatting the minor gridlines, I’m at a bit of a loss, and personally have no need, but this at least is a good start.

Averaging several time-series together with confidence interval (with test code)

Sounds very complicated but a simple plot will make it easy to understand:
I have three curves of cumulative sum of some values over time, which are the blue lines.
I want to average (or somehow combine in a statistically correct way) the three curves into one smooth curve and add confidence interval.
I tried one simple solution - combining all the data into one curve, average it with the "rolling" function in pandas, getting the standard deviation for it. I plotted those as the purple curve with the confidence interval around it.
The problem with my real data, and as illustrated in the plot above is the curve isn't smooth at all, also there are sharp jumps in the confidence interval which also isn't a good representation of the 3 separate curves as there is no jumps in them.
Is there a better way to represent the 3 different curves in one smooth curve with a nice confidence interval?
I supply a test code, tested on python 3.5.1 with numpy and pandas (don't change the seed in order to get the same curves).
There are some constrains - increasing the number of points for the "rolling" function isn't a solution for me because some of my data is too short for that.
Test code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
np.random.seed(seed=42)
## data generation - cumulative analysis over time
df1_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df1_values = pd.DataFrame(np.random.randint(0,10000,size=100), columns=['vals'])
df1_combined_sorted = pd.concat([df1_time, df1_values], axis = 1).sort_values(by=['time'])
df1_combined_sorted_cumulative = np.cumsum(df1_combined_sorted['vals'])
df2_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df2_values = pd.DataFrame(np.random.randint(1000,13000,size=100), columns=['vals'])
df2_combined_sorted = pd.concat([df2_time, df2_values], axis = 1).sort_values(by=['time'])
df2_combined_sorted_cumulative = np.cumsum(df2_combined_sorted['vals'])
df3_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df3_values = pd.DataFrame(np.random.randint(0,4000,size=100), columns=['vals'])
df3_combined_sorted = pd.concat([df3_time, df3_values], axis = 1).sort_values(by=['time'])
df3_combined_sorted_cumulative = np.cumsum(df3_combined_sorted['vals'])
## combining the three curves
df_all_vals_cumulative = pd.concat([df1_combined_sorted_cumulative,.
df2_combined_sorted_cumulative, df3_combined_sorted_cumulative]).reset_index(drop=True)
df_all_time = pd.concat([df1_combined_sorted['time'],
df2_combined_sorted['time'], df3_combined_sorted['time']]).reset_index(drop=True)
df_all = pd.concat([df_all_time, df_all_vals_cumulative], axis = 1)
## creating confidence intervals
df_all_sorted = df_all.sort_values(by=['time'])
ma = df_all_sorted.rolling(10).mean()
mstd = df_all_sorted.rolling(10).std()
## plotting
plt.fill_between(df_all_sorted['time'], ma['vals'] - 2 * mstd['vals'],
ma['vals'] + 2 * mstd['vals'],color='b', alpha=0.2)
plt.plot(df_all_sorted['time'],ma['vals'], c='purple')
plt.plot(df1_combined_sorted['time'], df1_combined_sorted_cumulative, c='blue')
plt.plot(df2_combined_sorted['time'], df2_combined_sorted_cumulative, c='blue')
plt.plot(df3_combined_sorted['time'], df3_combined_sorted_cumulative, c='blue')
matplotlib.use('Agg')
plt.show()

First of all, your sample code could be re-written to make better use of pd. For example
np.random.seed(seed=42)
## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
df = pd.concat([times, vals], axis = 1).sort_values(by=['time']).\
reset_index().drop('index', axis=1)
df['cumulative'] = df.vals.cumsum()
return df
# generate the dataframes
df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)
# join
df_all = pd.concat(dfs, ignore_index=True).sort_values(by=['time'])
# render function
def render(window=10):
# compute rolling means and confident intervals
mean_val = df_all.cumulative.rolling(window).mean()
std_val = df_all.cumulative.rolling(window).std()
min_val = mean_val - 2*std_val
max_val = mean_val + 2*std_val
plt.figure(figsize=(16,9))
for df in dfs:
plt.plot(df.time, df.cumulative, c='blue')
plt.plot(df_all.time, mean_val, c='r')
plt.fill_between(df_all.time, min_val, max_val, color='blue', alpha=.2)
plt.show()
The reason your curves aren't that smooth is maybe your rolling window is not large enough. You can increase this window size to get smoother graphs. For example render(20) gives:
while render(30) gives:
Although, the better way might be imputing each of df['cumulative'] to the entire time window and compute the mean/confidence interval on these series. With that in mind, we can modify the code as follows:
np.random.seed(seed=42)
## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
# note that we set time as index of the returned data
df = pd.concat([times, vals], axis = 1).dropna().set_index('time').sort_index()
df['cumulative'] = df.vals.cumsum()
return df
df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)
# rename column for later plotting
for i,df in zip(range(3),dfs):
df.rename(columns={'cumulative':f'cummulative_{i}'}, inplace=True)
# concatenate the dataframes with common time index
df_all = pd.concat(dfs,sort=False).sort_index()
# interpolate each cumulative column linearly
df_all.interpolate(inplace=True)
# plot graphs
mean_val = df_all.iloc[:,1:].mean(axis=1)
std_val = df_all.iloc[:,1:].std(axis=1)
min_val = mean_val - 2*std_val
max_val = mean_val + 2*std_val
fig, ax = plt.subplots(1,1,figsize=(16,9))
df_all.iloc[:,1:4].plot(ax=ax)
plt.plot(df_all.index, mean_val, c='purple')
plt.fill_between(df_all.index, min_val, max_val, color='blue', alpha=.2)
plt.show()
and we get:

Get smooth line plot by filling missing values

I have multiple Dataframes (up to 30) which all contain timestamps with associated values. The timestamp in the DataFrames do not necessarily overlap and the recorded values can only stay the same or increase. A DataFrame may look like this:
time coverage
0 0.000000 32.111748
1 0.875050 32.482579
2 1.850576 32.784133
3 3.693440 34.205134
...
I uploaded a couple of csv files with data here 1, 2, 3, 4.
So what I am trying to do is to plot the increase of the mean and median coverage values over time for all recordings, as follows:
# data is a list of dataframes
keys = ["Run " + str(i) for i in range(len(data))]
glued = pd.concat(data, keys=keys).reset_index(level=0).rename(columns={'level_0': 'Run'})
glued["roundtime"] = glued["time"] / 60
glued["roundtime"] = glued["roundtime"].round(0) # 1 significant digit
f, (ax1, ax2) = plt.subplots(2)
my_dpi = 96
stepsize = 5
start = 0
end = 60
ax1.set_title("Mean")
ax2.set_title("Median")
f.set_size_inches(1980 / my_dpi, 1080 / my_dpi)
ax1 = sns.lineplot(x="roundtime", y="coverage", ci="sd", estimator="mean", data=glued, ax=ax1)
ax1.set(xlabel="Time", ylabel="Coverage in percent")
ax1.xaxis.set_ticks(np.arange(start, end, stepsize))
ax1.set_xlim(0, 70)
ax2 = sns.lineplot(x="roundtime", y="coverage", ci="sd", estimator='median', data=glued, ax=ax2)
ax2.set(xlabel="Time", ylabel="Coverage in percent")
ax2.xaxis.set_ticks(np.arange(start, end, stepsize))
ax2.set_xlim(0, 70)
plt.show()
The result looks like this.
However, the curve should never decrease as the "coverage" values can never decrease either. The reason for this, I suspect, is that at certain points in time I only have recordings of some DataFrames with lower values and therefore the mean/median is also lower.
I tried to fix this by aligning the indices of all the DataFrames and filling missing values with previous recordings, before doing any of the previous code. Like this:
#create a common index
index = None
for df in data:
df.set_index("time", inplace=True, drop=False)
if index is not None:
index = index.union(df.index)
else:
index = df.index
# reindex all dataframes and fill missing values
new_data = []
for df in data:
print(df)
new_df = df.reindex(index, fill_value=np.NaN)
new_df = new_df.fillna(method="ffill")
new_data.append(new_df)
data = new_data
The result however does change much and decreases at certain times. It looks like this:
Is this approach wrong or am I simply missing something?

Python merge datasets X1(t), X2(t) -> X1(X2)

I have some datasets (lets stay at 2 here) which are dependent on a common variable t, like X1(t) and X2(t). However X1(t) and X2(t) don't have to share the same t values or even have the same amount of datapoints.
For example they could look like:
t1 = [2,6,7,8,10,13,14,16,17]
X1 = [10,10,10,20,20,20,30,30,30]
t2 = [3,4,5,6,8,10,11,14,15,16]
X2 = [95,100,100,105,158,150,142,196,200,204]
I am trying to create a new dataset YNew(XNew) (=X2(X1)) such that both datasets are linked without the shared variable t.
In this case it should look like:
XNew = [10,20,30]
YNew = [100,150,200]
where to every occuring X1-value a corresponding X2-value (a mean value) is assigned.
Is there an easy already known way to achieve this(maybe with pandas)?
My first guess would be to find all t-values for a certain X1-value (in the example case the X1-value 10 would lie in the range 2,...,7) and then look for all X2-values in that range and get their mean value. Then you should be able to assign YNew(XNew).
Thanks for every advice!
Update:
I added a graph, so maybe my intentions are a bit more clear. I want to assign the mean X2-value to the corresponding X1-value in the marked regions (where the same X1-values occur).
graph corresponding to example lists

alright, I just tried to implement what I mentioned and it works as I liked it.
Although I think that some things are still a little clumsy...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# datasets to treat
t1 = [2,6,7,8,10,13,14,16,17]
X1 = [10,10,10,20,20,20,30,30,30]
t2 = [3,4,5,6,8,10,11,14,15,16]
X2 = [95,100,100,105,158,150,142,196,200,204]
X1Series = pd.Series(X1, index = t1)
X2Series = pd.Series(X2, index = t2)
X1Values = X1Series.drop_duplicates().values #returns all occuring values of X1 without duplicates as array
# lists for results
XNew = []
YNew = []
#find for every occuring value X1 the mean value of X2 in the range of X1
for value in X1Values:
indexpos = X1Series[X1Series == value].index.values
max_t = indexpos[indexpos.argmax()] # get max and min index of the range of X1
min_t =indexpos[indexpos.argmin()]
print("X1 = "+str(value)+" occurs in range from "+str(min_t)+" to "+str(max_t))
slicedX2 = X2Series[(X2Series.index >= min_t) & (X2Series.index <= max_t)] # select range of X2
print("in this range there are following values of X2:")
print(slicedX2)
mean = slicedX2.mean() #calculate mean value of selection and append extracted values
print("with the mean value of: " + str(mean))
XNew.append(value)
YNew.append(mean)
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
ax1.plot(t1, X1,'ro-',label='X1(t)')
ax1.plot(t2, X2,'bo',label='X2(t)')
ax1.legend(loc=2)
ax1.set_xlabel('t')
ax1.set_ylabel('X1/X2')
ax2.plot(XNew,YNew,'ro-',label='YNew(XNew)')
ax2.legend(loc=2)
ax2.set_xlabel('XNew')
ax2.set_ylabel('YNew')
plt.show()

Adding a single label to the legend for a series of different data points plotted inside a designated bin in Python using matplotlib.pyplot.plot()

I have a script for plotting astronomical data of redmapping clusters using a csv file. I could get the data points in it and want to plot them using different colors depending on their redshift values: I am binning the dataset into 3 bins (0.1-0.2, 0.2-0.25, 0.25,0.31) based on the redshift.
The problem arises with my code after I distinguish to what bin the datapoint belongs: I want to have 3 labels in the legend corresponding to red, green and blue data points, but this is not happening and I don't know why. I am using plot() instead of scatter() as I also had to do the best fit from the data in the same figure. So everything needs to be in 1 figure.
import numpy as np
import matplotlib.pyplot as py
import csv
z = open("Sheet4CSV.csv","rU")
data = csv.reader(z)
x = []
y = []
ylow = []
yupp = []
xlow = []
xupp = []
redshift = []
for r in data:
x.append(float(r[2]))
y.append(float(r[5]))
xlow.append(float(r[3]))
xupp.append(float(r[4]))
ylow.append(float(r[6]))
yupp.append(float(r[7]))
redshift.append(float(r[1]))
from operator import sub
xerr_l = map(sub,x,xlow)
xerr_u = map(sub,xupp,x)
yerr_l = map(sub,y,ylow)
yerr_u = map(sub,yupp,y)
py.xlabel("$Original\ Tx\ XCS\ pipeline\ Tx\ keV$")
py.ylabel("$Iterative\ Tx\ pipeline\ keV$")
py.xlim(0,12)
py.ylim(0,12)
py.title("Redmapper Clusters comparison of Tx pipelines")
ax1 = py.subplot(111)
##Problem starts here after the previous line##
for p in redshift:
for i in xrange(84):
p=redshift[i]
if 0.1<=p<0.2:
ax1.plot(x[i],y[i],color="b", marker='.', linestyle = " ")#, label = "$z < 0.2$")
exit
if 0.2<=p<0.25:
ax1.plot(x[i],y[i],color="g", marker='.', linestyle = " ")#, label="$0.2 \leq z < 0.25$")
exit
if 0.25<=p<=0.3:
ax1.plot(x[i],y[i],color="r", marker='.', linestyle = " ")#, label="$z \geq 0.25$")
exit
##There seems nothing wrong after this point##
py.errorbar(x,y,yerr=[yerr_l,yerr_u],xerr=[xerr_l,xerr_u], fmt= " ",ecolor='magenta', label="Error bars")
cof = np.polyfit(x,y,1)
p = np.poly1d(cof)
l = np.linspace(0,12,100)
py.plot(l,p(l),"black",label="Best fit")
py.plot([0,15],[0,15],"black", linestyle="dotted", linewidth=2.0, label="line $y=x$")
py.grid()
box = ax1.get_position()
ax1.set_position([box.x1,box.y1,box.width, box.height])
py.legend(loc='center left',bbox_to_anchor=(1,0.5))
py.show()
In the 1st 'for' loop, I have indexed every value 'p' in the list 'redshift' so that bins can be created using 'if' statement. But if I add the labels that are hashed out against each py.plot() inside the 'if' statements, each data point 'i' that gets plotted in the figure as an intersection of (x[i],y[i]) takes the label and my entire legend attains in total 87 labels (including the 3 mentioned in the code at other places)!!!!!!
I essentially need 1 label for each bin...
Please tell me what needs to done after the bins are created and py.plot() commands used...Thanks in advance :-)
Sorry I cannot post my image here due to low reputation!
The data 'appended' for x, y and redshift lists from the csv file are as follows:
x=[5.031,10.599,10.589,8.548,9.089,8.675,3.588,1.244,3.023,8.632,8.953,7.603,7.513,2.917,7.344,7.106,3.889,7.287,3.367,6.839,2.801,2.316,1.328,6.31,6.19,6.329,6.025,5.629,6.123,5.892,5.438,4.398,4.542,4.624,4.501,4.504,5.033,5.068,4.197,2.854,4.784,2.158,4.054,3.124,3.961,4.42,3.853,3.658,1.858,4.537,2.072,3.573,3.041,5.837,3.652,3.209,2.742,2.732,1.312,3.635,2.69,3.32,2.488,2.996,2.269,1.701,3.935,2.015,0.798,2.212,1.672,1.925,3.21,1.979,1.794,2.624,2.027,3.66,1.073,1.007,1.57,0.854,0.619,0.547]
y=[5.255,10.897,11.045,9.125,9.387,17.719,4.025,1.389,4.152,8.703,9.051,8.02,7.774,3.139,7.543,7.224,4.155,7.416,3.905,6.868,2.909,2.658,1.651,6.454,6.252,6.541,6.152,5.647,6.285,6.079,5.489,4.541,4.634,8.851,4.554,4.555,5.559,5.144,5.311,5.839,5.364,3.18,4.352,3.379,4.059,4.575,3.914,5.736,2.304,4.68,3.187,3.756,3.419,9.118,4.595,3.346,3.603,6.313,1.816,4.34,2.732,4.978,2.719,3.761,2.623,2.1,4.956,2.316,4.231,2.831,1.954,2.248,6.573,2.276,2.627,3.85,3.545,25.405,3.996,1.347,1.679,1.435,0.759,0.677]
redshift = [0.12,0.25,0.23,0.23,0.27,0.26,0.12,0.27,0.17,0.18,0.17,0.3,0.23,0.1,0.23,0.29,0.29,0.12,0.13,0.26,0.11,0.24,0.13,0.21,0.17,0.2,0.3,0.29,0.23,0.27,0.25,0.21,0.11,0.15,0.1,0.26,0.23,0.12,0.23,0.26,0.2,0.17,0.22,0.26,0.25,0.12,0.19,0.24,0.18,0.15,0.27,0.14,0.14,0.29,0.29,0.26,0.15,0.29,0.24,0.24,0.23,0.26,0.29,0.22,0.13,0.18,0.24,0.14,0.24,0.24,0.17,0.26,0.29,0.11,0.14,0.26,0.28,0.26,0.28,0.27,0.23,0.26,0.23,0.19]

Working with numerical data like this, you should really consider using a numerical library, like numpy.
The problem in your code arises from processing each record (a coordinate (x,y) and the corresponding value redshift) one at a time. You are calling plot for each point, thereby creating legends for each of those 84 datapoints. You should consider your "bins" as groups of data that belong to the same dataset and process them as such. You could use "logical masks" to distinguish between your "bins", as shown below.
It's also not clear why you call exit after each plotting action.
import numpy as np
import matplotlib.pyplot as plt
x = np.array([5.031,10.599,10.589,8.548,9.089,8.675,3.588,1.244,3.023,8.632,8.953,7.603,7.513,2.917,7.344,7.106,3.889,7.287,3.367,6.839,2.801,2.316,1.328,6.31,6.19,6.329,6.025,5.629,6.123,5.892,5.438,4.398,4.542,4.624,4.501,4.504,5.033,5.068,4.197,2.854,4.784,2.158,4.054,3.124,3.961,4.42,3.853,3.658,1.858,4.537,2.072,3.573,3.041,5.837,3.652,3.209,2.742,2.732,1.312,3.635,2.69,3.32,2.488,2.996,2.269,1.701,3.935,2.015,0.798,2.212,1.672,1.925,3.21,1.979,1.794,2.624,2.027,3.66,1.073,1.007,1.57,0.854,0.619,0.547])
y = np.array([5.255,10.897,11.045,9.125,9.387,17.719,4.025,1.389,4.152,8.703,9.051,8.02,7.774,3.139,7.543,7.224,4.155,7.416,3.905,6.868,2.909,2.658,1.651,6.454,6.252,6.541,6.152,5.647,6.285,6.079,5.489,4.541,4.634,8.851,4.554,4.555,5.559,5.144,5.311,5.839,5.364,3.18,4.352,3.379,4.059,4.575,3.914,5.736,2.304,4.68,3.187,3.756,3.419,9.118,4.595,3.346,3.603,6.313,1.816,4.34,2.732,4.978,2.719,3.761,2.623,2.1,4.956,2.316,4.231,2.831,1.954,2.248,6.573,2.276,2.627,3.85,3.545,25.405,3.996,1.347,1.679,1.435,0.759,0.677])
redshift = np.array([0.12,0.25,0.23,0.23,0.27,0.26,0.12,0.27,0.17,0.18,0.17,0.3,0.23,0.1,0.23,0.29,0.29,0.12,0.13,0.26,0.11,0.24,0.13,0.21,0.17,0.2,0.3,0.29,0.23,0.27,0.25,0.21,0.11,0.15,0.1,0.26,0.23,0.12,0.23,0.26,0.2,0.17,0.22,0.26,0.25,0.12,0.19,0.24,0.18,0.15,0.27,0.14,0.14,0.29,0.29,0.26,0.15,0.29,0.24,0.24,0.23,0.26,0.29,0.22,0.13,0.18,0.24,0.14,0.24,0.24,0.17,0.26,0.29,0.11,0.14,0.26,0.28,0.26,0.28,0.27,0.23,0.26,0.23,0.19])
bin3 = 0.25 <= redshift
bin2 = np.logical_and(0.2 <= redshift, redshift < 0.25)
bin1 = np.logical_and(0.1 <= redshift, redshift < 0.2)
plt.ion()
labels = ("$z < 0.2$", "$0.2 \leq z < 0.25$", "$z \geq 0.25$")
colors = ('r', 'g', 'b')
for bin, label, co in zip( (bin1, bin2, bin3), labels, colors):
plt.plot(x[bin], y[bin], color=co, ls='none', marker='o', label=label)
plt.legend()
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How would I offset a pandas column of data by different amounts? - python

Related

Openpyxl minor gridlines

Averaging several time-series together with confidence interval (with test code)

Get smooth line plot by filling missing values

Python merge datasets X1(t), X2(t) -> X1(X2)

Adding a single label to the legend for a series of different data points plotted inside a designated bin in Python using matplotlib.pyplot.plot()

Categories

Resources