How would I offset a pandas column of data by different amounts? - python
I am plotting some columns within a pandas dataframe using matplotlib. I have a strategy for plotting whereby I'm zeroing to the initial value and then offset each chosen variable by a set amount. For example, this is my current plotting method:
fig, ax = plt.subplots()
# data is in a dataframe called inputData
timeseries_plots=['var1','var3','var8']
offsetFactor = 20
for ii,var in enumerate(timeseries_plots)
offsetRef = inputData[var].loc[~inputData[var].isnull()].iloc[0]
ax.plot(inputData[TimeIndex], offsetFactor*(len(timeseries_plots_avg)-ii-1)+inputData[timeseries_plots_avg[ii]]-offsetRef, label=var,markersize=1,marker='None',linestyle = 'solid',color=colour)
plt.show()
This produces something like this (with some matplotlib finessing):
As you can see, it removes the offsetRef (which in this case is the initial value of the variable), and then adds a constant offsetFactor (equal to 20 in this case) to each variable. The result is lines which start vertically offset by 20.
However, this can be a problem when the values start to drift over time, and one variable might cross another. What I'd like to do is reset the vertical offset - such as by changing the offsetRef beyond a certain date.
I have tried to do this in the following way. I start by initialising an array equal to the size of the variable. I then fill it with the offsetRef recalculated at the resetDates. I've included comments marked #PSEUDOCODE where I'm roughly writing what I want to do - but sorry in advance for them being pretty rough. Thank you in advance!
fig, ax = plt.subplots()
inputData = pd.DataFrame(np.random.randint(100, size=(100, 5)), columns=['timestamp','var2','var3','var4','var5'])
inputData['timestamp'][:]=pd.date_range('2020-may-01','2020-aug-08')
timeseries_plots=['var1','var3','var4']
offsetFactor = 20
resetDates = ['2020-jun-23','2020-jul-05']
for ii,var in enumerate(timeseries_plots)
offsetRef = np.zeros(inputData[var].size)
for tt,ttdate in enumerate(resetDates):
if tt=0:
#PSEUDO CODE: offsetRef[ inputData['timestamp'] <resetDates[tt]] = inputData[var].loc[~inputData[var].isnull()].iloc[0]
#PSEUDO CODE: offsetRef[ inputData['timestamp'] >=resetDates[tt]] = inputData[var].loc[~inputData[var].isnull()].iloc[ttdate]
#PSEUDO CODE: offsetRef[ inputData['timestamp'] >=resetDates[tt]] = inputData[var].loc[~inputData[var].isnull()].iloc[ttdate]
ax.plot(inputData[TimeIndex], offsetFactor*(len(timeseries_plots_avg)-ii-1)+inputData[timeseries_plots_avg[ii]]-offsetRef, label=var,markersize=1,marker='None',linestyle = 'solid',color=colour)
plt.show()
This is the current solution that I'll stick here so that it might be useful to others:
fig, ax = plt.subplots()
# set up df
inputData = pd.DataFrame(np.random.randint(100, size=(100, 5)), columns=['timestamp','var2','var3','var4','var5'])
inputData['timestamp'][:]=pd.date_range('2020-may-01','2020-aug-08')
inputData['var2']=np.arange(0,100,1)
inputData['var2'][0:3]=49
inputData['var4']=np.arange(0,200,2)
inputData['var2'][0:3]=np.nan
# set constants and settings
dispFactor=20
timeseries_plots=['var2','var4']
resetDates=['2020-05-05','2020-05-20', '2020-08-04']
offsetFactor = dispFactor
#begin
fig, ax=plt.subplots()
for ii,var in enumerate(timeseries_plots):
offsetRef = np.zeros(inputData[var].size)
for tt,ttdate in enumerate(resetDates):
if tt==0:
if inputData[var].loc[inputData['timestamp']==ttdate].isna().bool(): #if date is nan
print('a',inputData[var].loc[~inputData[var].isnull()].iloc[0],inputData[var].bfill().loc[inputData['timestamp']==ttdate])
offsetRef[(inputData['timestamp']<ttdate)]= inputData[var].loc[~inputData[var].isnull()].iloc[0]
offsetRef[(inputData['timestamp']>=ttdate)]=inputData[var].bfill().loc[inputData['timestamp']==ttdate]
else:
print('b',inputData[var].loc[~inputData[var].isnull()].iloc[0],inputData[var].loc[inputData['timestamp']==ttdate])
offsetRef[(inputData['timestamp']<ttdate)]= inputData[var].loc[~inputData[var].isnull()].iloc[0]
offsetRef[(inputData['timestamp']>=ttdate)]= inputData[var].loc[inputData['timestamp']==ttdate]
else:
if inputData[var].loc[inputData['timestamp']==ttdate].isna().bool(): #if date is nan
print('c')
offsetRef[ inputData['timestamp'] >=resetDates[tt]] = inputData[var].bfill().loc[inputData['timestamp']==ttdate]
else:
print('d',inputData[var].loc[inputData['timestamp']==ttdate])
offsetRef[ inputData['timestamp'] >=resetDates[tt]] = inputData[var].loc[inputData['timestamp']==ttdate]
print(offsetRef)
ax.plot(inputData['timestamp'], offsetFactor*(len(timeseries_plots)-ii-1)+inputData[var]-offsetRef)
plt.show()
This 'resets' the offset to 20 at the chosen resetDates to produce the following figure:
I possibly don't need the if-logic catches for nan data (and just rely on .bfill()) to work in either case - but this makes me feel that it's safer. I will edit as I improve the solution.
Related
Openpyxl minor gridlines
I am working on a Python application where I am collecting data from a device, and attempting to plot it in an excel file by using the Openpyxl library. I am successfully able to do everything including plotting the data, and formatting the scatter plot that I made, but I am having some trouble in adding minor gridlines to the plot. I feel like this is definitely possible because in the API, I can see under the openpyxl.chart.axis module, there is a “minorGridlines” attribute, but it is not a boolean input (ON/OFF), rather it takes a Chartlines class. I tried going a bit down the rabbit-hole of seeing how I would do this, but I am wondering what the most straightforward way of adding the minor-gridlines would be? Do you have to construct chart lines manually, or is there a simple way of doing this? I would really appreciate any help!
I think I answered my own question, but I will post it here if anybody else needs this (as I don’t see any other answers to this question on the forum). Example Code (see lines 4, 38): # Imports for script from openpyxl import Workbook # For plotting things in excel from openpyxl.chart import ScatterChart, Reference, Series from openpyxl.chart.axis import ChartLines from math import log10 # Variables for script fileName = 'testFile.xlsx' dataPoints = 100 # Generating a workbook to test with wb = Workbook() ws = wb.active # Fill data into the first sheet ws_name = ws.title # We will just generate a logarithmic plot, and scale the axis logarithmically (will look linear) x_data = [] y_data = [] for i in range(dataPoints): x_data.append(i + 1) y_data.append(log10(i + 1)) # Go back through the data, and place the data into the sheet ws['A1'] = 'x_data' ws['B1'] = 'y_data' for i in range(dataPoints): ws['A%d' % (i + 2)] = x_data[i] ws['B%d' % (i + 2)] = y_data[i] # Generate a reference to the cells that we can plot x_axis = Reference(ws, range_string='%s!A2:A%d' % (ws_name, dataPoints + 1)) y_axis = Reference(ws, range_string='%s!B2:B%d' % (ws_name, dataPoints + 1)) function = Series(xvalues=x_axis, values=y_axis) # Actually create the scatter plot, and append all of the plots to it ScatterPlot = ScatterChart() ScatterPlot.x_axis.minorGridlines = ChartLines() ScatterPlot.x_axis.scaling.logBase = 10 ScatterPlot.series.append(function) ScatterPlot.x_axis.title = 'X_Data' ScatterPlot.y_axis.title = 'Y_Data' ScatterPlot.title = 'Openpyxl Plotting Test' ws.add_chart(ScatterPlot, 'D2') # Save the file at the end to output it wb.save(fileName) Background on solution: I looked at how the code for Openpyxl generates the Major axis gridlines, which seems to follow a similar convention as the Minor axis gridlines, and I found that in the ‘NumericAxis’ class, they generated the major gridlines with the following line (labeled ‘##### This Line #####’ which is originally copied from the ‘openpyxl->chart->axis’ file): class NumericAxis(_BaseAxis): tagname = "valAx" axId = _BaseAxis.axId scaling = _BaseAxis.scaling delete = _BaseAxis.delete axPos = _BaseAxis.axPos majorGridlines = _BaseAxis.majorGridlines minorGridlines = _BaseAxis.minorGridlines title = _BaseAxis.title numFmt = _BaseAxis.numFmt majorTickMark = _BaseAxis.majorTickMark minorTickMark = _BaseAxis.minorTickMark tickLblPos = _BaseAxis.tickLblPos spPr = _BaseAxis.spPr txPr = _BaseAxis.txPr crossAx = _BaseAxis.crossAx crosses = _BaseAxis.crosses crossesAt = _BaseAxis.crossesAt crossBetween = NestedNoneSet(values=(['between', 'midCat'])) majorUnit = NestedFloat(allow_none=True) minorUnit = NestedFloat(allow_none=True) dispUnits = Typed(expected_type=DisplayUnitsLabelList, allow_none=True) extLst = Typed(expected_type=ExtensionList, allow_none=True) __elements__ = _BaseAxis.__elements__ + ('crossBetween', 'majorUnit', 'minorUnit', 'dispUnits',) def __init__(self, crossBetween=None, majorUnit=None, minorUnit=None, dispUnits=None, extLst=None, **kw ): self.crossBetween = crossBetween self.majorUnit = majorUnit self.minorUnit = minorUnit self.dispUnits = dispUnits kw.setdefault('majorGridlines', ChartLines()) ######## THIS Line ####### kw.setdefault('axId', 100) kw.setdefault('crossAx', 10) super(NumericAxis, self).__init__(**kw) #classmethod def from_tree(cls, node): """ Special case value axes with no gridlines """ self = super(NumericAxis, cls).from_tree(node) gridlines = node.find("{%s}majorGridlines" % CHART_NS) if gridlines is None: self.majorGridlines = None return self I took a stab, and after importing the ‘Chartlines’ class like so: from openpyxl.chart.axis import ChartLines I was able to add minor gridlines to the x-axis like so: ScatterPlot.x_axis.minorGridlines = ChartLines() As far as formatting the minor gridlines, I’m at a bit of a loss, and personally have no need, but this at least is a good start.
Averaging several time-series together with confidence interval (with test code)
Sounds very complicated but a simple plot will make it easy to understand: I have three curves of cumulative sum of some values over time, which are the blue lines. I want to average (or somehow combine in a statistically correct way) the three curves into one smooth curve and add confidence interval. I tried one simple solution - combining all the data into one curve, average it with the "rolling" function in pandas, getting the standard deviation for it. I plotted those as the purple curve with the confidence interval around it. The problem with my real data, and as illustrated in the plot above is the curve isn't smooth at all, also there are sharp jumps in the confidence interval which also isn't a good representation of the 3 separate curves as there is no jumps in them. Is there a better way to represent the 3 different curves in one smooth curve with a nice confidence interval? I supply a test code, tested on python 3.5.1 with numpy and pandas (don't change the seed in order to get the same curves). There are some constrains - increasing the number of points for the "rolling" function isn't a solution for me because some of my data is too short for that. Test code: import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib np.random.seed(seed=42) ## data generation - cumulative analysis over time df1_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time']) df1_values = pd.DataFrame(np.random.randint(0,10000,size=100), columns=['vals']) df1_combined_sorted = pd.concat([df1_time, df1_values], axis = 1).sort_values(by=['time']) df1_combined_sorted_cumulative = np.cumsum(df1_combined_sorted['vals']) df2_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time']) df2_values = pd.DataFrame(np.random.randint(1000,13000,size=100), columns=['vals']) df2_combined_sorted = pd.concat([df2_time, df2_values], axis = 1).sort_values(by=['time']) df2_combined_sorted_cumulative = np.cumsum(df2_combined_sorted['vals']) df3_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time']) df3_values = pd.DataFrame(np.random.randint(0,4000,size=100), columns=['vals']) df3_combined_sorted = pd.concat([df3_time, df3_values], axis = 1).sort_values(by=['time']) df3_combined_sorted_cumulative = np.cumsum(df3_combined_sorted['vals']) ## combining the three curves df_all_vals_cumulative = pd.concat([df1_combined_sorted_cumulative,. df2_combined_sorted_cumulative, df3_combined_sorted_cumulative]).reset_index(drop=True) df_all_time = pd.concat([df1_combined_sorted['time'], df2_combined_sorted['time'], df3_combined_sorted['time']]).reset_index(drop=True) df_all = pd.concat([df_all_time, df_all_vals_cumulative], axis = 1) ## creating confidence intervals df_all_sorted = df_all.sort_values(by=['time']) ma = df_all_sorted.rolling(10).mean() mstd = df_all_sorted.rolling(10).std() ## plotting plt.fill_between(df_all_sorted['time'], ma['vals'] - 2 * mstd['vals'], ma['vals'] + 2 * mstd['vals'],color='b', alpha=0.2) plt.plot(df_all_sorted['time'],ma['vals'], c='purple') plt.plot(df1_combined_sorted['time'], df1_combined_sorted_cumulative, c='blue') plt.plot(df2_combined_sorted['time'], df2_combined_sorted_cumulative, c='blue') plt.plot(df3_combined_sorted['time'], df3_combined_sorted_cumulative, c='blue') matplotlib.use('Agg') plt.show()
First of all, your sample code could be re-written to make better use of pd. For example np.random.seed(seed=42) ## data generation - cumulative analysis over time def get_data(max_val, max_time=1000): times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time']) vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals']) df = pd.concat([times, vals], axis = 1).sort_values(by=['time']).\ reset_index().drop('index', axis=1) df['cumulative'] = df.vals.cumsum() return df # generate the dataframes df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000])) dfs = (df1, df2, df3) # join df_all = pd.concat(dfs, ignore_index=True).sort_values(by=['time']) # render function def render(window=10): # compute rolling means and confident intervals mean_val = df_all.cumulative.rolling(window).mean() std_val = df_all.cumulative.rolling(window).std() min_val = mean_val - 2*std_val max_val = mean_val + 2*std_val plt.figure(figsize=(16,9)) for df in dfs: plt.plot(df.time, df.cumulative, c='blue') plt.plot(df_all.time, mean_val, c='r') plt.fill_between(df_all.time, min_val, max_val, color='blue', alpha=.2) plt.show() The reason your curves aren't that smooth is maybe your rolling window is not large enough. You can increase this window size to get smoother graphs. For example render(20) gives: while render(30) gives: Although, the better way might be imputing each of df['cumulative'] to the entire time window and compute the mean/confidence interval on these series. With that in mind, we can modify the code as follows: np.random.seed(seed=42) ## data generation - cumulative analysis over time def get_data(max_val, max_time=1000): times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time']) vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals']) # note that we set time as index of the returned data df = pd.concat([times, vals], axis = 1).dropna().set_index('time').sort_index() df['cumulative'] = df.vals.cumsum() return df df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000])) dfs = (df1, df2, df3) # rename column for later plotting for i,df in zip(range(3),dfs): df.rename(columns={'cumulative':f'cummulative_{i}'}, inplace=True) # concatenate the dataframes with common time index df_all = pd.concat(dfs,sort=False).sort_index() # interpolate each cumulative column linearly df_all.interpolate(inplace=True) # plot graphs mean_val = df_all.iloc[:,1:].mean(axis=1) std_val = df_all.iloc[:,1:].std(axis=1) min_val = mean_val - 2*std_val max_val = mean_val + 2*std_val fig, ax = plt.subplots(1,1,figsize=(16,9)) df_all.iloc[:,1:4].plot(ax=ax) plt.plot(df_all.index, mean_val, c='purple') plt.fill_between(df_all.index, min_val, max_val, color='blue', alpha=.2) plt.show() and we get:
Get smooth line plot by filling missing values
I have multiple Dataframes (up to 30) which all contain timestamps with associated values. The timestamp in the DataFrames do not necessarily overlap and the recorded values can only stay the same or increase. A DataFrame may look like this: time coverage 0 0.000000 32.111748 1 0.875050 32.482579 2 1.850576 32.784133 3 3.693440 34.205134 ... I uploaded a couple of csv files with data here 1, 2, 3, 4. So what I am trying to do is to plot the increase of the mean and median coverage values over time for all recordings, as follows: # data is a list of dataframes keys = ["Run " + str(i) for i in range(len(data))] glued = pd.concat(data, keys=keys).reset_index(level=0).rename(columns={'level_0': 'Run'}) glued["roundtime"] = glued["time"] / 60 glued["roundtime"] = glued["roundtime"].round(0) # 1 significant digit f, (ax1, ax2) = plt.subplots(2) my_dpi = 96 stepsize = 5 start = 0 end = 60 ax1.set_title("Mean") ax2.set_title("Median") f.set_size_inches(1980 / my_dpi, 1080 / my_dpi) ax1 = sns.lineplot(x="roundtime", y="coverage", ci="sd", estimator="mean", data=glued, ax=ax1) ax1.set(xlabel="Time", ylabel="Coverage in percent") ax1.xaxis.set_ticks(np.arange(start, end, stepsize)) ax1.set_xlim(0, 70) ax2 = sns.lineplot(x="roundtime", y="coverage", ci="sd", estimator='median', data=glued, ax=ax2) ax2.set(xlabel="Time", ylabel="Coverage in percent") ax2.xaxis.set_ticks(np.arange(start, end, stepsize)) ax2.set_xlim(0, 70) plt.show() The result looks like this. However, the curve should never decrease as the "coverage" values can never decrease either. The reason for this, I suspect, is that at certain points in time I only have recordings of some DataFrames with lower values and therefore the mean/median is also lower. I tried to fix this by aligning the indices of all the DataFrames and filling missing values with previous recordings, before doing any of the previous code. Like this: #create a common index index = None for df in data: df.set_index("time", inplace=True, drop=False) if index is not None: index = index.union(df.index) else: index = df.index # reindex all dataframes and fill missing values new_data = [] for df in data: print(df) new_df = df.reindex(index, fill_value=np.NaN) new_df = new_df.fillna(method="ffill") new_data.append(new_df) data = new_data The result however does change much and decreases at certain times. It looks like this: Is this approach wrong or am I simply missing something?
Python merge datasets X1(t), X2(t) -> X1(X2)
I have some datasets (lets stay at 2 here) which are dependent on a common variable t, like X1(t) and X2(t). However X1(t) and X2(t) don't have to share the same t values or even have the same amount of datapoints. For example they could look like: t1 = [2,6,7,8,10,13,14,16,17] X1 = [10,10,10,20,20,20,30,30,30] t2 = [3,4,5,6,8,10,11,14,15,16] X2 = [95,100,100,105,158,150,142,196,200,204] I am trying to create a new dataset YNew(XNew) (=X2(X1)) such that both datasets are linked without the shared variable t. In this case it should look like: XNew = [10,20,30] YNew = [100,150,200] where to every occuring X1-value a corresponding X2-value (a mean value) is assigned. Is there an easy already known way to achieve this(maybe with pandas)? My first guess would be to find all t-values for a certain X1-value (in the example case the X1-value 10 would lie in the range 2,...,7) and then look for all X2-values in that range and get their mean value. Then you should be able to assign YNew(XNew). Thanks for every advice! Update: I added a graph, so maybe my intentions are a bit more clear. I want to assign the mean X2-value to the corresponding X1-value in the marked regions (where the same X1-values occur). graph corresponding to example lists
alright, I just tried to implement what I mentioned and it works as I liked it. Although I think that some things are still a little clumsy... import pandas as pd import numpy as np import matplotlib.pyplot as plt # datasets to treat t1 = [2,6,7,8,10,13,14,16,17] X1 = [10,10,10,20,20,20,30,30,30] t2 = [3,4,5,6,8,10,11,14,15,16] X2 = [95,100,100,105,158,150,142,196,200,204] X1Series = pd.Series(X1, index = t1) X2Series = pd.Series(X2, index = t2) X1Values = X1Series.drop_duplicates().values #returns all occuring values of X1 without duplicates as array # lists for results XNew = [] YNew = [] #find for every occuring value X1 the mean value of X2 in the range of X1 for value in X1Values: indexpos = X1Series[X1Series == value].index.values max_t = indexpos[indexpos.argmax()] # get max and min index of the range of X1 min_t =indexpos[indexpos.argmin()] print("X1 = "+str(value)+" occurs in range from "+str(min_t)+" to "+str(max_t)) slicedX2 = X2Series[(X2Series.index >= min_t) & (X2Series.index <= max_t)] # select range of X2 print("in this range there are following values of X2:") print(slicedX2) mean = slicedX2.mean() #calculate mean value of selection and append extracted values print("with the mean value of: " + str(mean)) XNew.append(value) YNew.append(mean) fig = plt.figure() ax1 = fig.add_subplot(211) ax2 = fig.add_subplot(212) ax1.plot(t1, X1,'ro-',label='X1(t)') ax1.plot(t2, X2,'bo',label='X2(t)') ax1.legend(loc=2) ax1.set_xlabel('t') ax1.set_ylabel('X1/X2') ax2.plot(XNew,YNew,'ro-',label='YNew(XNew)') ax2.legend(loc=2) ax2.set_xlabel('XNew') ax2.set_ylabel('YNew') plt.show()
Adding a single label to the legend for a series of different data points plotted inside a designated bin in Python using matplotlib.pyplot.plot()
I have a script for plotting astronomical data of redmapping clusters using a csv file. I could get the data points in it and want to plot them using different colors depending on their redshift values: I am binning the dataset into 3 bins (0.1-0.2, 0.2-0.25, 0.25,0.31) based on the redshift. The problem arises with my code after I distinguish to what bin the datapoint belongs: I want to have 3 labels in the legend corresponding to red, green and blue data points, but this is not happening and I don't know why. I am using plot() instead of scatter() as I also had to do the best fit from the data in the same figure. So everything needs to be in 1 figure. import numpy as np import matplotlib.pyplot as py import csv z = open("Sheet4CSV.csv","rU") data = csv.reader(z) x = [] y = [] ylow = [] yupp = [] xlow = [] xupp = [] redshift = [] for r in data: x.append(float(r[2])) y.append(float(r[5])) xlow.append(float(r[3])) xupp.append(float(r[4])) ylow.append(float(r[6])) yupp.append(float(r[7])) redshift.append(float(r[1])) from operator import sub xerr_l = map(sub,x,xlow) xerr_u = map(sub,xupp,x) yerr_l = map(sub,y,ylow) yerr_u = map(sub,yupp,y) py.xlabel("$Original\ Tx\ XCS\ pipeline\ Tx\ keV$") py.ylabel("$Iterative\ Tx\ pipeline\ keV$") py.xlim(0,12) py.ylim(0,12) py.title("Redmapper Clusters comparison of Tx pipelines") ax1 = py.subplot(111) ##Problem starts here after the previous line## for p in redshift: for i in xrange(84): p=redshift[i] if 0.1<=p<0.2: ax1.plot(x[i],y[i],color="b", marker='.', linestyle = " ")#, label = "$z < 0.2$") exit if 0.2<=p<0.25: ax1.plot(x[i],y[i],color="g", marker='.', linestyle = " ")#, label="$0.2 \leq z < 0.25$") exit if 0.25<=p<=0.3: ax1.plot(x[i],y[i],color="r", marker='.', linestyle = " ")#, label="$z \geq 0.25$") exit ##There seems nothing wrong after this point## py.errorbar(x,y,yerr=[yerr_l,yerr_u],xerr=[xerr_l,xerr_u], fmt= " ",ecolor='magenta', label="Error bars") cof = np.polyfit(x,y,1) p = np.poly1d(cof) l = np.linspace(0,12,100) py.plot(l,p(l),"black",label="Best fit") py.plot([0,15],[0,15],"black", linestyle="dotted", linewidth=2.0, label="line $y=x$") py.grid() box = ax1.get_position() ax1.set_position([box.x1,box.y1,box.width, box.height]) py.legend(loc='center left',bbox_to_anchor=(1,0.5)) py.show() In the 1st 'for' loop, I have indexed every value 'p' in the list 'redshift' so that bins can be created using 'if' statement. But if I add the labels that are hashed out against each py.plot() inside the 'if' statements, each data point 'i' that gets plotted in the figure as an intersection of (x[i],y[i]) takes the label and my entire legend attains in total 87 labels (including the 3 mentioned in the code at other places)!!!!!! I essentially need 1 label for each bin... Please tell me what needs to done after the bins are created and py.plot() commands used...Thanks in advance :-) Sorry I cannot post my image here due to low reputation! The data 'appended' for x, y and redshift lists from the csv file are as follows: x=[5.031,10.599,10.589,8.548,9.089,8.675,3.588,1.244,3.023,8.632,8.953,7.603,7.513,2.917,7.344,7.106,3.889,7.287,3.367,6.839,2.801,2.316,1.328,6.31,6.19,6.329,6.025,5.629,6.123,5.892,5.438,4.398,4.542,4.624,4.501,4.504,5.033,5.068,4.197,2.854,4.784,2.158,4.054,3.124,3.961,4.42,3.853,3.658,1.858,4.537,2.072,3.573,3.041,5.837,3.652,3.209,2.742,2.732,1.312,3.635,2.69,3.32,2.488,2.996,2.269,1.701,3.935,2.015,0.798,2.212,1.672,1.925,3.21,1.979,1.794,2.624,2.027,3.66,1.073,1.007,1.57,0.854,0.619,0.547] y=[5.255,10.897,11.045,9.125,9.387,17.719,4.025,1.389,4.152,8.703,9.051,8.02,7.774,3.139,7.543,7.224,4.155,7.416,3.905,6.868,2.909,2.658,1.651,6.454,6.252,6.541,6.152,5.647,6.285,6.079,5.489,4.541,4.634,8.851,4.554,4.555,5.559,5.144,5.311,5.839,5.364,3.18,4.352,3.379,4.059,4.575,3.914,5.736,2.304,4.68,3.187,3.756,3.419,9.118,4.595,3.346,3.603,6.313,1.816,4.34,2.732,4.978,2.719,3.761,2.623,2.1,4.956,2.316,4.231,2.831,1.954,2.248,6.573,2.276,2.627,3.85,3.545,25.405,3.996,1.347,1.679,1.435,0.759,0.677] redshift = [0.12,0.25,0.23,0.23,0.27,0.26,0.12,0.27,0.17,0.18,0.17,0.3,0.23,0.1,0.23,0.29,0.29,0.12,0.13,0.26,0.11,0.24,0.13,0.21,0.17,0.2,0.3,0.29,0.23,0.27,0.25,0.21,0.11,0.15,0.1,0.26,0.23,0.12,0.23,0.26,0.2,0.17,0.22,0.26,0.25,0.12,0.19,0.24,0.18,0.15,0.27,0.14,0.14,0.29,0.29,0.26,0.15,0.29,0.24,0.24,0.23,0.26,0.29,0.22,0.13,0.18,0.24,0.14,0.24,0.24,0.17,0.26,0.29,0.11,0.14,0.26,0.28,0.26,0.28,0.27,0.23,0.26,0.23,0.19]
Working with numerical data like this, you should really consider using a numerical library, like numpy. The problem in your code arises from processing each record (a coordinate (x,y) and the corresponding value redshift) one at a time. You are calling plot for each point, thereby creating legends for each of those 84 datapoints. You should consider your "bins" as groups of data that belong to the same dataset and process them as such. You could use "logical masks" to distinguish between your "bins", as shown below. It's also not clear why you call exit after each plotting action. import numpy as np import matplotlib.pyplot as plt x = np.array([5.031,10.599,10.589,8.548,9.089,8.675,3.588,1.244,3.023,8.632,8.953,7.603,7.513,2.917,7.344,7.106,3.889,7.287,3.367,6.839,2.801,2.316,1.328,6.31,6.19,6.329,6.025,5.629,6.123,5.892,5.438,4.398,4.542,4.624,4.501,4.504,5.033,5.068,4.197,2.854,4.784,2.158,4.054,3.124,3.961,4.42,3.853,3.658,1.858,4.537,2.072,3.573,3.041,5.837,3.652,3.209,2.742,2.732,1.312,3.635,2.69,3.32,2.488,2.996,2.269,1.701,3.935,2.015,0.798,2.212,1.672,1.925,3.21,1.979,1.794,2.624,2.027,3.66,1.073,1.007,1.57,0.854,0.619,0.547]) y = np.array([5.255,10.897,11.045,9.125,9.387,17.719,4.025,1.389,4.152,8.703,9.051,8.02,7.774,3.139,7.543,7.224,4.155,7.416,3.905,6.868,2.909,2.658,1.651,6.454,6.252,6.541,6.152,5.647,6.285,6.079,5.489,4.541,4.634,8.851,4.554,4.555,5.559,5.144,5.311,5.839,5.364,3.18,4.352,3.379,4.059,4.575,3.914,5.736,2.304,4.68,3.187,3.756,3.419,9.118,4.595,3.346,3.603,6.313,1.816,4.34,2.732,4.978,2.719,3.761,2.623,2.1,4.956,2.316,4.231,2.831,1.954,2.248,6.573,2.276,2.627,3.85,3.545,25.405,3.996,1.347,1.679,1.435,0.759,0.677]) redshift = np.array([0.12,0.25,0.23,0.23,0.27,0.26,0.12,0.27,0.17,0.18,0.17,0.3,0.23,0.1,0.23,0.29,0.29,0.12,0.13,0.26,0.11,0.24,0.13,0.21,0.17,0.2,0.3,0.29,0.23,0.27,0.25,0.21,0.11,0.15,0.1,0.26,0.23,0.12,0.23,0.26,0.2,0.17,0.22,0.26,0.25,0.12,0.19,0.24,0.18,0.15,0.27,0.14,0.14,0.29,0.29,0.26,0.15,0.29,0.24,0.24,0.23,0.26,0.29,0.22,0.13,0.18,0.24,0.14,0.24,0.24,0.17,0.26,0.29,0.11,0.14,0.26,0.28,0.26,0.28,0.27,0.23,0.26,0.23,0.19]) bin3 = 0.25 <= redshift bin2 = np.logical_and(0.2 <= redshift, redshift < 0.25) bin1 = np.logical_and(0.1 <= redshift, redshift < 0.2) plt.ion() labels = ("$z < 0.2$", "$0.2 \leq z < 0.25$", "$z \geq 0.25$") colors = ('r', 'g', 'b') for bin, label, co in zip( (bin1, bin2, bin3), labels, colors): plt.plot(x[bin], y[bin], color=co, ls='none', marker='o', label=label) plt.legend() plt.show()