How to drop data above a certain frequency in a histogram/dataset? - python

To make things clearer, I don't want to remove the entire bin from the histogram, I just want to get rid of some of the data so that it is brought below a desired frequency. The line in the image shows the max frequency I would like
For context, I have a dataset containing a number of angles. My question is very similar to the question asked here Remove data above threshold in histogram in terms of the data used but unlike the question in the link, I dont wish to get rid of the data, just reduce it.
Can I do this directly from the histogram or will I need to just delete some of the data in the dataset?
edit (sorry I am new to coding and formatting here):
here is a solution i tried
bns = 30
hist, bins = np.histogram(dataset['Steering'], bins= bns)
removeddata = []
spb = 700
for j in range(bns):
rdata = []
for i in range(len(dataset['Steering'])):
if dataset['Steering'][i] >= bins[j] and dataset['Steering'][i] <=
bins[j+1]:
rdata.append(i)
rdata = shuffle(rdata)
rdata = rdata[spb:]
removeddata.extend(rdata)
print('removed:', len(removeddata))
dataset.drop(dataset.index[removeddata], inplace = True)
print ('remaining:', len(dataset))
center = (bins[:-1] + bins[1:])*0.5
plt.bar(center,hist,width=0.05)
plt.show()
This is somebody else's solution but it seemed to work for them. Even directly copying, it still throws errors. The error I got was "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()", I tried to change 'and' to & and got the error "TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]". Unsure what this exactly refers to but points to the line with the if statement. Checked the dtype of everything and they are all type float64, so unsure of my next step

This solution takes into account the clarified requirement that the original input data that exceeds the frequency threshold be dropped. I left my other answer because it is simpler and different enough that it may be useful to another user.
To clarify, this answer produces a new 1D array of data with fewer elements and then plots a histogram from that new data. The data are shuffled before the elements are removed (in case the input data were pre-sorted) in order to prevent bias in dropping data from either the low or high side of each bin.
import numpy as np
import matplotlib.pyplot as plt
from random import shuffle
def remove_gated_val_recursive(idx, to_gate_lst, bins_lst, data_lst):
if to_gate_lst[idx] == 0:
return(data_lst)
else:
bin_min, bin_max = bins_lst[idx], bins_lst[idx + 1]
for i in range(len(data_lst)):
if bin_min <= data_lst[i] < bin_max:
del data_lst[i]
to_gate_lst[idx] -= 1
break
return remove_gated_val_recursive(idx, to_gate_lst, bins_lst, data_lst)
threshold = 80
fig, ax1 = plt.subplots()
ax1.set_title("Some data")
np.random.seed(30)
data = np.random.randn(1000)
num_bins = 23
raw_hist, raw_bins = np.histogram(data, num_bins)
to_gate = []
for i in range(len(raw_hist)):
if raw_hist[i] > threshold:
to_gate.append(raw_hist[i] - threshold)
else:
to_gate.append(0)
data_lst = list(data)
shuffle(data_lst)
for idx in range(len(raw_hist)):
remove_gated_val_recursive(idx, to_gate, raw_bins, data_lst)
new_data = np.array(data_lst)
hist, bins = np.histogram(new_data, num_bins)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) * 0.5
ax1.bar(center, hist, align='center', width=width)
plt.show()
gives the following histogram, plotted from the new_data array.

This answer doesn't re-bin or re-center the data, but I believe it generally achieves what you're asking. Working from the example in the chosen answer of the post you linked, I edit the hist array so that the original input data is not changed as you indicated is your preferred solution:
import numpy as np
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1,2)
ax1.set_title("Some data")
ax2.set_title("Gated data < threshold")
np.random.seed(10)
data = np.random.randn(1000)
num_bins = 23
avg_samples_per_bin = 200
hist, bins = np.histogram(data, num_bins)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) * 0.5
ax1.bar(center, hist, align='center', width=width)
threshold = 80
gated = np.empty([len(hist)], dtype=np.int64)
for i in range(len(hist)):
if hist[i] > threshold:
gated[i] = threshold
else:
gated[i] = hist[i]
ax2.bar(center, gated, align="center", width=width)
plt.show()
which gives

Related

Draw longest possible vertical line between two curves in seaborn

I currently have a plot like this (consider that data is the dataframe I pasted at the very bottom):
import seaborn as sns
sns.relplot(
data = data,
x = "Threshold",
y = "Value",
kind = "line",
hue="Metric"
).set(xlabel="Threshold")
Which produces:
Now, I want to know how can I annotate a line in this plot, such that it is located between the curves, at the x-Axis value where the distance between curves are maximized. I would also need to annotate text to show the distance value.
It should be something like this:
Here is the pandas dataframe:
Threshold,Metric,Value
0.0,Recall,1.0
0.010101010101010102,Recall,0.9802536231884058
0.020202020202020204,Recall,0.9706521739130435
0.030303030303030304,Recall,0.9621376811594203
0.04040404040404041,Recall,0.9541666666666667
0.05050505050505051,Recall,0.9456521739130435
0.06060606060606061,Recall,0.9322463768115942
0.07070707070707072,Recall,0.9173913043478261
0.08080808080808081,Recall,0.908695652173913
0.09090909090909091,Recall,0.8976449275362319
0.10101010101010102,Recall,0.8813405797101449
0.11111111111111112,Recall,0.8644927536231884
0.12121212121212122,Recall,0.8498188405797101
0.13131313131313133,Recall,0.8358695652173913
0.14141414141414144,Recall,0.818659420289855
0.15151515151515152,Recall,0.7967391304347826
0.16161616161616163,Recall,0.7748188405797102
0.17171717171717174,Recall,0.7521739130434782
0.18181818181818182,Recall,0.7269927536231884
0.19191919191919193,Recall,0.6952898550724638
0.20202020202020204,Recall,0.6704710144927536
0.21212121212121213,Recall,0.648731884057971
0.22222222222222224,Recall,0.6097826086956522
0.23232323232323235,Recall,0.5847826086956521
0.24242424242424243,Recall,0.5521739130434783
0.25252525252525254,Recall,0.5023550724637681
0.26262626262626265,Recall,0.4766304347826087
0.27272727272727276,Recall,0.42047101449275365
0.2828282828282829,Recall,0.3958333333333333
0.29292929292929293,Recall,0.3539855072463768
0.30303030303030304,Recall,0.3327898550724638
0.31313131313131315,Recall,0.3036231884057971
0.32323232323232326,Recall,0.2798913043478261
0.33333333333333337,Recall,0.2371376811594203
0.3434343434343435,Recall,0.22119565217391304
0.3535353535353536,Recall,0.17300724637681159
0.36363636363636365,Recall,0.15996376811594204
0.37373737373737376,Recall,0.13568840579710145
0.38383838383838387,Recall,0.11938405797101449
0.393939393939394,Recall,0.10652173913043478
0.4040404040404041,Recall,0.09891304347826087
0.4141414141414142,Recall,0.08894927536231884
0.42424242424242425,Recall,0.07681159420289856
0.43434343434343436,Recall,0.06557971014492754
0.4444444444444445,Recall,0.05253623188405797
0.4545454545454546,Recall,0.04655797101449275
0.4646464646464647,Recall,0.024456521739130436
0.4747474747474748,Recall,0.019384057971014494
0.48484848484848486,Recall,0.009782608695652175
0.494949494949495,Recall,0.0034420289855072463
0.5050505050505051,Recall,0.002173913043478261
0.5151515151515152,Recall,0.0016304347826086956
0.5252525252525253,Recall,0.0007246376811594203
0.5353535353535354,Recall,0.00018115942028985507
0.5454545454545455,Recall,0.0
0.5555555555555556,Recall,0.0
0.5656565656565657,Recall,0.0
0.5757575757575758,Recall,0.0
0.5858585858585859,Recall,0.0
0.595959595959596,Recall,0.0
0.6060606060606061,Recall,0.0
0.6161616161616162,Recall,0.0
0.6262626262626263,Recall,0.0
0.6363636363636365,Recall,0.0
0.6464646464646465,Recall,0.0
0.6565656565656566,Recall,0.0
0.6666666666666667,Recall,0.0
0.6767676767676768,Recall,0.0
0.686868686868687,Recall,0.0
0.696969696969697,Recall,0.0
0.7070707070707072,Recall,0.0
0.7171717171717172,Recall,0.0
0.7272727272727273,Recall,0.0
0.7373737373737375,Recall,0.0
0.7474747474747475,Recall,0.0
0.7575757575757577,Recall,0.0
0.7676767676767677,Recall,0.0
0.7777777777777778,Recall,0.0
0.787878787878788,Recall,0.0
0.797979797979798,Recall,0.0
0.8080808080808082,Recall,0.0
0.8181818181818182,Recall,0.0
0.8282828282828284,Recall,0.0
0.8383838383838385,Recall,0.0
0.8484848484848485,Recall,0.0
0.8585858585858587,Recall,0.0
0.8686868686868687,Recall,0.0
0.8787878787878789,Recall,0.0
0.888888888888889,Recall,0.0
0.8989898989898991,Recall,0.0
0.9090909090909092,Recall,0.0
0.9191919191919192,Recall,0.0
0.9292929292929294,Recall,0.0
0.9393939393939394,Recall,0.0
0.9494949494949496,Recall,0.0
0.9595959595959597,Recall,0.0
0.9696969696969697,Recall,0.0
0.9797979797979799,Recall,0.0
0.98989898989899,Recall,0.0
1.0,Recall,0.0
0.0,Fall-out,1.0
0.010101010101010102,Fall-out,0.6990465720990212
0.020202020202020204,Fall-out,0.58461408367334
0.030303030303030304,Fall-out,0.516647992727734
0.04040404040404041,Fall-out,0.4643680104855929
0.05050505050505051,Fall-out,0.4172674037587468
0.06060606060606061,Fall-out,0.3796376551170116
0.07070707070707072,Fall-out,0.3507811343889394
0.08080808080808081,Fall-out,0.33186055852694335
0.09090909090909091,Fall-out,0.3152231359533222
0.10101010101010102,Fall-out,0.29964272879098575
0.11111111111111112,Fall-out,0.2855844238208993
0.12121212121212122,Fall-out,0.27161068008371564
0.13131313131313133,Fall-out,0.25719298987379235
0.14141414141414144,Fall-out,0.24338836860241422
0.15151515151515152,Fall-out,0.2312538316808659
0.16161616161616163,Fall-out,0.22026087140350506
0.17171717171717174,Fall-out,0.2083377375642137
0.18181818181818182,Fall-out,0.19694311143056467
0.19191919191919193,Fall-out,0.18402638310466565
0.20202020202020204,Fall-out,0.17440754286197493
0.21212121212121213,Fall-out,0.16548633279073208
0.22222222222222224,Fall-out,0.15278100754709004
0.23232323232323235,Fall-out,0.14292962391391667
0.24242424242424243,Fall-out,0.1317252605542989
0.25252525252525254,Fall-out,0.11555292476164303
0.26262626262626265,Fall-out,0.10612434729298353
0.27272727272727276,Fall-out,0.08902183793839714
0.2828282828282829,Fall-out,0.08331395471745978
0.29292929292929293,Fall-out,0.07232099444009894
0.30303030303030304,Fall-out,0.06735302200706086
0.31313131313131315,Fall-out,0.061454876012092256
0.32323232323232326,Fall-out,0.05665602604485973
0.33333333333333337,Fall-out,0.048982094158932836
0.3434343434343435,Fall-out,0.045641925459273196
0.3535353535353536,Fall-out,0.03748176648415534
0.36363636363636365,Fall-out,0.0341415977844957
0.37373737373737376,Fall-out,0.029321607509037482
0.38383838383838387,Fall-out,0.026996173604211148
0.393939393939394,Fall-out,0.024353635075999407
0.4040404040404041,Fall-out,0.022514428260364035
0.4141414141414142,Fall-out,0.01940680295118703
0.42424242424242425,Fall-out,0.017165930279263473
0.43434343434343436,Fall-out,0.014459970826374648
0.4444444444444445,Fall-out,0.011035240893812233
0.4545454545454546,Fall-out,0.009386296852208105
0.4646464646464647,Fall-out,0.004756569350781135
0.4747474747474748,Fall-out,0.003868676405301989
0.48484848484848486,Fall-out,0.002135171130795087
0.494949494949495,Fall-out,0.0008033317125763693
0.5050505050505051,Fall-out,0.0004228061645138786
0.5151515151515152,Fall-out,0.00031710462338540896
0.5252525252525253,Fall-out,4.228061645138786e-05
0.5353535353535354,Fall-out,0.0
0.5454545454545455,Fall-out,0.0
0.5555555555555556,Fall-out,0.0
0.5656565656565657,Fall-out,0.0
0.5757575757575758,Fall-out,0.0
0.5858585858585859,Fall-out,0.0
0.595959595959596,Fall-out,0.0
0.6060606060606061,Fall-out,0.0
0.6161616161616162,Fall-out,0.0
0.6262626262626263,Fall-out,0.0
0.6363636363636365,Fall-out,0.0
0.6464646464646465,Fall-out,0.0
0.6565656565656566,Fall-out,0.0
0.6666666666666667,Fall-out,0.0
0.6767676767676768,Fall-out,0.0
0.686868686868687,Fall-out,0.0
0.696969696969697,Fall-out,0.0
0.7070707070707072,Fall-out,0.0
0.7171717171717172,Fall-out,0.0
0.7272727272727273,Fall-out,0.0
0.7373737373737375,Fall-out,0.0
0.7474747474747475,Fall-out,0.0
0.7575757575757577,Fall-out,0.0
0.7676767676767677,Fall-out,0.0
0.7777777777777778,Fall-out,0.0
0.787878787878788,Fall-out,0.0
0.797979797979798,Fall-out,0.0
0.8080808080808082,Fall-out,0.0
0.8181818181818182,Fall-out,0.0
0.8282828282828284,Fall-out,0.0
0.8383838383838385,Fall-out,0.0
0.8484848484848485,Fall-out,0.0
0.8585858585858587,Fall-out,0.0
0.8686868686868687,Fall-out,0.0
0.8787878787878789,Fall-out,0.0
0.888888888888889,Fall-out,0.0
0.8989898989898991,Fall-out,0.0
0.9090909090909092,Fall-out,0.0
0.9191919191919192,Fall-out,0.0
0.9292929292929294,Fall-out,0.0
0.9393939393939394,Fall-out,0.0
0.9494949494949496,Fall-out,0.0
0.9595959595959597,Fall-out,0.0
0.9696969696969697,Fall-out,0.0
0.9797979797979799,Fall-out,0.0
0.98989898989899,Fall-out,0.0
1.0,Fall-out,0.0
Use pivot to transform the data from long to wide
Use idxmax to find the x (Threshold) of the max difference between y1 and y2 (Fall-out and Recall)
Use vlines to plot the vertical line at x from y1 to y2
Use annotate to plot the label at the midpoint of y1 and y2
g = sns.relplot(data=data, x='Threshold', y='Value', hue='Metric', kind='line')
# pivot to wide form
p = data.pivot(index='Threshold', columns='Metric', values='Value')
# find x, y1, and y2 corresponding to max difference
diff = p['Fall-out'].sub(p['Recall']).abs()
x = diff.idxmax()
y1, y2 = p.loc[x]
# plot line and label
ax = g.axes.flat[0]
ax.vlines(x, y1, y2, ls='--')
ax.annotate(f'Dist = {diff.loc[x]:.2f}', ha='left', va='center',
xy=(x, 0.5*(y1+y2)), xycoords='data',
xytext=(5, 0), textcoords='offset pixels')
The easiest way which I can think of is to create two separate lists of all values where the metric is Recall and another with all values where metric is Fall-out. This can be easily done using pandas operations as follows (Assuming the dataframe has name df) -
import math
import matplotlib.pyplot as plt
ls_metric = df['Metric'].to_list()
ls_value = df['Value'].to_list()
ls_threshold = df['Threshold'].to_list()
ls_value_recall = []
ls_value_fallout = []
ls_threshold_recall = []
ls_threshold_fallout = []
for i, j, k in zip(ls_metric, ls_value, ls_threshold):
if (i == 'Recall'):
ls_value_recall.append(j)
ls_threshold_recall.append(k)
elif(i == 'Fall-out'):
ls_value_fallout.append(j)
ls_threshold_recall.append(k)
ls_dist = []
for i, j in zip(ls_value_recall, ls_value_fallout):
ls_dist.append(math.abs(i-j))
max_diff = max(ls_dist)
location_of_max_diff = ls_dist.index(max_diff)
value_of_threshold_at_max_diff = ls_threshold_recall[location_of_max_diff]
value_of_recall_at_max_diff = ls_value_recall[location_of_max_diff]
value_of_fallout_at_max_diff = ls_value_fallout[location_of_max_diff]
x_values = [value_of_threshold_at_max_diff, value_of_threshold_at_max_diff]
y_values = [value_of_recall_at_max_diff, value_of_fallout_at_max_diff]
plt.plot(x_values, y_values)
Certain Assumptions - The Threshold Values are the same and same number of readings are present for both metrics which I think is true having had a brief glance at the data but if not I believe it's still pretty easy to modify the code
You can add this plot to your own figure for which the syntax is readily available, now as far as the label for the line is concerned one way to do this is use matplotlib.pyplot.text to add a textbox but with that you'll need to tweak with the location to get the desired location another way to do this would be to add it as a legend only

Pandas finding local max and min

I have a pandas data frame with two columns one is temperature the other is time.
I would like to make third and fourth columns called min and max. Each of these columns would be filled with nan's except where there is a local min or max, then it would have the value of that extrema.
Here is a sample of what the data looks like, essentially I am trying to identify all the peaks and low points in the figure.
Are there any built in tools with pandas that can accomplish this?
The solution offered by fuglede is great but if your data is very noisy (like the one in the picture) you will end up with lots of misleading local extremes. I suggest that you use scipy.signal.argrelextrema() method. The .argrelextrema() method has its own limitations but it has a useful feature where you can specify the number of points to be compared, kind of like a noise filtering algorithm. for example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import argrelextrema
# Generate a noisy AR(1) sample
np.random.seed(0)
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1] * 0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
n = 5 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal,
order=n)[0]]['data']
df['max'] = df.iloc[argrelextrema(df.data.values, np.greater_equal,
order=n)[0]]['data']
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
plt.plot(df.index, df['data'])
plt.show()
Some points:
you might need to check the points afterward to ensure there are no twine points very close to each other.
you can play with n to filter the noisy points
argrelextrema returns a tuple and the [0] at the end extracts a numpy array
Assuming that the column of interest is labelled data, one solution would be
df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)]
df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)]
For example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Generate a noisy AR(1) sample
np.random.seed(0)
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1]*0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
# Find local peaks
df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)]
df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)]
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
df.data.plot()
using Numpy
ser = np.random.randint(-40, 40, 100) # 100 points
peak = np.where(np.diff(ser) < 0)[0]
or
double_difference = np.diff(np.sign(np.diff(ser)))
peak = np.where(double_difference == -2)[0]
using Pandas
ser = pd.Series(np.random.randint(2, 5, 100))
peak_df = ser[(ser.shift(1) < ser) & (ser.shift(-1) < ser)]
peak = peak_df.index
You can do something similar to Foad's .argrelextrema() solution, but with the Pandas .rolling() function:
# Find local peaks
n = 5 #rolling period
local_min_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).min()]
local_max_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).max()]
plt.scatter(local_min_vals.index, local_min_vals, c='r')
plt.scatter(local_max_vals.index, local_max_vals, c='g')

Python merge datasets X1(t), X2(t) -> X1(X2)

I have some datasets (lets stay at 2 here) which are dependent on a common variable t, like X1(t) and X2(t). However X1(t) and X2(t) don't have to share the same t values or even have the same amount of datapoints.
For example they could look like:
t1 = [2,6,7,8,10,13,14,16,17]
X1 = [10,10,10,20,20,20,30,30,30]
t2 = [3,4,5,6,8,10,11,14,15,16]
X2 = [95,100,100,105,158,150,142,196,200,204]
I am trying to create a new dataset YNew(XNew) (=X2(X1)) such that both datasets are linked without the shared variable t.
In this case it should look like:
XNew = [10,20,30]
YNew = [100,150,200]
where to every occuring X1-value a corresponding X2-value (a mean value) is assigned.
Is there an easy already known way to achieve this(maybe with pandas)?
My first guess would be to find all t-values for a certain X1-value (in the example case the X1-value 10 would lie in the range 2,...,7) and then look for all X2-values in that range and get their mean value. Then you should be able to assign YNew(XNew).
Thanks for every advice!
Update:
I added a graph, so maybe my intentions are a bit more clear. I want to assign the mean X2-value to the corresponding X1-value in the marked regions (where the same X1-values occur).
graph corresponding to example lists
alright, I just tried to implement what I mentioned and it works as I liked it.
Although I think that some things are still a little clumsy...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# datasets to treat
t1 = [2,6,7,8,10,13,14,16,17]
X1 = [10,10,10,20,20,20,30,30,30]
t2 = [3,4,5,6,8,10,11,14,15,16]
X2 = [95,100,100,105,158,150,142,196,200,204]
X1Series = pd.Series(X1, index = t1)
X2Series = pd.Series(X2, index = t2)
X1Values = X1Series.drop_duplicates().values #returns all occuring values of X1 without duplicates as array
# lists for results
XNew = []
YNew = []
#find for every occuring value X1 the mean value of X2 in the range of X1
for value in X1Values:
indexpos = X1Series[X1Series == value].index.values
max_t = indexpos[indexpos.argmax()] # get max and min index of the range of X1
min_t =indexpos[indexpos.argmin()]
print("X1 = "+str(value)+" occurs in range from "+str(min_t)+" to "+str(max_t))
slicedX2 = X2Series[(X2Series.index >= min_t) & (X2Series.index <= max_t)] # select range of X2
print("in this range there are following values of X2:")
print(slicedX2)
mean = slicedX2.mean() #calculate mean value of selection and append extracted values
print("with the mean value of: " + str(mean))
XNew.append(value)
YNew.append(mean)
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
ax1.plot(t1, X1,'ro-',label='X1(t)')
ax1.plot(t2, X2,'bo',label='X2(t)')
ax1.legend(loc=2)
ax1.set_xlabel('t')
ax1.set_ylabel('X1/X2')
ax2.plot(XNew,YNew,'ro-',label='YNew(XNew)')
ax2.legend(loc=2)
ax2.set_xlabel('XNew')
ax2.set_ylabel('YNew')
plt.show()

Adding a single label to the legend for a series of different data points plotted inside a designated bin in Python using matplotlib.pyplot.plot()

I have a script for plotting astronomical data of redmapping clusters using a csv file. I could get the data points in it and want to plot them using different colors depending on their redshift values: I am binning the dataset into 3 bins (0.1-0.2, 0.2-0.25, 0.25,0.31) based on the redshift.
The problem arises with my code after I distinguish to what bin the datapoint belongs: I want to have 3 labels in the legend corresponding to red, green and blue data points, but this is not happening and I don't know why. I am using plot() instead of scatter() as I also had to do the best fit from the data in the same figure. So everything needs to be in 1 figure.
import numpy as np
import matplotlib.pyplot as py
import csv
z = open("Sheet4CSV.csv","rU")
data = csv.reader(z)
x = []
y = []
ylow = []
yupp = []
xlow = []
xupp = []
redshift = []
for r in data:
x.append(float(r[2]))
y.append(float(r[5]))
xlow.append(float(r[3]))
xupp.append(float(r[4]))
ylow.append(float(r[6]))
yupp.append(float(r[7]))
redshift.append(float(r[1]))
from operator import sub
xerr_l = map(sub,x,xlow)
xerr_u = map(sub,xupp,x)
yerr_l = map(sub,y,ylow)
yerr_u = map(sub,yupp,y)
py.xlabel("$Original\ Tx\ XCS\ pipeline\ Tx\ keV$")
py.ylabel("$Iterative\ Tx\ pipeline\ keV$")
py.xlim(0,12)
py.ylim(0,12)
py.title("Redmapper Clusters comparison of Tx pipelines")
ax1 = py.subplot(111)
##Problem starts here after the previous line##
for p in redshift:
for i in xrange(84):
p=redshift[i]
if 0.1<=p<0.2:
ax1.plot(x[i],y[i],color="b", marker='.', linestyle = " ")#, label = "$z < 0.2$")
exit
if 0.2<=p<0.25:
ax1.plot(x[i],y[i],color="g", marker='.', linestyle = " ")#, label="$0.2 \leq z < 0.25$")
exit
if 0.25<=p<=0.3:
ax1.plot(x[i],y[i],color="r", marker='.', linestyle = " ")#, label="$z \geq 0.25$")
exit
##There seems nothing wrong after this point##
py.errorbar(x,y,yerr=[yerr_l,yerr_u],xerr=[xerr_l,xerr_u], fmt= " ",ecolor='magenta', label="Error bars")
cof = np.polyfit(x,y,1)
p = np.poly1d(cof)
l = np.linspace(0,12,100)
py.plot(l,p(l),"black",label="Best fit")
py.plot([0,15],[0,15],"black", linestyle="dotted", linewidth=2.0, label="line $y=x$")
py.grid()
box = ax1.get_position()
ax1.set_position([box.x1,box.y1,box.width, box.height])
py.legend(loc='center left',bbox_to_anchor=(1,0.5))
py.show()
In the 1st 'for' loop, I have indexed every value 'p' in the list 'redshift' so that bins can be created using 'if' statement. But if I add the labels that are hashed out against each py.plot() inside the 'if' statements, each data point 'i' that gets plotted in the figure as an intersection of (x[i],y[i]) takes the label and my entire legend attains in total 87 labels (including the 3 mentioned in the code at other places)!!!!!!
I essentially need 1 label for each bin...
Please tell me what needs to done after the bins are created and py.plot() commands used...Thanks in advance :-)
Sorry I cannot post my image here due to low reputation!
The data 'appended' for x, y and redshift lists from the csv file are as follows:
x=[5.031,10.599,10.589,8.548,9.089,8.675,3.588,1.244,3.023,8.632,8.953,7.603,7.513,2.917,7.344,7.106,3.889,7.287,3.367,6.839,2.801,2.316,1.328,6.31,6.19,6.329,6.025,5.629,6.123,5.892,5.438,4.398,4.542,4.624,4.501,4.504,5.033,5.068,4.197,2.854,4.784,2.158,4.054,3.124,3.961,4.42,3.853,3.658,1.858,4.537,2.072,3.573,3.041,5.837,3.652,3.209,2.742,2.732,1.312,3.635,2.69,3.32,2.488,2.996,2.269,1.701,3.935,2.015,0.798,2.212,1.672,1.925,3.21,1.979,1.794,2.624,2.027,3.66,1.073,1.007,1.57,0.854,0.619,0.547]
y=[5.255,10.897,11.045,9.125,9.387,17.719,4.025,1.389,4.152,8.703,9.051,8.02,7.774,3.139,7.543,7.224,4.155,7.416,3.905,6.868,2.909,2.658,1.651,6.454,6.252,6.541,6.152,5.647,6.285,6.079,5.489,4.541,4.634,8.851,4.554,4.555,5.559,5.144,5.311,5.839,5.364,3.18,4.352,3.379,4.059,4.575,3.914,5.736,2.304,4.68,3.187,3.756,3.419,9.118,4.595,3.346,3.603,6.313,1.816,4.34,2.732,4.978,2.719,3.761,2.623,2.1,4.956,2.316,4.231,2.831,1.954,2.248,6.573,2.276,2.627,3.85,3.545,25.405,3.996,1.347,1.679,1.435,0.759,0.677]
redshift = [0.12,0.25,0.23,0.23,0.27,0.26,0.12,0.27,0.17,0.18,0.17,0.3,0.23,0.1,0.23,0.29,0.29,0.12,0.13,0.26,0.11,0.24,0.13,0.21,0.17,0.2,0.3,0.29,0.23,0.27,0.25,0.21,0.11,0.15,0.1,0.26,0.23,0.12,0.23,0.26,0.2,0.17,0.22,0.26,0.25,0.12,0.19,0.24,0.18,0.15,0.27,0.14,0.14,0.29,0.29,0.26,0.15,0.29,0.24,0.24,0.23,0.26,0.29,0.22,0.13,0.18,0.24,0.14,0.24,0.24,0.17,0.26,0.29,0.11,0.14,0.26,0.28,0.26,0.28,0.27,0.23,0.26,0.23,0.19]
Working with numerical data like this, you should really consider using a numerical library, like numpy.
The problem in your code arises from processing each record (a coordinate (x,y) and the corresponding value redshift) one at a time. You are calling plot for each point, thereby creating legends for each of those 84 datapoints. You should consider your "bins" as groups of data that belong to the same dataset and process them as such. You could use "logical masks" to distinguish between your "bins", as shown below.
It's also not clear why you call exit after each plotting action.
import numpy as np
import matplotlib.pyplot as plt
x = np.array([5.031,10.599,10.589,8.548,9.089,8.675,3.588,1.244,3.023,8.632,8.953,7.603,7.513,2.917,7.344,7.106,3.889,7.287,3.367,6.839,2.801,2.316,1.328,6.31,6.19,6.329,6.025,5.629,6.123,5.892,5.438,4.398,4.542,4.624,4.501,4.504,5.033,5.068,4.197,2.854,4.784,2.158,4.054,3.124,3.961,4.42,3.853,3.658,1.858,4.537,2.072,3.573,3.041,5.837,3.652,3.209,2.742,2.732,1.312,3.635,2.69,3.32,2.488,2.996,2.269,1.701,3.935,2.015,0.798,2.212,1.672,1.925,3.21,1.979,1.794,2.624,2.027,3.66,1.073,1.007,1.57,0.854,0.619,0.547])
y = np.array([5.255,10.897,11.045,9.125,9.387,17.719,4.025,1.389,4.152,8.703,9.051,8.02,7.774,3.139,7.543,7.224,4.155,7.416,3.905,6.868,2.909,2.658,1.651,6.454,6.252,6.541,6.152,5.647,6.285,6.079,5.489,4.541,4.634,8.851,4.554,4.555,5.559,5.144,5.311,5.839,5.364,3.18,4.352,3.379,4.059,4.575,3.914,5.736,2.304,4.68,3.187,3.756,3.419,9.118,4.595,3.346,3.603,6.313,1.816,4.34,2.732,4.978,2.719,3.761,2.623,2.1,4.956,2.316,4.231,2.831,1.954,2.248,6.573,2.276,2.627,3.85,3.545,25.405,3.996,1.347,1.679,1.435,0.759,0.677])
redshift = np.array([0.12,0.25,0.23,0.23,0.27,0.26,0.12,0.27,0.17,0.18,0.17,0.3,0.23,0.1,0.23,0.29,0.29,0.12,0.13,0.26,0.11,0.24,0.13,0.21,0.17,0.2,0.3,0.29,0.23,0.27,0.25,0.21,0.11,0.15,0.1,0.26,0.23,0.12,0.23,0.26,0.2,0.17,0.22,0.26,0.25,0.12,0.19,0.24,0.18,0.15,0.27,0.14,0.14,0.29,0.29,0.26,0.15,0.29,0.24,0.24,0.23,0.26,0.29,0.22,0.13,0.18,0.24,0.14,0.24,0.24,0.17,0.26,0.29,0.11,0.14,0.26,0.28,0.26,0.28,0.27,0.23,0.26,0.23,0.19])
bin3 = 0.25 <= redshift
bin2 = np.logical_and(0.2 <= redshift, redshift < 0.25)
bin1 = np.logical_and(0.1 <= redshift, redshift < 0.2)
plt.ion()
labels = ("$z < 0.2$", "$0.2 \leq z < 0.25$", "$z \geq 0.25$")
colors = ('r', 'g', 'b')
for bin, label, co in zip( (bin1, bin2, bin3), labels, colors):
plt.plot(x[bin], y[bin], color=co, ls='none', marker='o', label=label)
plt.legend()
plt.show()

Obtaining values used in boxplot, using python and matplotlib

I can draw a boxplot from data:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.rand(100)
plt.boxplot(data)
Then, the box will range from the 25th-percentile to 75th-percentile, and the whisker will range from the smallest value to the largest value between (25th-percentile - 1.5*IQR, 75th-percentile + 1.5*IQR), where the IQR denotes the inter-quartile range. (Of course, the value 1.5 is customizable).
Now I want to know the values used in the boxplot, i.e. the median, upper and lower quartile, the upper whisker end point and the lower whisker end point. While the former three are easy to obtain by using np.median() and np.percentile(), the end point of the whiskers will require some verbose coding:
median = np.median(data)
upper_quartile = np.percentile(data, 75)
lower_quartile = np.percentile(data, 25)
iqr = upper_quartile - lower_quartile
upper_whisker = data[data<=upper_quartile+1.5*iqr].max()
lower_whisker = data[data>=lower_quartile-1.5*iqr].min()
I was wondering, while this is acceptable, would there be a neater way to do this? It seems that the values should be ready to pull-out from the boxplot, as it's already drawn.
Why do you want to do so? what you are doing is already pretty direct.
Yeah, if you want to fetch them for the plot, when the plot is already made, simply use the get_ydata() method.
B = plt.boxplot(data)
[item.get_ydata() for item in B['whiskers']]
It returns an array of the shape (2,) for each whiskers, the second element is the value we want:
[item.get_ydata()[1] for item in B['whiskers']]
I've had this recently and have written a function to extract the boxplot values from the boxplot as a pandas dataframe.
The function is:
def get_box_plot_data(labels, bp):
rows_list = []
for i in range(len(labels)):
dict1 = {}
dict1['label'] = labels[i]
dict1['lower_whisker'] = bp['whiskers'][i*2].get_ydata()[1]
dict1['lower_quartile'] = bp['boxes'][i].get_ydata()[1]
dict1['median'] = bp['medians'][i].get_ydata()[1]
dict1['upper_quartile'] = bp['boxes'][i].get_ydata()[2]
dict1['upper_whisker'] = bp['whiskers'][(i*2)+1].get_ydata()[1]
rows_list.append(dict1)
return pd.DataFrame(rows_list)
And is called by passing an array of labels (the ones that you would pass to the boxplot plotting function) and the data returned by the boxplot function itself.
For example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def get_box_plot_data(labels, bp):
rows_list = []
for i in range(len(labels)):
dict1 = {}
dict1['label'] = labels[i]
dict1['lower_whisker'] = bp['whiskers'][i*2].get_ydata()[1]
dict1['lower_quartile'] = bp['boxes'][i].get_ydata()[1]
dict1['median'] = bp['medians'][i].get_ydata()[1]
dict1['upper_quartile'] = bp['boxes'][i].get_ydata()[2]
dict1['upper_whisker'] = bp['whiskers'][(i*2)+1].get_ydata()[1]
rows_list.append(dict1)
return pd.DataFrame(rows_list)
data1 = np.random.normal(loc = 0, scale = 1, size = 1000)
data2 = np.random.normal(loc = 5, scale = 1, size = 1000)
data3 = np.random.normal(loc = 10, scale = 1, size = 1000)
labels = ['data1', 'data2', 'data3']
bp = plt.boxplot([data1, data2, data3], labels=labels)
print(get_box_plot_data(labels, bp))
plt.show()
Outputs the following from get_box_plot_data:
label lower_whisker lower_quartile median upper_quartile upper_whisker
0 data1 -2.491652 -0.587869 0.047543 0.696750 2.559301
1 data2 2.351567 4.310068 4.984103 5.665910 7.489808
2 data3 7.227794 9.278931 9.947674 10.661581 12.733275
And produces the following plot:
upper_whisker = data[data<=upper_quartile+1.5*iqr].max()
lower_whisker = data[data>=lower_quartile-1.5*iqr].min()
equal to
upper_whisker = data.max()
lower_whisker = data.min()
if you just want to get the real data points in the dataset. But statistically speaking, the whisker values are upper_quantile+1.5IQR and lower_quantile-1.5IQR

Categories