binning random data into groups of equal data point by their value

binning random data into groups of equal data point by their value - python

I got a 2-columns dataframe (volume and price), and I want to create 20 bins based on the volume column with equal amount of data in each bin.
I.e. if I got volume = [1,6,8,2,6,9,3,6] and 4 bins, I want to cut the data to 1st bin: 1:2, 2nd: 3:6, 3rd: 6:8, 4th: 8:9
then to find the average volume and price within each bin and plot a graph of volume(x-axis) against price(y-axis)
the intervals don't need to be equally spaced. I want to have the same number of data in each interval and determine the range of each interval, then find the average value of the data within each interval and plot it
data = df['Volume']
discrete_dat, cutoff = discretize(dat, 20)
myList = sorted(set(cutoff))
Cutoff = np.asarray(myList)
df_2 = pd.DataFrame({'X' : fd['Volume'], 'Y' : df['dMidP']}) #we build a dataframe from the data
data_cut = pd.cut(data,Cutoff) #we cut the data following the bins #we cut the data following the bins
grp = df_2.groupby(by = data_cut) #we group the data by the cut
ret = grp.aggregate(np.mean) #we produce an aggregate representation (mean) of each bin
plt.loglog(df['Volume'],df['dMidP'],'o')
plt.loglog(ret.X,ret.Y,'r-')
plt.title('Price Impact (Sell)')
plt.xlabel('Volume')
plt.ylabel('dMidP')
plt.show()
my raw data and output plot
however, when I use the counter function, it returns me the following, indicating the number of data points in each interval is different.
Counter({Interval(0.41299999999999998, 0.46400000000000002, closed='right'): 2029,
Interval(0.877, 0.92800000000000005, closed='right'): 543,
Interval(0.050999999999999997, 0.069599999999999995, closed='right'): 93,
Interval(0.60299999999999998, 0.71399999999999997, closed='right'): 99,
Interval(0.46400000000000002, 0.496, closed='right'): 93,
Interval(0.092799999999999994, 0.125, closed='right'): 111,
Interval(0.125, 0.14799999999999999, closed='right'): 86,
Interval(0.0092800000000000001, 0.018599999999999998, closed='right'): 101,
Interval(0.53800000000000003, 0.60299999999999998, closed='right'): 99,
Interval(0.14799999999999999, 0.186, closed='right'): 108,
Interval(0.018599999999999998, 0.023199999999999998, closed='right'): 102,
Interval(0.186, 0.23200000000000001, closed='right'): 134,
Interval(3.246, 4.2670000000000003, closed='right'): 85,
Interval(0.496, 0.53800000000000003, closed='right'): 103,
Interval(1.391, 1.716, closed='right'): 86,
Interval(0.26400000000000001, 0.32500000000000001, closed='right'): 104,
nan: 243,
Interval(0.23200000000000001, 0.26400000000000001, closed='right'): 60,
Interval(0.032500000000000001, 0.046399999999999997, closed='right'): 186,
Interval(0.00464, 0.0092800000000000001, closed='right'): 87,
Interval(0.023199999999999998, 0.032500000000000001, closed='right'): 74,
Interval(0.71399999999999997, 0.877, closed='right'): 101,
Interval(0.97399999999999998, 1.1359999999999999, closed='right'): 92,
Interval(4.2670000000000003, 6.3120000000000003, closed='right'): 100,
Interval(0.046399999999999997, 0.050999999999999997, closed='right'): 33,
Interval(1.716, 1.855, closed='right'): 145,
Interval(0.069599999999999995, 0.092799999999999994, closed='right'): 97,
Interval(1.1359999999999999, 1.391, closed='right'): 319,
Interval(2.319, 2.7829999999999999, closed='right'): 114,
Interval(0.32500000000000001, 0.41299999999999998, closed='right'): 98,
Interval(0.92800000000000005, 0.97399999999999998, closed='right'): 72,
Interval(2.7829999999999999, 3.246, closed='right'): 75,
Interval(2.1429999999999998, 2.319, closed='right'): 128,
Interval(1.855, 2.1429999999999998, closed='right'): 56})

Related

How can I use fill_between if the points I have are array with single value

I have a matplotlib script.
In it in the end I have
x_val=[x[0] for x in lista]
y_val=[x[1] for x in lista]
z_val=[x[2] for x in lista]
ax.plot(x_val,y_val,'.-')
ax.plot(x_val,z_val,'.-')
This script plots well eventhough the values in y_val and z_val are not strictly numbers
Debugging I have
(Pdb) x_val
[69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153]
(Pdb) y_val
[array(1.74204588), array(1.74162786), array(1.74060187), array(1.73956786), array(-1.89492498), array(-1.89225716), array(-1.89842406), array(-1.89143466), array(-1.89171231), array(-1.88730752), array(-1.89144205), nan, array(1.71829279), array(-1.88108125), array(-1.87515878), array(-1.87912412), array(-1.87015615), array(-1.87152107), array(-1.86639765), array(-1.87383146), array(-1.86896753), array(-1.87339903), array(-1.8748417), array(-1.88515482), array(-1.88263666), array(-1.88571425), nan, nan, array(1.72480822), array(1.73666841), array(-1.88835078), array(-1.88489648), array(-1.89135095), array(-1.88647712), array(-1.88697799), array(-1.88330942), array(-1.88929744), array(-1.88320532), array(-1.88466698), array(-1.87994435), array(-1.88546968), array(-1.88014776), array(-1.87803843), array(-1.87505217), array(-1.8797416), array(-1.87223076), array(-1.87333355), array(-1.86838693), array(-1.87577428), array(-1.86875561), array(-1.86872998), array(-1.86385078), array(-1.87095955), array(-1.86509266), array(-1.86601095), array(-1.86223456), array(-1.87151403), array(-1.86695325), array(-1.86540432), array(-1.86244142), array(-1.87018407), array(-1.86767604), array(-1.8699986), array(-1.87008087), array(-1.88049869), array(1.70057683), array(1.74942263), array(-1.86556665), array(-1.88470081), array(-1.90776552), array(-1.9103818), array(-1.91022515), array(-1.89490587), array(-1.89507617), array(-1.8875979), array(-1.89318633), array(-1.8942595), array(-1.902641), array(-1.89313615), array(-1.87870174), array(-1.86319541), array(-1.85999368), array(-1.85943922), array(-1.88398592), array(1.73030903)]
z_val similarly
This does not represent a problem
However I want to do
ax.fill_between(x_val,0,1,where=(y_val*z_val) >0,
color='green',alpha=0.5 )
It is a first attempt that I will probably modify (in this example for instance I don't understand yet what transform=ax.get_xaxis_transform() does) but the problem is that now I got an error
File "plotgt_vs_time.py", line 160, in plot
ax.fill_between(x_val,0,1,where=(y_val*z_val) >0,
TypeError: can't multiply sequence by non-int of type 'list'
I suppose it is because it is an array. How can I modify my code so as to be able to use fill_between?
I tried modifying it to
x_val=[x[0] for x in lista]
y_val=[x[1][0] for x in lista]
z_val=[x[2][0] for x in lista]
but this throws an error
IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed
Then I modified it to
x_val=[x[0] for x in lista]
y_val=[float(x[1]) for x in lista]
z_val=[float(x[2]) for x in lista]
And now I only get floats, so I eliminated the 0-D arrays
but still got the error
TypeError: can't multiply sequence by non-int of type 'list'
How can I use fill_beetween?

In the end I solve it transforming the lists into numpy arrays
x_val=[x[0] for x in lista]
y_val=[float(x[1]) for x in lista]
z_val=[float(x[2]) for x in lista]
ax.fill_between(x_val,y_val,z_val,where=(np.array(y_val)*np.array(z_val)) >0,
color='red',alpha=0.5 )

Seaborn plots not correct [duplicate]

This question already has answers here:
Seaborn plots not showing up
(8 answers)
Closed 9 months ago.
import seaborn as sns, numpy as np
a = np.random.random((20, 20))
mask = np.zeros_like(a)
mask[np.tril_indices_from(mask)] = True #mask the lower triangle
with snenter code heres.axes_style("white"): #make the plot
ax = sns.heatmap(a, xticklabels=False, yticklabels=False, mask=mask, square=False, cmap="YlOrRd")
plt.show()
I make a Seaborn heatmap from an upper triangle numpy array.
This code using pandas:
import pandas as pd
df = pd.read_csv('datatraining.txt', sep=r',', engine='python', header=None, names = ['id', 'date','Temperature','Humidity','Light','CO2','HumidityRatio','Occupancy'])
df = df.drop([0])
df.index = pd.to_datetime(df.date)
df.drop('date', axis=1, inplace=True)
df = df.apply(pd.to_numeric)
def scale(df):
return (df - df.mean()) / df.std()
df.Temperature = scale(df.Temperature)
df.Humidity = scale(df.Humidity)
df.Light = scale(df.Light)
df.CO2 = scale(df.CO2)
df.HumidityRatio = scale(df.HumidityRatio)

I come to this question quite regularly and it always takes me a while to find what I search:
import seaborn as sns
import matplotlib.pyplot as plt
plt.show() # <--- This is what you are looking for
Please note: In Python 2, you can also use sns.plt.show(), but not in Python 3.
Complete Example
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Visualize C_0.99 for all languages except the 10 with most characters."""
import seaborn as sns
import matplotlib.pyplot as plt
l = [41, 44, 46, 46, 47, 47, 48, 48, 49, 51, 52, 53, 53, 53, 53, 55, 55, 55,
55, 56, 56, 56, 56, 56, 56, 57, 57, 57, 57, 57, 57, 57, 57, 58, 58, 58,
58, 59, 59, 59, 59, 59, 59, 59, 59, 60, 60, 60, 60, 60, 60, 60, 60, 61,
61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 62, 62, 62, 62, 62, 62, 62, 62,
62, 63, 63, 63, 63, 63, 63, 63, 63, 63, 64, 64, 64, 64, 64, 64, 64, 65,
65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 66, 66, 66, 66, 66,
67, 67, 67, 67, 67, 67, 67, 67, 68, 68, 68, 68, 68, 69, 69, 69, 70, 70,
70, 70, 71, 71, 71, 71, 71, 72, 72, 72, 72, 73, 73, 73, 73, 73, 73, 73,
74, 74, 74, 74, 74, 75, 75, 75, 76, 77, 77, 78, 78, 79, 79, 79, 79, 80,
80, 80, 80, 81, 81, 81, 81, 83, 84, 84, 85, 86, 86, 86, 86, 87, 87, 87,
87, 87, 88, 90, 90, 90, 90, 90, 90, 91, 91, 91, 91, 91, 91, 91, 91, 92,
92, 93, 93, 93, 94, 95, 95, 96, 98, 98, 99, 100, 102, 104, 105, 107, 108,
109, 110, 110, 113, 113, 115, 116, 118, 119, 121]
sns.distplot(l, kde=True, rug=False)
plt.show()
Gives
this result

Plots created using seaborn need to be displayed like ordinary matplotlib plots. This can be done using the
plt.show()
function from matplotlib.
Originally I posted the solution to use the already imported matplotlib object from seaborn (sns.plt.show()) however this is considered to be a bad practice. Therefore, simply directly import the matplotlib.pyplot module and show your plots with
import matplotlib.pyplot as plt
plt.show()
If the IPython notebook is used the inline backend can be invoked to remove the necessity of calling show after each plot. The respective magic is
%matplotlib inline
Details which will affect how you store your data, like:
Give as much detail as you can; and I can help you develop a structure.
Size of data, # of rows, columns, types of columns; are you appending rows, or just columns?
What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these.
(Giving a toy example could enable us to offer more specific recommendations.)
After that processing, then what do you do? Is step 2 ad hoc, or repeatable?
Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file?
Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)?

Spark UDF: Apply np.sum over a list of values in a data frame and filter values based on threshold

Very knew to using spark for data manipulation and UDF. I have a sample df with different test scores. There are 50 different columns like these. I am trying to define a custom apply function to filter values (total counts in each row) which are greater than 80.
test_scores
[65, 92, 96, 72, 70, 85, 72, 74, 79, 10, 82]
[59, 81, 91, 69, 66, 75, 65, 61, 71, 85, 69]
Below is what I am trying:
customfunc = udf(lambda val: (np.sum(val > 30)))
df2 = (df.withColumn('scores' ,customfunc('test_scores')))
Getting the below error:
TypeError: '>' not supported between instances of 'tuple' and 'str'

How perform unsupervised clustering on numbers in an Array using PyTorch

I got this array and I want to cluster/group the numbers into similar values.
An example of input array:
array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]
expected result :
array([57,58,59,60,61]), ([78,79,80,81,82,83]), ([101,102,103,104,105,106])
I tried to use clustering but I don't think it's gonna work if I don't know how many I'm going to split up.
true = np.where(array>=1)
-> (array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102,
103, 104, 105, 106], dtype=int64),)

Dynamic binning requires explicit criteria and is not an easy problem to automate because each array may require a different set of thresholds to bin them efficiently.
I think Gaussian mixtures with a silhouette score criteria is the best bet you have. Here is a code for what you are trying to achieve. The silhouette scores help you determine the number of clusters/Gaussians you should use and is quite accurate and interpretable for 1D data.
import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
#change value of clusters to check best silhoutte score
print('Silhoutte scores')
scores = []
for n in range(2,11):
model = GaussianMixture(n).fit(data)
preds = model.predict(data)
score = silhouette_score(data, preds)
scores.append(score)
print(n,'->',score)
n_best = np.argmax(scores)+2 #because clusters start from 2
model = GaussianMixture(n_best).fit(data) #best model fit
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))
#split data by clusters
pred = model.predict(data)
output = np.split(x, np.sort(np.unique(pred, return_index=True)[1])[1:])
print(output)
Silhoutte scores
2 -> 0.699444729378163
3 -> 0.8962176943475543 #<--- selected as nbest
4 -> 0.7602523591781903
5 -> 0.5835620702692205
6 -> 0.5313888070615105
7 -> 0.4457049486461251
8 -> 0.4355742296918767
9 -> 0.13725490196078433
10 -> 0.2159663865546218
This creates 3 gaussians with the following distributions to split the data into clusters.
Arrays output finally split by similar values
#output -
[array([57, 58, 59, 60, 61]),
array([78, 79, 80, 81, 82, 83]),
array([101, 102, 103, 104, 105, 106])]

You can perform kind of derivation on this array so that you can track changes better, assume your array is:
A = np.array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106])
so you can make a derivation vector by simply convolving your vector with [-1 1]:
A_ = abs(np.convolve(A, np.array([-1, 1])))
then A_ is:
array([57, 1, 1, 1, 1, 17, 1, 1, 1, 1, 1, 18, 2, 1, 1, 1, 106]
now you can define a threshold like 5 and find the cluster boundaries.
THRESHOLD = 5
cluster_bounds = np.argwhere(A_ > THRESHOLD)
now cluster_bounds is:
array([[0], [5], [11], [16]], dtype=int32)

Validating t-test results using Python scipy

I have simple Python function:
from scipy.stats import ttest_1samp
def tTest( expectedMean, sampleSet, alpha=0.05 ):
# T-value and P-value
tv, pv = ttest_1samp(sampleSet, expectedMean)
print(tv,pv)
return pv >= alpha
if __name__ == '__main__':
# Expected mean is 10
print tTest(10.0, [99, 99, 22, 77, 99, 55, 44, 33, 20, 9999, 99, 99, 99])
My expectation is that t-test should fail for this sample, as it is nowhere near the expected population mean of 10. However, program produces result:
(1.0790344826428238, 0.3017839504736506)
True
I.e. the p-value is ~30% which is too high to reject the hypothesis. I am not very knowledgeable about the maths behind t-test but I don't understand how this result can be correct. Does anyone have any ideas?

I performed the test using R just to check if the results are the same and they are:
t.test(x=c(99, 99, 22, 77, 99, 55, 44, 33, 20, 9999, 99, 99, 99), alternative = "two.sided",
mu = 10, paired = FALSE, var.equal = FALSE, conf.level = 0.95)
data: c(99, 99, 22, 77, 99, 55, 44, 33, 20, 9999, 99, 99, 99)
t = 1.079, df = 12, p-value = 0.3018
alternative hypothesis: true mean is not equal to 10
95 percent confidence interval:
-829.9978 2498.3055
sample estimates:
mean of x
834.1538
You can see that the p-value is 0.3.
This is a really interesting problem, I have a lot of issues with Hypothesis testing. First of all the sample size influences a lot, if u have a big sample size, lets say 5000 values, minor deviations from the expected value that you are testing will influence a lot the p-value, and so you will reject the null hypothesis most of the times, having small samples does the opposite.
And what is happening here is that you have a high variance in the data.
If you try to replace your data from [99, 99, 22, 77, 99, 55, 44, 33, 20, 9999, 99, 99, 99]
To
[99, 99, 99, 99, 100, 99, 99, 99, 99, 100, 99, 100, 100]
So it has a really small variance, your p-value will be a lot smaller, even tho the mean of this one is probably closer to 10.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

binning random data into groups of equal data point by their value - python

Related

How can I use fill_between if the points I have are array with single value

Seaborn plots not correct [duplicate]

Spark UDF: Apply np.sum over a list of values in a data frame and filter values based on threshold

How perform unsupervised clustering on numbers in an Array using PyTorch

Validating t-test results using Python scipy

Categories

Resources