Why does Pandas qcut give me unequal sized bins?

Why does Pandas qcut give me unequal sized bins? - python

Pandas docs have this to say about the qcut function:
Discretize variable into equal-sized buckets based on rank or based on sample quantiles.
So I would expect this code to give me 4 bins of 10 values each:
import numpy as np
import pandas as pd
np.random.seed(4242)
y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y, 4, labels=['1st', '2nd', '3rd', '4th'])
print('Quartiles:')
print(quartiles.value_counts(sort=False))
y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');
But instead I get this:
Quartiles:
1st 14
2nd 6
3rd 11
4th 9
dtype: int64
graph
What am I doing wrong here?

The reason this happens is because python doesn't know how to deal with 'boundary-line' cases, i.e. records that may fit first and second quartiles easily. A simple adjustment to your code will produce the desired result:
import numpy as np
import pandas as pd
np.random.seed(4242)
y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y.rank(method = 'first'), 4, labels=['1st', '2nd', '3rd', '4th'])
print('Quartiles:')
print(quartiles.value_counts(sort=False))
y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');
By stating the approach to be used by python using the rank() function, we give python a clear approach to handling records that cut across multiple bins. In this case, I've used (method = 'first') as the argument for the rank() function.
The output I get is as follows:
Quartiles:
1st 10
2nd 10
3rd 10
4th 10
dtype: int64

Looking at the boundaries of the bins highlights the problem stated inside the comments.
boundaries = [1, 2, 3.5, 6, 9]
These boundaries are correct. The code of pandas creates the values for the quantiles (inside qcut), first. Afterwards the samples are put into the bins. The range of 2s overlaps the boundary of the first quartile.
The reason for the third values is that the value below the threshold is a 3 and the value above the threshold is a 4. The function quantile of pandas is called so that the boundary lies in between the two neighboring values.
Concluding: A concept like quantiles gets more and more appropriate, when there are a larger number of samples, so that more values are available fixing the boundaries.

Related

How to prepare training data (remove boundary values)

Have we numpy function or pandas function which make somthinfg like that:
For me, boundary values are the farthest values from the regression line.
That means for me:
the farthest from the line over the line and the farthest from the line under the line.
If I will have data:
l1 = [0,1,4,3,4,3]
df = pd.DataFrame(l1)
It looks like that:
0
0 1
1 4
2 3
3 4
4 3
How to find data from index 1 and index 4.
I need to recognize from python script and remove them. I know how to remove but i do not know how to find.
What I want to do:
First I am going to calculate linear regression, next I am going to remove outsider values and next i am going to recalculate linear regression one more time without the farthest values.

To remove outliers, you can use Series.quantile:
Suppose the following dataframe:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2022)
df = pd.DataFrame({'A': np.random.normal(5, 2, size=50)})
df.plot.hist(bins=25)
plt.xlim(0, 10)
plt.show()
Now filter out your dataframe:
df1 = df.loc[df['A'].between(*df['A'].quantile([0.25, 0.75]).values)]
df1.plot.hist(bins=10)
plt.xlim(0, 10)
plt.show()

How do I format a y-axis 'y' in matplotlib going between pandas dataframes and simple variables?

CSV1only is a dataframe uploaded from a CSV
Let CSV1only as a dataframe be a column such that:
TRADINGITEMID:
1233
2455
3123
1235
5098
as a small example
How can I plot a scatterplot accordingly, specifically the y-axis?
I tried:
import pandas as pd
import matplotlib.pyplot as plt
CSV1only.plot(kind='scatter',x='TRADINGITEMID', y= [1,2], color='b')
plt.xlabel('TRADINGITEMID Numbers')
plt.ylabel('Range')
plt.title('Distribution of ItemIDNumbers')
and it doesn't work because of the y.
So, my main question is just how I can get a 0, 1, 2 y-axis for this scatter plot, as I want to make a distribution graph.
The following code doesn't work because it doesn't match the amount of rows included in the original TRADINGITEMID column, which has 5000 rows:
newcolumn_values = [1, 2]
CSV1only['un et deux'] = newcolumn_values
#and then I changed the y = [1,2] from before into y = ['un et deux']
Therefore the solution would need to work from any integer 1 to N, N being the # of rows. Yet, it would only have a range of [0, 2] or some [0, m], m being some arbitrary integer.
Don't need to worry about the actual pandas data frame CSV1only.
The 'TRADINGITEMIDNUMBERS' contains 5000 rows of unique numbers, so I just wanna plot those numbers on a line, with the y-axis being instances (which will never pass 1 since it is unique).

I think you are looking for the following: You need to generate y-values starting from 0 until n-1 where n is the total number of rows
y = np.arange(len(CSV1only['TRADINGITEMID']))
plt.scatter(CSV1only['TRADINGITEMID'], y, c='DarkBlue')

Pandas correlation on just one column containing np arrays

I'm working with a dataframe with a column containing a np.array per row (in this case representing the mean waveform of brain recordings trought the time). I want to calculate the pearson correlation of this column (array by array).
This is my code
lenght = len(df.Mean)
Mean = []
for i in range(len(df.Mean)):
Mean.append(df.Mean[i])
Correlation_p = np.zeros((lenght,lenght))
P_Value_p = np.zeros((lenght,lenght))
for i in range(lenght):
for j in range(lenght):
Correlation_p[i][j],P_Value_p[i][j] = stats.pearsonr(df.Mean[i],df.Mean[j])
This works, but I want to know if there is a more pythonic way to do it, maybe using df.corr(). I tried but I failed in how to do it.
EDIT: the output of df.Mean.head()
0 [-0.2559348091247745, 0.02743063113723536, 0.3...
1 [-0.37025615099744325, -0.11299328141596175, 0...
2 [-1.0543681894876467, -0.8452798699354909, -0....
3 [-0.23527437766943646, -0.28657810260136585, -...
4 [0.45557980303095674, 0.6055674269814991, 0.74...
Name: Mean, dtype: object

The arrays that you would like to correlate seem in single cells of the DataFrame, if I am not mistaken. The following brings it in a format where each single array occupies a single column.
I made an data example that resembles the format of df.Mean.head():
df = pd.DataFrame({'x':[np.random.randint(0,5,10), np.random.randint(0,5,10), np.random.randint(0,5,10)]})
You can turn these arrays into columns using this:
df = pd.DataFrame(np.array(df['x'].tolist()).transpose())
Adapt the reshape parameters according to your own dimensions.
From there, it would be fairly straightforward.
A correlation matrix can be created by:
df.corr()
A visualization of the correlation matrix:
import matplotlib.pyplot as plt
plt.matshow(df.corr())
plt.show()

Filling results of linear regression into a dataframe

I'm running a regression between two stocks:
(y=bank_matrix['EXO.MI']
and
x=bank_matrix['LDO.MI']).
My task is to update the slope coefficient every 20 days (lookback). In short, I want to have a list of slope coefficients starting from day 20 (my lookback). So I run this regression model called reg.
In the meantime, I create:
A)3 empty lists: Intercetta=[], Hedge=[], Residuals=[]
B)1 Dataframe called Regressione where I want to copy the results of my regression (Intercept,Slope and residuals) inside this dataframe columns (['Intercept','Hedge','Residuals']).
Now the whole code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as pdr
from sklearn.linear_model import LinearRegression
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
tickers=['EXO.MI','LDO.MI']
end=datetime.date.today()
gap=datetime.timedelta(days=650)
start=end- gap
Bank=pdr.get_data_yahoo(tickers,start=start,end=end)
bank_matrix=Bank['Adj Close']
bank_matrix=bank_matrix.dropna()
exor=bank_matrix['EXO.MI']
leonardo=bank_matrix['LDO.MI']
Regressione=pd.DataFrame(data=np.zeros((len(exor),3)),columns=['Intercetta','Hedge','Residuals'],index=bank_matrix['EXO.MI'].index)
lookback=20
Hedge=[]
Intercetta=[]
Residuals=[]
for i in range(lookback,len(exor)):
reg=LinearRegression().fit(bank_matrix[['LDO.MI']][i-lookback+1:i],bank_matrix[['EXO.MI']][i-lookback+1:i])
# Regressione.iloc[Regressione[i,'Hedge']]=reg.coef_[0]
Hedge.append(reg.coef_[0])
Intercetta.append(reg.intercept_)
y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:])
Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred)
Regressione=pd.DataFrame(list(zip(Intercetta,Hedge,Residuals)),columns=['Intercetta','Hedge','Residuals'])
Regressione.set_index(bank_matrix[['EXO.MI']].index[lookback:],inplace=True)
NOW THE FINAL QUESTION: Why in my final dataframe 'Regressione', the third column('Residuals') is an horizontal array???

so, firstly I think these 2 lines you are doing completely wrong:
y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:])
Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred)
You basically try to run linear regression for all the points starting 1 to 20, then 2 to 21, 3 to 22 etc. Then you try to fit that regression to data from observation 20 onward. So you get the model for e.g. 5 to 24 and based on it you predict observations 20 till the end, and take the difference between that prediction and actuals (mind that bank_matrix[['EXO.MI']][lookback:].to_numpy() doesn't change during your for loop).
I suppose what would make more sense here would be:
y_pred=reg.predict(bank_matrix[['LDO.MI']][i-lookback+1:i])
Residuals.append(bank_matrix[['EXO.MI']][i-lookback+1:i].to_numpy()-y_pred)
So you would take error of the model, or:
y_pred=reg.predict(bank_matrix[['LDO.MI']][i:])
Residuals.append(bank_matrix[['EXO.MI']][i:].to_numpy()-y_pred)
So you would try to fit prediction based on the current time span to the data going forward.
Now first option will produce lists of 19 elements per row, while the other one will produce list of 430, decreasing by 1 per row, until 1 in the last row. Because these are residuals - so you have a line, with a slope, and hedge 1 per given time span, but then you have number of observation within this range producing each different result. So depending on how do you want to express it - you can make it sum of square residuals, or maybe take mean residual - you can make it one number only by applying some further transformation to it.
Hope this helps...

From the doc:
If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
You need to use df.loc for example to modify the data in your dataframe...

K-means clustering on 3 dimensions with sklearn

I'm trying to cluster data using lat/lon as X/Y axes and DaysUntilDueDate as my Z axis. I also want to retain the index column ('PM') so that I can create a schedule later using this clustering analysis. The tutorial I found here has been wonderful but I don't know if it's taking the Z-axis into account, and my poking around hasn't resulted in anything but errors. I think the essential point in the code is the parameters of the iloc bit of this line:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(A.iloc[:, :])
I tried changing this part to iloc[1:4] (to only work on columns 1-3) but that resulted in the following error:
ValueError: n_samples=3 should be >= n_clusters=4
So my question is: How can I set up my code to run clustering analysis on 3-dimensions while retaining the index ('PM') column?
Here's my python file, thanks for your help:
from sklearn.cluster import KMeans
import csv
import pandas as pd
# Import csv file with data in following columns:
# [PM (index)] [Longitude] [Latitude] [DaysUntilDueDate]
df = pd.read_csv('point_data_test.csv',index_col=['PM'])
numProjects = len(df)
K = numProjects // 3 # Around three projects can be worked per day
print("Number of projects: ", numProjects)
print("K-clusters: ", K)
for k in range(1, K):
# Create a kmeans model on our data, using k clusters.
# Random_state helps ensure that the algorithm returns the
# same results each time.
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
# These are our fitted labels for clusters --
# the first cluster has label 0, and the second has label 1.
labels = kmeans_model.labels_
# Sum of distances of samples to their closest cluster center
SSE = kmeans_model.inertia_
print("k:",k, " SSE:", SSE)
# Add labels to df
df['Labels'] = labels
#print(df)
df.to_csv('test_KMeans_out.csv')

It seems the issue is with the syntax of iloc[1:4].
From your question it appears you changed:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
to:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[1:4])
It seems to me that either you have a typo or you don't understand how iloc works. So I will explain.
You should start by reading Indexing and Selecting Data from the pandas documentation.
But in short .iloc is an integer based indexing method for selecting data by position.
Let's say you have the dataframe:
A B C
1 2 3
4 5 6
7 8 9
10 11 12
The use of iloc in the example you provided iloc[:,:] selects all rows and columns and produces the entire dataframe. In case you aren't familiar with Python's slice notation take a look at the question Explain slice notation or the docs for An Informal Introduction to Python. The example you said caused your error iloc[1:4] selects the rows at index 1-3. This would result in:
A B C
4 5 6
7 8 9
10 11 12
Now, if you think about what you are trying to do and the error you received you will realize that you have selected fewer samples form your data than you are looking for clusters. 3 samples (rows 1, 2, 3) but you're telling KMeans to find 4 clusters, which just isn't possible.
What you really intended to do (as I understand it) was to select all rows and columns 1-3 that correspond to your lat, lng, and z values. To do this just add a colon as the first argument to iloc like so:
df.iloc[:, 1:4]
Now you will have selected all of your samples and the columns at index 1, 2, and 3. Now, assuming you have enough samples, KMeans should work as you intended.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does Pandas qcut give me unequal sized bins? - python

Related

How to prepare training data (remove boundary values)

How do I format a y-axis 'y' in matplotlib going between pandas dataframes and simple variables?

Pandas correlation on just one column containing np arrays

Filling results of linear regression into a dataframe

K-means clustering on 3 dimensions with sklearn

Categories

Resources