Calculate percentile of value in column - python

I have a dataframe with a column that has numerical values. This column is not well-approximated by a normal distribution. Given another numerical value, not in this column, how can I calculate its percentile in the column? That is, if the value is greater than 80% of the values in the column but less than the other 20%, it would be in the 20th percentile.

To find the percentile of a value relative to an array (or in your case a dataframe column), use the scipy function stats.percentileofscore().
For example, if we have a value x (the other numerical value not in the dataframe), and a reference array, arr (the column from the dataframe), we can find the percentile of x by:
from scipy import stats
percentile = stats.percentileofscore(arr, x)
Note that there is a third parameter to the stats.percentileofscore() function that has a significant impact on the resulting value of the percentile, viz. kind. You can choose from rank, weak, strict, and mean. See the docs for more information.
For an example of the difference:
>>> df
a
0 1
1 2
2 3
3 4
4 5
>>> stats.percentileofscore(df['a'], 4, kind='rank')
80.0
>>> stats.percentileofscore(df['a'], 4, kind='weak')
80.0
>>> stats.percentileofscore(df['a'], 4, kind='strict')
60.0
>>> stats.percentileofscore(df['a'], 4, kind='mean')
70.0
As a final note, if you have a value that is greater than 80% of the other values in the column, it would be in the 80th percentile (see the example above for how the kind method affects this final score somewhat) not the 20th percentile. See this Wikipedia article for more information.

Probably very late but still
df['column_name'].describe()
will give you the regular 25, 50 and 75 percentile with some additional data
but if you specifically want percentiles for some specific values then
df['column_name'].describe(percentiles=[0.1, 0.2, 0.3, 0.5])
This will give you 10th, 20th, 30th and 50th percentiles.
You can give as many values as you want.

Sort the column, and see if the value is in the first 20% or whatever percentile.
for example:
def in_percentile(my_series, val, perc=0.2):
myList=sorted(my_series.values.tolist())
l=len(myList)
return val>myList[int(l*perc)]
Or, if you want the actual percentile simply use searchsorted:
my_series.values.searchsorted(val)/len(my_series)*100

Since you're looking for values over/under a specific threshold, you could consider using pandas qcut function. If you wanted values under 20% and over 80%, divide your data into 5 equal sized partitions. Each partition would represent a 20% "chunk" of equal size (five 20% partitions is 100%). So, given a DataFrame with 1 column 'a' which represents the column you have data for:
df['newcol'] = pd.qcut(df['a'], 5, labels=False)
This will give you a new column to your DataFrame with each row having a value in (0, 1, 2, 3, 4). Where 0 represents your lowest 20% and 4 represents your highest 20% which is the 80% percentile.

Related

Find the average of last 25% from the input range in pandas

I have successfully imported temperature CSV file to Python Pandas DataFrame. I have also found the mean value of specific range:
df.loc[7623:23235, 'Temperature'].mean()
where 'Temperature' is Column title in DataFrame.
I would like to know if it is possible to change this function to find the average of last 25% (or 1/4) from the input range (7623:23235).
Yes, you can use the quantile method to find the value that separates the last 25% of the values in the input range and then use the mean method to calculate the average of the values in the last 25%.
Here's how you can do it:
quantile = df.loc[7623:23235, 'Temperature'].quantile(0.75)
mean = df.loc[7623:23235, 'Temperature'][df.loc[7623:23235, 'Temperature'] >= quantile].mean()
To find the average of the last 25% of the values in a specific range of a column in a Pandas DataFrame, you can use the iloc indexer along with slicing and the mean method.
For example, given a DataFrame df with a column 'Temperature', you can find the average of the last 25% of the values in the range 7623:23235 like this:
import math
# Find the length of the range
length = 23235 - 7623 + 1
# Calculate the number of values to include in the average
n = math.ceil(length * 0.25)
# Calculate the index of the first value to include in the average
start_index = length - n
# Use iloc to slice the relevant range of values from the 'Temperature' column
# and calculate the mean of those values
mean = df.iloc[7623:23235]['Temperature'].iloc[start_index:].mean()
print(mean)
This code first calculates the length of the range, then calculates the number of values that represent 25% of that range. It then uses the iloc indexer to slice the relevant range of values from the 'Temperature' column and calculates the mean of those values using the mean method.
Note that this code assumes that the indices of the DataFrame are consecutive integers starting from 0. If the indices are not consecutive or do not start at 0, you may need to adjust the code accordingly.

Using Pandas to develop own correlation python similar to pearson

count 716865 716873 716884 716943
0 -0.16029615828413712 -0.07630309240006158 0.11220663712532133 -0.2726775504078691
1 -0.6687265363491811 -0.6135022705188075 -0.49097425130988914 -0.736020384028633
2 0.06735205699309535 0.07948417451634422 0.09240256047258057 0.0617964313591086
3 0.372935701728449 0.44324822316416074 0.5625073287879649 0.3199599294007491
4 0.39439310866886124 0.45960496068147993 0.5591549439131621 0.34928093849248304
5 -0.08007381002566456 -0.021313801077641505 0.11996141286735541 -0.15572679401876433
6 0.20853071107951396 0.26561990841073535 0.3661990387594055 0.15720649076873264
7 -0.0488049712326824 0.02909288268076153 0.18643283476719688 -0.1438092892727158
8 0.017648470149950992 0.10136455179350337 0.2722686729095633 -0.07928001803992157
9 0.4693208827819954 0.6601182040950377 1.0 0.2858790498612906
10 0.07597883305423633 0.0720868097090368 0.06089458880790768 0.08522329510499728
I want to manipulate this normalized dataframe to do something similar to the .corr method python has built in but want to modify it. I want to create my own method for correlation and build a heatmap which I know how to do.
My end result is a dataframe which will be NxN with 0 or 1 values that meets criterias below. In the table I show above it will be 4x4.
The following steps are the criteria for my correlation method:
Loop through each column as the reference and subtract all the other columns from it.
As we loop I also want to disregard absolute values if both the reference and the correlating column have normalized values of less than 0.2.
For the remaining, if the difference values are less than 10 percent, it means the correlations is good and I start building it with 1 for positive correlation and 0 if any of the difference of the count values is greater than 10%.
all the diagonals will have a 1 for good correlation to each other and the other cells will have either 0 or 1.
The following is what I have but when I drop the deadband values, it does not catch all for some reason.
subdf = []
deadband = 0.2
for i in range(len(df2_norm.columns)):
# First, let's drop non-zero above deadband values in each row
df2_norm_drop = df2_norm.drop(df2_norm[(df2_norm.abs().iloc[:,i] < deadband) & \
(df2_norm.abs().iloc[:,i] > 0)].index)
# Take difference of first detail element normalized value to chart allowable
# normalized value
subdf.append(pd.DataFrame(df2_norm.subtract(df2_norm.iloc[:,i], axis =0)))
I know it looks a lot but would really appreciate any help. Thank you!

Generate missing values on the dataset based on ZIPF distribution

Currently, I want to observe the impact of missing values on my dataset. I replace data point (10, 20, 90 %) to missing values and observe the impact. This function below is to replace a certain per cent data point to missing.
def dropout(df, percent):
# create df copy
mat = df.copy()
# number of values to replace
prop = int(mat.size * percent)
# indices to mask
mask = random.sample(range(mat.size), prop)
# replace with NaN
np.put(mat, mask, [np.NaN]*len(mask))
return mat
My question is, I want to replace missing values based on zipf distirbution/power low/long tail. For instance, I have a dataset that contains of 10 columns (5 columns categorical data and 5 columns numerical data). I want to replace some data points on 5 columns categorical based on zipf law, columns in the left sides have more missing rather than in the right side.
I used Python to do this task.
I saw Scipy manual about zipf distirbution in this link: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.zipf.html but still it's not help me much.
Zipf distributions are a family of distributions on 0 to infinity, whereas you want to delete values from only 5 discrete columns, so you will have to make some arbitrary decisions to do this. Here is one way:
Pick a parameter for your Zipf distribution, say a = 2 as in the example given on the SciPy documentation page.
Looking at the plot given on that same page, you could decide to truncate at 10, i.e. if any sampled value of more than 10 comes up, you're just going to discard it.
Then you could just map the remaining domain of 0 to 10 linearly to your five categorical columns: Any value between 0 and 2 corresponds to the first column, and so on.
So you iteratively sample single values from your Zipf distribution using the SciPy function. For every sampled value, you delete one data point in the column the value corresponds to (see 3.), until you have reached the overall desired percentage of missing values.

Median and quantile values in Pyspark

In my dataframe I have an age column. The total number of rows are approx 77 billion. I want to calculate the quantile values of that column using PySpark. I have some code but the computation time is huge (maybe my process is very bad).
Is there any good way to improve this?
Dataframe example:
id age
1 18
2 32
3 54
4 63
5 42
6 23
What I have done so far:
#Summary stats
df.describe('age').show()
#For Quantile values
x5 = df.approxQuantile("age", [0.5], 0)
x25 = df.approxQuantile("age", [0.25], 0)
x75 = df.approxQuantile("age", [0.75], 0)
The first improvment to do would be to do all the quantile calculations at the same time:
quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0)
Also, note that you use the exact calculation of the quantiles. From the documentation we can see that (emphasis added by me):
relativeError – The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
Since you have a very large dataframe I expect that some error is acceptable in these calculations, but it will be a trade-off between speed and precision (although anything more than 0 could have a significant speed improvement).

Measuring covariance on several rows

I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).
One calculation is the covariance.
I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.
Here, you see a snapshot of the excel file:
Items and their demand over 24 months
The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.
The calculations are as follows:
First I have to calculate the averages per item. This is already something I found by doing the following code:
after importing the following:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
I imported the file:
df = pd.read_excel("Directory\\Covariance.xlsx")
And calculated the average per row:
x=df.iloc[:,1:].values
df['avg'] = x.mean(axis=1)
This gives the file with an extra column, the average (avg):
Items, their demand and the average
The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:
(column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.
After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).
So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).
The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?
Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:
df = pd.read_excel("Directory\\Covariance.xlsx", index_col=0)
Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!
avg = df.mean(axis=1)
To calculate the covariance matrix, just call .cov(). This however calculates pair-wise covariances of columns, to transpose the dataframe first:
cov = df.T.cov()
If you want, you can put everything together in 1 dataframe:
df['avg'] = avg
df = df.join(cov, rsuffix='_cov')
Note: the covariance matrix includes the covariance with itself = the variance per item.

Categories