Median and quantile values in Pyspark - python

In my dataframe I have an age column. The total number of rows are approx 77 billion. I want to calculate the quantile values of that column using PySpark. I have some code but the computation time is huge (maybe my process is very bad).
Is there any good way to improve this?
Dataframe example:
id age
1 18
2 32
3 54
4 63
5 42
6 23
What I have done so far:
#Summary stats
df.describe('age').show()
#For Quantile values
x5 = df.approxQuantile("age", [0.5], 0)
x25 = df.approxQuantile("age", [0.25], 0)
x75 = df.approxQuantile("age", [0.75], 0)

The first improvment to do would be to do all the quantile calculations at the same time:
quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0)
Also, note that you use the exact calculation of the quantiles. From the documentation we can see that (emphasis added by me):
relativeError – The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
Since you have a very large dataframe I expect that some error is acceptable in these calculations, but it will be a trade-off between speed and precision (although anything more than 0 could have a significant speed improvement).

Related

How do you pick a certain min number in Python?

I'm running a program which corrects responses to tests. There are 23 questions and each correct answer is given a + 1. My code sums these scores up for these 23 questions and creates a separate column (totalCorrect) which prints the final score out of 23. I have attached a screenshot of a portion of this column totalCorrect
What I want to do right now, is to assign a money incentive based on each performance. The incentive is 0.3$ for every right answer - the issue is, every survey has 23 questions but I only want to consider 20 of these questions to calculate the incentive. So out of the score (out of 23) we will consider only a min of 20 responses.
How can I do this?
This is what I have so far:
df['numCorrect'] = min{20, totalNumCorrect}
df['earnedAmount'] = 0.3 * df['numCorrect']
where 'earnedAmount' is trying to calculate the final incentive amount and numberCorrect is trying to isolate only 20 points out of a possible 23
df['earnedAmount'] = (0.3 * df['totalNumCorrect']).clip(0, 6)
0.3 * df['totalNumCorrect'] simply calculates the full amount, which is a Series (or dataframe column).
.clip then limits the values to be between 0 and 6. 6 is of course 0.3 * 20, the maximum amount someone can earn.

How to detect outliers in a timeseries dataframe and write the "clean" ones in a new dataframe

I'm really new to Python (and programming in general, hihi) and I'm analyzing 2 years of metereological data measured every 10s, in total I have 12 metereological parameters and I've created my dataframe df with the time as my row index and the name of the metereological parameters as the column names. Since I don't need a super granularity, I've resampled the data to hourly data, so the dataframe looks something like this.
Time G_DIFF G_HOR G_INCL RAIN RH T_a V_a V_a_dir
2016-05-01 02:00:00 0.0 0.011111 0.000000 0.013333 100.0 9.128167 1.038944 175.378056
2016-05-01 03:00:00 0.0 0.200000 0.016667 0.020000 100.0 8.745833 1.636944 218.617500
2016-05-01 04:00:00 0.0 0.105556 0.013889 0.010000 100.0 8.295333 0.931000 232.873333
There are outliers and I can get rid of them with a rolling standard deviation and mean which is what I've done "by hand" with the following code for one of the columns (the ambient temperature) where the algorithm writes the clean data in another dataframe (tr, in the example below).
roll = df["T_a"].rolling(24,center = True) #24h window
mean, std = roll.mean(), roll.std()
cut = std*3
low, up = mean - cut, mean+cut
tr.loc[(df["T_a"] < low) | (df["T_a"] > up) | (df["T_a"].isna()), "outliers"] = df["T_a"]
tr.loc[(df["T_a"] >= low) & (df["T_a"] <= up), "T_a"] = df["T_a"]
tr.loc[tr["T_a"].isna(),"T_a"] = tr["T_a"].bfill() #to input a value when a datum is NaN
Now, as I said, that works okay for a column BUT I would like to be able to do it for the 12 columns and, also, I'm almost sure that there's a more pythonic way to do it. I guess with a for loop should be feasible but nothing I've tried so far is working.
Could anyone give me some light, please? Thank you so much!!
all_columns = [df.columns] #This will give you list of all column names
all_columns = all_columns.remove('G_DIFF') # This will remove the column name that doesn't include those 12 columns
for column in all_columns:
roll = df[column].rolling(24,center = True) #24h window
mean, std = roll.mean(), roll.std()
cut = std*3
low, up = mean - cut, mean+cut
tr.loc[(df[column] < low) | (df[column] > up) | (df[column].isna()), "outliers"] = df[column]
tr.loc[(df[column] >= low) & (df[column] <= up), column] = df[column]
tr.loc[tr[column].isna(),column] = tr[column].bfill() #to input a value when a datum is NaN
There are two ways to remove outliers from time series data one is calculating percentile, mean std-dev which I am thinking you are using another way is looking at the graphs because sometimes data spread gives more information visually.
I have worked in data of yellow taxi prediction in a certain area, so basically I have a model which can predict in which region of NYC taxi can get more customers.
In that I had a time series data with 10-sec intervals with various features like trip distance,speed, working hours, and one was "Total fare", now I also wanted to remove the outliers from each column so started using mean and percentiles to do so,
The thing with total fares was mean and percentile was not giving an accurate threshold,
and my percentiles values:
0 percentile value is -242.55//
10 percentile value is 6.3//
20 percentile value is 7.8//
30 percentile value is 8.8//
40 percentile value is 9.8//
50 percentile value is 11.16//
60 percentile value is 12.8//
70 percentile value is 14.8//
80 percentile value is 18.3//
90 percentile value is 25.8//
100 percentile value is 3950611.6//
as you can see 100 was an ok fare but was considered as an outlier,
So I basically turned to visualization,
I sorted my fare values and plot it
as you can see in the end there is little of steepness
so basically magnified it,
Something like this,
and then I magnified it more for 50th to second last percentile
and voila I got my threshold, i.e 1000,
This method in actual terms is called the "elbow method", what you are doing is the first step and if you are not happy this can be the second step to find those thresholds,
I suggest you go from column to column and use any of these techniques because if you go from column to column you know how much data you are losing because losing data is losing information.
Personally, I follow visualization, in the end, it really depends on the data.

Using Pandas to develop own correlation python similar to pearson

count 716865 716873 716884 716943
0 -0.16029615828413712 -0.07630309240006158 0.11220663712532133 -0.2726775504078691
1 -0.6687265363491811 -0.6135022705188075 -0.49097425130988914 -0.736020384028633
2 0.06735205699309535 0.07948417451634422 0.09240256047258057 0.0617964313591086
3 0.372935701728449 0.44324822316416074 0.5625073287879649 0.3199599294007491
4 0.39439310866886124 0.45960496068147993 0.5591549439131621 0.34928093849248304
5 -0.08007381002566456 -0.021313801077641505 0.11996141286735541 -0.15572679401876433
6 0.20853071107951396 0.26561990841073535 0.3661990387594055 0.15720649076873264
7 -0.0488049712326824 0.02909288268076153 0.18643283476719688 -0.1438092892727158
8 0.017648470149950992 0.10136455179350337 0.2722686729095633 -0.07928001803992157
9 0.4693208827819954 0.6601182040950377 1.0 0.2858790498612906
10 0.07597883305423633 0.0720868097090368 0.06089458880790768 0.08522329510499728
I want to manipulate this normalized dataframe to do something similar to the .corr method python has built in but want to modify it. I want to create my own method for correlation and build a heatmap which I know how to do.
My end result is a dataframe which will be NxN with 0 or 1 values that meets criterias below. In the table I show above it will be 4x4.
The following steps are the criteria for my correlation method:
Loop through each column as the reference and subtract all the other columns from it.
As we loop I also want to disregard absolute values if both the reference and the correlating column have normalized values of less than 0.2.
For the remaining, if the difference values are less than 10 percent, it means the correlations is good and I start building it with 1 for positive correlation and 0 if any of the difference of the count values is greater than 10%.
all the diagonals will have a 1 for good correlation to each other and the other cells will have either 0 or 1.
The following is what I have but when I drop the deadband values, it does not catch all for some reason.
subdf = []
deadband = 0.2
for i in range(len(df2_norm.columns)):
# First, let's drop non-zero above deadband values in each row
df2_norm_drop = df2_norm.drop(df2_norm[(df2_norm.abs().iloc[:,i] < deadband) & \
(df2_norm.abs().iloc[:,i] > 0)].index)
# Take difference of first detail element normalized value to chart allowable
# normalized value
subdf.append(pd.DataFrame(df2_norm.subtract(df2_norm.iloc[:,i], axis =0)))
I know it looks a lot but would really appreciate any help. Thank you!

Calculate percentile of value in column

I have a dataframe with a column that has numerical values. This column is not well-approximated by a normal distribution. Given another numerical value, not in this column, how can I calculate its percentile in the column? That is, if the value is greater than 80% of the values in the column but less than the other 20%, it would be in the 20th percentile.
To find the percentile of a value relative to an array (or in your case a dataframe column), use the scipy function stats.percentileofscore().
For example, if we have a value x (the other numerical value not in the dataframe), and a reference array, arr (the column from the dataframe), we can find the percentile of x by:
from scipy import stats
percentile = stats.percentileofscore(arr, x)
Note that there is a third parameter to the stats.percentileofscore() function that has a significant impact on the resulting value of the percentile, viz. kind. You can choose from rank, weak, strict, and mean. See the docs for more information.
For an example of the difference:
>>> df
a
0 1
1 2
2 3
3 4
4 5
>>> stats.percentileofscore(df['a'], 4, kind='rank')
80.0
>>> stats.percentileofscore(df['a'], 4, kind='weak')
80.0
>>> stats.percentileofscore(df['a'], 4, kind='strict')
60.0
>>> stats.percentileofscore(df['a'], 4, kind='mean')
70.0
As a final note, if you have a value that is greater than 80% of the other values in the column, it would be in the 80th percentile (see the example above for how the kind method affects this final score somewhat) not the 20th percentile. See this Wikipedia article for more information.
Probably very late but still
df['column_name'].describe()
will give you the regular 25, 50 and 75 percentile with some additional data
but if you specifically want percentiles for some specific values then
df['column_name'].describe(percentiles=[0.1, 0.2, 0.3, 0.5])
This will give you 10th, 20th, 30th and 50th percentiles.
You can give as many values as you want.
Sort the column, and see if the value is in the first 20% or whatever percentile.
for example:
def in_percentile(my_series, val, perc=0.2):
myList=sorted(my_series.values.tolist())
l=len(myList)
return val>myList[int(l*perc)]
Or, if you want the actual percentile simply use searchsorted:
my_series.values.searchsorted(val)/len(my_series)*100
Since you're looking for values over/under a specific threshold, you could consider using pandas qcut function. If you wanted values under 20% and over 80%, divide your data into 5 equal sized partitions. Each partition would represent a 20% "chunk" of equal size (five 20% partitions is 100%). So, given a DataFrame with 1 column 'a' which represents the column you have data for:
df['newcol'] = pd.qcut(df['a'], 5, labels=False)
This will give you a new column to your DataFrame with each row having a value in (0, 1, 2, 3, 4). Where 0 represents your lowest 20% and 4 represents your highest 20% which is the 80% percentile.

Pandas.rolling_correlation, threshold?

I am using Pandas.rolling_corr to calculate correlation of two Pandas series.
pd.rolling_corr(x, y, 10)
x and y have very little variation. For instance
x[0] = 1.3342323
x[1] = 1.3342317
Since correlation is covariance divided by standard deviation, the correlation should only be inf or -inf if the standard deviation is 0. However, in the data set that I have, values are very close but no values are exactly same as another. But for some reason, I'm getting inf or -inf values in the correlation.
Here is my question: Is there a limit in pandas.rolling_corr, in which the number is just automatically rounded in calculation if the number if too small? (x<1e-7)
the dataset that I'm working with (x and y) are in 'float64' and I have set the chop_threshold to 0.
EDITED*
Here's is a sample of data that I'm working with. I'm trying to compute the correlation between the two columns, but the result is inf.
1144 679.5999998
1144 679.600001
1143.75 679.6000003
1143.75 679.5999993
1143 679.6000009
I think this can more generally be considered as a question about precision rather than correlation. And generally, you can expect things behind the scenes to be done at double precision which means that it's around 13 or 14 decimal places that things can get wonky (though certainly at 11 or 12 decimal places (or less) in many cases).
So here's a little example program that starts with differences of 1 in the 'x' column (1e0) and then drops it in steps all the way down to 1e-20.
pd.options.display.float_format = '{:12,.9f}'.format
df_corr = pd.DataFrame()
diffs = np.arange(4)
for i in range(20):
df = pd.DataFrame({ 'x': [1,1,1,1], 'y':[5,7,8,6] })
df['x'] = df['x'] + diffs*.1**i
df_corr = df_corr.append( pd.rolling_corr(df.x,df.y,3)[2:4], ignore_index=True )
A couple of comments before looking at results. First, pandas default display will sometimes hide very small differences in numbers so you need to force more precise formatting in some way. Here I used pd.options.
Second, I won't put a ton of comments in here, but the main thing you need to know is that the 'x' columns varies in each loop. You can add a print statement if you like, but the first 4 iterations look like this:
iteration values of column 'x'
0 [ 1. 2. 3. 4. ]
1 [ 1. 1.1 1.2 1.3 ]
2 [ 1. 1.01 1.02 1.03 ]
3 [ 1. 1.001 1.002 1.003 ]
So by the time we get to later iterations, the differences will be so small that pandas just sees it as [1,1,1,1]. The practical question then is at what point the differences are so small as to be indistinguishable to pandas.
To make this a little more concrete, this is what it looks like for the 8th iteration (and the rolling_corr uses a window of 3 since this dataset has only 4 rows):
x y
0 1.000000000 5
1 1.000000100 7
2 1.000000200 8
3 1.000000300 6
So here are the results, where the index value also gives the iteration number (and hence the precision of the differences in column 'x')
df_corr
2 3
0 0.981980506 -0.500000000
1 0.981980506 -0.500000000
2 0.981980506 -0.500000000
3 0.981980506 -0.500000000
4 0.981980506 -0.500000000
5 0.981980506 -0.500000000
6 0.981980505 -0.500000003
7 0.981980501 -0.500000003
8 0.981980492 -0.499999896
9 0.981978898 -0.500001485
10 0.981971784 -0.500013323
11 0.981893289 -0.500133227
12 0.981244570 -0.502373626
13 0.976259968 -0.505124339
14 0.954868546 -0.132598709
15 0.785584405 1.060660172
16 0.000000000 0.000000000
17 nan nan
18 nan nan
19 nan nan
So you can see there that you are getting reasonable results down to about the 13th decimal place (although accuracy is declining prior to that), and then they fall apart although you don't actually get NaNs until the 17th.
I'm not sure how to directly answer your question but hopefully that sheds some light. I would say that there is really no such thing as automatic rounding that occurs in these calculations but it's hard to generalize. It's certainly the case that some algorithms deal better than others with precision-type issues like this, but it's not possible to make any broad generalizations there, even amongst different functions or methods within pandas. (There is for example a current discussion about how numpy gets a different answer than the analogous function in pandas due to precision issues.)

Categories