Binning pandas data by top N percent

Binning pandas data by top N percent - python

I have a pandas series (as part of a larger data frame) like the below:
0 7416
1 10630
2 7086
3 2091
4 3995
5 1304
6 519
7 1262
8 3676
9 2371
10 5346
11 912
12 3653
13 1093
14 2986
15 2951
16 11859
I would like to group rows based on the following quantiles:
Top 0-5%
Top 6-10%
Top 11-25%
Top 26-50%
Top 51-75%
Top 76-100%
First I started by using pd.rank() on the data and then I planned on then using pd.cut() to cut the data into bins, but it does not seem like this accepts top N%, rather it accepts explicit bin edges. Is there an easy way to do this in pandas, or do I need to create a lambda/apply function which calculates which bin each of the ranked items should be placed in.

Is this what you had in mind?
pd.qcut(data, [0.05, 0.1, 0.25, 0.5, 0.75, 1])

Slightly modified version:
pd.qcut(data, [0, 0.05, 0.1, 0.25, 0.5, 0.75, 1])
Otherwise it gives me NaN if dataset below 0.05 (5%).

Related

annotate a single line from a multi-line plot with labels from another pandas column matplotlib

i have been looking around and i can find examples for annotating a single line chart by using iterrows for the dataframe. what i am struggling with is
a) selecting the single line in the plot instead of ax.lines (using ax.lines[#]) is clearly not proper and
b) annotating the values for the line with values from a different column
the dataframe dfg is in a format such that (edited to provide a minimal, reproducible example):
week 2016 2017 2018 2019 2020 2021 min max avg WoW Change
1 8188.0 9052.0 7658.0 7846.0 6730.0 6239.0 6730 9052 7893.7
2 7779.0 8378.0 7950.0 7527.0 6552.0 6045.0 6552 8378 7588.0 -194.0
3 7609.0 7810.0 8041.0 8191.0 6432.0 5064.0 6432 8191 7529.4 -981.0
4 8256.0 8290.0 8430.0 7083.0 6660.0 6507.0 6660 8430 7687.0 1443.0
5 7124.0 9372.0 7892.0 7146.0 6615.0 5857.0 6615 9372 7733.7 -650.0
6 7919.0 8491.0 7888.0 6210.0 6978.0 5898.0 6210 8491 7455.3 41.0
7 7802.0 7286.0 7021.0 7522.0 6547.0 4599.0 6547 7802 7218.1 -1299.0
8 8292.0 7589.0 7282.0 5917.0 6217.0 6292.0 5917 8292 7072.3 1693.0
9 8048.0 8150.0 8003.0 7001.0 6238.0 5655.0 6238 8150 7404.0 -637.0
10 7693.0 7405.0 7585.0 6746.0 6412.0 5323.0 6412 7693 7135.1 -332.0
11 8384.0 8307.0 7077.0 6932.0 6539.0 6539 8384 7451.7
12 7748.0 8224.0 8148.0 6540.0 6117.0 6117 8224 7302.6
13 7254.0 7850.0 7898.0 6763.0 6047.0 6047 7898 7108.1
14 7940.0 7878.0 8650.0 6599.0 5874.0 5874 8650 7352.1
15 8187.0 7810.0 7930.0 5992.0 5680.0 5680 8187 7066.6
16 7550.0 8912.0 8469.0 7149.0 4937.0 4937 8912 7266.6
17 7660.0 8264.0 8549.0 7414.0 5302.0 5302 8549 7291.4
18 7655.0 7620.0 7323.0 6693.0 5712.0 5712 7655 6910.0
19 7677.0 8590.0 7601.0 7612.0 5391.0 5391 8590 7264.6
20 7315.0 8294.0 8159.0 6943.0 5197.0 5197 8294 7057.0
21 7839.0 7985.0 7631.0 6862.0 7200.0 6862 7985 7480.6
22 7705.0 8341.0 8346.0 7927.0 6179.0 6179 8346 7574.7
... ... ... ... ... ... ... ... ...
51 8167.0 7993.0 7656.0 6809.0 5564.0 5564 8167 7131.4
52 7183.0 7966.0 7392.0 6352.0 5326.0 5326 7966 6787.3
53 5369.0 5369 5369 5369.0
with the graph plotted by:
fig, ax = plt.subplots(1, figsize=[14,4])
ax.fill_between(dfg.index, dfg["min"], dfg["max"], label="5 Yr. Range", facecolor="oldlace")
ax.plot(dfg.index, dfg[2020], label="2020", c="grey")
ax.plot(dfg.index, dfg[2021], label="2021", c="coral")
ax.plot(dfg.index, dfg.avg, label="5 Yr. Avg.", c="goldenrod", ls=(0,(1,2)), lw=3).
I would like to label the dfg[2021] line with the values from dfg['WoW Change']. Additionally, if anyone knows how to get the calculate the first value in the WoW column based on the last value from 2020 and the first value from 2021, that would be wonderful! It's currently just dfg['WoW Change'] = dfg[2021].diff()
Thanks!

Figured it out. Zipped the index and two columns up as a tuple. I ended up deciding I only wanted the last value to be shown but using below code:
a = dfg.index.values
b = dfg[2021]
c = dfg['WoW Change']
#zip 3 columns separately
labels = list(zip(dfg.index.values,dfg[2021],dfg['WoW Change']))
#remove tuples with index + 2 nan values
labels_light = [i for i in labels if not any(isinstance(n,float) and math.isnan(n) for n in i)]
#label last point using list accessors
ax.annotate(str("w/w change: " + str("{:,}".format(int(labels_light[-1][2])))+link[1]),xy=(labels_light[-1][0],labels_light[-1][1]))
I'm sure this could have been done much better by someone who knows what they're doing, any feedback is appreciated.

Change normalized integer values to categories for classification

I'm working on this dataset with the following columns, N/A counts and example of a record:
Serial No. 0
GRE Score 0
TOEFL Score 0
University Rating 0
SOP 0
LOR 0
CGPA 0
Research 0
Chance of Admit 0
dtype: int64
0: 1 337 118 4 4.5 4.5 9.65 1 0.92
1: 2 324 107 4 4.0 4.5 8.87 1 0.76
The column Chance of admit is a normalised intergar value ranging from 0 to 1, what i wanted to do was take this column and output a corrosponding ordered values where chance would be bins (low medium high) (unlikely doable likely) ect
What i have come across is that pandas has a built in function named to_categorical however, i don't understand it enough and what i read i still don't exactly get.
This dataset would be used for a decision tree where the labels would be the chance of admit
Thank you for your help

Since they are "normalized" values...why would you need to categorize them? A simple threshould should work right?
i.e.
0-0.33 low
0.33-0.66 medium
0.66-1.0 high
The only reason you would want to use an automated method would probably be if your number of categories keeps changing?
To do the category, you could use pandas to_categorical but you will need to determine the range and the number of bins (categories). From the docs this should work I think.
In [6]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
In [7]: labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
In [8]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
In [9]: df.head(10)
Out[9]:
value group
0 65 60 - 69
1 49 40 - 49
2 56 50 - 59
3 43 40 - 49
4 43 40 - 49
5 91 90 - 99
6 32 30 - 39
7 87 80 - 89
8 36 30 - 39
9 8 0 - 9
You can then replace df['group'] with your chance of admit column and fill up the necessary ranges for your discrete bins by threshold or automatic based on number of bins.
For your reference:
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

IIUC, you want to map a continuous variable to a categorical value based on ranges, for example:
0.96 -> high,
0.31 -> low
...
So pandas provides with a function for just that, cut, from the documentation:
Use cut when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable.
Setup
Serial No. GRE Score TOEFL Score ... CGPA Research Chance of Admit
0 1 337 118 ... 9.65 1 0.92
1 2 324 107 ... 8.87 1 0.76
2 2 324 107 ... 8.87 1 0.31
3 2 324 107 ... 8.87 1 0.45
[4 rows x 9 columns]
Assuming the above setup, you could use cut like this:
labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(labels)
Output
0 high
1 high
2 low
3 medium
Name: Chance of Admit, dtype: category
Categories (3, object): [low < medium < high]
Notice that we are use 3 bins: [(0, 0.33], (0.33, 0.66], (0.66, 1.0]] and that the values of the column Chance of Admit are [0.92, 0.76, 0.31, 0.45]. If you want to change the label names just change the value of the labels parameter, for example: labels=['unlikely', 'doable', 'likely']. If you need an ordinal value do:
labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=list(range(3)))
print(labels)
Output
0 2
1 2
2 0
3 1
Name: Chance of Admit, dtype: category
Categories (3, int64): [0 < 1 < 2]
Finally to put all in perspective you could do the following to add it to your DataFrame:
df['group'] = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(df)
Output
Serial No. GRE Score TOEFL Score ... Research Chance of Admit group
0 1 337 118 ... 1 0.92 high
1 2 324 107 ... 1 0.76 high
2 2 324 107 ... 1 0.31 low
3 2 324 107 ... 1 0.45 medium
[4 rows x 10 columns]

How to calculate mean values from a column from a range of another column

I have a dataframe with two columns Distance(m) and height(m).
I want to calculate the max, min and average height values from an interval of 0.04439 m of distance.
Distance is a continuous series from 0 to 0.81m each 0.00222m with a total of 403 values length.
The aim is to extract 18 values (max min average) of Height from 18 intervals each 0.0439m distance (the continuous distance series between 0 and 0.81m)
Then, create a dataframe (2 columns) of each distance interval and its respectively max min and avg value of height
Distance(m) = [0, 0.0022, 0.0044, .... 0.81 ]
Height(m) = [ 0, 0.1, 0.5, 0.4, 0.9, .... 0.1]
Dataframe
Distance(m) Hauteur(m)
0 0.00000 0.024711
1 0.00222 0.027125
2 0.00444 0.027961
3 0.00592 0.028880
4 0.00814 0.029417
5 0.01036 0.030100
6 0.01184 0.031440
7 0.01406 0.033486
8 0.01628 0.035371
9 0.01702 0.034865
10 0.01850 0.034976
11 0.02072 0.035458
12 0.02220 0.035132
13 0.02442 0.035541
14 0.02516 0.034973
15 0.02738 0.034044
16 0.02886 0.033878
17 0.03108 0.032232
18 0.03256 0.033035
19 0.03478 0.030564
20 0.03700 0.031252
21 0.03848 0.030833
22 0.04070 0.031696
23 0.04144 0.030501
24 0.04366 0.029986
up to 403 values
df3=df1[['Distance(m)', 'Hauteur(m)']]
bins = [0, 0.0439, 0.0878, 0.1317, 0.1756, 0.2195, 0.2634, 0.3073, 0.3512, 0.3951, 0.439, 0.4829, 0.5268, 0.5707, 0.6146, 0.6595, 0.7024, 0.7463, 0.7902]
df3['min'] = pd.cut(df3['Hauteur(m)'].min, bins)
df3['min']
Error shows: Input array must be 1 dimensional
Does anyone have any suggestions that can help me? Thanks!

That's where you have an error: .min.
What could could do instead:
df3['categories'] = pd.cut(df3['Hauteur(m)'], bins)
(df3.groupby('categories')['Distance(m)', 'Hauteur(m)'].agg(
{'max': 'max',
'min': 'min',
'average': 'mean'}))

Pandas Replace Values with Dictionary

I have a data frame with the below structure:
Ranges Relative_17-Aug Relative_17-Sep Relative_17-Oct
0 (0.0, 0.1] 1372 1583 1214
1 (0.1, 0.2] 440 337 648
2 (0.2, 0.3] 111 51 105
3 (0.3, 0.4] 33 10 19
4 (0.4, 0.5] 16 4 9
5 (0.5, 0.6] 7 7 1
6 (0.6, 0.7] 4 3 0
7 (0.7, 0.8] 5 1 0
8 (0.8, 0.9] 2 3 0
9 (0.9, 1.0] 2 0 1
10 (1.0, 2.0] 6 0 2
I am trying to replace column ranges with a dictionary using the below code but it is not working, any hints if I am doing something wrong:
mydict= {"(0.0, 0.1]":"<=10%","(0.1, 0.2]":">10% and <20%","(0.2, 0.3]":">20% and <30%", "(0.3, 0.4]":">30% and <40%", "(0.4, 0.5]":">40% and <50%", "(0.5, 0.6]":">50% and <60%", "(0.6, 0.7]":">60% and <70%", "(0.7, 0.8]":">70% and <80%", "(0.8, 0.9]":">80% and <90%", "(0.9, 1.0]":">90% and <100%", "(1.0, 2.0]":">100%"}
t_df["Ranges"].replace(mydict,inplace=True)
Thanks!

I think here is best use parameter labels in time of create Ranges column in cut:
labels = ['<=10%','>10% and <20%', ...]
#change by your bins
bins = [0,0.1,0.2...]
t_df['Ranges'] = pd.cut(t_df['col'], bins=bins, labels=labels)
If not possible, cast to string should help as suggest #Dark in comments, for better performance use map:
t_df["Ranges"] = t_df["Ranges"].astype(str).map(mydict)

By using map function this can be achieved easily and in a straight forward manner as shown below..
mydict= {"(0.0, 0.1]":"<=10%","(0.1, 0.2]":">10% and <20%","(0.2, 0.3]":">20% and <30%", "(0.3, 0.4]":">30% and <40%", "(0.4, 0.5]":">40% and <50%", "(0.5, 0.6]":">50% and <60%", "(0.6, 0.7]":">60% and <70%", "(0.7, 0.8]":">70% and <80%", "(0.8, 0.9]":">80% and <90%", "(0.9, 1.0]":">90% and <100%", "(1.0, 2.0]":">100%"}
t_df["Ranges"] = t_df["Ranges"].map(lambda x : mydict[str(x)])
Hope this helps..!!

cross validation in sklearn with given fold splits

I'm learning how to use Lasso and Ridge with sklearn in Python. I am given the folds in a column. I want to find the best parameter based on a 5 fold cross validation.
My data looks like the following:
mpg cylinders displacement horsepower weight acceleration origin fold
0 18 8 307 130 3504 12.0 1 3
1 15 8 350 165 3693 11.5 1 0
2 18 8 318 150 3436 11.0 1 2
3 16 8 304 150 3433 12.0 1 2
4 17 8 302 140 3449 10.5 1 3
reg_para = [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]
mpg is the y/target variable and the other columns are the predictor. the last column contains the folds. I want to run a Lasso and Ridge and find the best parameter. The problem I am having is incorporating the specified folds in cross validation. Here is what I have so far (for Lasso):
from sklearn.linear_model import Lasso, LassoCV
lasso_model = LassoCV(cv=5, alphas=reg_para)
lasso_fit = lasso_model.fit(X,y)
Is there a simple way to incorporate the fold splits? Any help is greatly appreciated

If your data are in a pandas dataframe, then all you need to do is access that column
fold_labels = df["fold"]
from sklearn.cross_validation import LeaveOneLabelOut
cv = LeaveOneLabelOut(fold_labels)
lasso_model = LassoCV(cv=cv, alphas=reg_para)
So if you obtain the fold labels in an array fold_labels you can just use LeaveOneLabelOut (sorry for the non-functional code. It should be sufficient to elucidate the idea though.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Binning pandas data by top N percent - python

Is this what you had in mind? pd.qcut(data, [0.05, 0.1, 0.25, 0.5, 0.75, 1])

Slightly modified version: pd.qcut(data, [0, 0.05, 0.1, 0.25, 0.5, 0.75, 1]) Otherwise it gives me NaN if dataset below 0.05 (5%).

Related

annotate a single line from a multi-line plot with labels from another pandas column matplotlib

Change normalized integer values to categories for classification

How to calculate mean values from a column from a range of another column

Pandas Replace Values with Dictionary

cross validation in sklearn with given fold splits

Categories

Resources