Python: pandas.cut labels are ignored - python

I want to cut one column in my pandas.DataFrame using pandas.cut(), but the labels I put into labels argument are not applied. Let me show you an example.
I have got the following data frame:
>>> import pandas as pd
>>> df = pd.DataFrame({'x': [-0.009, 0.089, 0.095, 0.096, 0.198]})
>>> print(df)
x
0 -0.009
1 0.089
2 0.095
3 0.096
4 0.198
And I cut x column like this:
>>> bins = pd.IntervalIndex.from_tuples([(-0.2, -0.1), (-0.1, 0.0), (0.0, 0.1), (0.1, 0.2)])
>>> labels = [100, 200, 300, 400]
>>> df['x_cut'] = pd.cut(df['x'], bins, labels=labels)
>>> print(df)
x x_cut
0 -0.009 (-0.1, 0.0]
1 0.089 (0.0, 0.1]
2 0.095 (0.0, 0.1]
3 0.096 (0.0, 0.1]
4 0.198 (0.1, 0.2]
However, I expected the data frame looking like this:
id x x_cut
0 6 0.089 200
1 6 0.089 300
2 6 0.095 300
3 6 0.096 300
4 6 0.098 400
What am I missing? How can I get the data frame with correct labels?

It is bug issue 21233.
For me working like #anky_91 commented mapping by dictionary created by zip:
df['x_cut'] = pd.cut(df['x'], bins).map(dict(zip(bins, labels)))
print(df)
x x_cut
0 -0.009 200
1 0.089 300
2 0.095 300
3 0.096 300
4 0.198 400

Related

How to find averages for each bin of a column

I have a data frame below:
import pandas as pd
df = {'A':[1.06, 1.01, 0.99, 0.98, 1.05, 0.96], 'B':[2, 7, 22, 7, 15, 16]}
df = pd.DataFrame(df)
I want to find the mean of column B for each bin of column A.
For example, if I want to create bins of 0.02 starting the minimum value in column A, then bins will be like this (Inclusive):
1) 0.96-0.98
2) 0.99-1.01
3) 1.02-1.04
4) 1.05-1.07
The average of each bin will be
1) (16+7)/2 = 11.5
2) (7+22)/2 = 14.5
3) 0
4) (2+15)/2 = 8.5
Thus, the outcome will look like:
df = {'A':[1.06, 1.01, 0.99, 0.98, 1.05, 0.96], 'B':[2, 7, 22, 7, 15, 16], 'Avg':[8.5, 14.5, 14.5, 11.5, 8.5, 11.5]}
df = pd.DataFrame(df)
LOGIC
You can make bins group and then use bins to group data and then apply transform on groupby which will return the result for each row rather then grouped data.
PS: pd.cut is used for binning, assuming you have approach to get desired bins.
SOLUTION
import pandas as pd
df = {"A": [1.06, 1.01, 0.99, 0.98, 1.05, 0.96], "B": [2, 7, 22, 7, 15, 16]}
df = pd.DataFrame(df)
df["cat"] = pd.cut(df["A"], [0.95, 0.98, 1.01, 1.04, 1.07], right=True)
print(df)
OUTPUT with Bins
A B cat
0 1.06 2 (1.04, 1.07]
1 1.01 7 (0.98, 1.01]
2 0.99 22 (0.98, 1.01]
3 0.98 7 (0.95, 0.98]
4 1.05 15 (1.04, 1.07]
5 0.96 16 (0.95, 0.98]
df["Avg"] = df.groupby("cat")["B"].transform("mean")
# you can directly groupby pd.cut without making new column
print(df)
FINAL OUTPUT with average
A B cat Avg
0 1.06 2 (1.04, 1.07] 8.5
1 1.01 7 (0.98, 1.01] 14.5
2 0.99 22 (0.98, 1.01] 14.5
3 0.98 7 (0.95, 0.98] 11.5
4 1.05 15 (1.04, 1.07] 8.5
5 0.96 16 (0.95, 0.98] 11.5

Change normalized integer values to categories for classification

I'm working on this dataset with the following columns, N/A counts and example of a record:
Serial No. 0
GRE Score 0
TOEFL Score 0
University Rating 0
SOP 0
LOR 0
CGPA 0
Research 0
Chance of Admit 0
dtype: int64
0: 1 337 118 4 4.5 4.5 9.65 1 0.92
1: 2 324 107 4 4.0 4.5 8.87 1 0.76
The column Chance of admit is a normalised intergar value ranging from 0 to 1, what i wanted to do was take this column and output a corrosponding ordered values where chance would be bins (low medium high) (unlikely doable likely) ect
What i have come across is that pandas has a built in function named to_categorical however, i don't understand it enough and what i read i still don't exactly get.
This dataset would be used for a decision tree where the labels would be the chance of admit
Thank you for your help
Since they are "normalized" values...why would you need to categorize them? A simple threshould should work right?
i.e.
0-0.33 low
0.33-0.66 medium
0.66-1.0 high
The only reason you would want to use an automated method would probably be if your number of categories keeps changing?
To do the category, you could use pandas to_categorical but you will need to determine the range and the number of bins (categories). From the docs this should work I think.
In [6]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
In [7]: labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
In [8]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
In [9]: df.head(10)
Out[9]:
value group
0 65 60 - 69
1 49 40 - 49
2 56 50 - 59
3 43 40 - 49
4 43 40 - 49
5 91 90 - 99
6 32 30 - 39
7 87 80 - 89
8 36 30 - 39
9 8 0 - 9
You can then replace df['group'] with your chance of admit column and fill up the necessary ranges for your discrete bins by threshold or automatic based on number of bins.
For your reference:
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
IIUC, you want to map a continuous variable to a categorical value based on ranges, for example:
0.96 -> high,
0.31 -> low
...
So pandas provides with a function for just that, cut, from the documentation:
Use cut when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable.
Setup
Serial No. GRE Score TOEFL Score ... CGPA Research Chance of Admit
0 1 337 118 ... 9.65 1 0.92
1 2 324 107 ... 8.87 1 0.76
2 2 324 107 ... 8.87 1 0.31
3 2 324 107 ... 8.87 1 0.45
[4 rows x 9 columns]
Assuming the above setup, you could use cut like this:
labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(labels)
Output
0 high
1 high
2 low
3 medium
Name: Chance of Admit, dtype: category
Categories (3, object): [low < medium < high]
Notice that we are use 3 bins: [(0, 0.33], (0.33, 0.66], (0.66, 1.0]] and that the values of the column Chance of Admit are [0.92, 0.76, 0.31, 0.45]. If you want to change the label names just change the value of the labels parameter, for example: labels=['unlikely', 'doable', 'likely']. If you need an ordinal value do:
labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=list(range(3)))
print(labels)
Output
0 2
1 2
2 0
3 1
Name: Chance of Admit, dtype: category
Categories (3, int64): [0 < 1 < 2]
Finally to put all in perspective you could do the following to add it to your DataFrame:
df['group'] = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(df)
Output
Serial No. GRE Score TOEFL Score ... Research Chance of Admit group
0 1 337 118 ... 1 0.92 high
1 2 324 107 ... 1 0.76 high
2 2 324 107 ... 1 0.31 low
3 2 324 107 ... 1 0.45 medium
[4 rows x 10 columns]

Transform dataframe value to range value in Python 3

I have a dataframe with the values:
3.05
35.97
49.11
48.80
48.02
10.61
25.69
6.02
55.36
0.42
47.87
2.26
54.43
8.85
8.75
14.29
41.29
35.69
44.27
1.08
I want transform the value into range and give new value to each value.
From the df we know the min value is 0.42 and the max value is 55.36.
From range min to max, I want divide to 4 group which is:
0.42 - 14.15 transform to 1
14.16 - 27.88 transform to 2
27.89 - 41.61 transform to 3
41.62 - 55.36 transform to 4
so the result I expected is
1
3
4
4
4
1
2
1
4
1
4
1
4
1
1
2
3
3
4
1
This is normally called binning, but pandas calls it cut. Sample code is below:
import pandas as pd
# Create a list of numbers, with a header called "nums"
data_list = [('nums', [3.05, 35.97, 49.11, 48.80, 48.02, 10.61, 25.69, 6.02, 55.36, 0.42, 47.87, 2.26, 54.43, 8.85, 8.75, 14.29, 41.29, 35.69, 44.27, 1.08])]
# Create the labels for the bin
bin_labels = [1,2,3,4]
# Create the dataframe object using the data_list
df = pd.DataFrame.from_items(data_list)
# Define the scope of the bins
bins = [0.41, 14.16, 27.89, 41.62, 55.37]
# Create the "bins" column using the cut function using the bins and labels
df['bins'] = pd.cut(df['nums'], bins=bins, labels=bin_labels)
This creates a dataframe which has the following structure:
print(df)
nums bins
0 3.05 1
1 35.97 3
2 49.11 4
3 48.80 4
4 48.02 4
5 10.61 1
6 25.69 2
7 6.02 1
8 55.36 4
9 0.42 1
10 47.87 4
11 2.26 1
12 54.43 4
13 8.85 1
14 8.75 1
15 14.29 2
16 41.29 3
17 35.69 3
18 44.27 4
19 1.08 1
You could construct a function like the following to have full control over the process:
def transform(l):
l2 = []
for i in l:
if 0.42 <= i <= 14.15:
l2.append(1)
elif i <= 27.8:
l2.append(2)
elif i <= 41.61:
l2.append(3)
elif i <= 55.36:
l2.append(4)
return(l2)
df['nums'] = transform(df['nums'])

The most elegant way to do a calculation on dataframe column

I'm a newbie in python.
I have a column in pandas dataframe called [weight].
Which is the efficient and smartest way to redefine securities's weights to sum 1 (or 100%) ? something like the sample calculation below
weight new weight
0,05 14%
0,10 29%
0,20 57%
total 0,35 100%
df = pd.DataFrame({'Security' : ['ABC','DEF','GHI'], 'Rating': ['AAA', 'BBB','AA'], 'Weight' : [ 0.05, 0.1, 0.2]})
print(df)
Security Rating Weight
ABC AAA 0.05
DEF BBB 0.10
GHI AA 0.20
I think we can devide weight by sum of weights and get the percentage of weight (newWeight):
import pandas as pd
df = pd.DataFrame({'Security' : ['ABC','DEF','GHI'], 'Rating': ['AAA', 'BBB','AA'], 'Weight' : [ 0.05, 0.1, 0.2]})
df['newWeight'] = 100 * df['Weight'] / sum(df['Weight'])
print(df)
## Rating Security Weight newWeight
## 0 AAA ABC 0.05 14.285714
## 1 BBB DEF 0.10 28.571429
## 2 AA GHI 0.20 57.142857
Using the apply method is a neat way to solve this problem. You can do something like this:
import pandas as pd
df = pd.DataFrame({'Security' : ['ABC','DEF','GHI'], 'Rating': ['AAA', 'BBB','AA'], 'Weight' : [ 0.05, 0.1, 0.2]})
total = df.Weight.sum()
df['newWeight'] = df.Weight.apply(lambda x: x / total)
The resulting DataFrame looks like this:
Security Rating Weight newWeight
0 ABC AAA 0.05 0.142857
1 DEF BBB 0.10 0.285714
2 GHI AA 0.20 0.571429
If you want to represent these as percentages, you need to convert them to strings, here's an example:
df['percentWeight'] = df.newWeight.apply(lambda x: "{}%".format(round(x * 100)))
And you get the result:
Security Rating Weight newWeight percentWeight
0 ABC AAA 0.05 0.142857 14%
1 DEF BBB 0.10 0.285714 29%
2 GHI AA 0.20 0.571429 57%

Pandas Replace Values with Dictionary

I have a data frame with the below structure:
Ranges Relative_17-Aug Relative_17-Sep Relative_17-Oct
0 (0.0, 0.1] 1372 1583 1214
1 (0.1, 0.2] 440 337 648
2 (0.2, 0.3] 111 51 105
3 (0.3, 0.4] 33 10 19
4 (0.4, 0.5] 16 4 9
5 (0.5, 0.6] 7 7 1
6 (0.6, 0.7] 4 3 0
7 (0.7, 0.8] 5 1 0
8 (0.8, 0.9] 2 3 0
9 (0.9, 1.0] 2 0 1
10 (1.0, 2.0] 6 0 2
I am trying to replace column ranges with a dictionary using the below code but it is not working, any hints if I am doing something wrong:
mydict= {"(0.0, 0.1]":"<=10%","(0.1, 0.2]":">10% and <20%","(0.2, 0.3]":">20% and <30%", "(0.3, 0.4]":">30% and <40%", "(0.4, 0.5]":">40% and <50%", "(0.5, 0.6]":">50% and <60%", "(0.6, 0.7]":">60% and <70%", "(0.7, 0.8]":">70% and <80%", "(0.8, 0.9]":">80% and <90%", "(0.9, 1.0]":">90% and <100%", "(1.0, 2.0]":">100%"}
t_df["Ranges"].replace(mydict,inplace=True)
Thanks!
I think here is best use parameter labels in time of create Ranges column in cut:
labels = ['<=10%','>10% and <20%', ...]
#change by your bins
bins = [0,0.1,0.2...]
t_df['Ranges'] = pd.cut(t_df['col'], bins=bins, labels=labels)
If not possible, cast to string should help as suggest #Dark in comments, for better performance use map:
t_df["Ranges"] = t_df["Ranges"].astype(str).map(mydict)
By using map function this can be achieved easily and in a straight forward manner as shown below..
mydict= {"(0.0, 0.1]":"<=10%","(0.1, 0.2]":">10% and <20%","(0.2, 0.3]":">20% and <30%", "(0.3, 0.4]":">30% and <40%", "(0.4, 0.5]":">40% and <50%", "(0.5, 0.6]":">50% and <60%", "(0.6, 0.7]":">60% and <70%", "(0.7, 0.8]":">70% and <80%", "(0.8, 0.9]":">80% and <90%", "(0.9, 1.0]":">90% and <100%", "(1.0, 2.0]":">100%"}
t_df["Ranges"] = t_df["Ranges"].map(lambda x : mydict[str(x)])
Hope this helps..!!

Categories