pandas binning via cut including single value (zero) bin

pandas binning via cut including single value (zero) bin - python

Using pandas in Python, given
import pandas as pd
s = pd.Series([0.0, 0.1, 0.2, 0.5, 0.0, 0.2, 0.1, 0.5, 0.0])
how to do binning where the first bin only contains zeros?
Tried
bins = pd.IntervalIndex.from_tuples([(0, 0), (0, 0.1), (0.1, 0.2), (0.2, float("inf"))])
pd.cut(s, bins)
which gives
0 NaN
1 (0.0, 0.1]
2 (0.1, 0.2]
3 (0.2, inf]
4 NaN
5 (0.1, 0.2]
6 (0.0, 0.1]
7 (0.2, inf]
8 NaN
dtype: category
Categories (4, interval[float64]): [(0.0, 0.0] < (0.0, 0.1] < (0.1, 0.2] < (0.2, inf]]
but
zero_bin = pd.IntervalIndex.from_tuples([(0, 0)], closed="both")
pd.cut(s, zero_bin)
results in
0 [0.0, 0.0]
1 NaN
2 NaN
3 NaN
4 [0.0, 0.0]
5 NaN
6 NaN
7 NaN
8 [0.0, 0.0]
dtype: category
Categories (1, interval[int64]): [[0, 0]]
But I did not find a way to combine the zero_bin and bins to get the desired result of
0 [0.0, 0.0]
1 (0.0, 0.1]
2 (0.1, 0.2]
3 (0.2, inf]
4 [0.0, 0.0]
5 (0.1, 0.2]
6 (0.0, 0.1]
7 (0.2, inf]
8 [0.0, 0.0]
dtype: category
Categories (4, interval[float64]): [[0.0, 0.0] < (0.0, 0.1] < (0.1, 0.2] < (0.2, inf]]

Maybe a little late but to anyone with this doubt, you should use the argument include_lowest=True inside pd.cut

Related

how do I add list values at a specific column in pandas?

Lets say I have the following dataframe:
df = pd.DataFrame({'sample': ['sample 1', 'sample 2'],
'values': [[0.2, 0.3, 0.5],[0.3, 0.3, 0.4]],
'group': [1, 0]})
output:
sample values group
0 sample 1 [0.2, 0.3, 0.5] 1
1 sample 2 [0.3, 0.3, 0.4] 0
and a list of group
lst=[0, 1, 2]
This is my expected output:

You can try this:
import numpy as np
new_df = pd.DataFrame(np.repeat(df.values, len(lst), axis=0), columns=df.columns)
new_lst = lst * len(df)
new_df['group'] = new_lst
print(new_df)
Output:
sample values group
0 sample 1 [0.2, 0.3, 0.5] 0
1 sample 1 [0.2, 0.3, 0.5] 1
2 sample 1 [0.2, 0.3, 0.5] 2
3 sample 2 [0.3, 0.3, 0.4] 0
4 sample 2 [0.3, 0.3, 0.4] 1
5 sample 2 [0.3, 0.3, 0.4] 2

Use a cross merge:
out = df.drop(columns='group').merge(pd.Series(lst, name='group'), how='cross')
Output:
sample values group
0 sample 1 [0.2, 0.3, 0.5] 0
1 sample 1 [0.2, 0.3, 0.5] 1
2 sample 1 [0.2, 0.3, 0.5] 2
3 sample 2 [0.3, 0.3, 0.4] 0
4 sample 2 [0.3, 0.3, 0.4] 1
5 sample 2 [0.3, 0.3, 0.4] 2

scikit learn Random Forest Classifier probability threshold

I'm using sklearn RandomForestClassifier for a prediction task.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=300, n_jobs=-1)
model.fit(x_train,y_train)
model.predict_proba(x_test)
There are 171 classes to predict.
I want to predict only those classes, where predict_proba(class) is at least 90%. Everything below should be set to 0.
For example, given the following:
1 2 3 4 5 6 7
0 0.0 0.0 0.1 0.9 0.0 0.0 0.0
1 0.2 0.1 0.1 0.3 0.1 0.0 0.2
2 0.1 0.1 0.1 0.1 0.1 0.4 0.1
3 1.0 0.0 0.0 0.0 0.0 0.0 0.0
my expected output is:
0 4
1 0
2 0
3 1

You can use numpy.argwhere as follows:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
model = RandomForestClassifier(n_estimators=300, n_jobs=-1)
model.fit(x_train,y_train)
preds = model.predict_proba(x_test)
#preds = np.array([[0.0, 0.0, 0.1, 0.9, 0.0, 0.0, 0.0],
# [ 0.2, 0.1, 0.1, 0.3, 0.1, 0.0, 0.2],
# [ 0.1 ,0.1, 0.1, 0.1, 0.1, 0.4, 0.1],
# [ 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]])
r = np.zeros(preds.shape[0], dtype=int)
t = np.argwhere(preds>=0.9)
r[t[:,0]] = t[:,1]+1
r
array([4, 0, 0, 1])

You can use list comprehensions:
import numpy as np
# dummy predictions - 3 samples, 3 classes
pred = np.array([[0.1, 0.2, 0.7],
[0.95, 0.02, 0.03],
[0.08, 0.02, 0.9]])
# first, keep only entries >= 0.9:
out_temp = np.array([[x[i] if x[i] >= 0.9 else 0 for i in range(len(x))] for x in pred])
out_temp
# result:
array([[0. , 0. , 0. ],
[0.95, 0. , 0. ],
[0. , 0. , 0.9 ]])
out = [0 if not x.any() else x.argmax()+1 for x in out_temp]
out
# result:
[0, 1, 3]

How to print categories in pandas.cut?

Notice that when you input pandas.cut into a dataframe, you get the bins of each element, Name:, Length:, dtype:, and Categories in the output. I just want the Categories array printed for me so I can obtain just the range of the number of bins I was looking for. For example, with bins=4 inputted into a dataframe of numbers "1,2,3,4,5", I would want the output to print solely the range of the four bins, i.e. (1, 2], (2, 3], (3, 4], (4, 5].
Is there anyway I can do this? It can be anything, even if it doesn't require printing "Categories".

I guessed that you just would like to get the 'bins' from pd.cut().
If so, you can simply set retbins=True, see the doc of pd.cut
For example:
In[01]:
data = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
cats, bins = pd.cut(data.a, 4, retbins=True)
Out[01]:
cats:
0 (0.996, 2.0]
1 (0.996, 2.0]
2 (2.0, 3.0]
3 (3.0, 4.0]
4 (4.0, 5.0]
Name: a, dtype: category
Categories (4, interval[float64]): [(0.996, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0]]
bins:
array([0.996, 2. , 3. , 4. , 5. ])
Then you can reuse the bins as you pleased.
e.g.,
lst = [1, 2, 3]
category = pd.cut(lst,bins)

For anyone who has come here to see how to select a particular bin from pd.cut function - we can use the pd.Interval funtcion
df['bin'] = pd.cut(df['y'], [0.1, .2,.3,.4,.5, .6,.7,.8 ,.9])
print(df["bin"].value_counts())
Ouput
(0.2, 0.3] 697
(0.4, 0.5] 156
(0.5, 0.6] 122
(0.3, 0.4] 12
(0.6, 0.7] 8
(0.7, 0.8] 4
(0.1, 0.2] 0
(0.8, 0.9] 0
print(df.loc[df['bin'] == pd.Interval(0.7,0.8)]

numpy arange implementation on pandas dataframe

I have a dataframe, like so,
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0, 0.5, 0.2],
'b': [1,1,0.3]})
print (df)
a b
0 0.0 1.0
1 0.5 1.0
2 0.2 0.3
I want to generate a Series that looks like
pd.Series ([np.arange ( start = 0, stop = 1, step = 0.1),
np.arange ( start = 0.5, stop = 1, step = 0.1),
np.arange ( start = 0.2, stop = 0.3, step = 0.1)])
0 [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, ...
1 [0.5, 0.6, 0.7, 0.8, 0.9]
2 [0.2]
dtype: object
I am trying to do this with a lambda function and getting an error, like so
foo = lambda x: np.arange(start = x.a, stop = x.b, step = 0.1)
print (df.apply(foo, axis =1))
ValueError: Shape of passed values is (3, 10), indices imply (3, 2)
I am not sure what this means. Is there a better/correct way to do this?

I'd use a comprehension
pd.Series([np.arange(a, b, .1) for a, b in zip(df.a, df.b)], df.index)
0 [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, ...
1 [0.5, 0.6, 0.7, 0.8, 0.9]
2 [0.2]
dtype: object

Use itertuples with Series constructor:
s = pd.Series([np.arange(x.a, x.b, .1) for x in df.itertuples()], index=df.index)
print (s)
0 [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, ...
1 [0.5, 0.6, 0.7, 0.8, 0.9]
2 [0.2]
dtype: object
s = pd.Series([np.arange(x.a, x.b, .1) for i, x in df.iterrows()], index=df.index)
print (s)
0 [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, ...
1 [0.5, 0.6, 0.7, 0.8, 0.9]
2 [0.2]
dtype: object
With apply works only converting to tuple:
foo = lambda x: tuple(np.arange(start = x.a, stop = x.b, step = 0.1))
print (df.apply(foo, axis = 1))
0 (0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, ...
1 (0.5, 0.6, 0.7, 0.8, 0.9)
2 (0.2,)
dtype: object

How to generate a list of 4 values that adds up to 1 in Python? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I'm trying to generate a complete list of 4 values that adds up to 1. Each value can be 10% increment.
For example,
These are valid lists
[0, 0, 0, 1]
[0.1, 0.8, 0.1, 0]
[0.2, 0.2, 0.2, 0.4]
These are invalid lists
[1, 0.1, 0, 0]
[0.5, 0.5, 0.1, 0]
I believe the permutation would be
10!/6! I could be wrong.

This cuts the interval [0, 10] at three integers, giving you four subintervals whose lengths just need to be divided by 10.
>>> import itertools
>>> for a, b, c in itertools.combinations_with_replacement(range(11), 3):
print([a/10, (b-a)/10, (c-b)/10, (10-c)/10])
[0.0, 0.0, 0.0, 1.0]
[0.0, 0.0, 0.1, 0.9]
[0.0, 0.0, 0.2, 0.8]
[0.0, 0.0, 0.3, 0.7]
[0.0, 0.0, 0.4, 0.6]
[0.0, 0.0, 0.5, 0.5]
[0.0, 0.0, 0.6, 0.4]
[0.0, 0.0, 0.7, 0.3]
[0.0, 0.0, 0.8, 0.2]
[0.0, 0.0, 0.9, 0.1]
[0.0, 0.0, 1.0, 0.0]
[0.0, 0.1, 0.0, 0.9]
[0.0, 0.1, 0.1, 0.8]
[0.0, 0.1, 0.2, 0.7]
[0.0, 0.1, 0.3, 0.6]
...
...
...
[0.7, 0.2, 0.0, 0.1]
[0.7, 0.2, 0.1, 0.0]
[0.7, 0.3, 0.0, 0.0]
[0.8, 0.0, 0.0, 0.2]
[0.8, 0.0, 0.1, 0.1]
[0.8, 0.0, 0.2, 0.0]
[0.8, 0.1, 0.0, 0.1]
[0.8, 0.1, 0.1, 0.0]
[0.8, 0.2, 0.0, 0.0]
[0.9, 0.0, 0.0, 0.1]
[0.9, 0.0, 0.1, 0.0]
[0.9, 0.1, 0.0, 0.0]
[1.0, 0.0, 0.0, 0.0]
Or more general (just replace 3 by the number of cuts you want):
>>> for cuts in itertools.combinations_with_replacement(range(11), 3):
print([(b-a)/10 for a, b in zip((0,) + cuts, cuts + (10,))])

Stefan's solution is nicer, however, you can also do this using list comprehension and the itertools library:
import itertools
perm = [[x /10.0 for x in t] for t in itertools.product(range(11), repeat=4) if sum(t)==10]

What you need is called an integer partition. A web search for that term should turn up a lot of hits. An integer partition of n of length m is just a list of m positive integers which add up to n.
Once you have an integer partition of 10 of length 4, you can just divide the integers by 10 to get increments of 0.1 and a sum of 1.
EDIT: I see that the list could be less than 4 items (some elements could be zero). So you are looking for integer partitions of length <= 4 (not length == 4).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas binning via cut including single value (zero) bin - python

Maybe a little late but to anyone with this doubt, you should use the argument include_lowest=True inside pd.cut

Related

how do I add list values at a specific column in pandas?

scikit learn Random Forest Classifier probability threshold

How to print categories in pandas.cut?

numpy arange implementation on pandas dataframe

How to generate a list of 4 values that adds up to 1 in Python? [closed]

Categories

Resources