Using pandas in Python, given
import pandas as pd
s = pd.Series([0.0, 0.1, 0.2, 0.5, 0.0, 0.2, 0.1, 0.5, 0.0])
how to do binning where the first bin only contains zeros?
Tried
bins = pd.IntervalIndex.from_tuples([(0, 0), (0, 0.1), (0.1, 0.2), (0.2, float("inf"))])
pd.cut(s, bins)
which gives
0 NaN
1 (0.0, 0.1]
2 (0.1, 0.2]
3 (0.2, inf]
4 NaN
5 (0.1, 0.2]
6 (0.0, 0.1]
7 (0.2, inf]
8 NaN
dtype: category
Categories (4, interval[float64]): [(0.0, 0.0] < (0.0, 0.1] < (0.1, 0.2] < (0.2, inf]]
but
zero_bin = pd.IntervalIndex.from_tuples([(0, 0)], closed="both")
pd.cut(s, zero_bin)
results in
0 [0.0, 0.0]
1 NaN
2 NaN
3 NaN
4 [0.0, 0.0]
5 NaN
6 NaN
7 NaN
8 [0.0, 0.0]
dtype: category
Categories (1, interval[int64]): [[0, 0]]
But I did not find a way to combine the zero_bin and bins to get the desired result of
0 [0.0, 0.0]
1 (0.0, 0.1]
2 (0.1, 0.2]
3 (0.2, inf]
4 [0.0, 0.0]
5 (0.1, 0.2]
6 (0.0, 0.1]
7 (0.2, inf]
8 [0.0, 0.0]
dtype: category
Categories (4, interval[float64]): [[0.0, 0.0] < (0.0, 0.1] < (0.1, 0.2] < (0.2, inf]]
Maybe a little late but to anyone with this doubt, you should use the argument include_lowest=True inside pd.cut
Related
Lets say I have the following dataframe:
df = pd.DataFrame({'sample': ['sample 1', 'sample 2'],
'values': [[0.2, 0.3, 0.5],[0.3, 0.3, 0.4]],
'group': [1, 0]})
output:
sample values group
0 sample 1 [0.2, 0.3, 0.5] 1
1 sample 2 [0.3, 0.3, 0.4] 0
and a list of group
lst=[0, 1, 2]
This is my expected output:
You can try this:
import numpy as np
new_df = pd.DataFrame(np.repeat(df.values, len(lst), axis=0), columns=df.columns)
new_lst = lst * len(df)
new_df['group'] = new_lst
print(new_df)
Output:
sample values group
0 sample 1 [0.2, 0.3, 0.5] 0
1 sample 1 [0.2, 0.3, 0.5] 1
2 sample 1 [0.2, 0.3, 0.5] 2
3 sample 2 [0.3, 0.3, 0.4] 0
4 sample 2 [0.3, 0.3, 0.4] 1
5 sample 2 [0.3, 0.3, 0.4] 2
Use a cross merge:
out = df.drop(columns='group').merge(pd.Series(lst, name='group'), how='cross')
Output:
sample values group
0 sample 1 [0.2, 0.3, 0.5] 0
1 sample 1 [0.2, 0.3, 0.5] 1
2 sample 1 [0.2, 0.3, 0.5] 2
3 sample 2 [0.3, 0.3, 0.4] 0
4 sample 2 [0.3, 0.3, 0.4] 1
5 sample 2 [0.3, 0.3, 0.4] 2
I'm using sklearn RandomForestClassifier for a prediction task.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=300, n_jobs=-1)
model.fit(x_train,y_train)
model.predict_proba(x_test)
There are 171 classes to predict.
I want to predict only those classes, where predict_proba(class) is at least 90%. Everything below should be set to 0.
For example, given the following:
1 2 3 4 5 6 7
0 0.0 0.0 0.1 0.9 0.0 0.0 0.0
1 0.2 0.1 0.1 0.3 0.1 0.0 0.2
2 0.1 0.1 0.1 0.1 0.1 0.4 0.1
3 1.0 0.0 0.0 0.0 0.0 0.0 0.0
my expected output is:
0 4
1 0
2 0
3 1
You can use numpy.argwhere as follows:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
model = RandomForestClassifier(n_estimators=300, n_jobs=-1)
model.fit(x_train,y_train)
preds = model.predict_proba(x_test)
#preds = np.array([[0.0, 0.0, 0.1, 0.9, 0.0, 0.0, 0.0],
# [ 0.2, 0.1, 0.1, 0.3, 0.1, 0.0, 0.2],
# [ 0.1 ,0.1, 0.1, 0.1, 0.1, 0.4, 0.1],
# [ 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]])
r = np.zeros(preds.shape[0], dtype=int)
t = np.argwhere(preds>=0.9)
r[t[:,0]] = t[:,1]+1
r
array([4, 0, 0, 1])
You can use list comprehensions:
import numpy as np
# dummy predictions - 3 samples, 3 classes
pred = np.array([[0.1, 0.2, 0.7],
[0.95, 0.02, 0.03],
[0.08, 0.02, 0.9]])
# first, keep only entries >= 0.9:
out_temp = np.array([[x[i] if x[i] >= 0.9 else 0 for i in range(len(x))] for x in pred])
out_temp
# result:
array([[0. , 0. , 0. ],
[0.95, 0. , 0. ],
[0. , 0. , 0.9 ]])
out = [0 if not x.any() else x.argmax()+1 for x in out_temp]
out
# result:
[0, 1, 3]
Notice that when you input pandas.cut into a dataframe, you get the bins of each element, Name:, Length:, dtype:, and Categories in the output. I just want the Categories array printed for me so I can obtain just the range of the number of bins I was looking for. For example, with bins=4 inputted into a dataframe of numbers "1,2,3,4,5", I would want the output to print solely the range of the four bins, i.e. (1, 2], (2, 3], (3, 4], (4, 5].
Is there anyway I can do this? It can be anything, even if it doesn't require printing "Categories".
I guessed that you just would like to get the 'bins' from pd.cut().
If so, you can simply set retbins=True, see the doc of pd.cut
For example:
In[01]:
data = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
cats, bins = pd.cut(data.a, 4, retbins=True)
Out[01]:
cats:
0 (0.996, 2.0]
1 (0.996, 2.0]
2 (2.0, 3.0]
3 (3.0, 4.0]
4 (4.0, 5.0]
Name: a, dtype: category
Categories (4, interval[float64]): [(0.996, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0]]
bins:
array([0.996, 2. , 3. , 4. , 5. ])
Then you can reuse the bins as you pleased.
e.g.,
lst = [1, 2, 3]
category = pd.cut(lst,bins)
For anyone who has come here to see how to select a particular bin from pd.cut function - we can use the pd.Interval funtcion
df['bin'] = pd.cut(df['y'], [0.1, .2,.3,.4,.5, .6,.7,.8 ,.9])
print(df["bin"].value_counts())
Ouput
(0.2, 0.3] 697
(0.4, 0.5] 156
(0.5, 0.6] 122
(0.3, 0.4] 12
(0.6, 0.7] 8
(0.7, 0.8] 4
(0.1, 0.2] 0
(0.8, 0.9] 0
print(df.loc[df['bin'] == pd.Interval(0.7,0.8)]
I have a dataframe, like so,
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0, 0.5, 0.2],
'b': [1,1,0.3]})
print (df)
a b
0 0.0 1.0
1 0.5 1.0
2 0.2 0.3
I want to generate a Series that looks like
pd.Series ([np.arange ( start = 0, stop = 1, step = 0.1),
np.arange ( start = 0.5, stop = 1, step = 0.1),
np.arange ( start = 0.2, stop = 0.3, step = 0.1)])
0 [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, ...
1 [0.5, 0.6, 0.7, 0.8, 0.9]
2 [0.2]
dtype: object
I am trying to do this with a lambda function and getting an error, like so
foo = lambda x: np.arange(start = x.a, stop = x.b, step = 0.1)
print (df.apply(foo, axis =1))
ValueError: Shape of passed values is (3, 10), indices imply (3, 2)
I am not sure what this means. Is there a better/correct way to do this?
I'd use a comprehension
pd.Series([np.arange(a, b, .1) for a, b in zip(df.a, df.b)], df.index)
0 [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, ...
1 [0.5, 0.6, 0.7, 0.8, 0.9]
2 [0.2]
dtype: object
Use itertuples with Series constructor:
s = pd.Series([np.arange(x.a, x.b, .1) for x in df.itertuples()], index=df.index)
print (s)
0 [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, ...
1 [0.5, 0.6, 0.7, 0.8, 0.9]
2 [0.2]
dtype: object
s = pd.Series([np.arange(x.a, x.b, .1) for i, x in df.iterrows()], index=df.index)
print (s)
0 [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, ...
1 [0.5, 0.6, 0.7, 0.8, 0.9]
2 [0.2]
dtype: object
With apply works only converting to tuple:
foo = lambda x: tuple(np.arange(start = x.a, stop = x.b, step = 0.1))
print (df.apply(foo, axis = 1))
0 (0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, ...
1 (0.5, 0.6, 0.7, 0.8, 0.9)
2 (0.2,)
dtype: object
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I'm trying to generate a complete list of 4 values that adds up to 1. Each value can be 10% increment.
For example,
These are valid lists
[0, 0, 0, 1]
[0.1, 0.8, 0.1, 0]
[0.2, 0.2, 0.2, 0.4]
These are invalid lists
[1, 0.1, 0, 0]
[0.5, 0.5, 0.1, 0]
I believe the permutation would be
10!/6! I could be wrong.
This cuts the interval [0, 10] at three integers, giving you four subintervals whose lengths just need to be divided by 10.
>>> import itertools
>>> for a, b, c in itertools.combinations_with_replacement(range(11), 3):
print([a/10, (b-a)/10, (c-b)/10, (10-c)/10])
[0.0, 0.0, 0.0, 1.0]
[0.0, 0.0, 0.1, 0.9]
[0.0, 0.0, 0.2, 0.8]
[0.0, 0.0, 0.3, 0.7]
[0.0, 0.0, 0.4, 0.6]
[0.0, 0.0, 0.5, 0.5]
[0.0, 0.0, 0.6, 0.4]
[0.0, 0.0, 0.7, 0.3]
[0.0, 0.0, 0.8, 0.2]
[0.0, 0.0, 0.9, 0.1]
[0.0, 0.0, 1.0, 0.0]
[0.0, 0.1, 0.0, 0.9]
[0.0, 0.1, 0.1, 0.8]
[0.0, 0.1, 0.2, 0.7]
[0.0, 0.1, 0.3, 0.6]
...
...
...
[0.7, 0.2, 0.0, 0.1]
[0.7, 0.2, 0.1, 0.0]
[0.7, 0.3, 0.0, 0.0]
[0.8, 0.0, 0.0, 0.2]
[0.8, 0.0, 0.1, 0.1]
[0.8, 0.0, 0.2, 0.0]
[0.8, 0.1, 0.0, 0.1]
[0.8, 0.1, 0.1, 0.0]
[0.8, 0.2, 0.0, 0.0]
[0.9, 0.0, 0.0, 0.1]
[0.9, 0.0, 0.1, 0.0]
[0.9, 0.1, 0.0, 0.0]
[1.0, 0.0, 0.0, 0.0]
Or more general (just replace 3 by the number of cuts you want):
>>> for cuts in itertools.combinations_with_replacement(range(11), 3):
print([(b-a)/10 for a, b in zip((0,) + cuts, cuts + (10,))])
Stefan's solution is nicer, however, you can also do this using list comprehension and the itertools library:
import itertools
perm = [[x /10.0 for x in t] for t in itertools.product(range(11), repeat=4) if sum(t)==10]
What you need is called an integer partition. A web search for that term should turn up a lot of hits. An integer partition of n of length m is just a list of m positive integers which add up to n.
Once you have an integer partition of 10 of length 4, you can just divide the integers by 10 to get increments of 0.1 and a sum of 1.
EDIT: I see that the list could be less than 4 items (some elements could be zero). So you are looking for integer partitions of length <= 4 (not length == 4).