Notice that when you input pandas.cut into a dataframe, you get the bins of each element, Name:, Length:, dtype:, and Categories in the output. I just want the Categories array printed for me so I can obtain just the range of the number of bins I was looking for. For example, with bins=4 inputted into a dataframe of numbers "1,2,3,4,5", I would want the output to print solely the range of the four bins, i.e. (1, 2], (2, 3], (3, 4], (4, 5].
Is there anyway I can do this? It can be anything, even if it doesn't require printing "Categories".
I guessed that you just would like to get the 'bins' from pd.cut().
If so, you can simply set retbins=True, see the doc of pd.cut
For example:
In[01]:
data = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
cats, bins = pd.cut(data.a, 4, retbins=True)
Out[01]:
cats:
0 (0.996, 2.0]
1 (0.996, 2.0]
2 (2.0, 3.0]
3 (3.0, 4.0]
4 (4.0, 5.0]
Name: a, dtype: category
Categories (4, interval[float64]): [(0.996, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0]]
bins:
array([0.996, 2. , 3. , 4. , 5. ])
Then you can reuse the bins as you pleased.
e.g.,
lst = [1, 2, 3]
category = pd.cut(lst,bins)
For anyone who has come here to see how to select a particular bin from pd.cut function - we can use the pd.Interval funtcion
df['bin'] = pd.cut(df['y'], [0.1, .2,.3,.4,.5, .6,.7,.8 ,.9])
print(df["bin"].value_counts())
Ouput
(0.2, 0.3] 697
(0.4, 0.5] 156
(0.5, 0.6] 122
(0.3, 0.4] 12
(0.6, 0.7] 8
(0.7, 0.8] 4
(0.1, 0.2] 0
(0.8, 0.9] 0
print(df.loc[df['bin'] == pd.Interval(0.7,0.8)]
Related
From scipy reference for scipy.stats.binned_statistic_dd,
Binedges: All but the last (righthand-most) bin is half-open in each
dimension. In other words, if bins is [1, 2, 3, 4], then the first bin
is [1, 2) (including 1, but excluding 2) and the second [2, 3). The
last bin, however, is [3, 4], which includes 4.
I want to use pandas.qcut to generate the bin edges to pass to binned statistic, but the edges are defined exactly the other way around.
a = np.arange(0,10,1)
[0 1 2 3 4 5 6 7 8 9]
where,
d,b = pd.qcut(a, 9, retbins=True)
print(d.value_counts())
print(b)
(-0.001, 1.0] 2
(1.0, 2.0] 1
(2.0, 3.0] 1
(3.0, 4.0] 1
(4.0, 5.0] 1
(5.0, 6.0] 1
(6.0, 7.0] 0
(7.0, 8.0] 2
(8.0, 9.0] 1
dtype: int64
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
If I now run the binned_statistic using this binning,
h,e,binning = sp.stats.binned_statistic_dd(values=a,sample=a,bins=[np.array(b)])
print(binning)
[1 2 3 4 5 6 7 8 9 9]
which is a different binning of course, due to the different definition of the bin edges.
Is there a way to the the edges of qcut reversed? since are "real" numbers I cannot just shift the values.
Otherwise, does scipy has this capability in some way I cannot see? Does binned_statistic allow to automatically define the bins based on the data distribution somehow?
So, expected output (is not uniquely defined) for this particular case could be
be = [0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7, 8.5, 9.5]
such that,
h,e,binning = sp.stats.binned_statistic_dd(values=a,sample=a,bins=[np.array(be)])
print(binning)
[1 1 2 3 4 5 6 8 8 9]
Using pandas in Python, given
import pandas as pd
s = pd.Series([0.0, 0.1, 0.2, 0.5, 0.0, 0.2, 0.1, 0.5, 0.0])
how to do binning where the first bin only contains zeros?
Tried
bins = pd.IntervalIndex.from_tuples([(0, 0), (0, 0.1), (0.1, 0.2), (0.2, float("inf"))])
pd.cut(s, bins)
which gives
0 NaN
1 (0.0, 0.1]
2 (0.1, 0.2]
3 (0.2, inf]
4 NaN
5 (0.1, 0.2]
6 (0.0, 0.1]
7 (0.2, inf]
8 NaN
dtype: category
Categories (4, interval[float64]): [(0.0, 0.0] < (0.0, 0.1] < (0.1, 0.2] < (0.2, inf]]
but
zero_bin = pd.IntervalIndex.from_tuples([(0, 0)], closed="both")
pd.cut(s, zero_bin)
results in
0 [0.0, 0.0]
1 NaN
2 NaN
3 NaN
4 [0.0, 0.0]
5 NaN
6 NaN
7 NaN
8 [0.0, 0.0]
dtype: category
Categories (1, interval[int64]): [[0, 0]]
But I did not find a way to combine the zero_bin and bins to get the desired result of
0 [0.0, 0.0]
1 (0.0, 0.1]
2 (0.1, 0.2]
3 (0.2, inf]
4 [0.0, 0.0]
5 (0.1, 0.2]
6 (0.0, 0.1]
7 (0.2, inf]
8 [0.0, 0.0]
dtype: category
Categories (4, interval[float64]): [[0.0, 0.0] < (0.0, 0.1] < (0.1, 0.2] < (0.2, inf]]
Maybe a little late but to anyone with this doubt, you should use the argument include_lowest=True inside pd.cut
In pandas own documentation on the cut method, it says that it produces equally sized bins. However, in the example they provide, it clearly doesn't:
>>>pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ...
The first interval is larger than all the others, why is that?
Edit: even if the smallest number (1) in the array is made more than 1 (e.g. 1.001), it still produces bins of unequal width:
In [291]: pd.cut(np.array([1.001, 7, 5, 4, 6, 3]), 3)
Out[291]:
[(0.995, 3.001], (5.0, 7.0], (3.001, 5.0], (3.001, 5.0], (5.0, 7.0], (0.995, 3.001]]
Categories (3, interval[float64]): [(0.995, 3.001] < (3.001, 5.0] < (5.0, 7.0]]
For the kind of performance you get, I can live with this amount of fractional inaccuracy. However, if you know your data and want to get as close to evenly spaced bins as possible, use linspace for the bin spec (similar to here):
arr = np.array([1, 7, 5, 4, 6, 3])
pd.cut(arr, np.linspace(arr.min(), arr.max(), 3+1), include_lowest=True)
# [(0.999, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.999, 3.0]]
# Categories (3, interval[float64]): [(0.999, 3.0] < (3.0, 5.0] < (5.0, 7.0]]
When reading the documentation for pd.qcut?, I simply couldn't understand its writing, particularly with its examples, one of them is below
>>> pd.qcut(range(5), 4)
... # doctest: +ELLIPSIS
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] ...
Why did it return 5 elements in the list (although the code specifying 4 buckets) and the 2 first elements are the same (-0.001, 1.0)?
Thanks.
Because 0 is in (-0.001, 1], so is 1.
range(5) # [0, 1, 2, 3, 4, 5]
The corresponding category of [0, 1, 2, 3, 4, 5] is [(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]].
Look at the range
list(range(5))
Out[116]: [0, 1, 2, 3, 4]
it is return 5 number , when you do qcut , 0,1 are considered into one range
pd.qcut(range(5), 4)
Out[115]:
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0]]
I have a dataframe column
df['probability'] = [0.5, 0.6, 0.7, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]
I need to group data like 0.5, 0.6, 0.7 and 1 as one category with output as 1 in a new column and 1.5 and 2 as one category with output as 2 and so on. Could anyone help me in this.