making numerical categories of pandas data - python

I tried to look at some reference where I can make an extra column that is categorical based on another column. I tried the documentation already pandas categorical, and stack overflow does not seem to have this, but I think it must be, but maybe I am using the wrong search tags?
for example
Size Size_cat
10 0-50
50 0-50
150 50-500
450 50-500
5000 1000-9000
10000 >9000
notice that the size category 500-1000 is missing (but no number corresponds to that)
The problem lies here is that I create a pandas crosstable later like this:
summary_table = pd.crosstab(index[res_sum["Type"],res_sum["Size"]],columns=[res_sum["Found"]],margins=True)
summary_table = summary_table.div(summary_table["All"] / 100, axis=0)
After some editing of this table I get this kind of result:
Found Exact Near No
Type Size
DEL 50 80 20 0
100 60 40 0
500 80 20 0
1000 60 40 0
5000 40 60 0
10000 20 80 0
DEL_Total 56.666667 43.333333 0
DUP 50 0 0 100
100 0 0 100
500 0 100 0
1000 0 100 0
5000 0 100 0
10000 20 80 0
DUP_Total 3.333333 63.333333 33.333333
the problem is that now (Size) just puts the sizes here, and therefore this table can vary in size. If 5000-DEL is missing in the data, that column will also disappear and then DUP has 6 categories and DEL 5. Additionally if I add more sizes, this table will become very large. So I wanted to make categories of the sizes, but always retaining the same categories, even if some of them are empty.
I hope I am clear, because it is kinda hard to explain.
this is what I tried already:
highest_size = res['Size'].max()
categories = int(math.ceil(highest_size / 100.0) * 100.0)
categories = int(categories / 10)
labels = ["{0} - {1}".format(i, i + categories) for i in range(0, highest_size, categories)]
print(highest_size)
print(categories)
print(labels)
10000
1000
['0 - 1000', '1000 - 2000', '2000 - 3000', '3000 - 4000', '4000 - 5000', '5000 - 6000', '6000 - 7000', '7000 - 8000', '8000 - 9000', '9000 - 10000']
I get number categories, but of course now they depend on the highest number, and the categories change based on the data. additionally I still need to link them to the 'Size' column in pandas. This does not work.
df['group'] = pd.cut(df.value, range(0, highest_size), right=False, labels=labels)
If possible I would like to make my own categories, instead of using range to get the same steps, like I made in the first example above. (otherwise it takes way to long to get to 10000 with steps of 100, and taking steps of 1000 will lose a lot of data in the smaller regions)

See a mock up below, to help you get the logic. Basically, you bin the Score into custom groups, by using cut (or even lambda or map ) and passing the value to the function GroupMapping. Let me know if it works.
import pandas as pd
df=pd.DataFrame({
'Name':['Harry','Sally','Mary','John','Francis','Devon','James','Holly','Molly','Nancy','Ben'],
'Score': [1143,2040,2500,3300,3143,2330,2670,2140,2890,3493,1723]}
)
def GroupMapping(dl):
if int(dl) <= 1000: return '0-1000'
elif 1000 < dl <= 2000: return '1000 - 2000'
elif 2000 < dl <= 3000: return '2000 - 3000'
elif 3000 < dl <= 4000: return '3000 - 4000'
else: return 'None'
#df["Group"] = df['Score'].map(GroupMapping)
#df["Group"] = df['Score'].apply(lambda row: GroupMapping(row))
df['Group'] = pd.cut(df['Score'], [0, 1000, 2000, 3000, 4000], labels=['0-1000', '1000 - 2000', '2000 - 3000','3000 - 4000' ])
df

Related

panda data frame applying multiple columns

product_code order eachprice
TN45 10 500
BY11 20 360
AJ21 5 800
and i need to create a new column based on order and each price if order>=10, then 5% discount, order>=50 then 10% discount for the price, how can i apply a function to achieve this:
product_code order each_price discounted_price
TN45 10 500 4500
BY11 20 360 6480
AJ21 5 800 4000
i tried to apply a function e.g.
df['discount'] = df.apply(function, axis=1)
but errors prompts
"A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
can anyone help? thanks
You could use nested numpy.where calls to achieve this. I've added an extra intermediate column to the results for the percentage discount, then used this column to calculate the final discounted price:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product_code': ['TN45', 'BY11', 'AJ21'],
'order': [10, 20, 5],
'each_price': [500, 360, 800]
})
df['discount'] = np.where(
df['order'] >= 50,
0.1,
np.where(
df['order'] >= 10,
0.05,
0
)
)
df['discounted_price'] = df['order'] * df['each_price'] * (1 - df['discount'])
Note that my results are slightly different from those in your expected output, but I believe they are correct based on the description of the discount conditions you gave:
product_code order each_price discount discounted_price
0 TN45 10 500 0.05 4750.0
1 BY11 20 360 0.05 6840.0
2 AJ21 5 800 0.00 4000.0
As you mention you are trying by using apply function. I did the same and is working. I am not sure what part of the function was wrong in your case.
import pandas as pd
df = pd.DataFrame({
'product_code': ['TN45', 'BY11', 'AJ21'],
'order': [10, 20, 5],
'each_price': [500, 360, 800]
})
# This is the apply function
def make_discount(row):
total=row["order"] * row['each_price']
if row["order"] >= 10:
total=total - (total*0.05)
elif row["order"] >= 50:
total=total - (total*0.1)
return total
Output:
df["discount_price"] = df.apply(make_discount, axis=1)
df
product_code order each_price discount_price
0 TN45 10 500 4750.0
1 BY11 20 360 6840.0
2 AJ21 5 800 4000.0

Create new Pandas Dataframe from observations which meets specific criteria

I have two original dataframes.
One contains limits: df_limits
feat_1 feat_2 feat_3
target 12 9 90
UL 15 10 120
LL 9 8 60
where target is ideal value,
UL - upper limit,
LL - lower limit
And another one original data: df_to_check
ID feat_1 feat_2 feat_3
123 12.5 9.6 100
456 18 3 100
789 9 11 100
I'm creating a function which desired output is get ID and features which are below or above the threshold (limits from first Dataframe) Till now I'm able to recognise which features are out of limits but I'm getting full output of original Dataframe...
def table(df_limits, df_to_check, column):
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
if UL_index is not None:
above_limit = {'ID': df_to_check['ID'],
'column': df_to_check[column],
'target': df_limits[column].loc['target']}
return pd.DataFrame(above_limit)
What I should change so my desired output would be:
(showing only ID and column where observations are out of limit)
The best if it would show also how many percent of original value is deviate from ideal value target (I would be glad for advices how to add such a column)
ID column target value deviate(%)
456 feat_1 12 18 50
456 feat_2 9 3 ...
789 feat_2 9 11 ...
Now after running this function its returning whole dataset because statement says if not null... which is not null... I understand why I have this issue but I don't know how to change it
Issue is with statement if UL_index is not None: since it returning whole dataset and I'm looking for way how to replace this part
Approach
reshape
merge
calculate
new_df = (df_to_check.set_index("ID").unstack().reset_index()
.rename(columns={"level_0":"column",0:"value"})
.merge(df_limits.T, left_on="column", right_index=True)
.assign(deviate=lambda dfa: (dfa.value-dfa.target)/dfa.target)
)
column
ID
value
target
UL
LL
deviate
feat_1
123
12.5
12
15
9
0.0416667
feat_1
456
18
12
15
9
0.5
feat_1
789
9
12
15
9
-0.25
feat_2
123
9.6
9
10
8
0.0666667
feat_2
456
3
9
10
8
-0.666667
feat_2
789
11
9
10
8
0.222222
feat_3
123
100
90
120
60
0.111111
feat_3
456
100
90
120
60
0.111111
feat_3
789
100
90
120
60
0.111111
First of all, you have not provided a reproducible example https://stackoverflow.com/help/minimal-reproducible-example because you have not shared the code which produces the two initial dataframes. Next time you ask a question, please keep it in mind, Without those, I made a toy example with my own (random) data.
I start by unpivoting what you call dataframe_to_check: that's because, if you want to check each feature independently, then that dataframe is not normalised (you might want to look up what database normalisation means).
The next step is a left outer join between the unpivoted dataframe you want to check and the (transposed) dataframe with the limits.
Once you have that, you can easily calculate whether a row is within the range, the deviation between value and target, etc, and you can of course group this however you want.
My code is below. It should be easy enough to customise it to your case.
import pandas as pd
import numpy as np
df_limits = pd.DataFrame(index =['min val','max val','target'])
df_limits['a']=[2,4,3]
df_limits['b']=[3,5,4.5]
df =pd.DataFrame(columns = df_limits.columns, data =np.random.rand(100,2)*6 )
df_unpiv = pd.melt( df.reset_index().rename(columns ={'index':'id'}), id_vars='id', var_name ='feature', value_name = 'value' )
# I reset the index because I couldn't get a join on a column and index, but there is probably a better way to do it
df_joined = pd.merge( df_unpiv, df_limits.transpose().reset_index().rename(columns = {'index':'feature'}), how='left', on ='feature' )
df_joined['abs diff from target'] = abs( df_joined['value'] - df_joined['target'] )
df_joined['outside range'] = (df_joined['value'] < df_joined['min val'] ) | (df_joined['value'] > df_joined['max val'])
df_outside_range = df_joined.query(" `outside range` == True " )
df_inside_range = df_joined.query(" `outside range` == False " )
I solved my issue maybe in bit clumsy way but it works as desired...
If someone have better answer I will still appreciate:
Example how to get only observations above limits, to have both just concatenate observation from UL_index and LL_index
def table(df_limits,df_to_check,column):
above_limit = []
df_above_limit = pd.DataFrame()
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
df_to_check_UL = df_to_check.loc[UL_index]
df_to_check_LL = df_to_check.loc[LL_index]
above_limit = {
'ID': df_to_check_UL['ID'],
'feature value': df_to_check[column],
'target': df_limits[column].loc['target']
}
df_above_limit = pd.DataFrame(above_limit, index = df_to_check_UL.index)
return df_above_limit

Optimizing interpolation of values in Pandas

I have been strugling with an optimization problem with Pandas.
I had developed a script to apply computation on every line of a relatively small DataFrame (~a few 1000s lines, a few dozen columns).
I relied heavily on the apply() function which was obviously a poor choice in most cases.
After a round of optimization I only have a method which takes time and I haven't found an easy solution for :
Basically my dataframe contains a list of video viewing statistics with the number of people who watched the video for every quartile (how many have watched 0%, 25%, 50%, etc) such as :
video_name
video_length
video_0
video_25
video_50
video_75
video_100
video_1
6
1000
500
300
250
5
video_2
30
1000
500
300
250
5
I am trying to interpolate the statistics to be able to answer "how many people would have watched each quartile of the video if it lasted X seconds"
Right now my function takes the dataframe and a "new_length" parameter, and calls apply() on each line.
The function which handles each line computes the time marks for each quartile (so 0, 7.5, 15, 22.5 and 30 for the 30s video), and time marks for each quartile given the new length (so to reduce the 30s video to 6s, the new time marks would be 0, 1.5, 3, 4.5 and 6).
I build a dataframe containing the time marks as index, and the stats as values in the first column:
index (time marks)
view_stats
0
1000
7.5
500
15
300
22.5
250
30
5
1.5
NaN
3
NaN
4.5
NaN
I then call DataFrame.interpolate(method="index") to fill the NaN values.
It works and gives me the result I expect, but it is taking a whopping 11s for a 3k lines dataframe and I believe it has to do with the use of the apply() method combined with the creation of a new dataframe to interpolate the data for each line.
Is there an obvious way achieve the same result "in place", e.g by avoiding the apply / new dataframe method, directly on the original dataframe ?
EDIT: The expected output when calling the function with 6 as the new length parameter would be :
video_name
video_length
video_0
video_25
video_50
video_75
video_100
new_video_0
new_video_25
new_video_50
new_video_75
new_video_100
video_1
6
1000
500
300
250
5
1000
500
300
250
5
video_2
6
1000
500
300
250
5
1000
900
800
700
600
The first line would be untouched because the video is already 6s long.
In the second line, the video would be cut from 30s to 6s so the new quartiles would be at 0, 1.5, 3, 4.5, 6s and the stats would be interpolated between 1000 and 500, which were the values at the old 0% and 25% time marks
EDIT2: I do not care if I need to add temporary columns, time is an issue, memory is not.
As a reference, this is my code :
def get_value(marks, asset, mark_index) -> int:
value = marks["count"][asset["new_length_marks"][mark_index]]
if isinstance(value, pandas.Series):
res = value.iloc(0)
else:
res = value
return math.ceil(res)
def length_update_row(row, assets, **kwargs):
asset_name = row["asset_name"]
asset = assets[asset_name]
# assets is a dict containing the list of files and the old and "new" video marks
# pre-calculated
marks = pandas.DataFrame(data=[int(row["video_start"]), int(row["video_25"]), int(row["video_50"]), int(row["video_75"]), int(row["video_completed"])],
columns=["count"],
index=asset["old_length_marks"])
marks = marks.combine_first(pandas.DataFrame(data=NaN, columns=["count"], index=asset["new_length_marks"][1:]))
marks = marks.interpolate(method="index")
row["video_25"] = get_value(marks, asset, 1)
row["video_50"] = get_value(marks, asset, 2)
row["video_75"] = get_value(marks, asset, 3)
row["video_completed"] = get_value(marks, asset, 4)
return row
def length_update_stats(report: pandas.DataFrame,
assets: dict) -> pandas.DataFrame:
new_report = new_report.apply(lambda row: length_update_row(row, assets), axis=1)
return new_report
IIUC, you could use np.interp:
# get the old x values
xs = df['video_length'].values[:, None] * [0, 0.25, 0.50, 0.75, 1]
# the corresponding y values
ys = df.iloc[:, 2:].values
# note that 6 is the new value
nxs = np.repeat(np.array(6), 2)[:, None] * [0, 0.25, 0.50, 0.75, 1]
res = pd.DataFrame(data=np.array([np.interp(nxi, xi, yi) for nxi, xi, yi in zip(nxs, xs, ys)]), columns="new_" + df.columns[2:] )
print(res)
Output
new_video_0 new_video_25 new_video_50 new_video_75 new_video_100
0 1000.0 500.0 300.0 250.0 5.0
1 1000.0 900.0 800.0 700.0 600.0
And then concat across the second axis:
output = pd.concat((df, res), axis=1)
print(output)
Output (concat)
video_name video_length video_0 ... new_video_50 new_video_75 new_video_100
0 video_1 6 1000 ... 300.0 250.0 5.0
1 video_2 30 1000 ... 800.0 700.0 600.0
[2 rows x 12 columns]

Python - Count row between interval in dataframe

I have a dataset with a date, engine, energy and max power column. Let's say that the dataset is composed of 2 machines and a depth of one month. Each machine has a maximum power (say 100 for simplicity). Each machine with 3 operating states (between Pmax and 80% of Pmax either nominal power, between 80% and 20% of Pmax or drop in load and finally below 20% of Pmax at 0 we consider that the machine stops below 20%)
The idea is to know, by period and machine, the number of times the machine has operated in the 2nd interval (between 80% and 20% of the Pmax). If a machine drops to stop it should not be counted and if it returns from stop it should not be counted either.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy.ma.extras import _ezclump as ez
data = {'date': ['01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020',
'01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020'],
'engine': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a',
'b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b',],
'energy': [100,100,100,100,100,80,80,60,60,60,60,60,90,100,100,50,50,40,20,0,0,0,20,50,60,100,100,50,50,50,50,
50,50,100,100,100,80,80,60,60,60,60,60,0,0,0,50,50,100,90,50,50,50,50,50,60,100,100,50,50,100,100],
'pmax': [100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,
100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100]
}
df = pd.DataFrame(data, columns = ['date', 'engine', 'energy', 'pmax'])
df['date'] = df['date'].astype('datetime64[ns]')
df = df.set_index('date')
df['inter'] = df['energy'].apply(lambda x: 2 if x >= 80 else (1 if x < 80 and x >= 20 else 0 ))
liste = []
engine_off = ez((df['inter'] == 1).to_numpy())
for i in engine_off:
if df.iloc[(i.start)-1, 3] == 0:
engine_off.remove(i)
elif df.iloc[(i.stop), 3] == 0:
engine_off.remove(i)
else:
liste.append([df['engine'][i.start],df.index[i.start],df.index[i.stop], i.stop - i.start])
dfend = pd.DataFrame(liste, columns=['engine','begin','end','nb_heure'])
dfend['month'] = dfend['begin'].dt.month_name()
dfgroupe = dfend.set_index('begin').groupby(['engine','month']).agg(['mean','max','min','std','count','sum']).fillna(1)
Either I recover my data in a Dataframe, I classify for each line the associated energy in an interval (2 for nominal operation, 1 for intermediate and 0 for stop)
Then I check if each row in the interval == 1 column allowing me to retrieve a list of slices with the start and end of each slice.
Then I loop to check that each element before or after my slice is different from 0 to exclude the decreases for stop or return from stop.
Then I create a dataframe from the list, then I average, sum, etc.
The problem is that my list has only 4 drops while there are 5 drops. This comes from the 4 slice (27.33)
Can someone help me?
Thank you
here is one way to do it, I tried to use your way with groups but ended up to do it slightly differently
# another way to create inter, probably faster on big dataframe
df['inter'] = pd.cut(df['energy']/df['pmax'], [-1,0.2, 0.8, 1.01],
labels=[0,1,2], right=False)
# mask if inter is equal to 1 and groupby engine
gr = df['inter'].mask(df['inter'].eq(1)).groupby(df['engine'])
# create a mask to get True for the rows you want
m = (df['inter'].eq(1) # the row are 1s
& ~gr.ffill().eq(0) # the row before 1s is not 0
& ~gr.bfill().eq(0) # the row after 1s is not 0
)
#create dfend with similar shape to yours
dfend = (df.assign(date=df.index) #create a column date for the agg
.where(m) # replace the rows not interesting by nan
.groupby(['engine', #groupby per engine
m.ne(m.shift()).cumsum()]) # and per group of following 1s
.agg(begin=('date','first'), #agg date with both start date
end = ('date','last')) # and end date
)
# create the colum nb_hours (although here it seems to be nb_days)
dfend['nb_hours'] = (dfend['end'] - dfend['begin']).dt.days+1
print (dfend)
begin end nb_hours
engine inter
a 2 2020-01-08 2020-01-12 5
4 2020-01-28 2020-01-31 4
b 4 2020-01-01 2020-01-02 2
6 2020-01-20 2020-01-25 6
8 2020-01-28 2020-01-29 2
and you got the three segment for engine b as required, then you can
#create dfgroupe
dfgroupe = (dfend.groupby(['engine', #groupby engine
dfend['begin'].dt.month_name()]) #and month name
.agg(['mean','max','min','std','count','sum']) #agg
.fillna(1)
)
print (dfgroupe)
nb_hours
mean max min std count sum
engine begin
a January 4.500000 5 4 0.707107 2 9
b January 3.333333 6 2 2.309401 3 10
I am assuming the following terminology:
- 80 <= energy <= 100 ---> df['inter'] == 2, normal mode.
- 20 <= energy < 80 ---> df['inter'] == 1, intermediate mode.
- 20 > energy ---> df['inter'] == 0, stop mode.
I reckon you want to find those periods of time in which:
1) The machine is operating in intermediate mode.
2) You don't want to count if the status is changing from intermediate to stop mode or from stop to intermediate mode.
# df['before']: this is to compare each row of df['inter'] with the previous row
# df['after']: this is to compare each row of df['inter'] with the next row
# df['target'] == 1 is when both above mentioned conditions (conditions 1 and 2) are met.
# In the next we mask the original df and keep those times that conditions 1 and 2 are met, then we group by machine and month, and after that obtain the min, max, mean, and so on.
df['before'] = df['inter'].shift(periods=1, fill_value=0)
df['after'] = df['inter'].shift(periods=-1, fill_value=0)
df['target'] = np.where((df['inter'] == 1) & (np.sum(df[['inter', 'before', 'after']], axis=1) > 2), 1, 0)
df['month'] = df['date'].dt.month
mask = df['target'] == 1
df_group = df[mask].groupby(['engine', 'month']).agg(['mean', 'max', 'min', 'std', 'count', 'sum'])

Randomly selecting from Pandas groups with equal probability -- unexpected behavior

I have 12 unique groups that I am trying to randomly sample from, each with a different number of observations. I want to randomly sample from the entire population (dataframe) with each group having the same probability of being selected from. The simplest example of this would be a dataframe with 2 groups.
groups probability
0 a 0.25
1 a 0.25
2 b 0.5
using np.random.choice(df['groups'], p=df['probability'], size=100) Each iteration will now have a 50% chance of selecting group a and a 50% chance of selecting group b
To come up with the probabilities I used the formula:
(1. / num_groups) / size_of_groups
or in Python:
num_groups = len(df['groups'].unique()) # 2
size_of_groups = df.groupby('label').size() # {a: 2, b: 1}
(1. / num_groups) / size_of_groups
Which returns
groups
a 0.25
b 0.50
This works great until I get past 10 unique groups, after which I start getting weird distributions. Here is a small example:
np.random.seed(1234)
group_size = 12
groups = np.arange(group_size)
probs = np.random.uniform(size=group_size)
probs = probs / probs.sum()
g = np.random.choice(groups, size=10000, p=probs)
df = pd.DataFrame({'groups': g})
prob_map = ((1. / len(df['groups'].unique())) / df.groupby('groups').size()).to_dict()
df['probability'] = df['groups'].map(prob_map)
plt.hist(np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True))
plt.xticks(np.arange(group_size))
plt.show()
I would expect a fairly uniform distribution with a large enough sample size, but I am getting these wings when the number of groups is 11+. If I change the group_size variable to 10 or lower, I do get the desired uniform distribution.
I can't tell if the problem is with my formula for calculating the probabilities, or possibly a floating point precision problem? Anyone know a better way to accomplish this, or a fix for this example?
Thanks in advance!
you are using hist which defaults to 10 bins...
plt.rcParams['hist.bins']
10
pass group_size as the bins parameter.
plt.hist(
np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True),
bins=group_size)
There is no problem about your calculations. Your resulting array is:
arr = np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True)
If you check the value counts:
pd.Series(arr).value_counts().sort_index()
Out:
0 855
1 800
2 856
3 825
4 847
5 835
6 790
7 847
8 834
9 850
10 806
11 855
dtype: int64
It is pretty close to a uniform distribution. The problem is with the default number of bins (10) of the histogram. Instead, try this:
bins = np.linspace(-0.5, 10.5, num=12)
pd.Series(arr).plot.hist(bins=bins)

Categories