pandas qcut not putting equal number of observations into each bin - python

I have a data frame, from which I can select a column (series) as follows:
df:
value_rank
275488 90
275490 35
275491 60
275492 23
275493 23
275494 34
275495 75
275496 40
275497 69
275498 14
275499 83
... ...
value_rank is a previously created percentile rank from a larger data-set. What I am trying to do, is to create bins of this data set, e.g. quintile
pd.qcut(df.value_rank, 5, labels=False)
275488 4
275490 1
275491 3
275492 1
275493 1
275494 1
275495 3
275496 2
... ...
This appears fine, as expected, but it isn't.
In fact, I have 1569 columns. The nearest number divisible by 5 bins is 1565 which should give 1565 / 5 = 313 observations in each bin. There are 4 extra records, so I would expect to have 4 bins with 314 observations, and one with 313 observations. Instead, I get this:
obs = pd.qcut(df.value_rank, 5, labels=False)
obs.value_counts()
0 329
3 314
1 313
4 311
2 302
I have no nans in df, and cannot think of any reason why this is happening. Literally beginning to tear my hair out!
Here is a small example:
df:
value_rank
286742 11
286835 53
286865 40
286930 31
286936 45
286955 27
287031 30
287111 36
287269 30
287310 18
pd.qcut gives this:
pd.qcut(df.value_rank, 5, labels = False).value_counts()
bin count
1 3
4 2
3 2
0 2
2 1
There should be 2 observations in each bin, not 3 in bin 1 and 1 in bin 2!

qcut is trying to compensate for repeating values. This is earlier to visualize if you return the bin limits along with your qcut results:
In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
In [43]: test_series = pd.Series(test_list, name='value_rank')
In [49]: pd.qcut(test_series, 5, retbins=True, labels=False)
Out[49]:
(array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]),
array([ 11. , 25.2, 30. , 33. , 41. , 53. ]))
You can see that there was no choice but to set the bin limit at 30, so qcut had to "steal" one from the expected values in the third bin and place them in the second. I'm thinking that this is just happening at a larger scale with your percentiles since you're basically condensing their ranks into a 1 to 100 scale. Any reason not to just run qcut directly on the data instead of the percentiles or return percentiles that have greater precision?

Just try with the below code :
pd.qcut(df.rank(method='first'),nbins)

If you must get equal (or nearly equal) bins, then here's a trick you can use with qcut. Using the same data as the accepted answer, we can force these into equal bins by adding some random noise to the original test_list and binning according to those values.
test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
np.random.seed(42) #set this for reproducible results
test_list_rnd = np.array(test_list) + np.random.random(len(test_list)) #add noise to data
test_series = pd.Series(test_list_rnd, name='value_rank')
pd.qcut(test_series, 5, retbins=True, labels=False)
Output:
(0 0
1 0
2 1
3 2
4 1
5 2
6 3
7 3
8 4
9 4
Name: value_rank, dtype: int64,
array([ 11.37454012, 25.97573801, 30.42160255, 33.11683016,
41.81316392, 53.70807258]))
So, now we have two 0's, two 1's, two 2's and two 4's!
Disclaimer
Obviously, use this at your discretion because results can vary based on your data; like how large your data set is and/or the spacing, for instance. The above "trick" works well for integers because even though we are "salting" the test_list, it will still rank order in the sense that there will won't be a value in group 0 greater than a value in group 1 (maybe equal, but not greater). If, however, you have floats, this may be tricky and you may have to reduce the size of your noise accordingly. For instance if you had floats like 2.1, 5.3, 5.3, 5.4, etc., you should should reduce the noise by dividing by 10: np.random.random(len(test_list)) / 10. If you have arbitrarily long floats, however, you probably would not have this problem in the first place, given the noise already present in "real" data.

This problem arises from duplicate values. A possible solution to force equal sized bins is to use the index as the input for pd.qcut after sorting the dataframe:
import random
df = pd.DataFrame({'A': [random.randint(3, 9) for x in range(20)]}).sort_values('A').reset_index()
del df['index']
df = df.reset_index()
df['A'].plot.hist(bins=30);
picture: https://i.stack.imgur.com/ztjzn.png
df.head()
df['qcut_v1'] = pd.qcut(df['A'], q=4)
df['qcut_v2'] = pd.qcut(df['index'], q=4)
df
picture: https://i.stack.imgur.com/RB4TN.png
df.groupby('qcut_v1').count().reset_index()
picture: https://i.stack.imgur.com/IKtsW.png
df.groupby('qcut_v2').count().reset_index()
picture: https://i.stack.imgur.com/4jrkU.png
sorry I cannot post images since I don't have at least 10 reputation on stackoverflow -.-

Related

For all possible combinations of rows and columns, aggregate values of pandas dataframe on the basis of their relative distance

I have a Dataframe of NxM values, and i need to iterate through it in order to identify all possible combinantions of lenght 1, 2, 3... up to NxM-1 of rows and columns.
For each of these combinations, I want to aggregate values. Let's make an example: df_unreduced is my initial dataframe
df_unreduced= pd.DataFrame({"Strike":[50,75,100,125,150],
"31/12/2021":[1,2,3,4,5],
"31/12/2022":[6,7,8,9,10],
"31/12/2023":[11,12,13,14,15],
"31/12/2024":[16,17,18,19,20],
"31/12/2025":[21,22,23,24,25]})
Here I focus on just one possible combination of rows and columns. lets say I want to aggregate on these 4 "nodes" (being node=intersection between one row and one column) of my initial df:
row=75 , col= "31/12/2022"
row=75 , col= "31/12/2024"
row=150 , col= "31/12/2022"
row=150 , col= "31/12/2024"
the expected result is:
df_reduce=pd.DataFrame({"Strike":[75,150],
"31/12/2022":[45.8333,135.8333],
"31/12/2024":[41.6667, 101.6667]})
if the combination was:
row=75 , col= "31/12/2022"
row=75 , col= "31/12/2024"
row=125 , col= "31/12/2022"
row=125 , col= "31/12/2024"
the expected result would be:
df_reduced = pd.DataFrame({"Strike": [75, 125],
"31/12/2022": [36.25, 111.25],
"31/12/2024": [51.25, 126.25]})
and so on. the key point here is that, as you may see from the expected results, i must linearly weight values that fall between two nodes, before summing them. To achieve the expected result i initially summed by rows and then by columns, but if i did the other way around, the results would be unchanged.
once defined the right summing code, I have to loop between all possible combinations.
Whatever the number/position of nodes i aggregate on, the sum of values within the reduced data frame must match the sum of values within the unreduced one.
EDIT: in order to elaborate a bit more on the aggregation and the expected result: in my df_unreduced the "Strike" array is [50, 75, 100, 125, 150]. As said i'm keen to adopt a two steps approach, summing up by rows and then by columns.
Let's take my first example: the first node i need to aggregate on (call it A) is (75, "31/12/2022") and the third (D) is (150, "31/12/2022"). in the row dimension, between A and D I have two nodes B=(100, "31/12/2022") and C=(125,"31/12/2022"). the distance between 75 and 150 is 75: 100 falls 1/3 away from A and 2/3 away from D, while 125 is 1/3 away from D and 2/3 away from A. for this reason, when aggregating on A, I can say the that A_AGG_row = A + (1-1/3)*B + (1-2/3)*C. Conversely, D_AGG_row = D + (1-1/3)C + (1-2/3) B.
on the column dimension, if we consider F=(75, "31/12/2024"), we have that E=(75, "31/12/2023") is halfway between A and F so A_AGG_col = A_AGG_row + (1-1/2)*E_AGG_row and F_AGG_col = F_AGG_row + (1-1/2)*E_AGG_row (where E_AGG_row is the result of the previous summing by rows on node E).
In more scientifc terms, when summing up rows I need to split linearly the value of the generic node "i" falling between two selected nodes, A and D, by applying this formulas:
1-(Strike_i -Strike_A)/(Strike_D - Strike_A) -> on node A
1-(Strike_D -Strike_i)/(Strike_D - Strike_A) -> on node D
when summing up columns I need to split linearly the value of the generic node "j" falling between two selected nodes, A and F
1-(Date_j-Date_A)/(Date_F - Date_A) -> on node A
1-(Date_F-Date_j)/(Date_F - Date_A) -> on node D
these are the complete steps:
1.df_unreduced
31/12/2021
31/12/2022
31/12/2023
31/12/2024
31/12/2025
50
1
6
11
16
21
75
2
7
12
17
22
100
3
8
13
18
23
125
4
9
14
19
24
150
5
10
15
20
25
summing rows:
31/12/2021
31/12/2022
31/12/2023
31/12/2024
31/12/2025
50
75
6.333
21.333
36.333
51.333
63.333
100
125
150
8.667
18.667
28.667
38.667
48.667
summing by cols:
31/12/2021
31/12/2022
31/12/2023
31/12/2024
31/12/2025
50
75
45.833
135.833
100
125
150
41.667
101.667
I hope this clarifies the problem
thanks

How can we numerically filter a dataframe based on multiple conditions if there are strings in one of the columns?

I am trying to filter a data frame based on two conditions.
filter items where the field named 'Count' > 30
...and...
filter items where the field named 'LAND SQUARE FEET' < 5000
Here is the code that I tried, and I got errors, or I wouldn't be posting here.
df.loc[(df['Count']>=30) & (df['LAND SQUARE FEET']< 5000)['Count','LAND SQUARE FEET']]
df[df.eval("Count>=30 & (LAND SQUARE FEET <5000).values")]
How can I get this to work?
df[(df["Count"] >= 30) & (df["LAND SQUARE FEET"] < 5000)]
Specifically, the error(see comments of other answer) is now saying there is a - in one of the values in your column, You can use my solution below OR you can do df["LAND SQUARE FEET"] = df["LAND SQUARE FEET"] =.replace('-','').astype(int) However, there may be other strings that you need to replace, that might mean you keep seeing errors if there are more strings, other than -, for example , or ft.. Also, that line with - might be bad data altogether as I'm not sure why a - would be in a number that is supposed to be an integer of square feet.
Also, you can look at that line specifically with df[df["LAND SQUARE FEET"].str.contains('-') and from there decide what you want to do with it -- either manipulate it's data with replace or make it NaN with pd.to_numeric()
Solution with pd.to_numeric():
You need to use pd.to_numeric() first as you have strings in your column and passing errors='coerce' changes to NaN for those values that are strings. The data type of the column should now be a float if you call df.info():
Step 1:
df = pd.DataFrame({"LAND SQUARE FEET" : [4500, '4,400 feet', '4,600', 4700, 5500, 6000],
"Count" : [45,55,65,75,15,25]})
df
Out[1]:
LAND SQUARE FEET Count
0 4500 45
1 4,400 feet 55
2 4,600 65
3 4700 75
4 5500 15
5 6000 25
Step 2:
df = pd.DataFrame({"LAND SQUARE FEET" : [4500, '4,400 feet', '4,600', 4700, 5500, 6000],
"Count" : [45,55,65,75,15,25]})
df["LAND SQUARE FEET"] = pd.to_numeric(df["LAND SQUARE FEET"], errors='coerce')
df
Out[2]:
LAND SQUARE FEET Count
0 4500.0 45
1 NaN 55
2 NaN 65
3 4700.0 75
4 5500.0 15
5 6000.0 25
Step 3 (and final output):
df = pd.DataFrame({"LAND SQUARE FEET" : [4500, '4,400 feet', '4,600', 4700, 5500, 6000],
"Count" : [45,55,65,75,15,25]})
df["LAND SQUARE FEET"] = pd.to_numeric(df["LAND SQUARE FEET"], errors='coerce')
new_df = df.loc[(df['Count']>=30) & (df['LAND SQUARE FEET']< 5000),['Count','LAND SQUARE FEET']]
new_df
Out[3]:
Count LAND SQUARE FEET
0 45 4500.0
3 75 4700.0

Counting number of spikes in a graph in python

With dataset df I plotted a graph looking like the following:
df
Time Temperature
8:23:04 18.5
8:23:04 19
9:12:57 19
9:12:57 20
9:12:58 20
9:12:58 21
9:12:59 21
9:12:59 23
9:13:00 23
9:13:00 25
9:13:01 25
9:13:01 27
9:13:02 27
9:13:02 28
9:13:03 28
Graph(Overall)
When zooming in the data, we can see more details:
I would like to count the number of activations of this temperature measurement device, which gives rise to temperature increasing drastically. I have defined an activation as below:
Let T0, T1, T2, T3 be temperature at time t=0,t=1,t=2,t=3, and d0= T1-T0, d1= T2-T1, d2= T3-T2, ... be the difference of 2 adjacent values.
If
1) d0 ≥ 0 and d1 ≥ 0 and d2 ≥ 0, and
2) T2- T0 > max(d0, d1, d2), and
3) T2-T0 < 30 second
It is considered as an activation. I want to count how many activations are there in total. What's a good way to do this?
Thanks.
There could be a number of different, valid answers depending on how a spike is defined.
Assuming you just want the indices where the temperature increases significantly. One simple method is to just look for very large jumps in value, above some threshold value. The threshold can be calculated from the mean difference of the data, which should give a rough approximation of where the significant variations in value occur. Here's a basic implementation:
import numpy as np
# Data
x = np.array([0, 1, 2, 50, 51, 52, 53, 100, 99, 98, 97, 96, 10, 9, 8, 80])
# Data diff
xdiff = x[1:] - x[0:-1]
# Find mean change
xdiff_mean = np.abs(xdiff).mean()
# Identify all indices greater than the mean
spikes = xdiff > abs(xdiff_mean)+1
print(x[1:][spikes]) # prints 50, 100, 80
print(np.where(spikes)[0]+1) # prints 3, 7, 15
You could also look use outlier rejection, which would be much more clever than this basic comparison to the mean difference. There are lots of answers on how to do that:
Can scipy.stats identify and mask obvious outliers?

Randomly selecting from Pandas groups with equal probability -- unexpected behavior

I have 12 unique groups that I am trying to randomly sample from, each with a different number of observations. I want to randomly sample from the entire population (dataframe) with each group having the same probability of being selected from. The simplest example of this would be a dataframe with 2 groups.
groups probability
0 a 0.25
1 a 0.25
2 b 0.5
using np.random.choice(df['groups'], p=df['probability'], size=100) Each iteration will now have a 50% chance of selecting group a and a 50% chance of selecting group b
To come up with the probabilities I used the formula:
(1. / num_groups) / size_of_groups
or in Python:
num_groups = len(df['groups'].unique()) # 2
size_of_groups = df.groupby('label').size() # {a: 2, b: 1}
(1. / num_groups) / size_of_groups
Which returns
groups
a 0.25
b 0.50
This works great until I get past 10 unique groups, after which I start getting weird distributions. Here is a small example:
np.random.seed(1234)
group_size = 12
groups = np.arange(group_size)
probs = np.random.uniform(size=group_size)
probs = probs / probs.sum()
g = np.random.choice(groups, size=10000, p=probs)
df = pd.DataFrame({'groups': g})
prob_map = ((1. / len(df['groups'].unique())) / df.groupby('groups').size()).to_dict()
df['probability'] = df['groups'].map(prob_map)
plt.hist(np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True))
plt.xticks(np.arange(group_size))
plt.show()
I would expect a fairly uniform distribution with a large enough sample size, but I am getting these wings when the number of groups is 11+. If I change the group_size variable to 10 or lower, I do get the desired uniform distribution.
I can't tell if the problem is with my formula for calculating the probabilities, or possibly a floating point precision problem? Anyone know a better way to accomplish this, or a fix for this example?
Thanks in advance!
you are using hist which defaults to 10 bins...
plt.rcParams['hist.bins']
10
pass group_size as the bins parameter.
plt.hist(
np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True),
bins=group_size)
There is no problem about your calculations. Your resulting array is:
arr = np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True)
If you check the value counts:
pd.Series(arr).value_counts().sort_index()
Out:
0 855
1 800
2 856
3 825
4 847
5 835
6 790
7 847
8 834
9 850
10 806
11 855
dtype: int64
It is pretty close to a uniform distribution. The problem is with the default number of bins (10) of the histogram. Instead, try this:
bins = np.linspace(-0.5, 10.5, num=12)
pd.Series(arr).plot.hist(bins=bins)

python pandas: How dropping items in dateframe

I have a huge amount of points in my dateframe, so I would want to drop some of them (ideally keeping the mean values).
e.g. currently I have
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
Is there any way to drop some amount of items, based on sampling?
To give more details. My problem is number of values for very close intervals e.g. 1491928756421062 and 1491928756421187
So I have a chart like
And instead I wanted to somehow have a mean value for those close intervals. But maybe grouped by a second...
I would use sample(), but as you said it selects randomly. If you want to take sample according to some logic, for instance, only keeping rows whose value is mean *.9 < value < mean * 1.1, you can try the following code. Actually, it all depends on your sampling strategy.
As an example, something like this could be done.
test.csv:
1491928756414930,4643
1491928756419607,166
1491928756419790,120
1491928756419927,142
1491928756420083,121
1491928756420217,109
1491928756420409,52
1491928756420476,105
1491928756420605,35
1491928756420654,120
1491928756420787,105
1491928756420907,93
1491928756421013,37
1491928756421062,112
1491928756421187,41
sampling:
df = pd.read_csv("test.csv", ",", header=None)
mean = df[1].mean()
my_sample = df[(mean *.90 < df[1]) & (df[1] < mean * 1.10)]
You're looking for resample
df.set_index(pd.to_datetime(df.date)).calltime.resample('s').mean()
This is a more complete example
tidx = pd.date_range('2000-01-01', periods=10000, freq='10ms')
df = pd.DataFrame(dict(calltime=np.random.randint(200, size=len(tidx))), tidx)
fig, axes = plt.subplots(2, figsize=(25, 10))
df.plot(ax=axes[0])
df.resample('s').mean().plot(ax=axes[1])
fig.tight_layout()

Categories