Groupby bins on multiple items - python

My dataframe looks something like this (but with about 100,000 rows of data):
ID,Total,TotalDate,DaysBtwRead,Type,YearlyAvg
1,1250,6/2/2017,17,AT267,229
2,1670,2/3/2012,320,PQ43,50
I'm trying to groupby yearly average totals using
df.groupby(pd.cut(df['YearlyAvg'], np.arange(0,1250,50))).count()
so that I can set-up for a unique monte carlo distribution, but I need these grouped by each individual Type as well. This currently only counts each range regardless of any other values.
Rather than having an overall aggregate count, I'm trying to set my code up so that the output looks more like the following (with YearlyAvg containing a count of each range)
Index,YearlyAvg
AT267(0, 50], 200
PQ43(0, 50], 123
AT267(50, 100], 49
PQ43(50, 100], 67
Is there an easier way to do this outside of creating a separate dataframe for each Type value?

You can using unstack with stack
df['bins']=pd.cut(df['YearlyAvg'], np.arange(0,1250,50))
df.groupby(['Type','bins']).size().unstack(fill_value=0).stack()# also here will create the multiple index for achieve what you need
Out[1783]:
Type bins
AT267 (0, 50] 0
(200, 250] 1
PQ43 (0, 50] 1
(200, 250] 0
dtype: int64

Related

Converting columns from float datatype to categorical datatype using binning

I wish to convert a data frame consisting of two columns.
Here is the sample df:
Output:
df:
cost numbers
1 360 23
2 120 35
3 2000 49
Both columns are float and I wish to convert them to categorical using binning.
I wish to create the following bins for each column when converting to categorical.
Bins for the numbers : 18-24, 25-44, 45-65, 66-92
Bins for cost column: >=1000, <1000
Finally, I want to not create a new column but just convert the column without creating a new one.
Here is my attempted code at this:
def PreprocessDataframe(df):
#use binning to convert age and budget to categorical columns
df['numbers'] = pd.cut(df['numbers'], bins=[18, 24, 25, 44, 45, 65, 66, 92])
df['cost'] = pd.cut(df['cost'], bins=['=>1000', '<1000'])
return df
I understand how to convert the "numbers" column but I am having trouble with the "cost" one.
Help would be nice on how to solve this.
Thanks in advance!
Cheers!
If you use bins=[18, 24, 25, 44, 45, 65, 66, 92], this is going to generate bins for 18-24, 24-25, 25-44, 44-45, etc... and you don't need the ones for 24-25, 44-45...
By default, the bins are from the first value (not incusive) to the last value inclusive.
So, for numbers, you could use instead bins=[17, 24, 44, 65, 92] (note the 17 at the first position, so 18 is included).
The optional parameter label allows to choose labels for the bins.
df['numbers'] = pd.cut(df['numbers'], bins=[17, 24, 44, 65, 92], labels=['18-24', '25-44', '45-65', '66-92'])
df['cost'] = pd.cut(df['cost'], bins=[0, 999.99, df['cost'].max()], labels=['<1000', '=>1000'])
print(df)
>>> df
cost numbers
0 <1000 18-24
1 <1000 25-44
2 =>1000 45-65

Looping over Pandas' groupby output when grouping by multiple columns and missing data

Grouping by multiple columns with missing data:
data = [['Falcon', 'Captive', 390], ['Falcon', None, 350],
['Parrot', 'Captive', 30], ['Parrot', 'Wild', 20]]
df = pd.DataFrame(data, columns = ['Animal', 'Type', 'Max Speed'])
I understand how missing data are dealt with when grouping by individual columns (groupby columns with NaN (missing) values), but do not understand the behaviour when grouping by two columns.
It seems I cannot loop over all groups even though they seem to identified:
groupeddf = df.groupby(['Animal', 'Type'])
counter = 0
for group in groupeddf:
counter = counter + 1
print(counter)
len(groupeddf.groups)
results in 3 and 4 which is not consistent.
Pandas version 1.0.3
In the post concerning groupby columns with NaN (missing) values
there is a sentence: NA groups in GroupBy are automatically excluded.
Apparently, in case of grouping by multiple columns, the same
occurs if any level of grouping key contains NaN.
To confirm it, run:
for key, grp in groupeddf:
print(f'\nGroup: {key}\n{grp}')
and the result will be:
Group: ('Falcon', 'Captive')
Animal Type Max Speed
0 Falcon Captive 390
Group: ('Parrot', 'Captive')
Animal Type Max Speed
2 Parrot Captive 30
Group: ('Parrot', 'Wild')
Animal Type Max Speed
3 Parrot Wild 20
But if you execute groupeddf.groups (to print the content), you will get:
{('Falcon', 'Captive'): Int64Index([0], dtype='int64'),
('Falcon', nan): Int64Index([1], dtype='int64'),
('Parrot', 'Captive'): Int64Index([2], dtype='int64'),
('Parrot', 'Wild'): Int64Index([3], dtype='int64')}
So we have group ('Falcon', nan), containing row with index 1.
If you want to process all groups, without any tricks to change
NaN into something other, run something like:
for key in groupeddf.groups:
print(f'\nGroup: {key}\n{df.loc[groupeddf.groups[key]]}')
This time the printout will contain also the previously missing group.
To loop over all groups in pandas 1.0 you'll need to convert the NoneType objects to strings.
df = df.astype(str) # or just df['Type'] = df['Type'].astype(str)
Then you'll get four iterations of your loop.
According to the docs:
NA and NaT group handling
If there are any NaN or NaT values in the
grouping key, these will be automatically excluded. In other words,
there will never be an “NA group” or “NaT group”. This was not the
case in older versions of pandas, but users were generally discarding
the NA group anyway (and supporting it was an implementation
headache).
Or you could upgrade to the dev pandas 1.1, where this issue appears to be fixed with the option dropna=False

Add new column based on mean slice of another column

Suppose I have a DataFrame
my_df = pd.DataFrame([10, 20, 30, 40, 50], columns=['col_1'])
I would like to add a new column where the value of each row in the new column is the mean of the values in col_1 starting at that row. In this case the new column (let's call it 'col_2' would be [30, 35, 40, 45, 50].
The following is not good code but it at least describes generating the values.
for i in range(len(my_df)):
my_df.loc[i]['col_2'] = my_df[i:]['col_1'].mean()
How can I do this in a clean, idiomatic way that doesn't raise a SettingWithCopyWarning?
You can reverse the column, take the incremental mean, and then reverse it back again.
my_df.loc[::-1, 'col_1'].expanding().mean()[::-1]
# 0 30.0
# 1 35.0
# 2 40.0
# 3 45.0
# 4 50.0
# Name: col_1, dtype: float64
A similar ndarray-level approach could be to use np.cumsum and divide by the increasing number of elements.
np.true_divide(np.cumsum(my_df.col_1.values[::-1]),
np.arange(1, len(my_df)+1))[::-1]
# array([30., 35., 40., 45., 50.])

Detecting outliers in a Pandas dataframe using a rolling standard deviation

I have a DataFrame for a fast Fourier transformed signal.
There is one column for the frequency in Hz and another column for the corresponding amplitude.
I have read a post made a couple of years ago, that you can use a simple boolean function to exclude or only include outliers in the final data frame that are above or below a few standard deviations.
df = pd.DataFrame({'Data':np.random.normal(size=200)}) # example dataset of normally distributed data.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] # or if you prefer the other way around
The problem is that my signal drops several magnitudes (up to 10 000 times smaller) as frequency increases up to 50 000Hz. Therefore, I am unable to use a function that only exports values above 3 standard deviation because I will only pick up the "peaks" outliers from the first 50 Hz.
Is there a way I can export outliers in my dataframe that are above 3 rolling standard deviations of a rolling mean instead?
This is maybe best illustrated with a quick example. Basically you're comparing your existing data to a new column that is the rolling mean plus three standard deviations, also on a rolling basis.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# Create a few outliers (3 of them, at index locations 10, 55, 80)
df.iloc[[10, 55, 80]] = 40.
r = df.rolling(window=20) # Create a rolling object (no computation yet)
mps = r.mean() + 3. * r.std() # Combine a mean and stdev on that object
print(df[df.Data > mps.Data]) # Boolean filter
# Data
# 55 40.0
# 80 40.0
To add a new column filtering only to outliers, with NaN elsewhere:
df['Peaks'] = df['Data'].where(df.Data > mps.Data, np.nan)
print(df.iloc[50:60])
Data Peaks
50 -1.29409 NaN
51 -1.03879 NaN
52 1.74371 NaN
53 -0.79806 NaN
54 0.02968 NaN
55 40.00000 40.0
56 0.89071 NaN
57 1.75489 NaN
58 1.49564 NaN
59 1.06939 NaN
Here .where returns
An object of same shape as self and whose corresponding entries are
from self where cond is True and otherwise are from other.

pandas qcut not putting equal number of observations into each bin

I have a data frame, from which I can select a column (series) as follows:
df:
value_rank
275488 90
275490 35
275491 60
275492 23
275493 23
275494 34
275495 75
275496 40
275497 69
275498 14
275499 83
... ...
value_rank is a previously created percentile rank from a larger data-set. What I am trying to do, is to create bins of this data set, e.g. quintile
pd.qcut(df.value_rank, 5, labels=False)
275488 4
275490 1
275491 3
275492 1
275493 1
275494 1
275495 3
275496 2
... ...
This appears fine, as expected, but it isn't.
In fact, I have 1569 columns. The nearest number divisible by 5 bins is 1565 which should give 1565 / 5 = 313 observations in each bin. There are 4 extra records, so I would expect to have 4 bins with 314 observations, and one with 313 observations. Instead, I get this:
obs = pd.qcut(df.value_rank, 5, labels=False)
obs.value_counts()
0 329
3 314
1 313
4 311
2 302
I have no nans in df, and cannot think of any reason why this is happening. Literally beginning to tear my hair out!
Here is a small example:
df:
value_rank
286742 11
286835 53
286865 40
286930 31
286936 45
286955 27
287031 30
287111 36
287269 30
287310 18
pd.qcut gives this:
pd.qcut(df.value_rank, 5, labels = False).value_counts()
bin count
1 3
4 2
3 2
0 2
2 1
There should be 2 observations in each bin, not 3 in bin 1 and 1 in bin 2!
qcut is trying to compensate for repeating values. This is earlier to visualize if you return the bin limits along with your qcut results:
In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
In [43]: test_series = pd.Series(test_list, name='value_rank')
In [49]: pd.qcut(test_series, 5, retbins=True, labels=False)
Out[49]:
(array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]),
array([ 11. , 25.2, 30. , 33. , 41. , 53. ]))
You can see that there was no choice but to set the bin limit at 30, so qcut had to "steal" one from the expected values in the third bin and place them in the second. I'm thinking that this is just happening at a larger scale with your percentiles since you're basically condensing their ranks into a 1 to 100 scale. Any reason not to just run qcut directly on the data instead of the percentiles or return percentiles that have greater precision?
Just try with the below code :
pd.qcut(df.rank(method='first'),nbins)
If you must get equal (or nearly equal) bins, then here's a trick you can use with qcut. Using the same data as the accepted answer, we can force these into equal bins by adding some random noise to the original test_list and binning according to those values.
test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
np.random.seed(42) #set this for reproducible results
test_list_rnd = np.array(test_list) + np.random.random(len(test_list)) #add noise to data
test_series = pd.Series(test_list_rnd, name='value_rank')
pd.qcut(test_series, 5, retbins=True, labels=False)
Output:
(0 0
1 0
2 1
3 2
4 1
5 2
6 3
7 3
8 4
9 4
Name: value_rank, dtype: int64,
array([ 11.37454012, 25.97573801, 30.42160255, 33.11683016,
41.81316392, 53.70807258]))
So, now we have two 0's, two 1's, two 2's and two 4's!
Disclaimer
Obviously, use this at your discretion because results can vary based on your data; like how large your data set is and/or the spacing, for instance. The above "trick" works well for integers because even though we are "salting" the test_list, it will still rank order in the sense that there will won't be a value in group 0 greater than a value in group 1 (maybe equal, but not greater). If, however, you have floats, this may be tricky and you may have to reduce the size of your noise accordingly. For instance if you had floats like 2.1, 5.3, 5.3, 5.4, etc., you should should reduce the noise by dividing by 10: np.random.random(len(test_list)) / 10. If you have arbitrarily long floats, however, you probably would not have this problem in the first place, given the noise already present in "real" data.
This problem arises from duplicate values. A possible solution to force equal sized bins is to use the index as the input for pd.qcut after sorting the dataframe:
import random
df = pd.DataFrame({'A': [random.randint(3, 9) for x in range(20)]}).sort_values('A').reset_index()
del df['index']
df = df.reset_index()
df['A'].plot.hist(bins=30);
picture: https://i.stack.imgur.com/ztjzn.png
df.head()
df['qcut_v1'] = pd.qcut(df['A'], q=4)
df['qcut_v2'] = pd.qcut(df['index'], q=4)
df
picture: https://i.stack.imgur.com/RB4TN.png
df.groupby('qcut_v1').count().reset_index()
picture: https://i.stack.imgur.com/IKtsW.png
df.groupby('qcut_v2').count().reset_index()
picture: https://i.stack.imgur.com/4jrkU.png
sorry I cannot post images since I don't have at least 10 reputation on stackoverflow -.-

Categories