I'm successfully using the groupby() function to compute statistics on grouped data, however, I'd now like to do the same for subsets of each group.
I can't seem to understand how to generate a subset for each group (as a groupby object) that can then be applied to a groupby function such as mean(). The following line works as intended:
d.groupby(['X','Y'])['Value'].mean()
How can I subset the values of the individual groups to then supply to the mean function? I suspect transform() or filter() might be useful though I can't figure out how.
EDIT to add reproducible example:
random.seed(881)
value = np.random.randn(15)
letter = np.random.choice(['a','b','c'],15)
date = np.repeat(pd.date_range(start = '1/1/2001', periods=3), 5)
data = {'date':date,'letter':letter,'value':value}
df = pd.DataFrame(data)
df.groupby(['date','letter'])['value'].mean()
date letter
2001-01-01 a -0.039407
b -0.350787
c 1.221200
2001-01-02 a -0.688744
b 0.346961
c -0.702222
2001-01-03 a 1.320947
b -0.915636
c -0.419655
Name: value, dtype: float64
Here's an example of calculating the mean of the multi-level group. Now I'd like to find the mean of a subset of each group. For example, the mean of each groups data that is < the groups 10th percentile. The key take away being that the subsets must be performed on the groups and not the entire df first.
I think the function you're looking for is quantile(), which you can add to a groupby().apply() statement. For the tenth percentile, use quantile(.1):
df.groupby(['date','letter'])['value'].apply(lambda g: g[g <= g.quantile(.1)].mean())
Related
I have an application that uses a Pandas dataframe to calculate each min/max row value for each column. For example:
col_a col_b col_c
2 8 7
10 4 3
6 5 1
calling df.max() produces
col_a 10
col_b 8
col_c 7
Just as a reference the I'm trying to conver the following code:
bin_stats = {'min': df.min(),
'max': df.max(),
'binwidth': (df.max()-df.min()+10**-6)/bincount}
# Transform data into bin positions for fast binning
data = ((df - in_stats['min'])/bin_stats['binwidth']).apply(np.floor)
I'm converting my functionality to Vaex and I need to print out the max row value for every column in my dataframe like above.I have tried df.max(column_names) but I get the error:
ValueError: Could not find a class (AggMax_object), seems object is not supported. How do I get an array of max values?
In vaex you can do df.max(). You need to pass an expression or a list of expressions for which you want to get the maximum value.
Consider this example:
import vaex
df = vaex.example()
columns = df.get_column_names(dtype='numeric')
df.max(columns)
# returns array([ 3.2000000e+01, 1.3049751e+02, 6.0022778e+01, 5.4506802e+01,
6.3641956e+02, 5.7964453e+02, 5.3974872e+02, 3.5941863e+04,
3.7393040e+03, 1.7840929e+03, -3.0200911e-01], dtype=float32)
Note that vaex has a df.minmax() method that can get you the min and max values in a single pass over the data (i.e. faster if you data is larger).
float_columns = df.get_column_names(dtype='float')
df.minmax(float_columns)
Having said all of this, vaex excels at binning stuff, so it might be worth looking into how to achieve what you want in a "vaex-native" way, instead of straight up translating pandas code into vaex. It should work, but you might not get optimal performance.
I have a data frame that has the following format:
d = {'id1': ['a', 'a', 'b', 'b',], 'id2': ['a', 'b', 'b', 'c'], 'score': ['1', '2', '3', '4']}
df = pd.DataFrame(data=d)
print(df)
id1 id2 score
0 a a 1
1 a b 2
3 b b 3
4 b c 4
The data frame has over 1 billion rows, it represents pairwise distance scores between objects in columns id1 and id2. I do not need all object pair combinations, for each object in id1 (there are about 40k unique id's) I only want to keep the top 100 closest (smallest) distance scores
The code I'm running to do this is the following:
df = df.groupby(['id1'])['score'].nsmallest(100)
The issue with this code is that I run into a memory error each time I try to run it
MemoryError: Unable to allocate 8.53 GiB for an array with shape (1144468900,) and data type float64
I'm assuming it is because in the background pandas is now creating a new data frame for the result of the group by, but the existing data frame is still held in memory.
The reason I am only taking the top 100 of each id is to reduce the size of the data frame, but I seems that while doing that process I am actually taking up more space.
Is there a way I can go about filtering this data down but not taking up more memory?
The desired output would be something like this (assuming top 1 instead of top 100)
id1 id2 score
0 a a 1
1 b b 3
Some additional info about the original df:
df.count()
permid_1 1144468900
permid_2 1144468900
distance 1144468900
dtype: int64
df.dtypes
permid_1 int64
permid_2 int64
distance float64
df.shape
dtype: object
(1144468900, 3)
id1 & id2 unique value counts: 33,830
I can't test this code, lacking your data, but perhaps try something like this:
indicies = []
for the_id in df['id1'].unique():
scores = df['score'][df['id1'] == the_id]
min_subindicies = np.argsort(scores.values)[:100] # numpy is raw index only
min_indicies = scores.iloc[min_subindicies].index # convert to pandas indicies
indicies.extend(min_indicies)
df = df.loc[indicies]
Descriptively, in each unique ID (the_id), extract the matching scores. Then find the raw indicies which are the smallest 100. Select those indicies, then map from the raw index to the Pandas index. Save the Pandas index to your list. Then at the end, subset on the pandas index.
iloc does take a list input. some_series.iloc should align properly with some_series.values which should allow this to work. Storing indicies indirectly like this should make this substantially more memory-efficient.
df['score'][df['id1'] == the_id] should work more efficiently than df.loc[df['id1'] == the_id, 'score']. Instead of taking the whole data frame and masking it, it takes only the score column of the data frame and masks it for matching IDs. You may want to del scores at the end of each loop if you want to immediately free more memory.
You can try the following:
df.sort_values(["id1", "scores"], inplace=True)
df["dummy_key"] = df["id1"].shift(100).ne(df["id1"])
df = df.loc[df["dummy_key"]]
You sort ascending (smallest on top), by first grouping, then by score.
You add column to indicate whether current id1 is different than the one 100 rows back (if it's not - your row is 101+ in order).
You filter by column from 2.
As Aryerez outlined in a comment, you can do something along the lines of:
closest = pd.concat([df.loc[df['id1'] == id1].sort_values(by = 'score').head(100) for
id1 in set(df['id1'])])
You could also do
def get_hundredth(id1):
sub_df = df.loc[df['id1'] == id1].sort_values(by = 'score')
return sub_df.iloc[100]['score']
hundredth_dict = {id1: get_hundredth(id1) for id1 in set(df['id1'])}
def check_distance(row):
return row['score'] <= hundredth_dict[row['id1']]
closest = df.loc[df.apply(check_distance, axis = 1)
Another strategy would be to look at how filtering out distances past a threshold affects the dataframe. That is, take
low_scores = df.loc[df['score']<threshold]
Does this significantly decrease the size of the dataframe for some reasonable threshold? You'd need a threshold that makes the dataframe small enough to work with, but leaves the lowest 100 scores for each id1.
You also might want to look into what sort of optimization you can do given your distance metric. There's probably algorithms out there specifically for cosine similarity.
For the given shape (1144468900, 3) with 33,830 unique value counts, id1 and id2 columns are good candidates for categorical column, convert them to categorical data type, and that will reduce the memory requirement by 1144468900/33,830 = 33,830 times approximately for these two columns, then perform any aggregation you want.
df[['id1', 'id2']] = df[['id1', 'id2']].astype('category')
out = df.groupby(['id1'])['score'].nsmallest(100)
I have a Pandas dataframe with two columns I am interested in: A categorical label and a timestamp. Presumably what I'm trying to do would also work with ordered numerical data. The dataframe is already sorted by timestamps in ascending order. I want to find out which label spans the longest time-window and select only the values associated with it in the original dataframe.
I have tried grouping the df by label, calculating the difference and selecting the maximum (longest time-window) successfully, however I'm having trouble finding an expression to select the corresponding values in the original df using this information.
Consider this example with numerical values:
d = {'cat': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C','C','C'],
'val': [1,3,5,6,8,9,0,5,10,20,4,5,6,7,8,9,10]}
df = pd.DataFrame(data = d)
Here I would expect something equivalent to df.loc[df.cat == 'B'] since B has the maximum difference of all the categories.
df.groupby('cat').val.apply(lambda x: x.max() - x.min()).max()
gives me the correct difference, but I have no idea how to use this to select the correct category in the original df.
You can go for idxmax to get the category that gave rise to maximum peak-to-peak value within groups (np.ptp does the maximum minus minimum). Then you can index with loc as you said, or query:
>>> max_cat = df.groupby("cat").val.apply(np.ptp).idxmax()
>>> max_cat
"B"
>>> df.query("cat == #max_cat") # or df.loc[df.cat == max_cat]
cat val
6 B 0
7 B 5
8 B 10
9 B 20
I am trying to calculate the covariance between two columns by group. I am doing doing the following:
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
'value1':[1,2,3,4,5,6,7],
'value2':[8,5,4,3,7,8,8]})
B = A.groupby('group')
B['value1'].cov(B['value2'])
Ideally, I would like to get the covariance between X and Y and not the whole variance-covariance matrix, since I only have two columns.
Thank you,
You are almost there, only that you do not clear understand the groupby object, see Pandas-GroupBy for more details.
For your problem, if I understand correctly, you would like to calculate cov between two columns in same group.
The simplest one is to use groupeby.cov function, which gives pairwise cov between groups.
A.groupby('group').cov()
value1 value2
group
A value1 1.666667 -2.666667
value2 -2.666667 4.666667
B value1 1.000000 0.500000
value2 0.500000 0.333333
If you only need cov(grouped_v1, grouped_v2)
grouped = A.groupby('group')
grouped.apply(lambda x: x['value1'].cov(x['value2']))
group
A -2.666667
B 0.500000
In which, grouped is a groupby object. For grouped.apply function, it need a callback function as argument and each group will be the argument for the callback function. Here, the callback function is a lambda function, and the argument x is a group (a DataFrame).
Hope this will be helpful for your understanding of groupby.
The following code gives you the grouped variance-covariance matrix. You can subset it as you wish to just get the covariances.
import pandas as pd
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
'value1':[1,2,3,4,5,6,7],
'value2':[8,5,4,3,7,8,8]})
print A.groupby('group').cov()
Here is an alternative solution that estimates cov(value1, value2) within each group, but doesn't use .apply():
import pandas as pd
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
'value1':[1,2,3,4,5,6,7],
'value2':[8,5,4,3,7,8,8]})
B = A.groupby('group')
cov_a_b = B[['value1', 'value2']].cov(ddof=0)['value1'].unstack()['value2']
As an additional note somewhat related to the question, you should be careful when using the NumPy/Pandas implementations of variance and covariance, as they use a degrees of freedom correction of 1 by default (confusingly, NumPy defaults to ddof=0 for their implementation of variance). This is why I included ddof=0.
If you're looking for cov() of specific two columns, you can use df.Age.cov(df.Salary)
Assuming that Age and salary are two of many columns of the dataFrame. This is useful for only two columns.
I'm using pandas groupby on my DataFrame df which has columns type, subtype, and 11 others. I'm then calling an apply with my combine_function (needs a better name) on the groups like:
grouped = df('type')
reduced = grouped.apply(combine_function)
where my combine_function checks if any element in the group contains any element with the given subtype, say 1, and looks like:
def combine_function(group):
if 1 in group.subtype:
return aggregate_function(group)
else:
return group
The combine_function then can call an aggregate_function, that calculates summary statistics, stores them in the first row, and then sets that row to be the group. It looks like:
def aggregate_function(group):
first = group.first_valid_index()
group.value1[group.index == first] = group.value1.mean()
group.value2[group.index == first] = group.value2.max()
group.value3[group.index == first] = group.value3.std()
group = group[(group.index == first)]
return group
I'm fairly sure this isn't the best way to do this, but it has been giving my the desired results, 99.9% of the time on thousands of DataFrames. However it sometimes throws an error that is somehow related to a group that I don't want to aggregate has exactly 2 rows:
ValueError: Shape of passed values is (13,), indices imply (13, 5)
where my an example groups had size:
In [4]: grouped.size()
Out[4]:
type
1 9288
3 7667
5 7604
11 2
dtype: int64
It processed the 3 three fine, and then gave the error when it tried to combine everything. If I comment out the line group = group[(group.index == first)] so update but don't aggregate or call my aggregate_function on all groups its fine.
Does anyone know the proper way to be doing this kind of aggregation of some groups but not others?
Your aggregate_functions looks contorted to me. When you aggregate a group, it automatically reduces to one row; you don't need to do it manually. Maybe I am missing the point. (Are you doing something special with the index that I'm not understanding?) But a more normal usage would look like this:
agg_condition = lambda x: Series([1]).isin(x['subtype]').any()
agg_functions = {'value1': np.mean, 'value2': np.max, 'value3': np.std}
df1 = df.groupby('type').filter(agg_condition).groupby('type').agg(**agg_functions)
df2 = df.groupby('type').filter(~agg_condition)
result = pd.concat([df1, df2])
Note: agg_condition is messy because (1) built-in Python in refers to the index of a Series, not its values, and (2) the result has to be reduced to a scalar by any().