I want to apply a custom reduction function to each group in a Python dataframe. The function reduces the group to a single row by performing operations that combine several of the columns of the group.
I've implemented this like so:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={
"afac": np.random.random(size=1000),
"bfac": np.random.random(size=1000),
"class":np.random.randint(low=0,high=5,size=1000)
})
def f(group):
total_area = group['afac'].sum()
per_area = (group['afac']/total_area).values
per_pop = group['bfac'].values
return pd.DataFrame(data={'per_apop': [np.sum(per_area*per_pop)]})
aggdf = df.groupby('class').apply(f)
My input data frame df looks like:
>>> df
afac bfac class
0 0.689969 0.992403 0
1 0.688756 0.728763 1
2 0.086045 0.499061 1
3 0.078453 0.198435 2
4 0.621589 0.812233 4
But my code gives this multi-indexed data frame:
>>> aggdf
per_apop
class
0 0 0.553292
1 0 0.503112
2 0 0.444281
3 0 0.517646
4 0 0.503290
I've tried various ways of getting back to a "normal" data frame, but none seem to work.
>>> aggdf.reset_index()
class level_1 per_apop
0 0 0 0.553292
1 1 0 0.503112
2 2 0 0.444281
3 3 0 0.517646
4 4 0 0.503290
>>> aggdf.unstack().reset_index()
class per_apop
0
0 0 0.553292
1 1 0.503112
2 2 0.444281
3 3 0.517646
4 4 0.503290
How can I perform this operation and get a normal data frame afterwards?
Update: The output data frame should have columns for class and per_apop. Ideally, the function f can return multiple columns and possibly multiple rows. Perhaps using
return pd.DataFrame(data={'per_apop': [np.sum(per_area*per_pop),2], 'sue':[1,3]})
You can select which level to reset as well as if you want to retain the index using reset_index. In your case, you ended up with a multi-index that has 2 levels: class and one that is unnamed. reset_index allows you to reset the entire index (default) or just the levels you want. In the following example, the last level (-1) is being pulled out of the index. By also using drop=True it is dropped rather than appended as a column in the data frame.
aggdf.reset_index(level=-1, drop=True)
per_apop
class
0 0.476184
1 0.476254
2 0.509735
3 0.502444
4 0.525287
EDIT:
To push the class level of the index back to the data frame, you can simply call .reset_index() again. Ugly, but it work.
aggdf.reset_index(level=-1, drop=True).reset_index()
class per_apop
0 0 0.515733
1 1 0.497349
2 2 0.527063
3 3 0.515476
4 4 0.494530
Alternatively, you could also, reset the index, then just drop the extra column.
aggdf.reset_index().drop('level_1', axis=1)
class per_apop
0 0 0.515733
1 1 0.497349
2 2 0.527063
3 3 0.515476
4 4 0.494530
Make your self-def function return Series
def f(group):
total_area = group['afac'].sum()
per_area = (group['afac']/total_area).values
per_pop = group['bfac'].values
return pd.Series(data={'per_apop': np.sum(per_area*per_pop)})
df.groupby('class').apply(f).reset_index()
class per_apop
0 0 0.508332
1 1 0.505593
2 2 0.488117
3 3 0.481572
4 4 0.500401
Although you have a good answer, a suggestion:
test func for df.groupby(...).apply( func ) on the first group, like this:
agroupby = df.groupby(...)
for key, groupdf in agroupby: # an iterator -> (key, groupdf) ... pairs
break # get the first pair
print( "\n-- first groupdf: len %d type %s \n%s" % (
len(groupdf), type(groupdf), groupdf )) # DataFrame
test = myfunc( groupdf )
# groupdf .col [col] [[col ...]] .set_index .resample ... as usual
Related
Consider the following toy code that performs a simplified version of my actual question:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,2,3,4,5],
'some column': [0,1,2,3,4],
}
)
df = df.set_index(['n_event'])
print(df)
resampled_df = df.sample(frac=1, replace=True)
print(resampled_df)
The resampled_df is, as it name suggests, a resampled version of the original one (with replacement). This is exactly what I want. An example output of the previous code is
some column
n_event
1 0
2 1
3 2
4 3
5 4
some column
n_event
4 3
1 0
4 3
4 3
2 1
Now for my actual question I have the following dataframe:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,1,2,2,3,3,4,4,5,5],
'n_channel': [1,2,1,2,1,2,1,2,1,2],
'some column': [0,1,2,3,4,5,6,7,8,9],
}
)
df = df.set_index(['n_event','n_channel'])
print(df)
which looks like
some column
n_event n_channel
1 1 0
2 1
2 1 2
2 3
3 1 4
2 5
4 1 6
2 7
5 1 8
2 9
I want to do exactly the same as before, resample with replacements, but treating each group of rows with the same n_event as a single entity. A hand-built example of what I want to do can look like this:
some column
n_event n_channel
2 1 2
2 3
2 1 2
2 3
3 1 4
2 5
1 1 0
2 1
5 1 8
2 9
As seen, each n_event was treated as a whole and things within each event were no mixed up.
How can I do this without proceeding by brute force (i.e. without for loops, etc)?
I have tried with df.sample(frac=1, replace=True, ignore_index=False) and a few things using group_by without success.
Would a pivot()/melt() sequence work for you?
Use pivot() to from long to wide (make each group a single row).
Do the sampling.
Then back from wide to long using melt().
Don't have time to work out a full answer but thought I would get this idea to you in case it might help you.
Following the suggestion of jch I was able to find a solution by combining pivot and stack:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,1,2,2,3,3,4,4,5,5],
'n_channel': [1,2,1,2,1,2,1,2,1,2],
'some column': [0,1,2,3,4,5,6,7,8,9],
'other col': [5,6,4,3,2,5,2,6,8,7],
}
)
resampled_df = df.pivot(
index = 'n_event',
columns = 'n_channel',
values = set(df.columns) - {'n_event','n_channel'},
)
resampled_df = resampled_df.sample(frac=1, replace=True)
resampled_df = resampled_df.stack()
print(resampled_df)
I'm new in python.
I have data frame (DF) example:
id
type
1
A
1
B
2
C
2
B
I would like to add a column example A_flag group by id.
In the end I have data frame (DF):
id
type
A_flag
1
A
1
1
B
1
2
C
0
2
B
0
I can do this in two step:
DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)
It's working, but it's very slowy for big data frame.
Is there any way to optimize this case ?
Thank's for help.
Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, e.g.
df['type'].eq('A')
Then, you can attach it to the groupby statement for second step, as follows:
df['A_flag'] = df['type'].eq('A').groupby(df['id']).transform('max').astype(int)
Result
print(df)
id type A_flag
0 1 A 1
1 1 B 1
2 2 C 0
3 2 B 0
In general, if you have more complicated conditions, you can also define it in vectorized way, eg. define the boolean series m by:
m = df['type'].eq('A') & df['type1'].gt(1) | (df['type2'] != 0)
Then, use it in step 2 as follows:
m.groupby(df['id']).transform('max').astype(int)
I am using dask dataframe.groupby().apply()
and get a dask series as a return value.
I am each group to a list triplets such as (a,b,1) and wish then to turn all the triplets into a single dask data frame
I am using this code in the end of the mapping function to return the triplets as a dask df
#assume here that trips is a generator for tripletes such as you would produce from itertools.product([l1,l2,l3])
trip = list(itertools.chain.from_iterable(trip))
df = pd.DataFrame.from_records(trip)
return dd.from_pandas(df,npartitions=1)
then when I try to use something similar to pandas concat with dask concatenate
Assume the result of the apply function is the variable result.
I am trying to use
import dask.dataframe as dd
dd.concat(result, axis=0
and get the error
raise TypeError("dfs must be a list of DataFrames/Series objects")
TypeError: dfs must be a list of DataFrames/Series objects
But when I check for the type of result using
print type(result)
I get
output: class 'dask.dataframe.core.Series'
What is the proper way to apply a function over groups of dask groupby object and get all the results into one dataframe?
Thanks
edit:--------------------------------------------------------------
in order to produce the use case, assume this fake data generation
import random
import pandas as pd
import dask.dataframe as dd
people = [[random.randint(1,3), random.randint(1,3), random.randint(1,3)] for i in range(1000)]
ddf = dd.from_pandas(pd.DataFrame.from_records(people, columns=["first name", "last name", "cars"]), npartitions=1)
Now my mission is to group people by first and last name (e.g all the people with same first name & first last name) and than I need to get a new dask data frame which will contain how many cars each group had.
Assume that the apply function can return either a series of lists of tuples e.g [(name,name,cars count),(name,name,cars count)] or a data frame with the same columns - name, name, car count.
Yes, I know that particular use case can be solved in another way, but please trust me, my use case is more complex. But i can not share the data and can not generate any similar data. so let's use a dummy data :-)
The challenge is to connect all the results of the apply into a single dask data frame (pandas data frame will be a problem here, data will not fit in memory - so transitions via a pandas data frame will be a problem)
For me working if output of apply is pandas DataFrame, so last if necessary convert to dask DataFrame:
def f(x):
trip = ((1,2,x) for x in range(3))
df = pd.DataFrame.from_records(trip)
return df
df1 = ddf.groupby('cars').apply(f, meta={'x': 'i8', 'y': 'i8', 'z': 'i8'}).compute()
#only for remove MultiIndex
df1 = df1.reset_index()
print (df1)
cars level_1 x y z
0 1 0 1 2 0
1 1 1 1 2 1
2 1 2 1 2 2
3 2 0 1 2 0
4 2 1 1 2 1
5 2 2 1 2 2
6 3 0 1 2 0
7 3 1 1 2 1
8 3 2 1 2 2
ddf1 = dd.from_pandas(df1,npartitions=1)
print (ddf1)
cars level_1 x y z
npartitions=1
0 int64 int64 int64 int64 int64
8 ... ... ... ... ...
Dask Name: from_pandas, 1 tasks
EDIT:
L = []
def f(x):
trip = ((1,2,x) for x in range(3))
#append each
L.append(da.from_array(np.array(list(trip)), chunks=(1,3)))
ddf.groupby('cars').apply(f, meta={'x': 'i8', 'y': 'i8', 'z': 'i8'}).compute()
dar = da.concatenate(L, axis=0)
print (dar)
dask.array<concatenate, shape=(12, 3), dtype=int32, chunksize=(1, 3)>
For your edit:
In [8]: ddf.groupby(['first name', 'last name']).cars.count().compute()
Out[8]:
first name last name
1 1 107
2 107
3 110
2 1 117
2 120
3 99
3 1 119
2 103
3 118
Name: cars, dtype: int64
This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.
You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1
Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.
Lets imagine you have a DataFrame df with a large number of columns, say 50, and df does not have any indexes (i.e. index_col=None). You would like to select a subset of the columns as defined by a required_columns_list, but would like to only return those rows meeting a mutiple criteria as defined by various boolean indexes. Is there a way to consicely generate the selection statement using a dict generator?
As an example:
df = pd.DataFrame(np.random.randn(100,50),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# df.columns = Index[u'Col001', u'Col002', ..., u'Col050']
required_columns_list = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
now lets imagine that I define:
boolean_index_dict = {'Col001':"MyAccount", 'Col002':"Summary", 'Col005':"Total"}
I would like to select out using a dict generator to construct the multiple boolean indices:
df.loc[GENERATOR_USING_boolean_index_dict, required_columns_list].values
The above generator boolean method would be the equivalent of:
df.loc[(df['Col001']=="MyAccount") & (df['Col002']=="Summary") & (df['Col005']=="Total"), ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']].values
Hopefully, you can see that this would be really useful 'template' in operating on large DataFrames and the boolean indexing can then be defined in the boolean_index_dict. I would greatly appreciate if you could let me know if this is possible in Pandas and how to construct the GENERATOR_USING_boolean_index_dict?
Many thanks and kind regards,
Bertie
p.s. If you would like to test this out, you will need to populate some of df columns with text. The definition of df using random numbers was simply given as a starter if required for testing...
Suppose this is your df:
df = pd.DataFrame(np.random.randint(0,4,(100,50)),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# the first five cols and rows:
df.iloc[:5,:5]
Col001 Col002 Col003 Col004 Col005
0 2 0 2 3 1
1 0 1 0 1 3
2 0 1 1 0 3
3 3 1 0 2 1
4 1 2 3 1 0
Compared to your example all columns are filled with ints of 0,1,2 or 3.
Lets define the criteria:
req = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
filt = {'Col001': 2, 'Col002': 2, 'Col005': 2}
So we want some columns, where some others columns all contain the value 2.
You can then get the result with:
df.loc[df[filt.keys()].apply(lambda x: x.tolist() == filt.values(), axis=1), req]
In my case this is the result:
Col002 Col012 Col025 Col032 Col033
43 2 2 1 3 3
98 2 1 1 1 2
Lets check the required columns for those rows:
df[filt.keys()].iloc[[43,98]]
Col005 Col001 Col002
43 2 2 2
98 2 2 2
And some other (non-matching) rows:
df[filt.keys()].iloc[[44,99]]
Col005 Col001 Col002
44 3 0 3
99 1 0 0
I'm starting to like Pandas more and more.