Concatenate multiple pandas groupby outputs - python

I would like to make multiple .groupby() operations on different subsets of a given dataset and bind them all together. For example:
import pandas as pd
df = pd.DataFrame({"ID":[1,1,2,2,2,3],"Subset":[1,1,2,2,2,3],"Value":[5,7,4,1,7,8]})
print(df)
ID Subset Value
0 1 1 5
1 1 1 7
2 2 2 4
3 2 2 1
4 2 2 7
5 3 1 9
I would then like to concatenate the following objects and store the result in a pandas data frame:
gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"]).mean()
gr2 = df[df["Subset"] == 2].groupby(["ID","Subset"]).mean()
# Why do gr1 and gr2 have column names in different rows?
I realize that df.groupby(["ID","Subset"]).mean() would give me the concatenated object I'm looking for. Just bear with me, this is a reduced example of what I'm actually dealing with.
I think the solution could be to transform gr1 and gr2 to pandas data frames and then concatenate them like I normally would.
In essence, my questions are the following:
How do I convert a groupby result to a data frame object?
In case this can be done without transforming the series to data frames, how do you bind two groupby results together and then transform that to a pandas data frame?
PS: I come from an R background, so to me it's odd to group a data frame by something and have the output return as a different type of object (series or multi index data frame). This is part of my question too: why does .groupby return a series? What kind of series is this? How come a series can have multiple columns and an index?

The return type in your example is a pandas MultiIndex object. To return a dataframe with a single transformation function for a single value, then you can use the following. Note the inclusion of as_index=False.
>>> gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"], as_index=False).mean()
>>> gr1
ID Subset Value
0 1 1 6
This however won't work if you wish to aggregate multiple functions like here. If you wish to avoid using df.groupby(["ID","Subset"]).mean(), then you can use the following for your example.
>>> gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"], as_index=False).mean()
>>> gr2 = df[df["Subset"] == 2].groupby(["ID","Subset"], as_index=False).mean()
>>> pd.concat([gr1, gr2]).reset_index(drop=True)
ID Subset Value
0 1 1 6
1 2 2 4
If you're only concerned with dealing with a specific subset of rows, the following could be applicable, since it removes the necessity to concatenate results.
>>> values = [1,2]
>>> df[df['Subset'].isin(values)].groupby(["ID","Subset"], as_index=False).mean()
ID Subset Value
0 1 1 6
1 2 2 4

Related

Why do we need to add : when defining a new column using .iloc function

When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.
pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).
You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.

Numpy, multiply dataframe by number outputting NaN

I am trying to multiply dataframe 1 column a by dataframe 2 column b.
combineQueryandBookFiltered['pnlValue'] = np.multiply(combineQueryandBookFiltered['pnlValue'], df_fxrate['fx_rate'])
pnlValue column has many numbers and fx_rate column is just the one number.
The code executes but my end result ends up with tons of NaN .
Any help would be appreciated.
It is probably due to the index of your dataframe. You need to use df_fxrate['fx_rate'].values:
combineQueryandBookFiltered['pnlValue'] = np.multiply(combineQueryandBookFiltered['pnlValue'], df_fxrate['fx_rate'].values)
or better:
combineQueryandBookFiltered['pnlValue']=combineQueryandBookFiltered['pnlValue']*df_fxrate['fx_rate'].values
I show you an example:
df1=pd.DataFrame(index=[1, 2])
df2=pd.DataFrame(index=[0])
df1['col1']=[1,1]
print(df1)
col1
1 1
2 1
df2['col1']=[1]
print(df2)
col1
0 1
print(np.multiply(df1['col1'],df2['col1']))
0 NaN
1 NaN
2 NaN
as you can see the multiplication is done according to the index
So you need something like this:
np.multiply(df1['col1'],df2['col1'].values)
or
df1['col1']*df2['col1'].values
Output:
1 1
2 1
Name: 1, dtype: int64
as you can see now only the df1['col1'] series index is used
-- Hi excelguy,
Is there a reason why you can't use the simple column multiplication?
df['C'] = df['A'] * df['B']
As was pointed out, multiplications of two series are based on their indices and it's likely that your fx_rate series does not have the same indices as the pnlValue series.
But since your fx_rate is only one value, I suggest multiplying your dataframe with a scalar instead:
fx_rate = df_fxrate['fx_rate'].iloc[0]
combineQueryandBookFiltered['pnlValue'] = combineQueryandBookFiltered['pnlValue'] * fx_rate

I want to pick out one column of the DataFrame but the result is automatically ordered by values

I just need one column of my dateframe, but in the original order. When I take it off, it is sorted by the values, and I can't understand why. I tried different ways to pick out one column but all the time it was sorted by the values.
this is my code:
import pandas
data = pandas.read_csv('/data.csv', sep=';')
longti = data.iloc[:,4]
To return the first Column your function should work.
import pandas as pd
df = pd.DataFrame(dict(A=[1,2,3,4,5,6], B=['A','B','C','D','E','F']))
df = df.iloc[:,0]
Out:
0 1
1 2
2 3
3 4
4 5
5 6
If you want to return the second Column you can use the following:
df = df.iloc[:,1]
Out:
0 A
1 B
2 C
3 D
4 E
5 F

How to re-index as multi-index pandas dataframe from index value that repeats

I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)

Pandas distinct count as a DataFrame

Suppose I have a Pandas DataFrame called df with columns a and b and what I want is the number of distinct values of b per each a. I would do:
distcounts = df.groupby('a')['b'].nunique()
which gives the desidered result, but it is as Series object rather than another DataFrame. I'd like a DataFrame instead. In regular SQL, I'd do:
SELECT a, COUNT(DISTINCT(b)) FROM df
and haven't been able to emulate this query in Pandas exactly. How to?
I think you need reset_index:
distcounts = df.groupby('a')['b'].nunique().reset_index()
Sample:
df = pd.DataFrame({'a':[7,8,8],
'b':[4,5,6]})
print (df)
a b
0 7 4
1 8 5
2 8 6
distcounts = df.groupby('a')['b'].nunique().reset_index()
print (distcounts)
a b
0 7 1
1 8 2
Another alternative using Groupby.agg instead:
df.groupby('a', as_index=False).agg({'b': 'nunique'})

Categories