Python Pandas add column to multi-index GroupBy DataFrame - python

I am trying to add a column to a Pandas GroupBy DataFrame with a multi-index. The column is the difference between the max and mean value of a common key after grouping.
Here is the input DataFrame:
Main Reads Test Subgroup
0 1 5 54 1
1 2 2 55 1
2 1 10 56 2
3 2 20 57 3
4 1 7 58 3
Here is the code:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Main': [1, 2, 1, 2, 1], 'Reads': [5, 2, 10, 20, 7],\
'Test':range(54,59), 'Subgroup':[1,1,2,3,3]})
result = df.groupby(['Main','Subgroup']).agg({'Reads':[np.max,np.mean]})
Here is the variable result before perform the calculation of diff:
Reads
amax mean
Main Subgroup
1 1 5 5
2 10 10
3 7 7
2 1 2 2
3 20 20
Next, I calculate the diff column with:
result['Reads']['diff'] = result['Reads']['amax'] - result['Reads']['mean']
but here is the output:
/home/userd/test.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/
...stable/indexing.html#indexing-view-versus-copy
...result['Reads']['diff'] = result['Reads']['amax'] - result['Reads']['mean']
I would like the diff column to be at the same level a amax and mean.
Is there a way to add a column to the innermost (bottom) column index of a multi-index GroupBy() object in Pandas?

You can access the multi-index using a tuple
result[('Reads','diff')] = result[('Reads','amax')] - result[('Reads','mean')]
You get
Reads
amax mean diff
Main Subgroup
1 1 5 5 0
2 10 10 0
3 7 7 0
2 1 2 2 0
3 20 20 0

#you can you lambda to build diff directly.
df.groupby(['Main','Subgroup']).agg({'Reads':[np.max,np.mean,lambda x: np.max(x)-np.mean(x)]}).rename(columns={'<lambda>':'diff'})
Out[2360]:
Reads
amax mean diff
Main Subgroup
1 1 5 5 0
2 10 10 0
3 7 7 0
2 1 2 2 0
3 20 20 0

Try this:
In [8]: result = df.groupby(['Main','Subgroup']).agg({'Reads':[np.max,np.mean, lambda x: x.max()-x.mean()]})
In [9]: result
Out[9]:
Reads
amax mean <lambda>
Main Subgroup
1 1 5 5 0
2 10 10 0
3 7 7 0
2 1 2 2 0
3 20 20 0
In [10]: result = result.rename(columns={'<lambda>':'diff'})
In [11]: result
Out[11]:
Reads
amax mean diff
Main Subgroup
1 1 5 5 0
2 10 10 0
3 7 7 0
2 1 2 2 0
3 20 20 0

Related

(Pandas) How to get count how often the same value as before occured ? (and extract a new Column out of it)

Im looking for a way to extract a new column out of my Pandas Dataframe, which shows a count of how often the current value occured the same as before (without interruption)
e.g. out of a Column like:
df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['RiseOrFall'])
the following column should be extracted:
0
1
0
1
0
1
2
0
1
2
3
0
Something similar a user posted a few years ago:
df = df.groupby(df['RiseOrFall'].ne(df['RiseOrFall'].shift()).cumsum())['RiseOrFall'].value_counts()
Or:
df = df.groupby([df['RiseOrFall'].ne(df['RiseOrFall'].shift()).cumsum(), 'RiseOrFall']).size()
print (df)
values values
1 10 2
2 23 2
3 9 3
4 10 4
5 12 1
Name: values, dtype: int64
BUT by the code above i am only getting showed the total values of how many times the value occured in a row, (not the counting up to it)
What i need should be a column with the same index as the column "RiseOrFall" and with the same amount of rows, like this:
0
1
0
1
0
1
2
0
1
2
3
0
You can use df.RiseOrFall.ne(df.RiseOrFall.shift()).cumsum() to group every changes on the RiseOrFall column and then use cumcount:
df.assign(Count=df.groupby(df.RiseOrFall.ne(df.RiseOrFall.shift()).cumsum()).cumcount())
RiseOrFall Count
0 10 0
1 10 1
2 23 0
3 23 1
4 9 0
5 9 1
6 9 2
7 10 0
8 10 1
9 10 2
10 10 3
11 12 0
Note: Please assign this back : df=df.assign(....)

creating dataframe efficiently without for loop

I am working with some advertising data, such as email data. I have two data sets:
one at the mail level, that for each person, states what days they were mailed, and then what day they were converted.
import pandas as pd
df_emailed=pd.DataFrame()
df_emailed['person']=['A','A','A','A','B','B','B']
df_emailed['day']=[2,4,8,9,1,2,5]
df_emailed
print(df_emailed)
person day
0 A 2
1 A 4
2 A 8
3 A 9
4 B 1
5 B 2
6 B 5
I have a summary dataframe that says whether someone converted, and which day they converted.
df_summary=pd.DataFrame()
df_summary['person']=['A','B']
df_summary['days_max']=[10,5]
df_summary['convert']=[1,0]
print(df_summary)
person days_max convert
0 A 10 1
1 B 5 0
I would like to combine these into a final dataframe that says, for each person:
1 to max date,
whether they were emailed (0,1) and on the last day in the dataframe,
whether they converted or not (0,1).
We are assuming they convert on the last day in the dataframe.
I know to do to this using a nested for loop, but I think that is just incredibly inefficient and sort of dumb. Does anyone know an efficient way of getting this done?
Desired result
df_final=pd.DataFrame()
df_final['person']=['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B']
df_final['day']=[1,2,3,4,5,6,7,8,9,10,1,2,3,4,5]
df_final['emailed']=[0,1,0,1,0,0,0,1,1,0,1,1,0,0,1]
df_final['convert']=[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]
print(df_final)
person day emailed convert
0 A 1 0 0
1 A 2 1 0
2 A 3 0 0
3 A 4 1 0
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 1 0
8 A 9 1 0
9 A 10 0 1
10 B 1 1 0
11 B 2 1 0
12 B 3 0 0
13 B 4 0 0
14 B 5 1 0
Thank you and happy holidays!
A high level approach involves modifying the df_summary (alias df2) to get our output. We'll need to
set_index operation on the days_max column on df2. We'll also change the name to days (which will help later on)
groupby to group on person
apply a reindex operation on the index (days, so we get rows for each day leading upto the last day)
fillna to fill NaNs in the convert column generated as a result of the reindex
assign to create a dummy column for emailed that we'll set later.
Next, index into the result of the previous operation using df_emailed. We'll use those values to set the corresponding emailed cells to 1. This is done by MultiIndexing with loc.
Finally, use reset_index to bring the index out as columns.
def f(x):
return x.reindex(np.arange(1, x.index.max() + 1))
df = df2.set_index('days_max')\
.rename_axis('day')\
.groupby('person')['convert']\
.apply(f)\
.fillna(0)\
.astype(int)\
.to_frame()\
.assign(emailed=0)
df.loc[df1[['person', 'day']].apply(tuple, 1).values, 'emailed'] = 1
df.reset_index()
person day convert emailed
0 A 1 0 0
1 A 2 0 1
2 A 3 0 0
3 A 4 0 1
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 0 1
8 A 9 0 1
9 A 10 1 0
10 B 1 0 1
11 B 2 0 1
12 B 3 0 0
13 B 4 0 0
14 B 5 0 1
Where
df1 = df_emailed
and,
df2 = df_summary

Filling a column with a condition on another column and shifting the values in pandas

My dataframe looks like this
№№№ randomNumCol n_k
0 5 1
1 6 0
2 7 1
3 8 0
4 9 1
5 10 1
6 11 1
7 12 1
...
I need to fill a column n_k as follows: if in the column randomNumCol is 1, then copy the value from the column №№№. If is 0, then insert the previous value from the column n_k.
BUT the first value in the column n_k should be equal to 2(for now I don't know why it so).
It should look like this
№№№ randomNumCol n_k
0 5 1 2
1 6 0 2
2 7 1 7
3 8 0 7
4 9 1 9
5 10 1 10
6 11 1 11
7 12 1 12
...
My code does not give the right result
dftest['n_k'] = np.where(dftest['randomNumCol'] == 1, dftest['№№№'], dftest['n_k'].shift(1))
I do not quite understand how to use shift(). And what to do with the first cell in n_k, which should always be 2?
Any advice, please?
You can copy the values from '№№№' column where randomNumCol is 1, set the remaining values to be nan, and then use ffill to fill the missing values:
import pandas as pd
df['n_k'] = df['№№№'].where(df.randomNumCol == 1, pd.np.nan)
df['n_k'].iat[0] = 2
df['n_k'] = df['n_k'].ffill().astype(df['№№№'].dtype)
df
# №№№ randomNumCol n_k
#0 5 1 2
#1 6 0 2
#2 7 1 7
#3 8 0 7
#4 9 1 9
#5 10 1 10
#6 11 1 11
#7 12 1 12
You can use fillna() instead of shift() .
import pandas as pd
df['n_k']=np.nan
df.loc[df['randomNumCol']==1,'n_k']=df['№№№']
df.ix[0,'n_k']=2
df['n_k'].fillna(method='ffill')

Sum pandas dataframe column values based on condition of column name

I have a DataFrame with column names in the shape of x.y, where I would like to sum up all columns with the same value on x without having to explicitly name them. That is, the value of column_name.split(".")[0] should determine their group. Here's an example:
import pandas as pd
df = pd.DataFrame({'x.1': [1,2,3,4], 'x.2': [5,4,3,2], 'y.8': [19,2,1,3], 'y.92': [10,9,2,4]})
df
Out[3]:
x.1 x.2 y.8 y.92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
The result should be the same as this operation, only I shouldn't have to explicitly list the column names and how they should group.
pd.DataFrame({'x': df[['x.1', 'x.2']].sum(axis=1), 'y': df[['y.8', 'y.92']].sum(axis=1)})
x y
0 6 29
1 6 11
2 6 3
3 6 7
Another option, you can extract the prefix from the column names and use it as a group variable:
df.groupby(by = df.columns.str.split('.').str[0], axis = 1).sum()
# x y
#0 6 29
#1 6 11
#2 6 3
#3 6 7
You can first create Multiindex by split and then groupby by first level and aggregate sum:
df.columns = df.columns.str.split('.', expand=True)
print (df)
x y
1 2 8 92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
df = df.groupby(axis=1, level=0).sum()
print (df)
x y
0 6 29
1 6 11
2 6 3
3 6 7

Resample pandas dataframe only knowing result measurement count

I have a dataframe which looks like this:
Trial Measurement Data
0 0 12
1 4
2 12
1 0 12
1 12
2 0 12
1 12
2 NaN
3 12
I want to resample my data so that every trial has just two measurements
So I want to turn it into something like this:
Trial Measurement Data
0 0 8
1 8
1 0 12
1 12
2 0 12
1 12
This rather uncommon task stems from the fact that my data has an intentional jitter on the part of the stimulus presentation.
I know pandas has a resample function, but I have no idea how to apply it to my second-level index while keeping the data in discrete categories based on the first-level index :(
Also, I wanted to iterate, over my first-level indices, but apparently
for sub_df in np.arange(len(df['Trial'].max()))
Won't work because since 'Trial' is an index pandas can't find it.
Well, it's not the prettiest I've ever seen, but from a frame looking like
>>> df
Trial Measurement Data
0 0 0 12
1 0 1 4
2 0 2 12
3 1 0 12
4 1 1 12
5 2 0 12
6 2 1 12
7 2 2 NaN
8 2 3 12
then we can manually build the two "average-like" objects and then use pd.melt to reshape the output:
avg = df.groupby("Trial")["Data"].agg({0: lambda x: x.head((len(x)+1)//2).mean(),
1: lambda x: x.tail((len(x)+1)//2).mean()})
result = pd.melt(avg.reset_index(), "Trial", var_name="Measurement", value_name="Data")
result = result.sort("Trial").set_index(["Trial", "Measurement"])
which produces
>>> result
Data
Trial Measurement
0 0 8
1 8
1 0 12
1 12
2 0 12
1 12

Categories