Is there a way to apply a function on a MultiIndex column? - python

I have a dataframe that looks like this:
id sex isActive score
0 1 M 1 10
1 2 F 0 20
2 2 F 1 30
3 2 M 0 40
4 3 M 1 50
I want to pivot the dataframe on the index id and columns sex and isActive (the value should be score). I want each id to have their score be a percentage of their total score associated with the sex group.
In the end, my dataframe should look like this:
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 1.0
2 0.4 0.6 1.0 NaN
3 NaN NaN NaN 1.0
I tried pivoting first:
p = df.pivot_table(index='id', columns=['sex', 'isActive'], values='score')
print(p)
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 10.0
2 20.0 30.0 40.0 NaN
3 NaN NaN NaN 50.0
Then, I summed up the scores for each group:
row_sum = p.sum(axis=1, level=[0])
print(row_sum)
sex F M
id
1 0.0 10.0
2 50.0 40.0
3 0.0 50.0
This is where I'm getting stuck. I'm trying to use DataFrame.apply to perform a column-wise sum based on the second dataframe. However, I keep getting errors following this format:
p.apply(lambda col: col/row_sum)
I may be overthinking this problem. Is there some better approach out there?

I think just a simple division of p by row_sum would work like:
print (p/row_sum)
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 1.0
2 0.4 0.6 1.0 NaN
3 NaN NaN NaN 1.0

Related

Pandas calculate the number of columns of a given name there are that have a value in a row

I have this dataset where I have some columns (not important to the calculations) and then many columns with same starting name. I want to calculate the sum of those columns per one row which contains else than NaN-value. The set looks something like this:
id
something
number1
number2
number3
number4
1
105
200
NaN
NaN
50
2
300
2
1
1
33
3
20
1
NaN
NaN
NaN
So I want to create new column that contains the length of the number columns that have a value. So the final dataset would look like this:
id
something
number1
number2
number3
number4
sum_columns
1
105
200
NaN
NaN
50
2
2
300
2
1
1
33
4
3
20
1
NaN
NaN
NaN
1
I know I can calculate the length of columns that start by specific name something like this:
df[df.columns[pd.Series(df.columns).str.startswith('number')]]
but I cant figure out, how can I add condition that there has to be other than NaN value and also how to apply it to every row. I think it could be done with lambda? but haven't succeeded yet.
# filter column on 'number' and count
df['sum_columns']=df.filter(like='number').count(axis=1)
df
id something number1 number2 number3 number4 sum_columns
0 1 105 200 NaN NaN 50.0 2
1 2 300 2 1.0 1.0 33.0 4
2 3 20 1 NaN NaN NaN 1
PS: Your first DF and second DF, the NaN count don't match. I used the second DF in the solution
Indeed df[df.columns[df.columns.str.startswith('number')]] will give your dataframe with the columns starting with 'number'. Now we only need to sum the number of values that are not NaN's. This can be done like so:
df['sum_columns'] = (df[df.columns[df.columns.str.startswith('number')]].notnull()).sum(axis=1)
Output:
id something number1 number2 number3 number4 sum_columns
0 1 105 200 NaN NaN 50.0 2
1 2 300 2 1.0 1.0 33.0 4
2 3 20 1 NaN NaN NaN 1
import pandas as pd
import numpy as np
df = {'something':[105, 300,20],
'number1':[200,2,1],
'number2':[np.nan,1,np.nan],
'number3':[np.nan,1,np.nan],
'number4':[50,33,np.nan]}
df = pd.DataFrame(df)
tmp = df[df.columns[pd.Series(df.columns).str.startswith('number')]]
df['sum_columns'] = tmp.notnull().sum(axis=1).tolist()
df
Output:
something number1 number2 number3 number4 sum_columns
0 105 200 NaN NaN 50.0 2
1 300 2 1.0 1.0 33.0 4
2 20 1 NaN NaN NaN 1
One can use pandas.DataFrame.iloc to, based on the index of the columns, filter to consider the desired ones, and .count(axis=1), as follows
df['sum_columns'] = df.iloc[:, 2:].count(axis=1)
[Out]:
id something number1 number2 number3 number4 sum_columns
0 1 105 200 NaN NaN 50.0 2
1 2 300 2 1.0 1.0 33.0 4
2 3 20 1 NaN NaN NaN 1

Applying a lambda function to columns on pandas avoiding redundancy

I have this dataset, which contains some NaN values:
df = pd.DataFrame({'Id':[1,2,3,4,5,6], 'Name':['Eve','Diana',np.NaN,'Mia','Mae',np.NaN], "Count":[10,3,np.NaN,8,5,2]})
df
Id Name Count
0 1 Eve 10.0
1 2 Diana 3.0
2 3 NaN NaN
3 4 Mia 8.0
4 5 Mae 5.0
5 6 NaN 2.0
I want to test if the column has a NaN value (0) or not (1) and creating two new columns. I have tried this:
df_clean = df
df_clean[['Name_flag','Count_flag']] = df_clean[['Name','Count']].apply(lambda x: 0 if x == np.NaN else 1, axis = 1)
But it mentions that The truth value of a Series is ambiguous. I want to make it avoiding redundancy, but I see there is a mistake in my logic. Please, could you help me with this question?
The expected table is:
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 1 1
1 2 Diana 3.0 1 1
2 3 NaN NaN 0 0
3 4 Mia 8.0 1 1
4 5 Mae 5.0 1 1
5 6 NaN 2.0 0 1
Multiply boolean mask by 1:
df[['Name_flag','Count_flag']] = df[['Name', 'Count']].isna() * 1
>>> df
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 0 0
1 2 Diana 3.0 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.0 0 0
4 5 Mae 5.0 0 0
5 6 NaN 2.0 1 0
For your problem of The truth value of a Series is ambiguous
For apply, you cannot return a scalar 0 or 1 because you have a series as input . You have to use applymap instead to apply a function elementwise. But comparing to NaN is not an easy thing:
Try:
df[['Name','Count']].applymap(lambda x: str(x) == 'nan') * 1
We can use isna and convert the boolean to int:
df[["Name_flag", "Count_flag"]] = df[["Name", "Count"]].isna().astype(int)
Id Name Count Name_flag Count_flag
0 1 Eve 10.00 0 0
1 2 Diana 3.00 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.00 0 0
4 5 Mae 5.00 0 0
5 6 NaN 2.00 1 0

Python Pandas - difference between 'loc' and 'where'?

Just curious on the behavior of 'where' and why you would use it over 'loc'.
If I create a dataframe:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
'Goals':[12,23,56,7,8,0,4,2,1,34],
'Gender':['m','m','m','f','f','m','f','m','f','m']})
And then apply the 'where' function:
df2 = df.where(df['Goals']>10)
I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:
Gender Goals ID Run Distance
0 m 12.0 1.0 234.0
1 m 23.0 2.0 35.0
2 m 56.0 3.0 77.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 m 34.0 10.0 123.0
If however I use the 'loc' function:
df2 = df.loc[df['Goals']>10]
It returns the dataframe subsetted without the NaN values:
Gender Goals ID Run Distance
0 m 12 1 234
1 m 23 2 35
2 m 56 3 77
9 m 34 10 123
So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?
Think of loc as a filter - give me only the parts of the df that conform to a condition.
where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 10 123 34 m
Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:
Gender Goals
0 m 12
1 m 23
If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:
df2 = df.where(df['Goals']>10)
print (df2)
ID Run Distance Goals Gender
0 1.0 234.0 12.0 m
1 2.0 35.0 23.0 m
2 3.0 77.0 56.0 m
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 10.0 123.0 34.0 m
df2 = df.where(df['Goals']>10, 100)
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 100 100 100 100
4 100 100 100 100
5 100 100 100 100
6 100 100 100 100
7 100 100 100 100
8 100 100 100 100
9 10 123 34 m
Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.
df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
9 10 123 34 m
If use loc is possible also filter by rows by condition and columns by name(s):
s = df.loc[df['Goals']>10, 'ID']
print (s)
0 1
1 2
2 3
9 10
Name: ID, dtype: int64
df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
ID Gender
0 1 m
1 2 m
2 3 m
9 10 m
loc retrieves only the rows that matches the condition.
where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).

Creating a numeric variable issue

I'm trying to create a new variable as the mean of another numeric var present in my database (mark1 type = float).
Unfortunately, the result is a new colunm with all NaN values.
still can't understand the reanson why.
The code i made is the following:
df = pd.read_csv("students2.csv")
df.loc[:, 'mean_m1'] = pd.Series(np.mean(df['mark1']).mean(), index= df)
this the first few rows after the code:
df.head()
ID gender subject mark1 mark2 mark3 fres mean_m1
0 1 mm 1 17.0 20.0 15.0 neg NaN
1 2 f 2 24.0 330.0 23.0 pos NaN
2 3 FEMale 1 17.0 16.0 24.0 0 NaN
3 4 male 3 27.0 23.0 21.0 1 NaN
4 5 m 2 30.0 22.0 24.0 positive NaN
None error messages are printed.
thx so much!
You need GroupBy + transform with 'mean'.
For the data you have provided, this is trivially equal to mark1. You should probably map your genders to categories, e.g. M or F, as a preliminary step.
df['mean_m1'] = df.groupby('gender')['mark1'].transform('mean')
print(df)
ID gender subject mark1 mark2 mark3 fres mean_m1
0 1 mm 1 17.000 20.000 15.000 neg 17.000
1 2 f 2 24.000 330.000 23.000 pos 24.000
2 3 FEMale 1 17.000 16.000 24.000 0 17.000
3 4 male 3 27.000 23.000 21.000 1 27.000
4 5 m 2 30.000 22.000 24.000 positive 30.000

DataFrameGroupBy diff() on condition

Suppose i have a DataFrame:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
which looks like
CATEGORY VALUE
0 a NaN
1 b 1
2 c 0
3 b 0
4 b 5
5 a 0
6 b 4
I group it:
df = df.groupby(by='CATEGORY')
And now, let me show, what i want with the help of example on one group 'b':
df.get_group('b')
group b:
CATEGORY VALUE
1 b 1
3 b 0
4 b 5
6 b 4
I need: In the scope of each group, count diff() between VALUE values, skipping all NaNs and 0s. So the result should be:
CATEGORY VALUE DIFF
1 b 1 -
3 b 0 -
4 b 5 4
6 b 4 -1
You can use diff to subtract values after dropping 0 and NaN values:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
grouped = df.groupby("CATEGORY")
# define diff func
diff = lambda x: x["VALUE"].replace(0, np.NaN).dropna().diff()
df["DIFF"] = grouped.apply(diff).reset_index(0, drop=True)
print(df)
CATEGORY VALUE DIFF
0 a NaN NaN
1 b 1.0 NaN
2 c 0.0 NaN
3 b 0.0 NaN
4 b 5.0 4.0
5 a 0.0 NaN
6 b 4.0 -1.0
Sounds like a job for a pd.Series.shift() operation along with a notnull mask.
First we remove the unwanted values, before we group the data
nonull_df = df[(df['VALUE'] != 0) & df['VALUE'].notnull()]
groups = nonull_df.groupby(by='CATEGORY')
Now we can shift internally in the groups and calculate the diff
nonull_df['next_value'] = groups['VALUE'].shift(1)
nonull_df['diff'] = nonull_df['VALUE'] - nonull_df['next_value']
Lastly and optionally you can copy the data back to the original dataframe
df.loc[nonull_df.index] = nonull_df
df
CATEGORY VALUE next_value diff
0 a NaN NaN NaN
1 b 1.0 NaN NaN
2 c 0.0 NaN NaN
3 b 0.0 1.0 -1.0
4 b 5.0 1.0 4.0
5 a 0.0 NaN NaN
6 b 4.0 5.0 -1.0

Categories