Multiple Indexes for Dataframe Grouping

Multiple Indexes for Dataframe Grouping - python

I'll just start with the example and then break down what is happening.
This is a sample input:
DataFrame:
**Name** **No.** **Test** ***Grade***
Bob 2123320 Math Nan
Joe 2832883 English 90
John 2139300 Science 85
Bob 2123320 History 93
John 2234903 Math 99
Hopeful output:
**Name** ********2139300********* ********2234903*******
Math English Science Math English Science
John 0 0 85 99 0 0
Like the title suggests, I am trying to apply multiple indexes. So basically it starts by looking for each name, and then for each name it finds to see how many distinct No.'s it has. In this case it sets a threshold at at least 2 distinct No.'s (which is why only John is outputted and Joe/Bob are not).
Now in each of these distinct No's. I have a specific subset of Tests I want to search for, in this case only {Math, English, Science}. For each of these tests, if the person in question took it in that No., there should be a grade. I would like that grade to be outputted for the test in question and for the tests not taken by that person on that No. I would like it to output some sort of simple marker (i.e if the person only took Math on that day, for English and Science output 0).
So in effect, it first indexes people by the number of distinct No.'s and groups them as such. It then indexes them by type of Test (for which I only want a subset). It finally assigns each person a value for the type of test they took and for the ones they didn't simply outputs an 0.
It's similar to another problem I asked earlier:
Grouped Feature Matrix in Python #2- Follow Up
Except now instead of 1's and 0's I have another column with actual values that I would like to output.
Thank you.
EDIT: More sample/Output
**Name** **No.** **Test** ***Grade***
Bob 2123320 Math Nan
Joe 2832883 English 90
John 2139300 Science 85
Bob 2123320 History 93
John 2234903 Math 99
Bob 2932848 English 99
**Name** 2139300 2234903 2123320 2932848
M E S M E S M E S M E S
John 0 0 85 99 0 0 Nan Nan Nan Nan Nan Nan
Bob Nan Nan Nan Nan nan Nan 86 0 0 0 99 0

Let's use:
Filter the dataframe to only those records you are concerned with
df_out = df[df.groupby(['Name'])['No.'].transform(lambda x: x.nunique() > 1)]
Now, reshape dataframe with set_index, unstack, and reindex:
df_out.set_index(['Name','No.','Test'])['Grade'].sum(level=[0,1,2])\
.unstack(-1, fill_value=0)\
.reindex(['Math','English','Science'], axis=1, fill_value=0)\
.unstack(-1, fill_value=0).swaplevel(0, 1, axis=1)\
.sort_index(1)
Output:
No. 2123320 2139300 2234903 2932848
Test English Math Science English Math Science English Math Science English Math Science
Name
Bob 0 0 0 0 0 0 0 0 0 99 0 0
John 0 0 0 0 0 85 0 99 0 0 0 0

You can use pivot_table:
In [11]: df.pivot_table(values="Grade", index=["Name"], columns=["No.", "Test"])
Out[11]:
No. 2123320 2139300 2234903 2832883
Test History Science Math English
Name
Bob 93.0 NaN NaN NaN
Joe NaN NaN NaN 90.0
John NaN 85.0 99.0 NaN
With the dropna flag to include all the NaN columns:
In [12]: df.pivot_table(values="Grade", index=["Name"], columns=["No.", "Test"], dropna=False)
Out[12]:
No. 2123320 2139300 2234903 2832883
Test English History Math Science English History Math Science English History Math Science English History Math Science
Name
Bob NaN 93.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Joe NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 90.0 NaN NaN NaN
John NaN NaN NaN NaN NaN NaN NaN 85.0 NaN NaN 99.0 NaN NaN NaN NaN NaN
and with fill_value=0
In [13]: df.pivot_table(values="Grade", index=["Name"], columns=["No.", "Test"], dropna=False, fill_value=0)
Out[13]:
No. 2123320 2139300 2234903 2832883
Test English History Math Science English History Math Science English History Math Science English History Math Science
Name
Bob 0 93 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Joe 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0 0
John 0 0 0 0 0 0 0 85 0 0 99 0 0 0 0 0

Related

Move certain pandas dataframe column values from one column to another and replace old position with Nan

I have a dataframe like this:
data = {"Name": ["Jason", "Jason", "Jason", "Jason", "Pat", "Amy", "Amy"]}
df = pd.DataFrame(data)
Name
0 Jason
1 Jason
2 Jason
3 Jason
4 Pat
5 Amy
6 Amy
and I need it to look like this:
Name Name2 Name3
0 Jason Nan Nan
1 Jason Nan Nan
2 Jason Nan Nan
3 Jason Nan Nan
4 Nan Pat Nan
5 Nan Nan Amy
6 Nan Nan Amy
I can manually create something in the direction I want to go but not sure how to automatically create the new columns by the count of unique values found in the "Name" column. I also need the values in the new column to be on the same row index. I found that the list always changes too, so using 'unique_names[0]' won't always work. Here's what I have tried so far but stuck. Also, this is just an example for one column but this would actually have about 17 similar columns with different values. Thanks
unique_names = list(set([p for p in df["Name"]]))
# ['Pat', 'Jason', 'Amy']
count = len(unique_names) # Trying to fit this somewhere to give it a count to refer to
# 3
for item in df["name"]:
if unique_names[0] == item:
df["new_name"] = pd.Series(item)
Name New_name
0 Jason Pat
1 Jason NaN
2 Jason NaN
3 Jason NaN
4 Pat NaN
5 Amy NaN
6 Amy NaN

We can do str.get_dummies then mul
s=df.Name.str.get_dummies().mul(df.Name,axis=0).replace('',np.nan)
s
Out[54]:
Amy Jason Pat
0 NaN Jason NaN
1 NaN Jason NaN
2 NaN Jason NaN
3 NaN Jason NaN
4 NaN NaN Pat
5 Amy NaN NaN
6 Amy NaN NaN

Apply a split based on a certain condition

I've the following dataframe:
data = {'Name': ['Peter | Jacker', 'John | Parcker', 'Paul | Cash', 'Tony'],
'Age': [10, 45, 14, 65]}
df = pd.DataFrame(data)
What I want to extract is the nicknames (the word after the character '|') only for the person that have more than 16 years. For that I am using the following code:
df['nickname'] = df.apply(lambda x: x.str.split('|', 1)[-1] if x['Age'] > 16 else 0, axis=1)
However, when I print the nickname I only getting the following results:
Name Age nickname
Peter | Jacker 10 0.0
John | Parcker 45 NaN
Paul | Cash 14 0.0
Tony 65 NaN
And what I want is this:
Name Age nickname
Peter | Jacker 10 NaN
John | Parcker 45 Parcker
Paul | Cash 14 NaN
Tony 65 NaN
What I am doing wrong?

Use numpy.where with select second lists after split if condition match, else add mising values (or 0, what need):
df['nickname'] = np.where(df['Age'] > 16, df['Name'].str.split('|', 1).str[1] , np.nan)
print (df)
Name Age nickname
0 Peter | Jacker 10 NaN
1 John | Parcker 45 Parcker
2 Paul | Cash 14 NaN
3 Tony 65 NaN

Apply split function on name column. Try below code:
import numpy as np
df.apply(lambda x: x['Name'].split('|', 1)[-1] if x['Age'] > 16 and len(x['Name'].split('|',1))>1 else np.nan, axis=1)
Name Age nickname
0 Peter | Jacker 10 NaN
1 John | Parcker 45 Parcker
2 Paul | Cash 14 NaN
3 Tony 65 NaN

how to melt a dataframe -- get the column name in the field of melt dataframe

I have a df as below
name 0 1 2 3 4
0 alex NaN NaN aa bb NaN
1 mike NaN rr NaN NaN NaN
2 rachel ss NaN NaN NaN ff
3 john NaN ff NaN NaN NaN
the melt function should return the below
name code
0 alex 2
1 alex 3
2 mike 1
3 rachel 0
4 rachel 4
5 john 1
Any suggestion is helpful. thanks.

Just follow these steps: melt, dropna, sort column name, reset index, and finally drop any unwanted columns
In [1171]: df.melt(['name'],var_name='code').dropna().sort_values('name').reset_index().drop(['index', 'value'], 1)
Out[1171]:
name code
0 alex 2
1 alex 3
2 john 1
3 mike 1
4 rachel 0
5 rachel 4

This should work.
df.unstack().reset_index().dropna()

df.set_index('name').unstack().reset_index().rename(columns={'level_0':'Code'}).dropna().drop(0,axis =1)[['name','Code']].sort_values('name')
output will be
name Code
alex 2
alex 3
john 1
mike 1
rachel 0
rachel 4

Pandas: how to merge to dataframes on multiple columns?

I have 2 dataframes, df1 and df2.
df1 Contains the information of some interactions between people.
df1
Name1 Name2
0 Jack John
1 Sarah Jack
2 Sarah Eva
3 Eva Tom
4 Eva John
df2 Contains the status of general people and also some people in df1
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Laura 0
I would like df2 only for the people that are in df1 (Laura disappears), and for those that are not in df2 keep NaN (i.e. Eva) such as:
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Eva NaN

Create a DataFrame on unique values of df1 and map it with df2 as:
df = pd.DataFrame(np.unique(df1.values),columns=['Name'])
df['Y'] = df.Name.map(df2.set_index('Name')['Y'])
print(df)
Name Y
0 Eva NaN
1 Jack 0.0
2 John 1.0
3 Sarah 0.0
4 Tom 1.0
Note : Order is not preserved.

You can create a list of unique names in df1 and use isin
names = np.unique(df1[['Name1', 'Name2']].values.ravel())
df2.loc[~df2['Name'].isin(names), 'Y'] = np.nan
Name Y
0 Jack 0.0
1 John 1.0
2 Sarah 0.0
3 Tom 1.0
4 Laura NaN

How to make complex data cleaning in pandas

For example, I have a DataFrame as following.
lineNum id name Cname score
1 001 Jack Math 99
2 002 Jack English 110
3 003 Jack Chinese 90
4 003 Jack Chinese 90
5 004 Tom Math Nan
6 005 Tom English 75
7 006 Tom Chinese 85
As you see, I want to data cleaning for this data.
1) delete the duplicate value which is line 3 and line 4.
2) deal with ths unreasonable value. In line 2, Jack's English get 110 which is over the max value 100. I want to set his score to the mean value for all students' english score.
3) deal with the Nan value. Tom's Math score is Nan. I want to change to the mean value for all students' math score.
I can do every requirement respectively. But I don't know how to do all these three requirements. Thanks!

plan
I drop duplicates to start.
use mask to make scores > than 100 null
filter new dataframe and group by with mean
map means and use it to fill nulls
d = df.drop_duplicates(['id', 'name', 'Cname'])
s0 = d.score
s1 = s0.mask(s > 100)
m = s1.mask(s1 > 100).notnull()
d.assign(score=s1.fillna(d.Cname.map(d[m].groupby('Cname').score.mean())))
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 110.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0

You can use:
cols = ['id','name','Cname','score']
#remove duplicates by columns
df = df.drop_duplicates(subset=cols)
#replace values > 100 to NaN
df.loc[df['score'] > 100, 'score'] = np.nan
#replace NaN by mean for all students by subject
df['score'] = df.groupby('Cname')['score'].transform(lambda x: x.fillna(x.mean()))
print (df)
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 75.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0
Alternative solution with mask for NaN:
cols = ['id','name','Cname','score']
df = df.drop_duplicates(subset=cols)
df['score'] = df['score'].mask(df['score'] > 100)
df['score'] = df.groupby('Cname')['score'].apply(lambda x: x.fillna(x.mean()))
print (df)
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 75.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0

You should consider `.apply(func)' if the data is not too big.
import pandas as pd
df = pd.read_table('sample.txt', delimiter='\s+', na_values='Nan') # Your sample data
df = df.set_index('lineNum').drop_duplicates()
def deal_with(x):
if (x['score'] > 100.) or (pd.isnull(x['score'])):
df_ = df[df['id'] != x['id']]
x['score'] = df_.loc[df_['Cname'] == x['Cname'], 'score'].mean()
return x
print(df.apply(deal_with, axis=1))
id name Cname score
lineNum
1 1 Jack Math 99.0
2 2 Jack English 75.0
3 3 Jack Chinese 90.0
5 4 Tom Math 99.0
6 5 Tom English 75.0
7 6 Tom Chinese 85.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.