Apply a split based on a certain condition - python

I've the following dataframe:
data = {'Name': ['Peter | Jacker', 'John | Parcker', 'Paul | Cash', 'Tony'],
'Age': [10, 45, 14, 65]}
df = pd.DataFrame(data)
What I want to extract is the nicknames (the word after the character '|') only for the person that have more than 16 years. For that I am using the following code:
df['nickname'] = df.apply(lambda x: x.str.split('|', 1)[-1] if x['Age'] > 16 else 0, axis=1)
However, when I print the nickname I only getting the following results:
Name Age nickname
Peter | Jacker 10 0.0
John | Parcker 45 NaN
Paul | Cash 14 0.0
Tony 65 NaN
And what I want is this:
Name Age nickname
Peter | Jacker 10 NaN
John | Parcker 45 Parcker
Paul | Cash 14 NaN
Tony 65 NaN
What I am doing wrong?

Use numpy.where with select second lists after split if condition match, else add mising values (or 0, what need):
df['nickname'] = np.where(df['Age'] > 16, df['Name'].str.split('|', 1).str[1] , np.nan)
print (df)
Name Age nickname
0 Peter | Jacker 10 NaN
1 John | Parcker 45 Parcker
2 Paul | Cash 14 NaN
3 Tony 65 NaN

Apply split function on name column. Try below code:
import numpy as np
df.apply(lambda x: x['Name'].split('|', 1)[-1] if x['Age'] > 16 and len(x['Name'].split('|',1))>1 else np.nan, axis=1)
Name Age nickname
0 Peter | Jacker 10 NaN
1 John | Parcker 45 Parcker
2 Paul | Cash 14 NaN
3 Tony 65 NaN

Related

switch value between column based pandas

I have the following data frame:
Name Age City Gender Country
0 Jane 23 NaN F London
1 Melissa 45 Nan F France
2 John 35 Nan M Toronto
I want to switch value between column based on condition:
if Country equal to Toronto and London
I would like to have this output:
Name Age City Gender Country
0 Jane 23 London F NaN
1 Melissa 45 NaN F France
2 John 35 Toronto M NaN
How can I do this?
I would use .loc to check the rows where Country contains London or Toronto, then set the City column to those values and use another loc statement to replace London and Toronto with Nan in the country column
df.loc[df['Country'].isin(['London', 'Toronto']), 'City'] = df['Country']
df.loc[df['Country'].isin(['London', 'Toronto']), 'Country'] = np.nan
output:
Name Age City Gender Country
0 Jane 23 London F NaN
1 Melissa 45 NaN F France
2 John 35 Toronto M NaN
You could use np.where:
cities = ['London', 'Toronto']
df['City'] = np.where(
df['Country'].isin(cities),
df['Country'],
df['City']
)
df['Country'] = np.where(
df['Country'].isin(cities),
np.nan,
df['Country']
)
Results:
Name Age City Gender Country
0 Jane 23 London F NaN
1 Melissa 45 NaN F France
2 John 35 Toronto M NaN
cond = df['Country'].isin(['London', 'Toronto'])
df['City'].mask(cond, df['Country'], inplace = True)
df['Country'].mask(cond, np.nan, inplace = True)
Name Age City Gender Country
0 Jane 23 London F NaN
1 Melissa 45 NaN F France
2 John 35 Toronto M NaN

Find a quantile value and then add it back to the dataframe

I have one dataframe that looks like this:
ID | Name | pay
1 John 10
1 John 20
1 John 30
I want to find the 50th percentile of pay and then add that value back to the orginal df.
df = np.round(users.groupby('ID')['pay'].quantile(0.50), 3).reset_index()
ID | Name | pay | 50th percentile
1 John 10 20
1 John 20 20
1 John 30 20
I've seen where you can use transform and pass it a function, but I haven't seen an example using the quantile function, so I'm not sure how to pass the quantile function to transform.
I thought something like this df = df.groupby('ID')['pay'].transform(quantile(0.50))
You can add positional arguments to functions with groupby.transform
import pandas as pd
import io
t = '''
ID Name pay
1 John 10
1 John 20
1 John 30
2 Doris 40
2 Doris 100'''
df = pd.read_csv(io.StringIO(t), sep='\s+')
df['median'] = df.groupby('ID').pay.transform('quantile', .5)
df
Output
ID Name pay median
0 1 John 10 20.0
1 1 John 20 20.0
2 1 John 30 20.0
3 2 Doris 40 70.0
4 2 Doris 100 70.0

Multiple Indexes for Dataframe Grouping

I'll just start with the example and then break down what is happening.
This is a sample input:
DataFrame:
**Name** **No.** **Test** ***Grade***
Bob 2123320 Math Nan
Joe 2832883 English 90
John 2139300 Science 85
Bob 2123320 History 93
John 2234903 Math 99
Hopeful output:
**Name** ********2139300********* ********2234903*******
Math English Science Math English Science
John 0 0 85 99 0 0
Like the title suggests, I am trying to apply multiple indexes. So basically it starts by looking for each name, and then for each name it finds to see how many distinct No.'s it has. In this case it sets a threshold at at least 2 distinct No.'s (which is why only John is outputted and Joe/Bob are not).
Now in each of these distinct No's. I have a specific subset of Tests I want to search for, in this case only {Math, English, Science}. For each of these tests, if the person in question took it in that No., there should be a grade. I would like that grade to be outputted for the test in question and for the tests not taken by that person on that No. I would like it to output some sort of simple marker (i.e if the person only took Math on that day, for English and Science output 0).
So in effect, it first indexes people by the number of distinct No.'s and groups them as such. It then indexes them by type of Test (for which I only want a subset). It finally assigns each person a value for the type of test they took and for the ones they didn't simply outputs an 0.
It's similar to another problem I asked earlier:
Grouped Feature Matrix in Python #2- Follow Up
Except now instead of 1's and 0's I have another column with actual values that I would like to output.
Thank you.
EDIT: More sample/Output
**Name** **No.** **Test** ***Grade***
Bob 2123320 Math Nan
Joe 2832883 English 90
John 2139300 Science 85
Bob 2123320 History 93
John 2234903 Math 99
Bob 2932848 English 99
**Name** 2139300 2234903 2123320 2932848
M E S M E S M E S M E S
John 0 0 85 99 0 0 Nan Nan Nan Nan Nan Nan
Bob Nan Nan Nan Nan nan Nan 86 0 0 0 99 0
Let's use:
Filter the dataframe to only those records you are concerned with
df_out = df[df.groupby(['Name'])['No.'].transform(lambda x: x.nunique() > 1)]
Now, reshape dataframe with set_index, unstack, and reindex:
df_out.set_index(['Name','No.','Test'])['Grade'].sum(level=[0,1,2])\
.unstack(-1, fill_value=0)\
.reindex(['Math','English','Science'], axis=1, fill_value=0)\
.unstack(-1, fill_value=0).swaplevel(0, 1, axis=1)\
.sort_index(1)
Output:
No. 2123320 2139300 2234903 2932848
Test English Math Science English Math Science English Math Science English Math Science
Name
Bob 0 0 0 0 0 0 0 0 0 99 0 0
John 0 0 0 0 0 85 0 99 0 0 0 0
You can use pivot_table:
In [11]: df.pivot_table(values="Grade", index=["Name"], columns=["No.", "Test"])
Out[11]:
No. 2123320 2139300 2234903 2832883
Test History Science Math English
Name
Bob 93.0 NaN NaN NaN
Joe NaN NaN NaN 90.0
John NaN 85.0 99.0 NaN
With the dropna flag to include all the NaN columns:
In [12]: df.pivot_table(values="Grade", index=["Name"], columns=["No.", "Test"], dropna=False)
Out[12]:
No. 2123320 2139300 2234903 2832883
Test English History Math Science English History Math Science English History Math Science English History Math Science
Name
Bob NaN 93.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Joe NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 90.0 NaN NaN NaN
John NaN NaN NaN NaN NaN NaN NaN 85.0 NaN NaN 99.0 NaN NaN NaN NaN NaN
and with fill_value=0
In [13]: df.pivot_table(values="Grade", index=["Name"], columns=["No.", "Test"], dropna=False, fill_value=0)
Out[13]:
No. 2123320 2139300 2234903 2832883
Test English History Math Science English History Math Science English History Math Science English History Math Science
Name
Bob 0 93 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Joe 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0 0
John 0 0 0 0 0 0 0 85 0 0 99 0 0 0 0 0

Python:merge data frame with different rows

I need to merge two data frame with different rows and without common key:
df1:
name | age | loc
Bob | 20 | USA
df2:
food | car | sports
Sushi | Toyota | soccer
meat | Ford | baseball
result I want:
name | age | loc | food | car | sports
Bob | 20 | USA | Sushi | Toyota | soccer
Bob | 20 | USA | Meat | Ford | baseball
my code below:
pd.merge(df1,df2,how='right',left_index=True,right_index=True)
it works well when df2 is more than two rows but be incorrect when df2 is only one row.
any ideas for this question?
Use reindex_axis by index of df2:
df1 = df1.reindex_axis(df2.index, method='ffill')
print (df1)
name age loc
0 Bob 20 USA
1 Bob 20 USA
df = pd.merge(df1,df2,how='right',left_index=True,right_index=True)
print (df)
name age loc food car sports
0 Bob 20 USA Sushi Toyota soccer
1 Bob 20 USA meat Ford baseball
You can use fillna with method ffill (.ffill) if no NaN data in df1 and df2:
#default outer join
df = pd.concat([df1,df2], axis=1).ffill()
print (df)
name age loc food car sports
0 Bob 20.0 USA Sushi Toyota soccer
1 Bob 20.0 USA meat Ford baseball
df = pd.merge(df1,df2,how='right',left_index=True,right_index=True).ffill()
print (df)
name age loc food car sports
0 Bob 20.0 USA Sushi Toyota soccer
1 Bob 20.0 USA meat Ford baseball
Another type of solution... based on concat.
x = range(0,5)
y = range(5,10)
z = range(10,15)
a = range(10,5,-1)
b = range(15,10,-1)
v = range(0,1)
w = range(2,3)
A = pd.DataFrame(dict(x=x,y=y,z=z))
B = pd.DataFrame(dict(a=a,b=b))
C = pd.DataFrame(dict(v=v,w=w))
pd.concat([A,B])
>>> pd.concat([A,B],axis = 1)
x y z a b
0 0 5 10 10 15
1 1 6 11 9 14
2 2 7 12 8 13
3 3 8 13 7 12
4 4 9 14 6 11
#Edit: based on the comments.. this solution does not answer the question.. Because in the question the amount of rows are different. Here is another solution
This solution is based on the dataframe D
n_mult = B.shape[0]
D = C.append([C]*(n_mult-1)).reset_index()[['v','w']]
pd.concat([D,B],axis = 1)

How to make complex data cleaning in pandas

For example, I have a DataFrame as following.
lineNum id name Cname score
1 001 Jack Math 99
2 002 Jack English 110
3 003 Jack Chinese 90
4 003 Jack Chinese 90
5 004 Tom Math Nan
6 005 Tom English 75
7 006 Tom Chinese 85
As you see, I want to data cleaning for this data.
1) delete the duplicate value which is line 3 and line 4.
2) deal with ths unreasonable value. In line 2, Jack's English get 110 which is over the max value 100. I want to set his score to the mean value for all students' english score.
3) deal with the Nan value. Tom's Math score is Nan. I want to change to the mean value for all students' math score.
I can do every requirement respectively. But I don't know how to do all these three requirements. Thanks!
plan
I drop duplicates to start.
use mask to make scores > than 100 null
filter new dataframe and group by with mean
map means and use it to fill nulls
d = df.drop_duplicates(['id', 'name', 'Cname'])
s0 = d.score
s1 = s0.mask(s > 100)
m = s1.mask(s1 > 100).notnull()
d.assign(score=s1.fillna(d.Cname.map(d[m].groupby('Cname').score.mean())))
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 110.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0
You can use:
cols = ['id','name','Cname','score']
#remove duplicates by columns
df = df.drop_duplicates(subset=cols)
#replace values > 100 to NaN
df.loc[df['score'] > 100, 'score'] = np.nan
#replace NaN by mean for all students by subject
df['score'] = df.groupby('Cname')['score'].transform(lambda x: x.fillna(x.mean()))
print (df)
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 75.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0
Alternative solution with mask for NaN:
cols = ['id','name','Cname','score']
df = df.drop_duplicates(subset=cols)
df['score'] = df['score'].mask(df['score'] > 100)
df['score'] = df.groupby('Cname')['score'].apply(lambda x: x.fillna(x.mean()))
print (df)
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 75.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0
You should consider `.apply(func)' if the data is not too big.
import pandas as pd
df = pd.read_table('sample.txt', delimiter='\s+', na_values='Nan') # Your sample data
df = df.set_index('lineNum').drop_duplicates()
def deal_with(x):
if (x['score'] > 100.) or (pd.isnull(x['score'])):
df_ = df[df['id'] != x['id']]
x['score'] = df_.loc[df_['Cname'] == x['Cname'], 'score'].mean()
return x
print(df.apply(deal_with, axis=1))
id name Cname score
lineNum
1 1 Jack Math 99.0
2 2 Jack English 75.0
3 3 Jack Chinese 90.0
5 4 Tom Math 99.0
6 5 Tom English 75.0
7 6 Tom Chinese 85.0

Categories