I work with a dataset that occasionally have values removed before I get my hands on it. When a value is removed, it is generally replaced my NaN or ''. What is the most efficient way to collapse the values to the left?
Specifically, I'm trying to turn this:
1 2 3 4
bill sjd meoip
nick tredsn bana
fred ccrw aaaa cretwew bbbbb
tom eomwepo
jill dew weaedf
Into this:
1 2 3 4
bill sjd meoip
nick tredsn bana
fred ccrw aaaa cretwew bbbbb
tom eomwepo
jill dew weaedf
The column titles don't matter, the only thing that matters is that there are no leading empty cells and no empty cells between.
I would prefer to do this in a non-iterative fashion, as the df can be quite large.
Try this, if those blanks are '', then use mask to np.nan, else you don't need mask nor fillna:
df.mask(df == '').apply(lambda x: pd.Series(x.dropna().values), axis=1).fillna('')
Output:
0 1 2 3
bill sjd meojp
nick tredsn bana
fred ccrw aaaa cretwew bbbb
tom eomwep
jill dew weadf
Related
If there is someone who understands, please help me to resolve this. I want to label user data using python pandas, where there are two columns in my dataset, namely author, and retweeted_screen_name. I want to do a label with the criteria if every user in the author column has the same value in the retweeted_screen_name column then are 1 and the others that do not have the same value are 0.
Author
RT_Screen_Name
Label
Alice
John
1
Sandy
John
1
Lisa
Mario
0
Luna
Mark
0
Luna
John
1
Luke
Anthony
0
df['Label']=0
df.loc[df["RT_Screen_Name"]=="John", ["Label"]] = 1
It is unclear what condition you are using to decide the label variable, but if you are clear on your condition you can change out the conditional statement within this code. Also if you edit your question to clarify the condition, notify me and I will adjust my answer.
IIUC, try with groupby:
df["Label"] = (df.groupby("RT_Screen_Name")["Author"].transform("count")>1).astype(int)
>>> df
Author RT_Screen_Name Label
0 Alice John 1
1 Sandy John 1
2 Lisa Mario 0
3 Luna Mark 0
4 Luna John 1
5 Luke Anthony 0
I'm trying to delete duplicates from one row, if the value in another column is null. Here's a sample dataframe:
Primary Application
Assigned To
Application 1
Jim Smith
Application 1
nan
Application 2
John Williams
Application 2
nan
Application 3
nan
Application 3
Sarah Smith
I'm trying to write a conditional that deletes the duplicate in Primary Application if the first or second value of a duplicate in Assigned To is null.
The ideal output would be:
Primary Application
Assigned To
Application 1
Jim Smith
Application 2
John Williams
Application 3
Sarah Smith
Here's what I've written so far:
df = df.groupby('Primary Application', as_index=False).apply(
lambda x: x.drop_duplicates(subset=['Primary Application'], keep='first'
if x['Assigned To'].iat[1].isnull()
else x.drop_duplicates(subset=['Primary Application'], keep='last')))
The main issue is with the if statement regarding isnull(). I've also tried using is none which hasn't worked either.
A key thing I should have added to this question: there are NA values I do want to keep, just not ones that are duplicates with what's already been assigned.
You can pass a custom function to groupby.agg
import numpy as np
df.groupby('Primary Application', as_index=False).agg(lambda x: np.nan if x.isnull().all() else x.dropna(subset=['Assigned To']))
Another way is via transform(). If the group size is one then keep it, otherwise keep all the non-nan's in the group.
With a couple additional cases to represent non-dup situations:
d = {'Primary Application': ['Application 1 ','Application 1 ','Application 2 ',
'Application 2 ','Application 3 ','Application 3 ','Application 4 ',
'Application 5 '],
'Assigned To': ['Jim Smith',np.nan,'John Williams',np.nan,np.nan,'Sarah Smith',
np.nan,'Mark Meed' ]}
df = pd.DataFrame(d)
Primary Application Assigned To
0 Application 1 Jim Smith
1 Application 1 NaN
2 Application 2 John Williams
3 Application 2 NaN
4 Application 3 NaN
5 Application 3 Sarah Smith
6 Application 4 NaN
7 Application 5 Mark Meed
df[df.groupby('Primary Application') \
.transform(lambda x: (x.size==1) | (~pd.isna(x)))['Assigned To']]
Primary Application Assigned To
0 Application 1 Jim Smith
2 Application 2 John Williams
5 Application 3 Sarah Smith
6 Application 4 NaN
7 Application 5 Mark Meed
I have the following dataframe with firstname and surname. I want to create a column fullname.
df1 = pd.DataFrame({'firstname':['jack','john','donald'],
'lastname':[pd.np.nan,'obrien','trump']})
print(df1)
firstname lastname
0 jack NaN
1 john obrien
2 donald trump
This works if there are no NaN values:
df1['fullname'] = df1['firstname']+df1['lastname']
However since there are NaNs in my dataframe, I decided to cast to string first. But it causes a problem in the fullname column:
df1['fullname'] = str(df1['firstname'])+str(df1['lastname'])
firstname lastname fullname
0 jack NaN 0 jack\n1 john\n2 donald\nName: f...
1 john obrien 0 jack\n1 john\n2 donald\nName: f...
2 donald trump 0 jack\n1 john\n2 donald\nName: f...
I can write some function that checks for nans and inserts the data into the new frame, but before I do that - is there another fast method to combine these strings into one column?
You need to treat NaNs using .fillna() Here, you can fill it with '' .
df1['fullname'] = df1['firstname'] + ' ' +df1['lastname'].fillna('')
Output:
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trumpt
You may also use .add and specify a fill_value
df1.firstname.add(" ").add(df1.lastname, fill_value="")
PS: Chaining too many adds or + is not recommended for strings, but for one or two columns you should be fine
df1['fullname'] = df1['firstname']+df1['lastname'].fillna('')
There is also Series.str.cat which can handle NaN and includes the separator.
df1["fullname"] = df1["firstname"].str.cat(df1["lastname"], sep=" ", na_rep="")
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trump
What I will do (For the case more than two columns need to join)
df1.stack().groupby(level=0).agg(' '.join)
Out[57]:
0 jack
1 john obrien
2 donald trump
dtype: object
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
I have a pandas dataframe and I would like to add a column level to split specific columns (metric_a, metric_b, metric_c) into several subcolumns based on the value of another column (parameter).
Current data format:
participant param metric_a metric_b metric_c
0 alice a 0,700 0,912 0,341
1 alice b 0,736 0,230 0,370
2 bob a 0,886 0,364 0,995
3 bob b 0,510 0,704 0,990
4 charlie a 0,173 0,462 0,709
5 charlie b 0,085 0,950 0,807
6 david a 0,676 0,653 0,189
7 david b 0,823 0,524 0,430
Wanted data format:
participant metric_a metric_b metric_c
a b a b a b
0 alice 0,700 0,736 0,912 0,230 0,341 0,370
1 bob 0,886 0,510 0,364 0,704 0,995 0,990
2 charlie 0,173 0,085 0,462 0,950 0,709 0,807
3 david 0,676 0,823 0,653 0,524 0,189 0,430
I have tried
df.set_index(['participant', 'param']).unstack(['param'])
which gives me a close result but not satisfies me as I want to keep a single-level index and participant a regular column.
metric_a metric_b metric_c
param a b a b a b
participant
alice 0,700 0,736 0,912 0,230 0,341 0,370
bob 0,886 0,510 0,364 0,704 0,995 0,990
charlie 0,173 0,085 0,462 0,950 0,709 0,807
david 0,676 0,823 0,653 0,524 0,189 0,430
I have the intuition that groupby() or pivot_table() functions could do the job but cannot figure out how.
IIUC, use DataFrame.set_index and unstack, and reset_index specifying col_level parameter:
df.set_index(['participant', 'param']).unstack('param').reset_index(col_level=0)
[out]
participant metric_a metric_b metric_c
param a b a b a b
0 alice 0,700 0,736 0,912 0,230 0,341 0,370
1 bob 0,886 0,510 0,364 0,704 0,995 0,990
2 charlie 0,173 0,085 0,462 0,950 0,709 0,807
3 david 0,676 NaN 0,653 NaN 0,189 NaN
4 heidi NaN 0,823 NaN 0,524 NaN 0,430
From a two string columns pandas data frame looking like:
d = {'SCHOOL' : ['Yale', 'Yale', 'LBS', 'Harvard','UCLA', 'Harvard', 'HEC'],
'NAME' : ['John', 'Marc', 'Alex', 'Will', 'Will','Miller', 'Tom']}
df = pd.DataFrame(d)
Notice the relationship between NAME to SCHOOL is n to 1.
I want to get the last school in case one person has gone to two different schools (see "Will" case).
So far I got:
df = df.groupby('NAME')['SCHOOL'].unique().reset_index()
Return:
NAME SCHOOL
0 Alex [LBS]
1 John [Yale]
2 Marc [Yale]
3 Miller [Harvard]
4 Tom [HEC]
5 Will [Harvard, UCLA]
PROBLEMS:
unique() return both school not only the last school.
This line return SCHOOL column as a np.array instead of string. Very difficult to work further with this df.
Both problems where solved based on #IanS comments.
Using last() instead of unique():
df = df.groupby('NAME')['SCHOOL'].last().reset_index()
Return:
NAME SCHOOL
0 Alex LBS
1 John Yale
2 Marc Yale
3 Miller Harvard
4 Tom HEC
5 Will UCLA
Use drop_duplicates with parameter last and specifying column for check duplicates:
df = df.drop_duplicates('NAME', keep='last')
print (df)
NAME SCHOOL
0 John Yale
1 Marc Yale
2 Alex LBS
4 Will UCLA
5 Miller Harvard
6 Tom HEC
Also if need sorting add sort_values:
df = df.drop_duplicates('NAME', keep='last').sort_values('NAME')
print (df)
NAME SCHOOL
2 Alex LBS
0 John Yale
1 Marc Yale
5 Miller Harvard
6 Tom HEC
4 Will UCLA