Given the following sample df:
Other1 Other2 Name Value
0 0 1 Johnson C
1 0 0 Johnson NaN
2 1 1 Smith R
3 1 1 Smith NaN
4 0 1 Jackson X
5 1 1 Jackson NaN
6 1 1 Jackson NaN
I want to be able to fill the NaN values with the df['Value'] value associated with the given name in that row. My desired outcome is the following, which I know can be achieved like so:
df['Value'] = df['Value'].fillna(method='ffill')
Other1 Other2 Name Value
0 0 1 Johnson C
1 0 0 Johnson C
2 1 1 Smith R
3 1 1 Smith R
4 0 1 Jackson X
5 1 1 Jackson X
6 1 1 Jackson X
However, this solution will not achieve the desired result if the names are not followed by one another in order. I also cannot sort by df['Name'], as the order is important. Is there an efficient means of simply filling a given NaN value by it's associated name value and assigning it to that?
It's also important to note that a given Name will always only have a single value associated with it. Thank you in advance.
You should use groupby and transform:
df['Value'] = df.groupby('Name')['Value'].transform('first')
df
Other1 Other2 Name Value
0 0 1 Johnson C
1 0 0 Johnson C
2 1 1 Smith R
3 1 1 Smith R
4 0 1 Jackson X
5 1 1 Jackson X
6 1 1 Jackson X
Peter's answer is not correct because the first valid value may not always be the first in the group, in which case ffill will pollute the next group with the previous group's value.
ALollz's answer is fine, but dropna incurs some degree of overhead.
Related
I have a DF of the following:
value name
A Steven
A Steven
A Ron
B Joe
B Steven
B Ana
I want to perform a value_counts() operation on the name column so the output will be a DF where the columns are the counters of the values :
value Steven Ron Joe Ana
A 2 1 0 0
B 1 0 1 1
Tried a group by+value_counts and than transposing the results but didn't reached the output.
It is crosstab
pd.crosstab(df.value, df.name).reset_index().rename_axis(None,1)
Out[62]:
value Ana Joe Ron Steven
0 A 0 0 1 2
1 B 1 1 0 1
You can do this with groupby and value_counts like this:
df.groupby('value')['name'].value_counts().unstack(fill_value=0).reset_index()
Output:
name value Ana Joe Ron Steven
0 A 0 0 1 2
1 B 1 1 0 1
This question already has an answer here:
Group Value Count By Column with Pandas Dataframe
(1 answer)
Closed 3 years ago.
I have a (101×1766) dataframe and I put a sample down.
Index Id Brand1 Brand2 Brand3
0 1 NaN Good Bad
1 2 Bad NaN NaN
2 3 NaN NaN VeryBad
3 4 Good NaN NaN
4 5 NaN Good VeryGood
5 6 VeryBad Good NaN
What I want to achieve is a table like that
Index VeryBad Bad Good VeryGood
Brand1 1 1 0 0
Brand2 0 0 3 0
Brand3 1 1 0 1
I could not find a solution even a wrong one at all.
So, hope to see your help
Let us do two steps : melt + crosstab
s=df.melt(['Id','Index'])
yourdf=pd.crosstab(s.variable,s.value)
yourdf
value Bad Good VeryBad VeryGood
variable
Brand1 1 1 1 0
Brand2 0 3 0 0
Brand3 1 0 1 1
Select all columns without first by DataFrame.iloc, then count values by value_counts, replace non matched missing values, convert to integers, transpose and last for change order of columns use reindex:
cols = ['VeryBad','Bad','Good','VeryGood']
df = df.iloc[:, 1:].apply(pd.value_counts).fillna(0).astype(int).T.reindex(cols, axis=1)
print (df)
VeryBad Bad Good VeryGood
Brand1 1 1 1 0
Brand2 0 0 3 0
Brand3 1 1 0 1
Here is an approach using melt and pivot_table:
(df.melt(id_vars='Id')
.pivot_table(index='variable',
columns='value',
aggfunc='count',
fill_value=0))
[out]
Id
value Bad Good VeryBad VeryGood
variable
Brand1 1 1 1 0
Brand2 0 3 0 0
Brand3 1 0 1 1
Another way is : get_dummies with transpose + groupby()+sum()
m=pd.get_dummies(df.set_index('Id').T)
final=m.groupby(m.columns.str.split('_').str[1],axis=1).sum()
Bad Good VeryBad VeryGood
Brand1 1 1 1 0
Brand2 0 3 0 0
Brand3 1 0 1 1
My input data:
df=pd.DataFrame({'A':['adam','monica','joe doe','michael mo'], 'B':['david','valenti',np.nan,np.nan]})
print(df)
A B
0 adam david
1 monica valenti
2 joe doe NaN
3 michael mo NaN
I need to extract strings after space, to a second column, but when I use my code...:
df['B'] = df['A'].str.extract(r'( [a-zA-Z](.*))')
print(df)
A B
0 adam NaN
1 monica NaN
2 joe doe doe
3 michael mo mo
...I receive NaN in each cell where value has not been extracted. How to avoid it?
I tried to extract only from rows where NaN exist using this code:
df.loc[df.B.isna(),'B'] = df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')
ValueError: Incompatible indexer with DataFrame
Expected output:
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
I think solution should be simplify - split by spaces and get second lists and pass to Series.fillna function:
df['B'] = df['B'].fillna(df['A'].str.split().str[1])
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Detail:
print (df['A'].str.split().str[1])
0 NaN
1 NaN
2 doe
3 mo
Name: A, dtype: object
Your solution should be changed:
df['B'] = df['A'].str.extract(r'( [a-zA-Z](.*))')[0].fillna(df.B)
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Better solution wich changed regex and expand=False for Series:
df['B'] = df['A'].str.extract(r'( [a-zA-Z].*)', expand=False).fillna(df.B)
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Detail:
print (df['A'].str.extract(r'( [a-zA-Z].*)', expand=False))
0 NaN
1 NaN
2 doe
3 mo
Name: A, dtype: object
EDIT:
For extract also values from first column simpliest is use:
df1 = df['A'].str.split(expand=True)
df['A'] = df1[0]
df['B'] = df['B'].fillna(df1[1])
print (df)
A B
0 adam david
1 monica valenti
2 joe doe
3 michael mo
Your approach doesn't function because of the different shapes of the right and the left sides of your statement. The left part has the shape (2,) and the right part (2, 2):
df.loc[df.B.isna(),'B']
Returns:
2 NaN
3 NaN
And you want to fill this with:
df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')
Returns:
0 1
2 doe oe
3 mo o
You can take the column 1 and then it will have the same shape (2,) as the left part and will fit:
df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')[1]
Returns:
2 oe
3 o
I have the following wide df1:
Area geotype type ...
1 a 2 ...
1 a 1 ...
2 b 4 ...
4 b 8 ...
And the following two-column df2:
Area geotype
1 London
4 Cambridge
And I want the following:
Area geotype type ...
1 London 2 ...
1 London 1 ...
2 b 4 ...
4 Cambridge 8 ...
So I need to match based on the non-unique Area column, and then only if there is a match, replace the set values in the geotype column.
Apologies if this is a duplicate, I did actually search hard for a solution to this.
use update + map
df1.geotype.update(df1.Area.map(df2.set_index('Area').geotype))
Area geotype type
0 1 London 2
1 1 London 1
2 2 b 4
3 4 Cambridge 8
I think you can use map by Series created with set_index and then fill NaN values by combine_first or fillna:
df1.geotype = df1.ID.map(df2.set_index('ID')['geotype']).combine_first(df1.geotype)
#df1.geotype = df1.ID.map(df2.set_index('ID')['geotype']).fillna(df1.geotype)
print (df1)
ID geotype type
0 1 London 2
1 2 a 1
2 3 b 4
3 4 Cambridge 8e
Another solution with mask and numpy.in1d:
df1.geotype = df1.geotype.mask(np.in1d(df1.ID, df2.ID),
df1.ID.map(df2.set_index('ID')['geotype']))
print (df1)
ID geotype type
0 1 London 2
1 2 a 1
2 3 b 4
3 4 Cambridge 8e
EDIT by comment:
Problem is not unique ID values in df2 like:
df2 = pd.DataFrame({'ID': [1, 1, 4], 'geotype': ['London', 'Paris', 'Cambridge']})
print (df2)
ID geotype
0 1 London
1 1 Paris
2 4 Cambridge
So function map cannot choose right value and raise error.
Solution is remove duplicates by drop_duplicates, by default keep first value:
df2 = df2.drop_duplicates('ID')
print (df2)
ID geotype
0 1 London
2 4 Cambridge
Or if need keep last value:
df2 = df2.drop_duplicates('ID', keep='last')
print (df2)
ID geotype
1 1 Paris
2 4 Cambridge
If cannot remove duplicates, there is another solution with outer merge, but there are duplicated rows where is duplicated ID in df2:
df1 = pd.merge(df1, df2, on='ID', how='outer', suffixes=('_',''))
df1.geotype = df1.geotype.combine_first(df1.geotype_)
df1 = df1.drop('geotype_', axis=1)
print (df1)
ID type geotype
0 1 2 London
1 1 2 Paris
2 2 1 a
3 3 4 b
4 4 8e Cambridge
alternative solution:
In [78]: df1.loc[df1.ID.isin(df2.ID), 'geotype'] = df1.ID.map(df2.set_index('ID').geotype)
In [79]: df1
Out[79]:
ID geotype type
0 1 London 2
1 2 a 1
2 3 b 4
3 4 Cambridge 8
UPDATE: answers updated question - if you have duplicates in the Area column in the df2 DF:
In [152]: df1.loc[df1.Area.isin(df2.Area), 'geotype'] = df1.Area.map(df2.set_index('Area').geotype)
...
skipped
...
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
get rid of duplicates:
In [153]: df1.loc[df1.Area.isin(df2.Area), 'geotype'] = df1.Area.map(df2.drop_duplicates(subset='Area').set_index('Area').geotype)
In [154]: df1
Out[154]:
Area geotype type
0 1 London 2
1 1 London 1
2 2 b 4
3 4 Cambridge 8
I have a df that looks like this:
Department ID Sale
1 Jim 1
1 Sue 1
1 John 1
2 Bob 0
2 Janet 0
2 Jim 0
3 John 1
3 John 1
3 Jim 1
What I am trying to do
I want to count the number of departments where a given name appears that has made a sale within that department. This is somewhat confusing so it is better illustrated with my expected output:
ID #ofDepartments
Jim 2
Sue 1
John 2
Bob 0
Janet 0
Notice that John and Jim both have a two next to their names because they both made sales within two different departments (even though John made two sales in dept 3 and one in dept 1, he only appears in two departments overall whereas Jim appeared in three departments but only made sales in two).
I am completely racking my brain how to achieve this as I have tried every possible permutation of groupby without success. Any help?
Edit: the closest I've come was using something like
df.groupby(['ID']).sum()
but that "double counts" the sales John made in department three so it makes it seem as though he has sold in three departments instead of just two
You can use DataFrame.drop_duplicates before grouping, to drop duplicates based on Department and ID . Then group based on ID and then take sum(). Example -
df.drop_duplicates(['Department','ID']).groupby('ID')['Sale'].sum()
Demo -
In [68]: df
Out[68]:
Department ID Sale
0 1 Jim 1
1 1 Sue 1
2 1 John 1
3 2 Bob 0
4 2 Janet 0
5 3 John 1
6 3 John 1
7 3 Jim 1
8 3 Peggy 1
In [69]: df.drop_duplicates(['Department','ID']).groupby('ID')['Sale'].sum()
Out[69]:
ID
Bob 0
Janet 0
Jim 2
John 2
Peggy 1
Sue 1
Name: Sale, dtype: int64