Pandas data frame spread function or similar? - python

Here's a pandas df:
df = pd.DataFrame({'First' : ['John', 'Jane', 'Mary'],
'Last' : ['Smith', 'Doe', 'Johnson'],
'Group' : ['A', 'B', 'A'],
'Measure' : [2, 11, 1]})
df
Out[38]:
First Last Group Measure
0 John Smith A 2
1 Jane Doe B 11
2 Mary Johnson A 1
I would like to "spread" the Group variable with the values in Measure.
df_desired
Out[39]:
First Last A B
0 John Smith 2 0
1 Jane Doe 0 11
2 Mary Johnson 1 0
Each level within Group variable becomes its own column populated with the values contained in column Measure. How can I achieve this?

Using pivot_table
df.pivot_table(index=['First','Last'],columns='Group',values='Measure',fill_value=0)
Out[247]:
Group A B
First Last
Jane Doe 0 11
John Smith 2 0
Mary Johnson 1 0

If your order doesn't matter, you can do something along these lines:
df.set_index(['First','Last', 'Group']).unstack('Group').fillna(0).reset_index()
First Last Measure
Group A B
0 Jane Doe 0.0 11.0
1 John Smith 2.0 0.0
2 Mary Johnson 1.0 0.0

Related

Combine two colums of string data in one with selection rule

I have to combine two colums of string data in one (in the same DataFrame), also I need some sort of selection rule, I give you an example
import numpy as np
import pandas as pd
df = pd.DataFrame({'nameA':['martin', 'peter', 'john', 'tom', 'bill'],
'nameB':[ np.NaN,np.NaN , 'jhon', 'tomX', 'billX']})
df
nameA nameB
0 martin NaN
1 peter NaN
2 john jhon
3 tom tomX
4 bill billX
This output is the expected behavior
nameA nameB nameAB
0 martin NaN martin
1 peter NaN peter
2 john jhon jhon
3 tom tomX tomX
4 bill billX bilX
the rule should be something like this:
if A and B are different write B
if B are NaN write A
if A and B are NaN write NaN
I have found tips with numbers but no with strings, I think I have to test row by row and get a true o false value and then write the appropriate value
Any guidance or assistance would be greatly appreciated!
You can use df.combine_first():
In [1972]: df['nameAB'] = df.nameB.combine_first(df.nameA)
In [1973]: df
Out[1973]:
nameA nameB nameAB
0 martin NaN martin
1 peter NaN peter
2 john jhon jhon
3 tom tomX tomX
4 bill billX billX
Use np.where:
import pandas as pd
import numpy as np
df = pd.DataFrame({'nameA': ['martin', 'peter', 'john', 'tom', 'bill'],
'nameB': [np.NaN, np.NaN, 'jhon', 'tomX', 'billX']})
df['nameAB'] = np.where(pd.isna(df['nameB']), df['nameA'], df['nameB'])
print(df)
Output
nameA nameB nameAB
0 martin NaN martin
1 peter NaN peter
2 john jhon jhon
3 tom tomX tomX
4 bill billX billX
Given your conditions you only return nameA when nameB is nan.
Series.fillna
df['nameAB'] = df['nameB'].fillna(df['nameA'])
print(df)
Output
nameA nameB nameAB
0 martin NaN martin
1 peter NaN peter
2 john jhon jhon
3 tom tomX tomX
4 bill billX billX

Group by and value_counts - return results as columns

I have a DF of the following:
value name
A Steven
A Steven
A Ron
B Joe
B Steven
B Ana
I want to perform a value_counts() operation on the name column so the output will be a DF where the columns are the counters of the values :
value Steven Ron Joe Ana
A 2 1 0 0
B 1 0 1 1
Tried a group by+value_counts and than transposing the results but didn't reached the output.
It is crosstab
pd.crosstab(df.value, df.name).reset_index().rename_axis(None,1)
Out[62]:
value Ana Joe Ron Steven
0 A 0 0 1 2
1 B 1 1 0 1
You can do this with groupby and value_counts like this:
df.groupby('value')['name'].value_counts().unstack(fill_value=0).reset_index()
Output:
name value Ana Joe Ron Steven
0 A 0 0 1 2
1 B 1 1 0 1

Keep original string values after pandas str.extract() if the regex doesn't match

My input data:
df=pd.DataFrame({'A':['adam','monica','joe doe','michael mo'], 'B':['david','valenti',np.nan,np.nan]})
print(df)
A B
0 adam david
1 monica valenti
2 joe doe NaN
3 michael mo NaN
I need to extract strings after space, to a second column, but when I use my code...:
df['B'] = df['A'].str.extract(r'( [a-zA-Z](.*))')
print(df)
A B
0 adam NaN
1 monica NaN
2 joe doe doe
3 michael mo mo
...I receive NaN in each cell where value has not been extracted. How to avoid it?
I tried to extract only from rows where NaN exist using this code:
df.loc[df.B.isna(),'B'] = df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')
ValueError: Incompatible indexer with DataFrame
Expected output:
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
I think solution should be simplify - split by spaces and get second lists and pass to Series.fillna function:
df['B'] = df['B'].fillna(df['A'].str.split().str[1])
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Detail:
print (df['A'].str.split().str[1])
0 NaN
1 NaN
2 doe
3 mo
Name: A, dtype: object
Your solution should be changed:
df['B'] = df['A'].str.extract(r'( [a-zA-Z](.*))')[0].fillna(df.B)
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Better solution wich changed regex and expand=False for Series:
df['B'] = df['A'].str.extract(r'( [a-zA-Z].*)', expand=False).fillna(df.B)
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Detail:
print (df['A'].str.extract(r'( [a-zA-Z].*)', expand=False))
0 NaN
1 NaN
2 doe
3 mo
Name: A, dtype: object
EDIT:
For extract also values from first column simpliest is use:
df1 = df['A'].str.split(expand=True)
df['A'] = df1[0]
df['B'] = df['B'].fillna(df1[1])
print (df)
A B
0 adam david
1 monica valenti
2 joe doe
3 michael mo
Your approach doesn't function because of the different shapes of the right and the left sides of your statement. The left part has the shape (2,) and the right part (2, 2):
df.loc[df.B.isna(),'B']
Returns:
2 NaN
3 NaN
And you want to fill this with:
df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')
Returns:
0 1
2 doe oe
3 mo o
You can take the column 1 and then it will have the same shape (2,) as the left part and will fit:
df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')[1]
Returns:
2 oe
3 o

Pandas: how to merge to dataframes on multiple columns?

I have 2 dataframes, df1 and df2.
df1 Contains the information of some interactions between people.
df1
Name1 Name2
0 Jack John
1 Sarah Jack
2 Sarah Eva
3 Eva Tom
4 Eva John
df2 Contains the status of general people and also some people in df1
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Laura 0
I would like df2 only for the people that are in df1 (Laura disappears), and for those that are not in df2 keep NaN (i.e. Eva) such as:
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Eva NaN
Create a DataFrame on unique values of df1 and map it with df2 as:
df = pd.DataFrame(np.unique(df1.values),columns=['Name'])
df['Y'] = df.Name.map(df2.set_index('Name')['Y'])
print(df)
Name Y
0 Eva NaN
1 Jack 0.0
2 John 1.0
3 Sarah 0.0
4 Tom 1.0
Note : Order is not preserved.
You can create a list of unique names in df1 and use isin
names = np.unique(df1[['Name1', 'Name2']].values.ravel())
df2.loc[~df2['Name'].isin(names), 'Y'] = np.nan
Name Y
0 Jack 0.0
1 John 1.0
2 Sarah 0.0
3 Tom 1.0
4 Laura NaN

Pandas Dataframe fillna() using other known column values

Given the following sample df:
Other1 Other2 Name Value
0 0 1 Johnson C
1 0 0 Johnson NaN
2 1 1 Smith R
3 1 1 Smith NaN
4 0 1 Jackson X
5 1 1 Jackson NaN
6 1 1 Jackson NaN
I want to be able to fill the NaN values with the df['Value'] value associated with the given name in that row. My desired outcome is the following, which I know can be achieved like so:
df['Value'] = df['Value'].fillna(method='ffill')
Other1 Other2 Name Value
0 0 1 Johnson C
1 0 0 Johnson C
2 1 1 Smith R
3 1 1 Smith R
4 0 1 Jackson X
5 1 1 Jackson X
6 1 1 Jackson X
However, this solution will not achieve the desired result if the names are not followed by one another in order. I also cannot sort by df['Name'], as the order is important. Is there an efficient means of simply filling a given NaN value by it's associated name value and assigning it to that?
It's also important to note that a given Name will always only have a single value associated with it. Thank you in advance.
You should use groupby and transform:
df['Value'] = df.groupby('Name')['Value'].transform('first')
df
Other1 Other2 Name Value
0 0 1 Johnson C
1 0 0 Johnson C
2 1 1 Smith R
3 1 1 Smith R
4 0 1 Jackson X
5 1 1 Jackson X
6 1 1 Jackson X
Peter's answer is not correct because the first valid value may not always be the first in the group, in which case ffill will pollute the next group with the previous group's value.
ALollz's answer is fine, but dropna incurs some degree of overhead.

Categories