Map multiple columns using Series from another DataFrame - python

I have two DataFrames. I need is to replace the text in columns B, C, and D in df1 with the values from df2['SC'], based on the value of df2['Title'].
df1
A B C D
Dave Green Blue Yellow
Pete Red
Phil Purple
df2
A ID N SC Title
Dave 1 5 2 Green
Dave 1 10 2 Blue
Dave 1 15 3 Yellow
Pete 2 100 3 Red
Phil 3 200 4 Purple
Desired output:
A B C D
Dave 2 2 3
Pete 3
Phil 4

Using stack + map + unstack
df1.set_index('A').stack().map(df2.set_index('Title')['SC']).unstack()
B C D
A
Dave 2.0 2.0 3.0
Pete 3.0 NaN NaN
Phil 4.0 NaN NaN
If a column contains all NaN it will be lost. To avoid this you could reindex
.reindex(df1.columns, axis=1) # append to previous command

Related

Merge two different dataframes

I want to merge two different dataframe which the second one has some rows to complete in the first one.
df4 = pd.DataFrame({'a':['red','green','yellow','blue'],'b':[1,5,6,7],'c':[1,7,8,9]})
df5 = pd.DataFrame({'a':'red','b':44, 'c':55}, index=[0])
print(pd.merge(df4,df5, how='left', on='a'))
Output
a b_x c_x b_y c_y
0 red 1 1 44.0 55.0
1 green 5 7 NaN NaN
2 yellow 6 8 NaN NaN
3 blue 7 9 NaN NaN
Expected Output
a b c
0 red 44 55
1 green 5 7
2 yellow 6 8
3 blue 7 9
Replace - with np.nan and use combine_first:
df4.replace('-',np.nan,inplace=True)
df4.combine_first(df5)
prints:
a b c
0 red 44.0 55.0
1 green 5.0 7.0
2 yellow 6.0 8.0
3 blue 7.0 9.0
Concatenate and drop duplicates by column 'a'.
print(pd.concat([df5, df4]).drop_duplicates(['a'], keep='first'))
You can use DataFrame.update:
df4.update(df5)
Output:
>>> df4
a b c
0 red 44.0 55.0
1 green 5 7
2 yellow 6 8
3 blue 7 9

Attempting to pivot a dataframe with only text columns - "Index contains duplicate entries, cannot reshape"

I'm having issues with pivoting the below data
index column data
0 1 A cat
1 1 B blue
2 1 C seven
3 2 A dog
4 2 B green
5 2 B red
6 2 C eight
7 2 C five
8 3 A fish
9 3 B pink
10 3 C one
I am attempting to pivot it by using
df.pivot(index='index', columns='column', values="data")
But I receive the error "Index contains duplicate entries, cannot reshape"
I have looked through a large number of similar posts to this but none of the solutions I tried worked
My desired output is
index A B C
1 cat blue seven
2 dog green eight
2 dog green five
2 dog red eight
2 dog red five
3 fish pink one
What would be the best solution for this?
in this question Pandas pivot warning about repeated entries on index they state that duplicate pairs (so a duplicate pair in the columns 'index' and 'column') are not possible to pivot.
in your dataset, the index 2 has two times the column values B and C.
Can you change the 'index' column?
See my new dataframe as an example:
df = pd.DataFrame({'index': [1,1,1,2,2,3,2,4,3,4,3],
'column': ['A','B','C','A','B','B','C','C','A','B','C'],
'data':['cat','blue','seven', 'dog', 'green', 'red',
'eight','five', 'fish', 'pink', 'one']})
df
out:
index column data
0 1 A cat
1 1 B blue
2 1 C seven
3 2 A dog
4 2 B green
5 3 B red
6 2 C eight
7 4 C five
8 3 A fish
9 4 B pink
10 3 C one
df.pivot('index', 'column', 'data')
out:
column A B C
index
1 cat blue seven
2 dog green eight
3 fish red one
4 NaN pink five
Option_2
If you use unstack with 'append':
testing = df.set_index(['index', 'column'],
append=True).unstack('column')
testing
data
column A B C
index
0 1 cat NaN NaN
1 1 NaN blue NaN
2 1 NaN NaN seven
3 2 dog NaN NaN
4 2 NaN green NaN
5 2 NaN red NaN
6 2 NaN NaN eight
7 3 NaN NaN five
8 3 fish NaN NaN
9 3 NaN pink NaN
10 3 NaN NaN one

how to melt a dataframe -- get the column name in the field of melt dataframe

I have a df as below
name 0 1 2 3 4
0 alex NaN NaN aa bb NaN
1 mike NaN rr NaN NaN NaN
2 rachel ss NaN NaN NaN ff
3 john NaN ff NaN NaN NaN
the melt function should return the below
name code
0 alex 2
1 alex 3
2 mike 1
3 rachel 0
4 rachel 4
5 john 1
Any suggestion is helpful. thanks.
Just follow these steps: melt, dropna, sort column name, reset index, and finally drop any unwanted columns
In [1171]: df.melt(['name'],var_name='code').dropna().sort_values('name').reset_index().drop(['index', 'value'], 1)
Out[1171]:
name code
0 alex 2
1 alex 3
2 john 1
3 mike 1
4 rachel 0
5 rachel 4
This should work.
df.unstack().reset_index().dropna()
df.set_index('name').unstack().reset_index().rename(columns={'level_0':'Code'}).dropna().drop(0,axis =1)[['name','Code']].sort_values('name')
output will be
name Code
alex 2
alex 3
john 1
mike 1
rachel 0
rachel 4

Pandas: how to merge to dataframes on multiple columns?

I have 2 dataframes, df1 and df2.
df1 Contains the information of some interactions between people.
df1
Name1 Name2
0 Jack John
1 Sarah Jack
2 Sarah Eva
3 Eva Tom
4 Eva John
df2 Contains the status of general people and also some people in df1
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Laura 0
I would like df2 only for the people that are in df1 (Laura disappears), and for those that are not in df2 keep NaN (i.e. Eva) such as:
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Eva NaN
Create a DataFrame on unique values of df1 and map it with df2 as:
df = pd.DataFrame(np.unique(df1.values),columns=['Name'])
df['Y'] = df.Name.map(df2.set_index('Name')['Y'])
print(df)
Name Y
0 Eva NaN
1 Jack 0.0
2 John 1.0
3 Sarah 0.0
4 Tom 1.0
Note : Order is not preserved.
You can create a list of unique names in df1 and use isin
names = np.unique(df1[['Name1', 'Name2']].values.ravel())
df2.loc[~df2['Name'].isin(names), 'Y'] = np.nan
Name Y
0 Jack 0.0
1 John 1.0
2 Sarah 0.0
3 Tom 1.0
4 Laura NaN

Python Average of Multiple Columns and Rows

How do I group by two columns in a dataframe and specify other columns for which I want an overall average?
Data
name team a b c d
Bob blue 2 4 3 5
Bob blue 2 4 3 4
Bob blue 1 5 3 4
Bob green 1 3 2 5
Bob green 1 2 1 1
Bob green 1 2 1 4
Bob green 5 2 2 1
Jane red 1 2 2 3
Jane red 3 3 3 4
Jane red 2 5 1 2
Jane red 4 5 5 3
Desired Output
name team avg
Bob blue 3.333333333
Bob green 2.125
Jane red 3
You can mean two times :-)
df.groupby(['name','team']).mean().mean(1)
Out[1263]:
name team
Bob blue 3.333333
green 2.125000
Jane red 3.000000
dtype: float64
You need to set the index as the grouping columns and stack the remaining columns:
df.set_index(['name', 'team']).stack().groupby(level=[0, 1]).mean()
Out:
name team
Bob blue 3.333333
green 2.125000
Jane red 3.000000
dtype: float64

Categories