I have 2 dataframes, df1 and df2.
df1 Contains the information of some interactions between people.
df1
Name1 Name2
0 Jack John
1 Sarah Jack
2 Sarah Eva
3 Eva Tom
4 Eva John
df2 Contains the status of general people and also some people in df1
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Laura 0
I would like df2 only for the people that are in df1 (Laura disappears), and for those that are not in df2 keep NaN (i.e. Eva) such as:
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Eva NaN
Create a DataFrame on unique values of df1 and map it with df2 as:
df = pd.DataFrame(np.unique(df1.values),columns=['Name'])
df['Y'] = df.Name.map(df2.set_index('Name')['Y'])
print(df)
Name Y
0 Eva NaN
1 Jack 0.0
2 John 1.0
3 Sarah 0.0
4 Tom 1.0
Note : Order is not preserved.
You can create a list of unique names in df1 and use isin
names = np.unique(df1[['Name1', 'Name2']].values.ravel())
df2.loc[~df2['Name'].isin(names), 'Y'] = np.nan
Name Y
0 Jack 0.0
1 John 1.0
2 Sarah 0.0
3 Tom 1.0
4 Laura NaN
Related
I have two lists of unequal length:
Name = ['Tom', 'Jack', 'Nick', 'Juli', 'Harry']
bId= list(range(0,3))
I want to build a data frame that would look like below:
'Name' 'bId'
Tom 0
Tom 1
Tom 2
Jack 0
Jack 1
Jack 2
Nick 0
Nick 1
Nick 2
Juli 0
Juli 1
JUli 2
Harry 0
Harry 1
Harry 2
Please suggest.
Use itertools.product with DataFrame constructor:
from itertools import product
df = pd.DataFrame(product(Name, bId), columns=['Name','bId'])
print (df)
Name bId
0 Tom 0
1 Tom 1
2 Tom 2
3 Jack 0
4 Jack 1
5 Jack 2
6 Nick 0
7 Nick 1
8 Nick 2
9 Juli 0
10 Juli 1
11 Juli 2
12 Harry 0
13 Harry 1
14 Harry 2
I have a df as below
name 0 1 2 3 4
0 alex NaN NaN aa bb NaN
1 mike NaN rr NaN NaN NaN
2 rachel ss NaN NaN NaN ff
3 john NaN ff NaN NaN NaN
the melt function should return the below
name code
0 alex 2
1 alex 3
2 mike 1
3 rachel 0
4 rachel 4
5 john 1
Any suggestion is helpful. thanks.
Just follow these steps: melt, dropna, sort column name, reset index, and finally drop any unwanted columns
In [1171]: df.melt(['name'],var_name='code').dropna().sort_values('name').reset_index().drop(['index', 'value'], 1)
Out[1171]:
name code
0 alex 2
1 alex 3
2 john 1
3 mike 1
4 rachel 0
5 rachel 4
This should work.
df.unstack().reset_index().dropna()
df.set_index('name').unstack().reset_index().rename(columns={'level_0':'Code'}).dropna().drop(0,axis =1)[['name','Code']].sort_values('name')
output will be
name Code
alex 2
alex 3
john 1
mike 1
rachel 0
rachel 4
I have two dataframes df1 and df2
df1
Name1 Name2
0 John Jack
1 Eva Tom
2 Eva Sara
3 Carl Sam
4 Sam Erin
df2 Name Money
0 John 40
1 Eva 20
2 Jack 10
3 Tom 80
4 Sara 34
5 Carl 77
6 Erin 12
I would like to merge the two dataframes and get:
df1
Name1 Name2 Money1 Money2
0 John Jack 40 10
1 Eva Tom 20 80
2 Eva Sara 20 34
3 Carl Sam 77 NaN
4 Sam Erin NaN 12
this what I am doing but I think this is not the best solution:
df1 = pd.merge(df1, df2, right_on='Name1', left_on='Name')
df1.columns = ['Name1', 'Name2', 'Money1']
df1 = pd.merge(df1, df2, right_on='Name2', left_on='Name')
df1.columns = ['Name1', 'Name2', 'Money1', 'Money2']
Using map with apply
df1[['Money1','Money2']]=df1.apply(lambda x : x.map(df2.set_index('Name').Money))
df1
Out[293]:
Name1 Name2 Money1 Money2
0 John Jack 40.0 10.0
1 Eva Tom 20.0 80.0
2 Eva Sara 20.0 34.0
3 Carl Sam 77.0 NaN
4 Sam Erin NaN 12.0
You can use index matching without the need to apply
assign
df = df.set_index('Name1').assign(Money_1=df2.set_index('Name').Money).reset_index().set_index('Name2').assign(Money_2=df2.set_index('Name').Money).reset_index()
Which is actually a one-liner, but is kinda big. The other option is to explicitly write the lines:
loc
df = df.set_index('Name1')
df.loc[:, 'Money_1'] = df2.set_index('Name').Money
df = df.reset_index().set_index('Name2')
df.loc[:, 'Money_2'] = df2.set_index('Name').Money
df.reset_index()
Both outputs
Name1 Name2 Money_1 Money_2
0 John Jack 40.0 10.0
1 Eva Tom 20.0 80.0
2 Eva Sara 20.0 34.0
3 Carl Sam 77.0 NaN
4 Sam Erin NaN 12.0
I have two dataframes df and df1. df contains name and attributes of people.
df Name Age
0 Jack 33
1 Anna 25
2 Emilie 49
3 Frank 19
4 John 42
while df1 contains the info of the number of contacts between two people. In df1 we can have some people that don't appear in df.
df1 Name1 Name2 c
0 Frank Paul 2
1 Julia Anna 5
2 Frank John 1
3 Emilie Jack 3
4 Tom Steven 2
5 Tom Jack 5
I want to drop all the rows from df1 in Name1 or Name2 don't appear in df.
df1 Name1 Name2 c
0 Frank John 1
1 Emilie Jack 3
Use isin -
df1[df1[['Name1', 'Name2']].isin(df.Name).all(1)]
# Name1 Name2 c
#2 Frank John 1
#3 Emilie Jack 3
Or:
df1[df1.Name1.isin(df.Name) & df1.Name2.isin(df.Name)]
# Name1 Name2 c
#2 Frank John 1
#3 Emilie Jack 3
Can also use np.isin
df1[np.isin(df1.Name1, df.Name) &
np.isin(df1.Name2, df.Name)]
Here's my dataframe:
user1 user2 cat quantity + other quantities
----------------------------------------------------
Alice Bob 0 ....
Alice Bob 1 ....
Alice Bob 2 ....
Alice Carol 0 ....
Alice Carol 2 ....
I want to make sure that every user1-user2 pair has a row corresponding to each category (there are three: 0,1,2). If not, I want to insert a row, and set the other columns to zero.
user1 user2 cat quantity + other quantities
----------------------------------------------------
Alice Bob 0 ....
Alice Bob 1 ....
Alice Bob 2 ....
Alice Carol 0 ....
Alice Carol 1 <SET ALL TO ZERO>
Alice Carol 2 ....
what I have so far is the list of all user1-user2 which has less than 3 values for cat:
df.groupby(['user1','user2']).agg({'cat':'count'}).reset_index()[['user1','user2']]
I could iterate over these users, but that will take a long time (there are >1M such pairs). I've checked at other solutions for inserting rows in pandas based on some condition (like Pandas/Python adding row based on condition and Insert row in Pandas Dataframe based on a condition) but they're not exactly the same.
Also, since this is a huge dataset, the solution has to be vectorized. How should I proceed?
Use set_index with reindex by MultiIndex.from_product:
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2 4
1 Alice Bob 1 3 4
2 Alice Bob 2 4 4
3 Alice Carol 0 6 4
4 Alice Carol 2 3 4
df = df.set_index(['user1','user2', 'cat'])
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2 4
1 Alice Bob 1 3 4
2 Alice Bob 2 4 4
3 Alice Carol 0 6 4
4 Alice Carol 1 0 0
5 Alice Carol 2 3 4
Another solution is create new Dataframe by all combinations of unique values of columns and merge with right join:
from itertools import product
df1 = pd.DataFrame(list(product(df['user1'].unique(),
df['user2'].unique(),
df['cat'].unique())), columns=['user1','user2', 'cat'])
df = df.merge(df1, how='right').fillna(0)
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2.0 4.0
1 Alice Bob 1 3.0 4.0
2 Alice Bob 2 4.0 4.0
3 Alice Carol 0 6.0 4.0
4 Alice Carol 2 3.0 4.0
5 Alice Carol 1 0.0 0.0
EDIT2:
df['user1'] = df['user1'] + '_' + df['user2']
df = df.set_index(['user1', 'cat']).drop('user2', 1)
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
df[['user1','user2']] = df['user1'].str.split('_', expand=True)
print (df)
user1 cat quantity a user2
0 Alice 0 2 4 Bob
1 Alice 1 3 4 Bob
2 Alice 2 4 4 Bob
3 Alice 0 6 4 Carol
4 Alice 1 0 0 Carol
5 Alice 2 3 4 Carol
EDIT3:
cols = df.columns.difference(['user1','user2'])
df = (df.groupby(['user1','user2'])[cols]
.apply(lambda x: x.set_index('cat').reindex(df['cat'].unique(), fill_value=0))
.reset_index())
print (df)
user1 user2 cat a quantity
0 Alice Bob 0 4 2
1 Alice Bob 1 4 3
2 Alice Bob 2 4 4
3 Alice Carol 0 4 6
4 Alice Carol 1 0 0
5 Alice Carol 2 4 3