Insert rows in pandas where one column misses some value in groupby - python

Here's my dataframe:
user1 user2 cat quantity + other quantities
----------------------------------------------------
Alice Bob 0 ....
Alice Bob 1 ....
Alice Bob 2 ....
Alice Carol 0 ....
Alice Carol 2 ....
I want to make sure that every user1-user2 pair has a row corresponding to each category (there are three: 0,1,2). If not, I want to insert a row, and set the other columns to zero.
user1 user2 cat quantity + other quantities
----------------------------------------------------
Alice Bob 0 ....
Alice Bob 1 ....
Alice Bob 2 ....
Alice Carol 0 ....
Alice Carol 1 <SET ALL TO ZERO>
Alice Carol 2 ....
what I have so far is the list of all user1-user2 which has less than 3 values for cat:
df.groupby(['user1','user2']).agg({'cat':'count'}).reset_index()[['user1','user2']]
I could iterate over these users, but that will take a long time (there are >1M such pairs). I've checked at other solutions for inserting rows in pandas based on some condition (like Pandas/Python adding row based on condition and Insert row in Pandas Dataframe based on a condition) but they're not exactly the same.
Also, since this is a huge dataset, the solution has to be vectorized. How should I proceed?

Use set_index with reindex by MultiIndex.from_product:
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2 4
1 Alice Bob 1 3 4
2 Alice Bob 2 4 4
3 Alice Carol 0 6 4
4 Alice Carol 2 3 4
df = df.set_index(['user1','user2', 'cat'])
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2 4
1 Alice Bob 1 3 4
2 Alice Bob 2 4 4
3 Alice Carol 0 6 4
4 Alice Carol 1 0 0
5 Alice Carol 2 3 4
Another solution is create new Dataframe by all combinations of unique values of columns and merge with right join:
from itertools import product
df1 = pd.DataFrame(list(product(df['user1'].unique(),
df['user2'].unique(),
df['cat'].unique())), columns=['user1','user2', 'cat'])
df = df.merge(df1, how='right').fillna(0)
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2.0 4.0
1 Alice Bob 1 3.0 4.0
2 Alice Bob 2 4.0 4.0
3 Alice Carol 0 6.0 4.0
4 Alice Carol 2 3.0 4.0
5 Alice Carol 1 0.0 0.0
EDIT2:
df['user1'] = df['user1'] + '_' + df['user2']
df = df.set_index(['user1', 'cat']).drop('user2', 1)
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
df[['user1','user2']] = df['user1'].str.split('_', expand=True)
print (df)
user1 cat quantity a user2
0 Alice 0 2 4 Bob
1 Alice 1 3 4 Bob
2 Alice 2 4 4 Bob
3 Alice 0 6 4 Carol
4 Alice 1 0 0 Carol
5 Alice 2 3 4 Carol
EDIT3:
cols = df.columns.difference(['user1','user2'])
df = (df.groupby(['user1','user2'])[cols]
.apply(lambda x: x.set_index('cat').reindex(df['cat'].unique(), fill_value=0))
.reset_index())
print (df)
user1 user2 cat a quantity
0 Alice Bob 0 4 2
1 Alice Bob 1 4 3
2 Alice Bob 2 4 4
3 Alice Carol 0 4 6
4 Alice Carol 1 0 0
5 Alice Carol 2 4 3

Related

Merging two dataframes while changing the order of the second dataframe each time

df is like so:
Week Name
1 TOM
1 BEN
1 CARL
2 TOM
2 BEN
2 CARL
3 TOM
3 BEN
3 CARL
and df1 is like so:
ID Letter
1 A
2 B
3 C
I want to merge the two dataframes so that each name is assigned a different letter each time. So the result should be like this:
Week Name Letter
1 TOM A
1 BEN B
1 CARL C
2 TOM B
2 BEN C
2 CARL A
3 TOM C
3 BEN A
3 CARL B
Any help would be greatly appreciated. Thanks in advance.
df1['Letter'] = df1.groupby('Week').cumcount().add(df1['Week'].sub(1)).mod(df1.groupby('Week').transform('count')['Name']).map(df2['Letter'])
Output:
>>> df1
Week Name Letter
0 1 TOM A
1 1 BEN B
2 1 CARL C
3 2 TOM B
4 2 BEN C
5 2 CARL A
6 3 TOM C
7 3 BEN A
8 3 CARL B

Adding rows to a column for every element in the list for every unique value in another column in python pandas

I have two lists of unequal length:
Name = ['Tom', 'Jack', 'Nick', 'Juli', 'Harry']
bId= list(range(0,3))
I want to build a data frame that would look like below:
'Name' 'bId'
Tom 0
Tom 1
Tom 2
Jack 0
Jack 1
Jack 2
Nick 0
Nick 1
Nick 2
Juli 0
Juli 1
JUli 2
Harry 0
Harry 1
Harry 2
Please suggest.
Use itertools.product with DataFrame constructor:
from itertools import product
df = pd.DataFrame(product(Name, bId), columns=['Name','bId'])
print (df)
Name bId
0 Tom 0
1 Tom 1
2 Tom 2
3 Jack 0
4 Jack 1
5 Jack 2
6 Nick 0
7 Nick 1
8 Nick 2
9 Juli 0
10 Juli 1
11 Juli 2
12 Harry 0
13 Harry 1
14 Harry 2

Pandas: how to merge to dataframes on multiple columns?

I have 2 dataframes, df1 and df2.
df1 Contains the information of some interactions between people.
df1
Name1 Name2
0 Jack John
1 Sarah Jack
2 Sarah Eva
3 Eva Tom
4 Eva John
df2 Contains the status of general people and also some people in df1
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Laura 0
I would like df2 only for the people that are in df1 (Laura disappears), and for those that are not in df2 keep NaN (i.e. Eva) such as:
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Eva NaN
Create a DataFrame on unique values of df1 and map it with df2 as:
df = pd.DataFrame(np.unique(df1.values),columns=['Name'])
df['Y'] = df.Name.map(df2.set_index('Name')['Y'])
print(df)
Name Y
0 Eva NaN
1 Jack 0.0
2 John 1.0
3 Sarah 0.0
4 Tom 1.0
Note : Order is not preserved.
You can create a list of unique names in df1 and use isin
names = np.unique(df1[['Name1', 'Name2']].values.ravel())
df2.loc[~df2['Name'].isin(names), 'Y'] = np.nan
Name Y
0 Jack 0.0
1 John 1.0
2 Sarah 0.0
3 Tom 1.0
4 Laura NaN

How to assign a unique ID to detect repeated rows in a pandas dataframe?

I am working with a large pandas dataframe, with several columns pretty much like this:
A B C D
John Tom 0 1
Homer Bart 2 3
Tom Maggie 1 4
Lisa John 5 0
Homer Bart 2 3
Lisa John 5 0
Homer Bart 2 3
Homer Bart 2 3
Tom Maggie 1 4
How can I assign an unique id to each repeated row? For example:
A B C D new_id
John Tom 0 1.2 1
Homer Bart 2 3.0 2
Tom Maggie 1 4.2 3
Lisa John 5 0 4
Homer Bart 2 3 5
Lisa John 5 0 4
Homer Bart 2 3.0 2
Homer Bart 2 3.0 2
Tom Maggie 1 4.1 6
I know that I can use duplicate to detect the duplicated rows, however I can not visualize were are reapeting those rows. I tried to:
df.assign(id=(df.columns).astype('category').cat.codes)
df
However, is not working. How can I get a unique id for detecting groups of duplicated rows?
For small dataframes, you can convert your rows to tuples, which can be hashed, and then use pd.factorize.
df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1
groupby is more efficient for larger dataframes:
df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1
Group by the columns you are trying to find duplicates over and use ngroup:
df['new_id'] = df.groupby(['A','B','C','D']).ngroup()

Python - How to fill string value with the modal value for the group

I have a dataset like the below. I want to be able to be able to populate the missing text with what is normal for the group. I have tried using ffil but this doesn't help the ones that are blank at the start, and bfil similarly for the end. How can I do this?
Group Name
1 Annie
2 NaN
3 NaN
4 David
1 NaN
2 Bertha
3 Chris
4 NaN
Desired Output:
Group Name
1 Annie
2 Bertha
3 Chris
4 David
1 Annie
2 Bertha
3 Chris
4 David
Using collections.Counter to create a modal mapping by group:
from collections import Counter
s = df.dropna(subset=['Name'])\
.groupby('Group')['Name']\
.apply(lambda x: Counter(x).most_common()[0][0])
df['Name'] = df['Name'].fillna(df['Group'].map(s))
print(df)
Group Name
0 1 Annie
1 2 Bertha
2 3 Chris
3 4 David
4 1 Annie
5 2 Bertha
6 3 Chris
7 4 David
You can use value_counts and head:
s = df.groupby('Group')['Name'].apply(lambda x: x.value_counts().head(1)).reset_index(-1)['level_1']
df['Name'] = df['Name'].fillna(df['Group'].map(s))
print(df)
Output:
Group Name
0 1 Annie
1 2 Bertha
2 3 Chris
3 4 David
4 1 Annie
5 2 Bertha
6 3 Chris
7 4 David

Categories