I have a pandas dataframe like
user_id
music_id
has_rating
A
a
1
B
b
1
and I would like to automatically add new rows for each of user_id & music_id for those users haven't rated, like
user_id
music_id
has_rating
A
a
1
A
b
0
B
a
0
B
b
1
for each of user_id and music_id combination pairs those are not existing in my Pandas dataframe yet.
is there any way to append such rows automatically like this?
You can use a temporary reshape with pivot_table and fill_value=0 to fill the missing values with 0:
(df.pivot_table(index='user_id', columns='music_id',
values='has_rating', fill_value=0)
.stack().reset_index(name='has_rating')
)
Output:
user_id music_id has_rating
0 A a 1
1 A b 0
2 B a 0
3 B b 1
Try using pd.MultiIndex.from_product()
l = ['user_id','music_id']
(df.set_index(l)
.reindex(pd.MultiIndex.from_product([df[l[0]].unique(),df[l[1]].unique()],names = l),fill_value=0)
.reset_index())
Output:
user_id music_id has_rating
0 A a 1
1 A b 0
2 B a 0
3 B b 1
Related
I have a multiindex dataframe. Columns 'partner', 'employer', and 'date' are the multiindex.
enter image description here
partner
employer
date
ecom
sales
A
a
10/01/21
1
0
A
a
10/02/21
1
0
A
a
10/03/21
0
1
A
b
10/01/21
0
1
A
b
10/02/21
1
0
A
b
10/03/21
1
0
B
c
10/03/21
1
0
B
c
10/04/21
1
0
B
c
10/04/21
0
1
I'm trying to find which unique (parter, employer) pairs have 'ecom' BEFORE 'sales'. For example, I want to have the output to be. How do I filter through each (partner, employer) pair with these conditions in python?
enter image description here
partner
employer
date
ecom
sales
A
a
10/01/21
1
0
A
a
10/02/21
1
0
A
a
10/03/21
0
1
B
c
10/03/21
1
0
B
c
10/04/21
1
0
B
c
10/04/21
0
1
Try this:
# Find the first date where ecom or sales is not 0
first_date = lambda col: col.first_valid_index()[-1]
tmp = df.replace(0, np.nan).sort_index().groupby(level=[0,1]).agg(
first_ecom=('ecom', first_date),
first_sales=('sales', first_date)
)
# The (partner, employer) pairs where ecom happens before sales
idx = tmp[tmp['first_ecom'] < tmp['first_sales']].index
# Condition to filter the original frame
cond = df.index.droplevel(-1).isin(idx)
# Result
df[cond]
I have a pandas dataframe that contains user id and ad click (if any) by this user across several days
df =pd.DataFrame([['A',0], ['A',1], ['A',0], ['B',0], ['B',0], ['B',0], ['B',1], ['B',1], ['B',1]],columns=['user_id', 'click_count'])
Out[8]:
user_id click_count
0 A 0
1 A 1
2 A 0
3 B 0
4 B 0
5 B 0
6 B 1
7 B 1
8 B 1
I would like to convert this dataframe into A dataframe WITH 1 row per user where 'click_cnt' now is sum of all click_count across all rows for each user in the original dataframe i.e.
Out[18]:
user_id click_cnt
0 A 1
1 B 3
What you're after is the function groupby:
df = df.groupby('user_id', as_index=False).sum()
Adding the flag as_index=False will add the keys as a separate column instead of using them for the new index.
groupby is super useful - have a read through the documentation for more info.
I have a dataframe that looks like the one below
ID | Value
1 | A
1 | B
1 | C
2 | B
2 | C
I want to create a symmetric matrix of based on Value:
A B C
A 1 1 1
B 1 2 2
C 1 2 2
This basically indicates how many people have both values (v1,v2). I am currently using for loops to scan the dataframe for every combination but was wondering if there was an easier way to do it using pandas.
Use merge with cross join by ID column with crosstab and DataFrame.rename_axis for remove index and columns names:
df = pd.merge(df, df, on='ID')
df = pd.crosstab(df['Value_x'], df['Value_y']).rename_axis(None).rename_axis(None, axis=1)
print (df)
A B C
A 1 1 1
B 1 2 2
C 1 2 2
I have 2 columns, User_ID and Item_ID. Now I want to make a new column 'Reordered' which will contain values as either 0 or 1. 0 is when a particular user has ordered an item only once, and 1 is when a particular user orders an item more than once.
I think this can be done by grouping on User_ID and then using apply function to map duplicated items as 1 and non duplicated as 0 but I'm not able to figure out the correct python code for that.
If someone can please help me with this.
You can use Series.duplicated with parameter keep=False for all duplicates - output is Trues and Falses. Last convert to ints by astype:
df['Reordered'] = df['User_ID'].duplicated(keep=False).astype(int)
Sample:
df = pd.DataFrame({'User_ID':list('aaabaccd'),
'Item_ID':list('eetyutyu')})
df['Reordered'] = df['User_ID'].duplicated(keep=False).astype(int)
print (df)
Item_ID User_ID Reordered
0 e a 1
1 e a 1
2 t a 1
3 y b 0
4 u a 1
5 t c 1
6 y c 1
7 u d 0
Or maybe need DataFrame.duplicated for check duplicates per each user:
df['Reordered'] = df.duplicated(['User_ID','Item_ID'], keep=False).astype(int)
print (df)
Item_ID User_ID Reordered
0 e a 1
1 e a 1
2 t a 0
3 y b 0
4 u a 0
5 t c 0
6 y c 0
7 u d 0
I have two categories A and B that can take on 5 different states (values, names or categories) defined by the list abcde. Counting the occurence of each state and storing it in a data frame is fairly easy. However, I would also like the resulting data frame to include zeros for the possible values that have not occured in Category A or B.
First, here's a dataframe that matches the description:
In[1]:
import pandas as pd
possibleValues = list('abcde')
df = pd.DataFrame({'Category A':list('abbc'), 'Category B':list('abcc')})
print(df)
Out[1]:
Category A Category B
0 a a
1 b b
2 b c
3 c c
I've tried different approaches with df.groupby(...).size() and .count() , combined with the list of possible values and the names of the categories in a list, but with no success.
Here's the desired output:
Category A Category B
a 1 1
b 2 1
c 1 2
d 0 0
e 0 0
To go one step further, I'd also like to include a column with the totals for each possible state across all categories:
Category A Category B Total
a 1 1 2
b 2 1 3
c 1 2 3
d 0 0 0
e 0 0 0
SO has got many related questions and answers, but to my knowledge none that suggest a solution to this particular problem. Thank you for any suggestions!
P.S
I'd like to make the solution adjustable to the number of categories, possible values and number of rows.
Need apply + value_counts + reindex + sum:
cols = ['Category A','Category B']
df1 = df[cols].apply(pd.value_counts).reindex(possibleValues, fill_value=0)
df1['total'] = df1.sum(axis=1)
print (df1)
Category A Category B total
a 1 1 2
b 2 1 3
c 1 2 3
d 0 0 0
e 0 0 0
Another solution is convert columns to categorical and then 0 values are added without reindex:
cols = ['Category A','Category B']
df1 = df[cols].apply(lambda x: pd.Series.value_counts(x.astype('category',
categories=possibleValues)))
df1['total'] = df1.sum(axis=1)
print (df1)
Category A Category B total
a 1 1 2
b 2 1 3
c 1 2 3
d 0 0 0
e 0 0 0