Finding and mapping duplicates in a pandas groupby object - python

I have 2 columns, User_ID and Item_ID. Now I want to make a new column 'Reordered' which will contain values as either 0 or 1. 0 is when a particular user has ordered an item only once, and 1 is when a particular user orders an item more than once.
I think this can be done by grouping on User_ID and then using apply function to map duplicated items as 1 and non duplicated as 0 but I'm not able to figure out the correct python code for that.
If someone can please help me with this.

You can use Series.duplicated with parameter keep=False for all duplicates - output is Trues and Falses. Last convert to ints by astype:
df['Reordered'] = df['User_ID'].duplicated(keep=False).astype(int)
Sample:
df = pd.DataFrame({'User_ID':list('aaabaccd'),
'Item_ID':list('eetyutyu')})
df['Reordered'] = df['User_ID'].duplicated(keep=False).astype(int)
print (df)
Item_ID User_ID Reordered
0 e a 1
1 e a 1
2 t a 1
3 y b 0
4 u a 1
5 t c 1
6 y c 1
7 u d 0
Or maybe need DataFrame.duplicated for check duplicates per each user:
df['Reordered'] = df.duplicated(['User_ID','Item_ID'], keep=False).astype(int)
print (df)
Item_ID User_ID Reordered
0 e a 1
1 e a 1
2 t a 0
3 y b 0
4 u a 0
5 t c 0
6 y c 0
7 u d 0

Related

pandas - creating new rows of combination with value of 0

I have a pandas dataframe like
user_id
music_id
has_rating
A
a
1
B
b
1
and I would like to automatically add new rows for each of user_id & music_id for those users haven't rated, like
user_id
music_id
has_rating
A
a
1
A
b
0
B
a
0
B
b
1
for each of user_id and music_id combination pairs those are not existing in my Pandas dataframe yet.
is there any way to append such rows automatically like this?
You can use a temporary reshape with pivot_table and fill_value=0 to fill the missing values with 0:
(df.pivot_table(index='user_id', columns='music_id',
values='has_rating', fill_value=0)
.stack().reset_index(name='has_rating')
)
Output:
user_id music_id has_rating
0 A a 1
1 A b 0
2 B a 0
3 B b 1
Try using pd.MultiIndex.from_product()
l = ['user_id','music_id']
(df.set_index(l)
.reindex(pd.MultiIndex.from_product([df[l[0]].unique(),df[l[1]].unique()],names = l),fill_value=0)
.reset_index())
Output:
user_id music_id has_rating
0 A a 1
1 A b 0
2 B a 0
3 B b 1

How to drop conflicted rows in Dataframe?

I have a cliassification task, which means the conflicts harm the performance, i.e. same feature but different label.
idx feature label
0 a 0
1 a 1
2 b 0
3 c 1
4 a 0
5 b 0
How could I get formated dataframe as below?
idx feature label
2 b 0
3 c 1
5 b 0
Dataframe.duplicated() only output the duplicated rows, it seems the logic operation between df["features"].duplicated() and df.duplicated() do not return the results I want.
I think you need rows with only one unique value per groups - so use GroupBy.transform with DataFrameGroupBy.nunique, compare by 1 and filter in boolean indexing:
df = df[df.groupby('feature')['label'].transform('nunique').eq(1)]
print (df)
idx feature label
2 2 b 0
3 3 c 1
5 5 b 0

Python Selecting and Adding row values of columns in dataframe to create an aggregated dataframe

I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:
Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p
You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p

Count occurences of specific values in a data frame, where all possible values are defined by a list

I have two categories A and B that can take on 5 different states (values, names or categories) defined by the list abcde. Counting the occurence of each state and storing it in a data frame is fairly easy. However, I would also like the resulting data frame to include zeros for the possible values that have not occured in Category A or B.
First, here's a dataframe that matches the description:
In[1]:
import pandas as pd
possibleValues = list('abcde')
df = pd.DataFrame({'Category A':list('abbc'), 'Category B':list('abcc')})
print(df)
Out[1]:
Category A Category B
0 a a
1 b b
2 b c
3 c c
I've tried different approaches with df.groupby(...).size() and .count() , combined with the list of possible values and the names of the categories in a list, but with no success.
Here's the desired output:
Category A Category B
a 1 1
b 2 1
c 1 2
d 0 0
e 0 0
To go one step further, I'd also like to include a column with the totals for each possible state across all categories:
Category A Category B Total
a 1 1 2
b 2 1 3
c 1 2 3
d 0 0 0
e 0 0 0
SO has got many related questions and answers, but to my knowledge none that suggest a solution to this particular problem. Thank you for any suggestions!
P.S
I'd like to make the solution adjustable to the number of categories, possible values and number of rows.
Need apply + value_counts + reindex + sum:
cols = ['Category A','Category B']
df1 = df[cols].apply(pd.value_counts).reindex(possibleValues, fill_value=0)
df1['total'] = df1.sum(axis=1)
print (df1)
Category A Category B total
a 1 1 2
b 2 1 3
c 1 2 3
d 0 0 0
e 0 0 0
Another solution is convert columns to categorical and then 0 values are added without reindex:
cols = ['Category A','Category B']
df1 = df[cols].apply(lambda x: pd.Series.value_counts(x.astype('category',
categories=possibleValues)))
df1['total'] = df1.sum(axis=1)
print (df1)
Category A Category B total
a 1 1 2
b 2 1 3
c 1 2 3
d 0 0 0
e 0 0 0

pandas: Grouping or filtering based on values in list, instead of dataframe

I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?
Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7

Categories