I'm trying to use pandas to identify sub-sections of a dataframe which are identical. So, for example, if I have a dataframe like:
id A B
0 1 1 2
1 1 2 3
2 1 5 6
3 2 1 2
4 2 2 3
5 2 5 6
6 3 8 9
7 3 4 0
8 3 9 7
I want to group by ID, so Rows 0 - 2 would form Group 1, Rows 3 - 5 would form Group 2, and Rows 6 - 8 would form Group 3. I know I can use pd.groupby() to group rows by ID. In the case here, Group 2 is a repetition of Group 1 (Columns A and B are identical in both)
What I then want to do is to remove repeated groups, so in this case I would want to remove the second group. My final dataframe would then look like:
id A B
0 1 1 2
1 1 2 3
2 1 5 6
6 3 8 9
7 3 4 0
8 3 9 7
Every column in the duplicate groups is the same, except for the ID which is different for each group. I only want to remove a group if it is identical for every row in the group. Any help would be much appreciated!
This is one way using a helper column and pd.Series.drop_duplicates.
The idea is to first create a mapping from id to a tuple of values representing all rows for that id. Then drop duplicates and extract the index of the remainder.
df['C'] = list(zip(df['A'], df['B']))
s = df.groupby('id')['C'].apply(tuple)\
.drop_duplicates().index
res = df.loc[df['id'].isin(s), ['id', 'A', 'B']]
print(res)
id A B
0 1 1 2
1 1 2 3
2 1 5 6
6 3 8 9
7 3 4 0
8 3 9 7
Check pd.crosstab
s=pd.crosstab(df.id,[df.A,df.B]).drop_duplicates().unstack()
s[s!=0].reset_index().drop(0,1)
Out[128]:
A B id
0 1 2 1
1 2 3 1
2 4 0 3
3 5 6 1
4 8 9 3
5 9 7 3
Related
Given a pandas data frame, how can I get the first row for each unique value in a column?
for example, given:
a b key
0 1 2 1
1 2 3 1
2 3 3 1
3 4 5 2
4 5 6 2
5 6 6 2
6 7 2 1
7 8 2 1
8 9 2 3
the result when analyzing by column key should be
a b key
0 1 2 1
3 4 5 2
8 9 2 3
p.s. df src:
pd.DataFrame([{'a':1,'b':2,'key':1},
{'a':2,'b':3,'key':1},
{'a':3,'b':3,'key':1},
{'a':4,'b':5,'key':2},
{'a':5,'b':6,'key':2},
{'a':6,'b':6,'key':2},
{'a':7,'b':2,'key':1},
{'a':8,'b':2,'key':1},
{'a':9,'b':2,'key':3}])
drop_duplicates does this. By default it keeps the first of the set, although that can be changed by other parameters.
df = df.drop_duplicates('key')
I have a dataframe that looks like
ID feature
1 2
1 3
1 4
2 3
2 2
3 5
3 8
3 4
3 2
4 4
4 6
and I want to add a new column n_ID that counts the number of times that an element occur in the column ID, so the desire output should look like
ID feature n_ID
1 2 3
1 3 3
1 4 3
2 3 2
2 2 2
3 5 4
3 8 4
3 4 4
3 2 4
4 4 2
4 6 2
I know the .value_counts() function but I don't know how to make use of this method to make the new column. Thanks in advance
Using value counts... I was thinking of this... #sophocles Thanks for transform... :)
df = pd.DataFrame({"ID":[1,1,1,2,2,3,3,3,3,4,4],
"feature":[1,2,3,4,5,6,7,8,9,10,11]})
df1 = pd.DataFrame(df["ID"].value_counts().reset_index())
df1.columns = ["ID","n_ID"]
df = df.merge(df1,how = "left",on="ID")
Just create new column and count the occurance using lambda func:
Code:
df['n_id'] = df.apply(lambda x: df['ID'].tolist().count(x.ID), axis=1)
Output:
ID feature n_id
0 1 1 3
1 1 2 3
2 1 3 3
3 2 4 2
4 2 5 2
5 3 6 4
6 3 7 4
7 3 8 4
8 3 9 4
9 4 10 2
10 4 11 2
I'm currently working on a dataframe that has processes (based on ID) that may or not reach the end of the process. The end of the process is defined as the activity which has index=6. What i need to do is to filter those processes (ID) based on the fact they are completed, which means all 6 the activities are done (so in the process we'll have activities which have index equal to 1,2,3,4,5 and 6 in this specific order).
the dataframe is structured as follows:
ID A index
1 activity1 1
1 activity2 2
1 activity3 3
1 activity4 4
1 activity5 5
1 activity6 6
2 activity7 1
2 activity8 2
2 activity9 3
3 activity10 1
3 activity11 2
3 activity12 3
3 activity13 4
3 activity14 5
3 activity15 6
And the resulting dataframe should be:
ID A index
1 activity1 1
1 activity2 2
1 activity3 3
1 activity4 4
1 activity5 5
1 activity6 6
3 activity10 1
3 activity11 2
3 activity12 3
3 activity13 4
3 activity14 5
3 activity15 6
I've tried to do so working with sum(), creating a new column 'a' and checking if the sum of every group was greater than 20 (which means taking groups in which the sum() is at least 21, which is the sum of 1,2,3,4,5,6) with the function gt().
df['a'] = df['index'].groupby(df['index']).sum()
df2 = df[df['a'].gt(20)]
Probably this isn't the best approach, so also other approaches are more than welcome.
Any idea on how to select rows based on this condition?
Another possible solution:
out = (df.groupby('ID')
.filter(lambda g: (len(g['index']) == 6) and
(g['index'].eq([*range(1,7)]).all())))
print(out)
ID A index
0 1 activity1 1
1 1 activity2 2
2 1 activity3 3
3 1 activity4 4
4 1 activity5 5
5 1 activity6 6
9 3 activity10 1
10 3 activity11 2
11 3 activity12 3
12 3 activity13 4
13 3 activity14 5
14 3 activity15 6
this may not be the fastest method, especially on a large dataframe, but it does the job
df = df.loc[df.groupby(['ID'])['index'].transform(lambda x: list(x)==list(range(1,7)))]
Or this other variation:
df = df.loc[df.groupby('ID')['index'].filter(lambda x: list(x)==list(range(1,7))).index]
Output:
ID A index
0 1 activity1 1
1 1 activity2 2
2 1 activity3 3
3 1 activity4 4
4 1 activity5 5
5 1 activity6 6
9 3 activity10 1
10 3 activity11 2
11 3 activity12 3
12 3 activity13 4
13 3 activity14 5
14 3 activity15 6
Here an example:
import pandas as pd
df = pd.DataFrame({
'product':['1','1','1','2','2','2','3','3','3','4','4','4','5','5','5'],
'value':['a','a','a','a','a','b','a','b','a','b','b','b','a','a','a']
})
product value
0 1 a
1 1 a
2 1 a
3 2 a
4 2 a
5 2 b
6 3 a
7 3 b
8 3 a
9 4 b
10 4 b
11 4 b
12 5 a
13 5 a
14 5 a
I need to output:
1 a
4 b
5 a
Because 'value' values for distinct 'product' values all are same
I'm sorry for bad English
I think you need this
m=df.groupby('product')['value'].transform('nunique')
df.loc[m==1].drop_duplicates(). reset_index(drop=True)
Output
product value
0 1 a
1 4 b
2 5 a
Details
df.groupby('product')['value'].transform('nunique') returns a series as below
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 2
9 1
10 1
11 1
12 1
13 1
14 1
where the numbers of the number of unique values in each group. Then we use df.loc to get only the rows in which this value is 1, so, the groups with unique values.
The we drop duplicates since you need only the group & its unique value.
If I undestand correctly your question, this simple code is for your:
distinct_prod_df = df.drop_duplicates(['product'])
and gives:
product value
0 1 a
3 2 a
6 3 a
9 4 b
12 5 a
You can try this:
mask = df.groupby('product').apply(lambda x: x.nunique() == 1)
df = df[mask].drop_duplicates()
I'm trying to create a historical time-series of a number of identifiers for a number of different metrics, as part of that i'm trying to create multi index dataframe and then "fill it" with the individual dataframes.
Multi Index:
ID1 ID2
ITEM1 ITEM2 ITEM1 ITEM2
index
Dataframe to insert
ITEM1 ITEM2
Date
a
b
c
looking through the official docs and this website i found the following relevant:
Add single index data frame to multi index data frame, Pandas, Python and the associated pandas official docs pages:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
https://pandas.pydata.org/pandas-docs/stable/advanced.html
i've managed with something like :
for i in df1.index:
for j in df2.columns:
df1.loc[i,(ID,j)]=df2.loc[i,j]
but it seems highly inefficient when i need to do this across circa 100 dataframes.
for some reason a simply
df1.loc[i,(ID)]=df2.loc[i] doesn't seem to work
neither does :
df1[ID1]=df1.append(df2)
which returns a Cannot set a frame with no defined index and a value that cannot be converted to a Series
my understanding from looking around is that this is because im effectively leaving half the dataframe empty ( ragged list? )
any help appreciated on how to iteratively populate my multi index DF would be greatly appreciated.
let me know if i've missed relevant information,
cheers.
Setup
df1 = pd.DataFrame(
[[1, 2, 3, 4, 5, 6] * 2] * 3,
columns=pd.MultiIndex.from_product(['ID1 ID2 ID3'.split(), range(4)])
)
df2 = df1.ID1 * 2
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 1 2 3 4 5 6
1 1 2 3 4 5 6 1 2 3 4 5 6
2 1 2 3 4 5 6 1 2 3 4 5 6
df2
0 1 2 3
0 2 4 6 8
1 2 4 6 8
2 2 4 6 8
The problem is that Pandas is trying to line up indices (or columns in this case). We can do some transpose/join trickery but I'd rather avoid that.
Option 1
Take advantage of the fact that we can assign via loc an array so long as the shape matches up. Well, we better make sure it does and that the order of columns and index are correct. I use align with the right parameter to do this. Then assign the values of the aligned df2
df1.loc[:, 'ID1'] = df2.align(df1.ID1, 'right')[0].values
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 2
Or, we can give df2 the additional level of column indexing that we need to lined it up. The use update to replace the relevant cells in place.
df1.update(pd.concat({'ID1': df2}, axis=1))
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 3
A creative way using stack and assign with unstack
df1.stack().assign(ID1=df2.stack()).unstack()
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6