Drop duplicates but keep reference to dropped rows

Drop duplicates but keep reference to dropped rows - python

I have a dataframe with many duplicate rows. The dataset has hundreds of rows and columns.
For each row there is an unique identifier. I want to create a dataframe with only unique rows. Then I want to create a mapping that maps the identifier in the unique-row dataframe, to the identifiers of the original dataframe.
For example
import pandas as pd
# Dummy data
df = pd.DataFrame({'col_1': [1, 2, 2, 1, 2, 3],
'col_2': [2, 4, 4, 2, 4, 2],
'col_3': [3, 2, 2, 3, 2, 7]},
index=['A', 'B', 'C', 'D', 'E', 'F'])
df
Out[11]:
col_1 col_2 col_3
A 1 2 3
B 2 4 2
C 2 4 2
D 1 2 3
E 2 4 2
F 3 2 7
# Unique row dataframe
df_unique = df.drop_duplicates()
df_unique()
Out[12]:
col_1 col_2 col_3
A 1 2 3
B 2 4 2
F 3 2 7
# Mapping from df_unique to df
# Creating this mapping is the problem
mapping = {'A': ('A', 'D'),
'B': ('B', 'C', 'E'),
'F': ('F')}
In this case rows 'A' and 'D' are equal, and 'A' maps to 'A' and 'D' before drop_duplicates().
How to create mapping?
Here I used drop_duplicates() to create the unique-row dataframe. This is not a requirement. And the mapping does not have to be a dictionary if somebody has a better idea.

Use GroupBy.agg with first and tuple by all columns of DataFrame and then create dictionary of tuples:
mapping = (df.reset_index()
.groupby(df.columns.tolist())['index']
.agg(['first',tuple])
.set_index('first')['tuple']
.to_dict())
print (mapping)
{'A': ('A', 'D'), 'B': ('B', 'C', 'E'), 'F': ('F',)}

Related

Python Groupby Separate Filter for Each Group

Say I have the following data frame:
import pandas as pd
d = {'id': [1, 2, 3, 3, 3, 2, 2, 1, 2, 3, 2, 3],
'date': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4],
'product': ['a', 'a', 'b', 'a', 'b', 'a', 'b', 'c', 'b', 'c', 'c', 'c']}
df = pd.DataFrame(d)
I want to keep all data for each ID the day of and after they bought product 'b' and get rid of all data before they bought product 'b'. ID 1 would have no data because they did not purchase the product, ID 2 would have data for the 3rd and 4th day, and ID 3 would have data for days 1-4.
I know that I could groupby id and then filter rows from individual groups but I can't figure out how to make the filter dynamic based on the group. I've tried looping through the groups but it's slow (right now I have 19,000 IDs but it'll only grow as I continue the project).
Any help would be greatly appreciated. Thank you!

You can select the products "b" with eq and set the successive rows to True per group using groupby+cummax. Then slice the dataframe
df[df['product'].eq('b').groupby(df['id']).cummax()]
output:
id date product
2 3 1 b
3 3 2 a
4 3 2 b
6 2 3 b
8 2 3 b
9 3 3 c
10 2 4 c
11 3 4 c
NB. this assumes the dataframe is ordered by date. If not use sort_values(by='date') (or by=['group', 'date'])

How to calculate number of rows between 2 indexes of pandas dataframe

I have the following Pandas dataframe in Python:
import pandas as pd
d = {'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data=d)
df.index=['A', 'B', 'C', 'D', 'E']
df
which gives the following output:
col1 col2
A 1 6
B 2 7
C 3 8
D 4 9
E 5 10
I need to write a function (say the name will be getNrRows(fromIndex) ) that will take an index value as input and will return the number of rows between that given index and the last index of the dataframe.
For instance:
nrRows = getNrRows("C")
print(nrRows)
> 2
Because it takes 2 steps (rows) from the index C to the index E.
How can I write such a function in the most elegant way?

The simplest way might be
len(df[row_index:]) - 1

For your information we have built-in function get_indexer_for
len(df)-df.index.get_indexer_for(['C'])-1
Out[179]: array([2], dtype=int64)

Replace values of a dataframe with the value of another dataframe

I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12

Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12

One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])

cannot add multiple column with values in Python Pandas

I want to add the the data of reference to data, so I use
data[reference.columns]=reference
but it only creates the column with no value, how can I add the value?

Your two DataFrames are indexed differently, so when you do data[reference.columns] = reference it tries to align the new columns on indices. Since the indices of reference are not in data (or only align for index=0) it adds the columns, but fills the values with NaN.
It looks like you want to add multiple static columns to data with the values from reference. You can just assign these:
for col in reference.columns:
data[col] = reference[col].values[0]
Here's an illustration of the issue.
import pandas as pd
data = pd.DataFrame({'id': [1, 2, 3, 4],
'val1': ['A', 'B', 'C', 'D']})
reference = pd.DataFrame({'id2': [1, 2, 3, 4],
'val2': ['A', 'B', 'C', 'D']})
These have the same indices ranging from 0-3.
data[reference.columns] = reference
Outputs
id val1 id2 val2
0 1 A 1 A
1 2 B 2 B
2 3 C 3 C
3 4 D 4 D
But, if these DataFrames have different indices (that only partially overlap):
data = pd.DataFrame({'id': [1, 2, 3, 4],
'val1': ['A', 'B', 'C', 'D']})
reference = pd.DataFrame({'id2': [1, 2, 3, 4],
'val2': ['A', 'B', 'C', 'D']})
reference.index=[3,4,5,6]
data[reference.columns]=reference
Outputs:
id val1 id2 val2
0 1 A NaN NaN
1 2 B NaN NaN
2 3 C NaN NaN
3 4 D 1.0 A
As only the index value of 3 is shared.

Replicating rows in pandas dataframe by column value and add a new column with repetition index

My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]

Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Drop duplicates but keep reference to dropped rows - python

Related

Python Groupby Separate Filter for Each Group

How to calculate number of rows between 2 indexes of pandas dataframe

Replace values of a dataframe with the value of another dataframe

cannot add multiple column with values in Python Pandas

Replicating rows in pandas dataframe by column value and add a new column with repetition index

Categories

Resources