I am trying to efficiently remove duplicates in Pandas in which duplicates are inverted across two columns. For example, in this data frame:
import pandas as pd
key = pd.DataFrame({'p1':['a','b','a','a','b','d','c'],'p2':['b','a','c','d','c','a','b'],'value':[1,1,2,3,5,3,5]})
df = pd.DataFrame(key,columns=['p1','p2','value'])
print frame
p1 p2 value
0 a b 1
1 b a 1
2 a c 2
3 a d 3
4 b c 5
5 d a 3
6 c b 5
I would want to remove rows 1, 5 and 6, leaving me with just:
p1 p2 value
0 a b 1
2 a c 2
3 a d 3
4 b c 5
Thanks in advance for ideas on how to do this.
Reorder the p1 and p2 values so they appear in a canonical order:
mask = df['p1'] < df['p2']
df['first'] = df['p1'].where(mask, df['p2'])
df['second'] = df['p2'].where(mask, df['p1'])
yields
In [149]: df
Out[149]:
p1 p2 value first second
0 a b 1 a b
1 b a 1 a b
2 a c 2 a c
3 a d 3 a d
4 b c 5 b c
5 d a 3 a d
6 c b 5 b c
Then you can drop_duplicates:
df = df.drop_duplicates(subset=['value', 'first', 'second'])
import pandas as pd
key = pd.DataFrame({'p1':['a','b','a','a','b','d','c'],'p2':['b','a','c','d','c','a','b'],'value':[1,1,2,3,5,3,5]})
df = pd.DataFrame(key,columns=['p1','p2','value'])
mask = df['p1'] < df['p2']
df['first'] = df['p1'].where(mask, df['p2'])
df['second'] = df['p2'].where(mask, df['p1'])
df = df.drop_duplicates(subset=['value', 'first', 'second'])
df = df[['p1', 'p2', 'value']]
yields
In [151]: df
Out[151]:
p1 p2 value
0 a b 1
2 a c 2
3 a d 3
4 b c 5
Related
In advance: Sorry, the title is a bit fuzzy
PYTHON
I have two tables. In one there are unique names for example 'A', 'B', 'C' and in the other table there is a Time series with months example 10/2021, 11/2021, 12/2021. I want to join the tables now that I have all TimeStemps for each name. So the final data should look like this:
Month
Name
10/2021
A
11/2021
A
12/2021
A
10/2021
B
11/2021
B
12/2021
B
10/2021
C
11/2021
C
12/2021
C
from cartesian product in pandas
df1 = pd.DataFrame([1, 2, 3], columns=['A'])
df2 = pd.DataFrame(["a", "b", "c"], columns=['B'])
df = (df1.assign(key=1)
.merge(df2.assign(key=1), on="key")
.drop("key", axis=1)
)
A B
0 1 a
1 1 b
2 1 c
3 2 a
4 2 b
5 2 c
6 3 a
7 3 b
8 3 c
If you are only trying to get the cartesian product of the values - you can do it using itertools.product
import pandas as pd
from itertools import product
df1 = pd.DataFrame(list('abcd'), columns=['letters'])
df2 = pd.DataFrame(list('1234'), columns=['numbers'])
df_combined = pd.DataFrame(product(df1['letters'], df2['numbers']), columns=['letters', 'numbers'])
output
letters numbers
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4
12 d 1
13 d 2
14 d 3
15 d 4
I have a dataframe
A 2
B 4
C 3
and I would like to make a data frame with the following
A 0
A 1
B 0
B 0
B 0
B 1
C 0
C 0
C 1.
So for B, I want to make 4 rows and each one is 0 except for the last one which is 1. Similarly, for A, I'll have 2 rows and the first one has a 0 and the second one has a 1.
In general, if I have a row in the original table with X n, I want to return n rows in the new table with n-1 of them being X 0 and the final one as X 1.
Is there a way to do this in R? Or Python or SQL?
In R, we may use uncount to replicate the rows from the second column and replace the second column with binary by converting the first to logical column (duplicated)
library(tidyr)
library(dplyr)
df1 %>%
uncount(v2) %>%
mutate(v2 = +(!duplicated(v1, fromLast = TRUE)))
-output
v1 v2
1 A 0
2 A 1
3 B 0
4 B 0
5 B 0
6 B 1
7 C 0
8 C 0
9 C 1
Or in Python
import pandas as pd
df1 = pd.DataFrame({"v1":["A", "B", "C"], "v2": [2, 4, 3]})
df2 = df1.reindex(df1.index.repeat(df1.v2))
df2['v2'] = (~df2.duplicated(subset = ['v2'], keep = "last")) + 0
df2
v1 v2
0 A 0
0 A 1
1 B 0
1 B 0
1 B 0
1 B 1
2 C 0
2 C 0
2 C 1
data
df1 <- structure(list(v1 = c("A", "B", "C"), v2 = c(2L, 4L, 3L)),
class = "data.frame", row.names = c(NA,
-3L))
It's not that hard with base R...
d <- data.frame(x = LETTERS[1:3], n = c(2L, 4L, 3L))
d
## x n
## 1 A 2
## 2 B 4
## 3 C 3
data.frame(x = rep.int(d$x, d$n), i = replace(integer(sum(d$n)), cumsum(d$n), 1L))
## x i
## 1 A 0
## 2 A 1
## 3 B 0
## 4 B 0
## 5 B 0
## 6 B 1
## 7 C 0
## 8 C 0
## 9 C 1
# load package
library(data.table)
# set as data table
setDT(df)
# work
df1 <- df[rep(seq(.N), b), ][, c := 1:.N, a]
df1[, d := 0][b == c, d := 1][, b := d][, c('c', 'd') := NULL]
There is another way in python to replicate what #akrun did in R:
>>> from datar.all import f, tibble, uncount, mutate, duplicated, as_integer
>>> df1 = tibble(v1=list("ABC"), v2=[2, 4, 3])
>>> df1 >> uncount(f.v2) >> mutate(
... v2=as_integer(~duplicated(f.v1, from_last=True))
... )
v1 v2
<object> <object>
0 A 0
1 A 1
2 B 0
3 B 0
4 B 0
5 B 1
6 C 0
7 C 0
8 C 1
I am the author of datar, which is backed by pandas.
I have two dataframes df1 and df2. First column in both is a customer ID which is an int, but other columns contains various string values. I want to produce a new dataframe df3 that contains, for each customer ID, a set of values found in df2 but not in df1.
Example:
df1:
v1 v2 v3 v4
cust
1 A B B A
2 A A A A
3 B B A A
4 B C A A
df2:
v1 v2 v3 v4
cust
1 A A C B
2 A A C B
3 C B B A
4 C B B A
Expected output:
cust
1 {C}
2 {B, C}
3 {C}
4 {}
In [2]: df_2 = pd.DataFrame({"KundelID" : list(range(1,11)),
...: 'V1' : list('AACCBBBCCC'),
...: 'V2' : list('AABBBCCCAA'),
...: 'V3' : list('CCBBBBBAAB'),
...: 'V4' : list('BBAACAAAAB')})
...: df_1 = pd.DataFrame({"KundelID" : list(range(1,11)),
...: 'V1' : list('AABBCCCCCC'),
...: 'V2' : list('BABCCCCAAA'),
...: 'V3' : list('BAAAAABBBB'),
...: 'V4' : list('AAAACCCCBB')})
In [3]: df_1
Out[3]:
KundelID V1 V2 V3 V4
0 1 A B B A
1 2 A A A A
2 3 B B A A
3 4 B C A A
4 5 C C A C
5 6 C C A C
6 7 C C B C
7 8 C A B C
8 9 C A B B
9 10 C A B B
In [4]: df_2
Out[4]:
KundelID V1 V2 V3 V4
0 1 A A C B
1 2 A A C B
2 3 C B B A
3 4 C B B A
4 5 B B B C
5 6 B C B A
6 7 B C B A
7 8 C C A A
8 9 C A A A
9 10 C A B B
In [7]: pd.DataFrame({"KundeID" : df_2.KundelID,
...: 'Not-in-df_1' : [','.join([i for i in df_2_ if not i in df_1_]) if [i for i in df_2_ if not i in df_1_] else None for df_1_,df_2_ in zip(df_1.T[1:].apply(np.unique), df_2.T[1:].apply(np.unique))]})
Out[7]:
KundeID Not-in-df_1
0 1 C
1 2 B,C
2 3 C
3 4 None
4 5 B
5 6 B
6 7 A
7 8 None
8 9 None
9 10 None
The idea is to transform all values in each row into a set. Then, we can take the set difference for each customer ID. This avoids loops and list comprehensions:
df3 = (
pd
.concat([
df1.reindex(index=df2.index).apply(set, axis=1),
df2.apply(set, axis=1),
], axis=1)
.apply(lambda r: r[1].difference(r[0]), axis=1)
)
print(df3)
# Out:
cust
1 {C}
2 {B, C}
3 {C}
4 {}
Notes:
The bit df1.reindex(index=df2.index) is in case some IDs are absent from df1 or df2).
It is trivial to transform the output into something else instead of a set. For example ','.join(r[1].difference(r[0])) as the lambda will make strings.
Setup:
For future reference, in order to facilitate a reproducible example, it is a good idea to provide some code that can directly be copy/pasted by SO-ers for a quick start into your problem.
df1 = pd.read_csv(io.StringIO("""
1 A B B A
2 A A A A
3 B B A A
4 B C A A
"""), sep=' ', names='cust v1 v2 v3 v4'.split()).set_index('cust')
df2 = pd.read_csv(io.StringIO("""
1 A A C B
2 A A C B
3 C B B A
4 C B B A
"""), sep=' ', names='cust v1 v2 v3 v4'.split()).set_index('cust')
You transform each dataframe into a Series of sets, then perform a set operation across the Series, leveraging the intrinsic data alignment from pandas Series:
df2.apply(set, axis=1) - df1.apply(set, axis=1)
Output:
cust
1 {C}
2 {C, B}
3 {C}
4 {}
dtype: object
If you want the symmetric difference across datasets (i.e. elements in either the set or other but not both), then it's better using pd.concat:
dfs = [df1, df2]
pd.concat([df.apply(set, 1) for df in dfs], 1).apply(lambda x: x[0]^x[1], 1)
where 1 here stands for axis=1. Also, replacing x[0]^x[1] by set.symmetric_difference(*x) should work as well.
Interestingly, Series_A ^ Series_B doesn't work as expected, instead (apparently), it returns a bool Series telling us if the returning values from the set operations are not empty.
I have to copy columns from one DataFrame A to another DataFrame B. The column names in A and B do not match.
What is the best way to do it? There are several columns like this. Do I need to write for each column like B["SO"] = A["Sales Order"] etc.
i would use pd.concat
combined_df = pd.concat([df1, df2[['column_a', 'column_b']]], axis=1)
also gives you the power to concat different size dateframes , outer join etc.
Use:
df1 = pd.DataFrame({
'SO':list('abcdef'),
'RI':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
print (df1)
SO RI C
0 a 4 7
1 b 5 8
2 c 4 9
3 d 5 4
4 e 5 2
5 f 4 3
df2 = pd.DataFrame({
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df2)
D E F
0 1 5 a
1 3 3 a
2 5 6 a
3 7 9 b
4 1 2 b
5 0 4 b
Create dictionary for rename, select columns matched, rename by dict and DataFrame.join to original - DataFrames matched by index values:
d = {'SO':'Sales Order',
'RI':'Retail Invoices'}
df11 = df1[d.keys()].rename(columns=d)
print (df11)
Sales Order Retail Invoices
0 a 4
1 b 5
2 c 4
3 d 5
4 e 5
5 f 4
df = df2.join(df11)
print (df)
D E F Sales Order Retail Invoices
0 1 5 a a 4
1 3 3 a b 5
2 5 6 a c 4
3 7 9 b d 5
4 1 2 b e 5
5 0 4 b f 4
Make a dictionary of abbreviations. And try this code.
Ex:
full_form_dict = {'SO':'Sales Order',
'RI':'Retail Invoices',}
A_col = list(A.columns)
B_col = [v for k,v in full_form_dict.items() if k in A_col]
# to loop over A_col
# B_col = [v for col in A_col for k,v in full_form_dict.items() if k == col]
I have data in this format
ID Val
1 A
1 B
1 C
2 A
2 C
2 D
I want to group by data at each ID and see combinations that exist and sum the multiple combinations up. The resulting output should look like
v1 v2 count
A B 1
A C 2
A D 1
B C 1
C D 1
Is there a smart way to get this instead of looping through each possible combinations?
this should work:
>>> ts = df.groupby('Val')['ID'].aggregate(lambda ts: set(ts))
>>> ts
Val
A set([1, 2])
B set([1])
C set([1, 2])
D set([2])
Name: ID, dtype: object
>>> from itertools import product
>>> pd.DataFrame([[i, j, len(ts[i] & ts[j])] for i, j in product(ts.index, ts.index) if i < j],
... columns=['v1', 'v2', 'count'])
v1 v2 count
0 A B 1
1 A C 2
2 A D 1
3 B C 1
4 B D 0
5 C D 1
What I came up with:
Use pd.merge to create the cartesian product
Filter the cartesian product to include only combinations of the form that you desire
Count the number of combinations
Convert to the desired dataframe format
Unsure if it is faster than looping through all possible combinations.
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
import pandas as pd
from itertools import izip
# Create the dataframe
df = pd.DataFrame([
[1, 'A'],
[1, 'B'],
[1, 'C'],
[2, 'A'],
[2, 'C'],
[2, 'D'],
], columns=['ID', 'Val'])
'''
ID Val
0 1 A
1 1 B
2 1 C
3 2 A
4 2 C
5 2 D
[6 rows x 2 columns]
'''
# Create the cartesian product
df2 = pd.merge(df, df, on='ID')
'''
ID Val_x Val_y
0 1 A A
1 1 A B
2 1 A C
3 1 B A
4 1 B B
5 1 B C
6 1 C A
7 1 C B
8 1 C C
9 2 A A
10 2 A C
11 2 A D
12 2 C A
13 2 C C
14 2 C D
15 2 D A
16 2 D C
17 2 D D
[18 rows x 3 columns]
'''
# Count the values, filtering A, A pairs, and B, A pairs.
counts = pd.Series([
v for v in izip(df2.Val_x, df2.Val_y)
if v[0] != v[1] and v[0] < v[1]
]).value_counts(sort=False).sort_index()
'''
(A, B) 1
(A, C) 2
(A, D) 1
(B, C) 1
(C, D) 1
dtype: int64
'''
# Combine the counts
df3 = pd.DataFrame(dict(
v1=[v1 for v1, _ in counts.index],
v2=[v2 for _, v2 in counts.index],
count=counts.values
))
'''
count v1 v2
0 1 A B
1 2 A C
2 1 A D
3 1 B C
4 1 C D
'''