I Have a dataframe which has some unique IDs in two of the columns.for e.g
S.no. Column1 Column2
1 00001x 00002x
2 00003j 00005k
3 00002x 00001x
4 00004d 00008e
Value can be anything in the string format
I want to compare the two column in such a way that either of s.no 1 or 3 data remains. as these id contains the same information. only its order is different.
Basically if for one row value in a column 1 is X and column 2 is Y and for other row value in column 1 is Y and in Column 2 is x then only one of the row should remain.
is that possible in python?
You can convert your columns as frozenset per row.
This will give a common order to apply duplicated.
Finally, slice the rows using the previous output as mask:
mask = df.filter(like='Column').apply(frozenset, axis=1).duplicated()
df[~mask]
previous answer using set:
mask = df.filter(like='Column').apply(lambda x: tuple(set(x)), axis=1).duplicated()
df[~mask]
NB. Using a set or sorted requires to convert as tuple (lambda x: tuple(sorted(x))) as the duplicated function hashes the values, which is not possible with mutable objects
output:
S.no. Column1 Column2
0 1 00001x 00002x
1 2 00003j 00005k
3 4 00004d 00008e
Related
I'm trying to find the duplicates between 2 columns, were order is independent, but I need to keep the count of duplicates after dropping them
df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
columns=['source', 'target'],
)
This is my expected result
source target count
0 A B 2
1 D B 1
3 B C 2
I've already tried several approaches, but I can't come close to a solution.
It does not matter which combination is maintained. In the result example I kept the first.
The following approach creates a new column containing a set of the values in the columns specified. The advantage is that all other columns are preserved in the final result. Furthermore, the indices are preserved the same way as in the expected output you posted:
df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
columns=['source', 'target'],)
# Create column with set of both columns
df['tmp'] = df.apply(lambda x: frozenset([x['source'], x['target']]), axis=1)
# Create count column based on new tmp column
df['count'] = df.groupby(['tmp'])['target'].transform('size')
# Drop duplicate rows based on new tmp column
df = df[~df.duplicated(subset='tmp', keep='first')]
# Remove tmp column
df = df.drop('tmp', 1)
df
Output:
source target count
0 A B 2
1 D B 1
3 B C 2
You can use df.duplicated() to see which ones are duplicated, the output is true if item is duplicated and false if it isn’t. For more infos and practical example check out the documentation
Create a summary based on applying a frozenset to your desired columns. Here we're using all columns.
summary = df.apply(frozenset, axis=1).value_counts()
This'll give you a Series of:
(A, B) 2
(C, B) 2
(B, D) 1
dtype: int64
You can then reconstruct a DataFrame by iterating over that series, eg:
df2 = pd.DataFrame(((*idx, val) for idx, val in summary.items()), columns=[*df.columns, 'count'])
Which results in:
source target count
0 A B 2
1 C B 2
2 B D 1
I'm looking to merge a dataframe based on one column and summing the others. I have attempted to do this via df.loc combining sum and have searched Stack extensively already. If any of you have suggestions, they are very welcome :)
Original:
Date
Value x
Value y
13-3-1920
1
0
13-3-1920
0
1
30-4-1920
0
1
30-4-1920
1
1
Desired Output:
Date
Value x
Value y
13-3-1920
1
1
30-4-1920
1
2
Thank you in advance!
"df" is the dataframe. I've considered date column to be of string data type.
The group by function gives the expected output.
df.groupby(['date']).sum()
I have a pandas series of this format with multiple non unique indexes (example)
index value
num 1
0 2
num 3
0 4
and would like to split it into 2 series:
index value index value
num 1 0 2
num 3 0 4
The order of the values has to be maintained as in the example( order as they appear in the list). The first can just be obtained by
series.num
or
series['num']
Unfortunately it doesn't work for the second one as the indexes are integers. Anybody has a solution to this?
You can use .iloc[] to locate the rows by index:
df1 = df.iloc[df.index == 'num']
df2 = df.iloc[df.index == 0]
This code will return you with two dataframes, separated by index.
For example, I have a dataframe called dat, then I want to apply a function on each column of the dataframe, if the return value is Ture, then keep this column and turn to next column, if the return value is False, then drop this column and turn to next column.
I know I can write a for loop to do this, but is there a efficient way to do this?
You could do it like this using boolean index on df.columns:
I want to drop all columns where the 'sum' for simplicity is greater than 50
df = pd.DataFrame({'A':[2,4,6,8],'B':[101,102,102,102]})
r = df.apply(np.sum) # applies the sum function to all columns
c = r <= 50 #create boolean test for columns
df[c[c].index] #Use boolea indexing to get columns and column filter for dataframe
Output:
A
0 2
1 4
2 6
3 8
Updating an old answer:
df.loc[:, df.sum() <= 50]
I have a dataframe with millions of rows with unique indexes and a column('b') that has several repeated values.
I would like to generate a dataframe without the duplicated data but I do not want to lose the index information. I want the new dataframe to have an index that is a concatenation of the indexes ("old_index1,old_index2") where 'b' had duplicated values but remains unchanged for rows where 'b' had unique values. The values of the 'b' column should remain unchanged like in a keep=first strategy. Example below.
Input dataframe:
df = pd.DataFrame(data = [[1,"non_duplicated_1"],
[2,"duplicated"],
[2,"duplicated"],
[3,"non_duplicated_2"],
[4,"non_duplicated_3"]],
index=['one','two','three','four','five'],
columns=['a','b'])
desired output:
a b
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
The actual dataframe is quite large so I would like to avoid non-vectorized operations.
I am finding this surprisingly difficult...Any ideas?
You can use transform on the index column (after you use reset_index). Then, drop duplicates in column b:
df.index = df.reset_index().groupby('b')['index'].transform(','.join)
df.drop_duplicates('b',inplace=True)
>>> df
a b
index
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
Setup
dct = {'index': ','.join, 'a': 'first'}
You can reset_index before using groupby, although it's unclear to me why you want this:
df.reset_index().groupby('b', as_index=False, sort=False).agg(dct).set_index('index')
b a
index
one non_duplicated_1 1
two,three duplicated 2
four non_duplicated_2 3
five non_duplicated_3 4