Pandas find duplicate concatenated values across selected columns

Pandas find duplicate concatenated values across selected columns - python

I want to find duplicates in a selection of columns of a df,
# converts the sub df into matrix
mat = df[['idx', 'a', 'b']].values
str_dict = defaultdict(set)
for x in np.ndindex(mat.shape[0]):
concat = ''.join(str(x) for x in mat[x][1:])
# take idx as values of each key a + b
str_dict[concat].update([mat[x][0]])
dups = {}
for key in str_dict.keys():
dup = str_dict[key]
if len(dup) < 2:
continue
dups[key] = dup
The code finds duplicates of the concatenation of a and b. Uses the concatenation as key for a set defaultdict (str_dict), updates the key with idx values; finally uses a dict (dups) to store any concatenation if the length of its value (set) is >= 2.
I am wondering if there is a better way to do that in terms of efficiency.

You can just concatenate and convert to set:
res = set(df['a'].astype(str) + df['b'].astype(str))
Example:
df = pd.DataFrame({'idx': [1, 2, 3],
'a': [4, 4, 5],
'b': [5, 5,6]})
res = set(df['a'].astype(str) + df['b'].astype(str))
print(res)
# {'56', '45'}
If you need to map indices too:
df = pd.DataFrame({'idx': [1, 2, 3],
'a': [41, 4, 5],
'b': [3, 13, 6]})
df['conc'] = (df['a'].astype(str) + df['b'].astype(str))
df = df.reset_index()
res = df.groupby('conc')['index'].apply(set).to_dict()
print(res)
# {'413': {0, 1}, '56': {2}}

You can filter the column you need before drop_duplicate
df[['a','b']].drop_duplicates().astype(str).apply(np.sum,1).tolist()
Out[1027]: ['45', '56']

Related

Create a dictionary from a list

I'm trying to create a dictionary in Python from this output:
["'a'=df2['a']", "'b'=df2['b']", "'c'=df2['c']", "'d'=df2['d']"]
I tried with this code:
list_columns = list(df2.columns)
list_dictionary = []
for row in list_columns:
resultado = "'"+str(row)+"'" + "=" + "df2[" + "'" + row + "'" + "]"
list_dictionary.append(resultado)
clean_list_dictionary = ','.join(list_dictionary).replace('"','')
dictionary = dict(clean_list_dictionary)
print(dictionary)
But I get an error:
ValueError: dictionary update sequence element #0 has length 1; 2 is required
Do you have any idea how I can make this work?
Thank you in advance!
Output dictionary should look like this:
{
'a' : df2['a'],
'b' : df2['b'],
'c' : df2['c'],
'd' : df2['d']
}

Method 1: Transforming your list of string for an eval later
As you have mentioned in your comment -
I would like to create a dictionary for with this format: ''' {'a' : df2['a'], 'b' : df2['b'], 'c' : df2['c'], 'd' : df2['d']} ''' I will use it as global variables in an eval() function.
You can use the following to convert your input string
#dummy dataframe
df2 = pd.DataFrame([[1,2,3,4]], columns=['a','b','c','d']) #Dummy dataframe
#your list of strings
l = ["'a'=df2['a']", "'b'=df2['b']", "'c'=df2['c']", "'d'=df2['d']"]
#Solution
def dict_string(l):
s0 = [i.split('=') for i in l]
s1 = '{' + ', '.join([': '.join([k,v]) for k,v in s0]) + '}'
return s1
output = dict_string(l)
print(output)
eval(output)
#String before eval
{'a': df2['a'], 'b': df2['b'], 'c': df2['c'], 'd': df2['d']} #<----
#String after eval
{'a': 0 1
Name: a, dtype: int64,
'b': 0 2
Name: b, dtype: int64,
'c': 0 3
Name: c, dtype: int64,
'd': 0 4
Name: d, dtype: int64}
Method 2: Using eval as part of your iteration of the list of strings
Here is a way to do this using list comprehensions and eval, as part of the iteration on the list of strings itself. This will give you the final output that you would get if you were to use eval on the dictionary string you are expecting.
#dummy dataframe
df2 = pd.DataFrame([[1,2,3,4]], columns=['a','b','c','d']) #Dummy dataframe
#your list of strings
l = ["'a'=df2['a']", "'b'=df2['b']", "'c'=df2['c']", "'d'=df2['d']"]
#Solution
def eval_dict(l):
s0 = [(eval(j) for j in i.split('=')) for i in l]
s1 = {k:v for k,v in s0}
return s1
output = eval_dict(l)
print(output)
{'a': 0 1
Name: a, dtype: int64,
'b': 0 2
Name: b, dtype: int64,
'c': 0 3
Name: c, dtype: int64,
'd': 0 4
Name: d, dtype: int64}
The output is a dict that has 4 keys, (a,b,c,d) and 4 corresponding values for columns a, b, c, d from df2 respectively.

You can loop over the list,split by charater and convert to dict.
Code:
dic= {}
[dic.update(dict( [l.split('=')])) for l in ls]
dic

I think this is exactly what you want.
data = ["'a'=df2['a']", "'b'=df2['b']", "'c'=df2['c']", "'d'=df2['d']"]
dic = {}
for d in data:
k = d.split("=")[0]
v = df2[d.split("=")[1].split("\'")[1]]
dic.update({k: v})
print(dic)

Its not clear what exactly you want to achieve.
If You have a pd.DataFrame() and you want to convert it to a dictionary where column names are keys and column values are dict values you should use df.to_dict('series').
import pandas as pd
# Generate the dataframe
data = {'a': [1, 2, 1, 0], 'b': [2, 3, 4, 5], 'c': [10, 11, 12, 13], 'd': [21, 22, 23, 24]}
df = pd.DataFrame.from_dict(data)
# Convert to dictionary
result = df.to_dict('series')
print(result)
If you have a list of strings that you need to convert to desired output than you should do it differently. What you have are strings 'df' while df in your dict is a variable. So you only need to extract the column names and use the variable df not the string 'df'
import pandas as pd
# Generate the dataframe
data = {'a': [1, 2, 1, 0], 'b': [2, 3, 4, 5], 'c': [10, 11, 12, 13], 'd': [21, 22, 23, 24]}
df = pd.DataFrame.from_dict(data)
# create string list
lst = ["'a'=df2['a']", "'b'=df2['b']", "'c'=df2['c']", "'d'=df2['d']"]
# Convert to dictionary
result = {}
for item in lst:
key = item[1]
result[key] = df[key]
print(result)
The results are the same but in second case list of strings is created for no reason because first example can achieve the same results without it..

I would like to know when we want to print only specific columns in pandas how to implement that

cols = list(ds.columns.values)
ds = ds[cols[1:3] + cols[5:6] + [cols[9]]]
print(ds)
Why did we convert into list in this line cols = list(ds.columns.values)?

If ds is a DataFrame from Pandas:
type(ds.columns.values)
>>> <class 'numpy.ndarray'>
If you sum two differences columns of string or char in numpy:
a1 = np.char.array(['a', 'b'])
a2 = np.char.array(['c', 'd'])
a1 + a2
>>> chararray(['ac', 'bd'], dtype='<U2')
and not:
np.char.array(['a', 'b', 'c', 'd'])
That why you should convert it in list because:
list1 = ['a','b']
list2 = ['c','d']
list1 + list2
>>> ['a','b','c','d']
Remember, pandas.DataFrame need a list of columns, that why you should feed DataFrame a list :
panda.DataFrame[[columns1,columns2,columns5,columns9]]

If you do slicing for a single numpy.ndarray or a single list, you would be able to get the dataframe:
cols = ds.columns.values #numpy.ndarray
ds = ds[cols[1:3]] #ok
cols = ds.columns.tolist() #list
ds = ds[cols[1:3]] #ok
However, if you use the + operator, the behavior is different between numpy.ndarray and list
cols = ds.columns.values #numpy.ndarray
ds = ds[cols[1:3] + cols[5:6]] #ERROR
cols = ds.columns.tolist() #list
ds = ds[cols[1:3] + cols[5:6]] #ok
That is because the + operator is "concatenation" for list,
whereas for numpy.ndarray, the + operator is numpy.add.
In other words, cols[1:3] + cols[5:6] is actually doing np.add(cols[1:3], cols[5:6])
Refer to documentation for more details.

ds.columns returns an ndarray, so slicing it will also produce ndarrays. + between ndarrays behave differently than in between lists
df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [1, 2, 3, 4], 'col3': [1, 2, 3, 4], 'col4': [1, 2, 3, 4],
'col5': [1, 2, 3, 4], 'col6': [1, 2, 3, 4], 'col7': [1, 2, 3, 4], 'col8': [1, 2, 3, 4]})
cols_arr = df.columns.values
cols_list = list(df.columns.values)
print(cols_arr[0:2] + cols_arr[3:4] + [cols_arr[7]])
print(cols_list[0:2] + cols_list[3:4] + [cols_list[7]])
Output
['col1col4col8' 'col2col4col8']
['col1', 'col2', 'col4', 'col8']
When you try to get to access the dataframe df[cols_arr[0:2] + cols_arr[2:3] + [cols_arr[3]]] using the first result you will get
KeyError: "None of [Index(['col1col4col8', 'col2col4col8'], dtype='object')] are in the [columns]"
With the lists df[cols_list[0:2] + cols_list[3:4] + [cols_list[7]]] you will get the new dataframe
col1 col2 col4 col8
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 4 4 4 4

A simpler way to convert columns into a list:
ds.columns.tolist()
But this also seems unnecessary. ds.columns returns an Index. You can select values from Index just like from a normal list, and then append them to each other using .union:
cols = ds.columns
ds = ds[cols[1:3].union(cols[5:6]).union(cols[9])]
Note that you can use .iloc to reach your goal in a more idiomatic way:
ds = ds.iloc[:, [1, 2, 5, 9]]

Selecting a row in pandas based on all its column values

I would like to locate a specific row (given all its columns values) within a pandas frame.
My attempts so far:
df = pd.DataFrame(
columns = ["A", "B", "C"],
data = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12],
])
# row to find (last one)
row = {"A" : 10, "B" : 11, "C" : 12}
# chain
idx = df[(df["A"] == 10) & (df["B"] == 11) & (df["B"] == 11)].index[0]
print(idx)
# iterative
mask = pd.Series([True] * len(df))
for k, v in row.items():
mask &= (df[k] == v)
idx = df[mask].index[0]
print(idx)
# pandas series
for idx in df.index:
print(idx, (df.iloc[idx,:] == pd.Series(row)).all())
Is there a simpler way to do that? Something like idx = df.find(row)?
This functionality is often needed for example to locate one specific sample in a time series. I cannot believe that there is no straightforward way to do that.

Do you simply want?
df[df.eq(row).all(axis=1)] #.index # if the indices are needed
output:
A B C
3 10 11 12
Or, if you have more columns and want to ignore them for the comparison:
df[df[list(row)].eq(row).all(axis=1)]

How to merge multiple lists into 1 but only those elements that were in all of the initial lists?

I need to merge 5 lists of which any list can be empty that way so that only items that were in all 5 initial lists are included in the newly formed list.
for filter in filters:
if filter == 'M':
filtered1 = [] # imagine that this is filled
if filter == 'V':
filtered2 = [] # imagine that this is filled
if filter == 'S':
filtered3 = [] # imagine that this is filled
if filter == 'O':
filtered4 = [] # imagine that this is filled
if filter == 'C':
filtered5 = [] # imagine that this is filled
filtered = [] # merge all 5 lists from above
So now I need to make a list filtered with merged data from all filtered lists 1-5. How should I do that?

This is the most classical solution.
filtered = filter1 + filter2 + filter3 + filter4 + filter5
What happens is that you add an list to another one and so on...
So if filter1 was ['a', 'b'] and filter3 was ['c', 'd'] and filter4 was ['e'],
then you would get:
filtered = ['a', 'b', 'c', 'd', 'e']

Given some lists xs1, ..., xs5:
xss = [xs1, xs2, xs3, xs4, xs5]
sets = [set(xs) for xs in xss]
merged = set.intersection(*sets)
This has the property that merged may be in any order.

f1, f2, f3, f4, f5 = [1], [], [2, 5], [4, 1], [3]
only_merge = [*f1, *f2, *f3, *f4, *f5]
print("Only merge: ", only_merge)
merge_and_sort = sorted([*f1, *f2, *f3, *f4, *f5])
print("Merge and sort: ", merge_and_sort)
merge_and_unique_and_sort = list({*f1, *f2, *f3, *f4, *f5})
print("Merge, unique and sort: ", merge_and_unique_and_sort)
Output:
Only merge: [1, 2, 5, 4, 1, 3]
Merge and sort: [1, 1, 2, 3, 4, 5]
Merge, unique and sort: [1, 2, 3, 4, 5]

split an table(or list) based on column python

I have a array
arr = [['a', 'b', 'a'], [1, 2, 3]
I need this to be spliited based on the first array values i.e based on 'a' or 'b'. So Expected output is
arr_out_a = [1, 3]
arr_out_b = [2]
How do I do it?
Please help me correct the question,if the way I'm using words like list and array might create confusion

Use collections.defaultdict():
In [82]: arr = [['a', 'b', 'a'], [1, 2, 3]]
In [83]: from collections import defaultdict
In [84]: d = defaultdict(list)
In [85]: for i, j in zip(*arr):
....: d[i].append(j)
....:
In [86]: d
Out[86]: defaultdict(<class 'list'>, {'b': [2], 'a': [1, 3]})

Basically just append them conditionally to predefined empty lists:
arr_out_a = []
arr_out_b = []
for char, num in zip(*arr):
if char == 'a':
arr_out_a.append(num)
else:
arr_out_b.append(num)
or if you don't like the zip:
arr_out_a = []
arr_out_b = []
for idx in range(len(arr[0])):
if arr[0][idx] == 'a':
arr_out_a.append(arr[1][idx])
else:
arr_out_b.append(arr[1][idx])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas find duplicate concatenated values across selected columns - python

You can filter the column you need before drop_duplicate df[['a','b']].drop_duplicates().astype(str).apply(np.sum,1).tolist() Out[1027]: ['45', '56']

Related

Create a dictionary from a list

I would like to know when we want to print only specific columns in pandas how to implement that

Selecting a row in pandas based on all its column values

How to merge multiple lists into 1 but only those elements that were in all of the initial lists?

split an table(or list) based on column python

Categories

Resources