Get unique values from multiple lists in Pandas column - python

How can I join the multiple lists in a Pandas column 'B' and get the unique values only:
A B
0 10 [x50, y-1, sss00]
1 20 [x20, MN100, x50, sss00]
2 ...
Expected output:
[x50, y-1, sss00, x20, MN100]

You can do this simply by list comprehension and sum() method:
result=[x for x in set(df['B'].sum())]
Now If you print result you will get your desired output:
['y-1', 'x20', 'sss00', 'x50', 'MN100']

If in input data are not lists, but strings first create lists:
df.B = df.B.str.strip('[]').str.split(',')
Or:
import ast
df.B = df.B.apply(ast.literal_eval)
Use Series.explode for one Series from lists with Series.unique for remove duplicates if order is important:
L = df.B.explode().unique().tolist()
#alternative
#L = df.B.explode().drop_duplicates().tolist()
print (L)
['x50', 'y-1', 'sss00', 'x20', 'MN100']
Another idea if order is not important use set comprehension with flatten lists:
L = list(set([y for x in df.B for y in x]))
print (L)
['x50', 'MN100', 'x20', 'sss00', 'y-1']

Related

Convert comma-separated values into integer list in pandas dataframe

How to convert a comma-separated value into a list of integers in a pandas dataframe?
Input:
Desired output:
There are 2 steps - split and convert to integers, because after split values are lists of strings, solution working well also if different lengths of lists (not added Nones):
df['qty'] = df['qty'].apply(lambda x: [int(y) for y in x.split(',')])
Or:
df['qty'] = df['qty'].apply(lambda x: list(map(int, x.split(','))))
Alternative solutions:
df['qty'] = [[int(y) for y in x.split(',')] for x in df['qty']]
df['qty'] = [list(map(int, x.split(','))) for x in df['qty']]
Or try expand=True:
df['qty'] = df['qty'].str.split(',', expand=True).astype(int).agg(list, axis=1)
Vectorised solution :
import ast
df["qty"] = ("[" + df["qty"].astype(str) + "]").apply(ast.literal_eval)

Count number of rows that a value stored in a list occurs in

There is a DataFrame df that holds data in list of strings:
>> df
words
0 [a,b,c]
1 [a]
2 [x,c,c]
3 [a]
...
I want to count the number of rows that each value in words occurs in. For example:
a: 3
b: 1
c: 2
x: 1
I get a list of all unique words in the DataFrame using:
>> from collections import OrderedDict #using OrderedDict to keep word order
>> l = []
>> df.words.apply(lambda x: l.append(x)) #add list of words to a list
>> l = list(OrderedDict.fromkeys([j for i in l for j in i])) #merge list of lists and remove duplicates
>> print(l)
[a,b,c,x]
From here I go through the list l and checking each row of df if the word exists, and then sum the Bool value for each word.
data = []
for w in l:
tmp = []
df.words.apply(lambda x: tmp.append(w in x))
data.append(sum(tmp))
I can then create a dictionary of words and their count. This is, however, very inefficient, as it takes a long time (70,000+ words and 50,000+ rows). Is there a faster way of doing this?
You can use Series.explode with Series.value_counts
df['words'].explode().values_counts(sort=False)
One more alternative is using itertools.chain.from_iterable with collections.Counter
counts = Counter(chain.from_iterable(df['words']))
pd.Series(counts)
a 3
b 1
c 3
x 1
dtype: int64
Convert each list to a set, then back to list. Combine them using itertools and then run a collections.Counter on it to get the dictionary.
#if data is your list of lists or dataframe
import itertools
import collections
data = [list(set(i)) for i in data]
newData = list(itertools.chain.from_iterable(data))
#chain makes an iterator that returns elements from the first iterable until it is
#exhausted, then proceeds to the next iterable, until all of the iterables are
#exhausted.
dictVal = collections.Counter(newData)

Intersection in single pandas Series

0
0 [g,k]
1 [e,g]
2 [e]
3 [k,e]
4 [s]
5 [g]
I am trying to get the value which appears once in the data column, in this example the solution should be 's'.
But I can only find methods to solve this problem while having two series or two dataframe columns.
I can't do it in one column, because if the value is part of a combination unique won't work as far as I know.
If need test if one value only is possible use Series.explode with Series.value_counts and then filter index by 1 in boolean indexing:
s = df[0].explode().value_counts()
L = s.index[s == 1].tolist()
print (L)
['s']
Or use pure python solution with Counter and flatten nested lists in Series in list comprehension:
from collections import Counter
L = [k for k, v in Counter([y for x in df[0] for y in x]).items() if v == 1]
print (L)
['s']

Search values from a list in dataframe cell list and add another column with results

I am trying to create a column with the result of a comparison between a Dataframe cell list and a list
I have this dataframe with list values:
df = pd.DataFrame({'A': [['KB4525236', 'KB4485447', 'KB4520724', 'KB3192137', 'KB4509091']], 'B': [['a', 'b']]})
and a list with this value:
findKBs = ['KB4525236','KB4525202']
The expected result :
A B C
0 [KB4525236, KB4485447, KB4520724, KB3192137, K... [a, b] [KB4525202]
I donĀ“t know how to iterate my list with the cell list and find the non matches, can you help me?
You should simply compare the 2 lists like this: Loop through the values of findKBs and assign them to new list if they are not in df['A'][0]
df['C'] = [[x for x in findKBs if x not in df['A'][0]]]
Result:
A B C
0 [KB4525236, KB4485447, KB4520724, KB3192137, K... [a, b] [KB4525202]
There's probably a pandas-centric way you could do it,but this appears to work:
df['C'] = [list(filter(lambda el: True if el not in df['A'][0] else False, findKBs))]

Python: Combination with criteria

I have the following list of combinations:
a = [(1,10),(2,8),(300,28),(413,212)]
b = [(8,28), (8,15),(10,21),(28,34),(413,12)]
I want to create a new combination list from these two lists which follow the following criteria:
A. List a and List b have common elements.
The second element of the tuple in list a equals the first element of the
Tuple in the list b.
Combination of List a and List b should form a new combination:
d = [(1,10,21),(2,8,28),(2,8,15),(300,28,34)]
All other tuples in both lists which do not satisfy the criteria get ignored.
QUESTIONS
Can I do this criteria based combination using itertools?
What is the most elegant way to solve this problem with/without using modules?
How can one display the output in excel sheet to display each element of a tuple in list d to a separate column such as:
d = [(1,10,21),(2,8,28),(2,8,15),(300,28,34)] is displayed in excel as:
Col A = [1, 2, 2, 300]
Col B = [10,8,8,28]
Col C = [21,28,15,34]
pandas works like a charm for excel.
Here is the code:
a = [(1,10),(2,8),(300,28),(413,212)]
b = [(8,28), (8,15),(10,21),(28,34),(413,12)]
c = [(x, y, t) for x, y in a for z, t in b if y == z]
import pandas as pd
df = pd.DataFrame(c)
df.to_excel('MyFile.xlsx', header=False, index=False)

Categories