Intersection in single pandas Series - python

0
0 [g,k]
1 [e,g]
2 [e]
3 [k,e]
4 [s]
5 [g]
I am trying to get the value which appears once in the data column, in this example the solution should be 's'.
But I can only find methods to solve this problem while having two series or two dataframe columns.
I can't do it in one column, because if the value is part of a combination unique won't work as far as I know.

If need test if one value only is possible use Series.explode with Series.value_counts and then filter index by 1 in boolean indexing:
s = df[0].explode().value_counts()
L = s.index[s == 1].tolist()
print (L)
['s']
Or use pure python solution with Counter and flatten nested lists in Series in list comprehension:
from collections import Counter
L = [k for k, v in Counter([y for x in df[0] for y in x]).items() if v == 1]
print (L)
['s']

Related

Get unique values from multiple lists in Pandas column

How can I join the multiple lists in a Pandas column 'B' and get the unique values only:
A B
0 10 [x50, y-1, sss00]
1 20 [x20, MN100, x50, sss00]
2 ...
Expected output:
[x50, y-1, sss00, x20, MN100]
You can do this simply by list comprehension and sum() method:
result=[x for x in set(df['B'].sum())]
Now If you print result you will get your desired output:
['y-1', 'x20', 'sss00', 'x50', 'MN100']
If in input data are not lists, but strings first create lists:
df.B = df.B.str.strip('[]').str.split(',')
Or:
import ast
df.B = df.B.apply(ast.literal_eval)
Use Series.explode for one Series from lists with Series.unique for remove duplicates if order is important:
L = df.B.explode().unique().tolist()
#alternative
#L = df.B.explode().drop_duplicates().tolist()
print (L)
['x50', 'y-1', 'sss00', 'x20', 'MN100']
Another idea if order is not important use set comprehension with flatten lists:
L = list(set([y for x in df.B for y in x]))
print (L)
['x50', 'MN100', 'x20', 'sss00', 'y-1']

Count number of rows that a value stored in a list occurs in

There is a DataFrame df that holds data in list of strings:
>> df
words
0 [a,b,c]
1 [a]
2 [x,c,c]
3 [a]
...
I want to count the number of rows that each value in words occurs in. For example:
a: 3
b: 1
c: 2
x: 1
I get a list of all unique words in the DataFrame using:
>> from collections import OrderedDict #using OrderedDict to keep word order
>> l = []
>> df.words.apply(lambda x: l.append(x)) #add list of words to a list
>> l = list(OrderedDict.fromkeys([j for i in l for j in i])) #merge list of lists and remove duplicates
>> print(l)
[a,b,c,x]
From here I go through the list l and checking each row of df if the word exists, and then sum the Bool value for each word.
data = []
for w in l:
tmp = []
df.words.apply(lambda x: tmp.append(w in x))
data.append(sum(tmp))
I can then create a dictionary of words and their count. This is, however, very inefficient, as it takes a long time (70,000+ words and 50,000+ rows). Is there a faster way of doing this?
You can use Series.explode with Series.value_counts
df['words'].explode().values_counts(sort=False)
One more alternative is using itertools.chain.from_iterable with collections.Counter
counts = Counter(chain.from_iterable(df['words']))
pd.Series(counts)
a 3
b 1
c 3
x 1
dtype: int64
Convert each list to a set, then back to list. Combine them using itertools and then run a collections.Counter on it to get the dictionary.
#if data is your list of lists or dataframe
import itertools
import collections
data = [list(set(i)) for i in data]
newData = list(itertools.chain.from_iterable(data))
#chain makes an iterator that returns elements from the first iterable until it is
#exhausted, then proceeds to the next iterable, until all of the iterables are
#exhausted.
dictVal = collections.Counter(newData)

Search values from a list in dataframe cell list and add another column with results

I am trying to create a column with the result of a comparison between a Dataframe cell list and a list
I have this dataframe with list values:
df = pd.DataFrame({'A': [['KB4525236', 'KB4485447', 'KB4520724', 'KB3192137', 'KB4509091']], 'B': [['a', 'b']]})
and a list with this value:
findKBs = ['KB4525236','KB4525202']
The expected result :
A B C
0 [KB4525236, KB4485447, KB4520724, KB3192137, K... [a, b] [KB4525202]
I donĀ“t know how to iterate my list with the cell list and find the non matches, can you help me?
You should simply compare the 2 lists like this: Loop through the values of findKBs and assign them to new list if they are not in df['A'][0]
df['C'] = [[x for x in findKBs if x not in df['A'][0]]]
Result:
A B C
0 [KB4525236, KB4485447, KB4520724, KB3192137, K... [a, b] [KB4525202]
There's probably a pandas-centric way you could do it,but this appears to work:
df['C'] = [list(filter(lambda el: True if el not in df['A'][0] else False, findKBs))]

Counting the number of pairs of items in a grouped dataframe's column. (Pandas)

I would like to count the number of pairs of items in a column. I made my solution for it, but I would like to know if there are more concise solutions.
Here is an example and my approach.
I have a DataFrame like this.
df = pd.DataFrame({'id':[1,1,2,2], 'item':['apple','orange','orange','apple']})
Finally, I would like to know what items are bought most together. Therefore, in this case, I'd like to get a result that orange and apple are bought most together.
Then, I did groupby based on values in id column.
id_group = df.groupby('id')
Then, to count the number of pairs of items in item column, I made a function like below and applied to item column of id_group. After this, I combined lists of tuples using sum(). Finally, I used Counter() to count the number of pairs containing same items.
In combos(), I used sorted()to avoid counting ('apple','orange') and ('orange','apple') separately.
Are there better approaches to get the result showing there are 2 pairs of ('apple','orange') or 2 pairs of ('orange','apple')
import itertools
from collections import Counter
def combos(x):
combinations = []
num = x.size
while num != 1:
combinations += list(itertools.combinations(x,num))
num -= 1
element_sorted = map(sorted,combinations)
return list(map(tuple,element_sorted))
k= id_group['item'].apply(lambda x:combos(x)).sum()
Counter(k)
Use all_subsets function with change 0 to 2 for pairs, triples... like your soluton:
#https://stackoverflow.com/a/5898031
from itertools import chain, combinations
def all_subsets(ss):
return chain(*map(lambda x: combinations(ss, x), range(2, len(ss)+1)))
And then flatten values, I think better is not use sum to concatenate lists. It looks fancy but it's quadratic and should be considered bad practice.
So here is used flattening with sorted tuples in list comprehension:
k = [tuple(sorted(z)) for y in id_group['item'].apply(all_subsets) for z in y]
print (Counter(k))
Counter({('apple', 'orange'): 2})
How about this?
from collections import Counter
k = df.groupby('id')['item'].apply(lambda x: tuple(x.sort_values()))
Counter(k)
Counter({('apple', 'orange'): 2})

Is there a Don't Care value for lists in Python

Is there a way to use count() where you are looking for a specific value in the nested list and not caring about the rest?
lst = [[1,6],[1,4],[3,4],[1,2]]
X = 1
lst.count([X, _ ])
This would return a count of 3, since there are three nested lists that have a 1 in the first index.
Is there a way to do this?
Use some sneaky sum() hacks:
sum(k[0] == X for k in your_list)
I.e.
>>> X = 1
>>> your_list = [[1,6],[1,4],[3,4],[1,2]]
>>> sum(k[0] == X for k in your_list)
3
why?
The section: k[0] == X for k in your_list is a generator expression that yields True for each element in your_list which has first element equal to your X. The sum() function takes the values and treats a True as a 1.
Look at the length of a filtered list:
my_list = [[1,6][1,4][3,4][1,2]]
X = 1
len([q for q in my_list if q[0] == X])
Or, if you prefer to use count, then make a list of the items you do care about:
[q[0] for q in my_list].count(X)
You can do len(filter(lambda x: x[0] == 1, lst))
But be careful, if your list contains an element that is not a list (or an empty list) it will throw an exception! This could be handled by adding two additional conditions
len(filter(lambda x: type(x) == list and len(x) > 0 and x[0] == 1, lst))
Counting how often one value occurs in the first position requires a full pass over the list, so if you plan to use the potential countfunction(inputlist, target) more than once on the same list, it's more efficient to build a dictionary holding all the counts (also requiring one pass) which you can subsequently query with O(1).
>>> from collections import Counter
>>> from operator import itemgetter
>>>
>>> lst = [[1,6],[1,4],[3,4],[1,2]]
>>> c = Counter(map(itemgetter(0), lst))
>>> c[1]
3
>>> c[3]
1
>>> c[512]
0
Others have shown good ways to approach this problem using python built-ins, but you can use numpy if what you're actually after is fancy indexing.
For example:
import numpy as np
lst = np.array([[1,6],[1,4],[3,4],[1,2]])
print(lst)
#array([[1, 6],
# [1, 4],
# [3, 4],
# [1, 2]])
In this case lst is a numpy.ndarray with shape (4,2) (4 rows and 2 columns). If you want to count the number of rows where the first column (index 0) is equal to X, you can write:
X = 1
print((lst[:,0] == X).sum())
#3
The first part lst[:,0] means grab all rows and only the first index.
print(lst[:,0])
#[1 1 3 1]
Then you check which of these is equal to X:
print(lst[:,0]==X)
#[ True True False True]
Finally sum the resultant array to get the count. (There is an implicit conversion from bool to int for the sum.)

Categories