Check to see if column values exist in dictionary [pandas] - python

Can a data frame column (Series) of lists be used as a conditional check within a dictionary?
I have a column of lists of words (split up tweets) that I'd like to feed to a vocab dictionary to see if they all exist - if one does not exist, I'd like to skip it, continue on and then run a function over the existing words.
This code produces the intended result for one row in the column, however, I get a "unhashable type list" error if I try to apply it to more than one column.
w2v_sum = w2v[[x for x in train['words'].values[1] if x in w2v.vocab]].sum()
Edit with reproducible example:
df = pd.DataFrame(data={'words':[['cow','bird','cat'],['red','blue','green'],['low','high','med']]})
d = {'cow':1,'bird':4,'red':1,'blue':1,'green':1,'high':6,'med':3}
Desired output is total (sum of the words within dictionary):
total words
0 5 [cow, bird, cat]
1 3 [red, blue, green]
2 9 [low, high, med]

This should do what you want:
import pandas as pd
df = pd.DataFrame(data={'words':[['cow','bird','cat'],['red','blue','green'],['low','high','med']]})
d = {'cow':1,'bird':4,'red':1,'blue':1,'green':1,'high':6,'med':3}
EDIT:
To reflect the lists inside the column, see this nested comprehension:
list_totals = [[d[x] for x in y if x in d] for y in df['words'].values]
list_totals = [sum(x) for x in list_totals]
list_totals
[5, 3, 9]
You can then add list_totals as a column to your pd.

One solution is to use collections.Counter and a list comprehension:
from collections import Counter
d = Counter({'cow':1,'bird':4,'red':1,'blue':1,'green':1,'high':6,'med':3})
df['total'] = [sum(map(d.__getitem__, L)) for L in df['words']]
print(df)
words total
0 [cow, bird, cat] 5
1 [red, blue, green] 3
2 [low, high, med] 9
Alternatively, if you always have a fixed number of words, you can split into multiple series and use pd.DataFrame.applymap:
df['total'] = pd.DataFrame(df['words'].tolist()).applymap(d.get).sum(1).astype(int)

Related

Count number of rows that a value stored in a list occurs in

There is a DataFrame df that holds data in list of strings:
>> df
words
0 [a,b,c]
1 [a]
2 [x,c,c]
3 [a]
...
I want to count the number of rows that each value in words occurs in. For example:
a: 3
b: 1
c: 2
x: 1
I get a list of all unique words in the DataFrame using:
>> from collections import OrderedDict #using OrderedDict to keep word order
>> l = []
>> df.words.apply(lambda x: l.append(x)) #add list of words to a list
>> l = list(OrderedDict.fromkeys([j for i in l for j in i])) #merge list of lists and remove duplicates
>> print(l)
[a,b,c,x]
From here I go through the list l and checking each row of df if the word exists, and then sum the Bool value for each word.
data = []
for w in l:
tmp = []
df.words.apply(lambda x: tmp.append(w in x))
data.append(sum(tmp))
I can then create a dictionary of words and their count. This is, however, very inefficient, as it takes a long time (70,000+ words and 50,000+ rows). Is there a faster way of doing this?
You can use Series.explode with Series.value_counts
df['words'].explode().values_counts(sort=False)
One more alternative is using itertools.chain.from_iterable with collections.Counter
counts = Counter(chain.from_iterable(df['words']))
pd.Series(counts)
a 3
b 1
c 3
x 1
dtype: int64
Convert each list to a set, then back to list. Combine them using itertools and then run a collections.Counter on it to get the dictionary.
#if data is your list of lists or dataframe
import itertools
import collections
data = [list(set(i)) for i in data]
newData = list(itertools.chain.from_iterable(data))
#chain makes an iterator that returns elements from the first iterable until it is
#exhausted, then proceeds to the next iterable, until all of the iterables are
#exhausted.
dictVal = collections.Counter(newData)

Frequency of elements in pairwise comparison of lists within a list in Python

I have a list of lists like this:
my_list_of_lists =
[['sparrow','sparrow','sparrow','junco','jay','robin'],
['sparrow','sparrow','junco', 'sparrow','robin','robin'],
['sparrow','sparrow','sparrow','sparrow','jay','robin']]
I would like to do a pairwise comparison at each position for all lists with the list like this:
#1 with 2
['sparrow','sparrow','sparrow','junco','jay','robin']
['sparrow','sparrow','junco', 'sparrow','robin','robin']
#1 with 3
['sparrow','sparrow','sparrow','junco','jay','robin']
['sparrow','sparrow','sparrow','sparrow','jay','robin']
#2 with 3
['sparrow','sparrow','junco', 'sparrow','robin','robin']
['sparrow','sparrow','sparrow','sparrow','jay','robin']
So the pairs for the 1 with 2:
pairs =[('sparrow','sparrow'), ('sparrow','sparrow'), ('sparrow','junco'),('junco','sparrow'),('junco','junco'), ('jay','robin'), ('robin','robin')]
I would like to get the counts and frequency of the pairs in each pairwise comparison:
pairs =[('sparrow','sparrow'), ('sparrow','sparrow'), ('sparrow','junco'),('junco','sparrow') ('junco','junco'), ('jay','robin'), ('robin','robin')]
sparrowsparrow_counts = 2
juncosparrow_counts = 2
jayrobin_counts = 1
robinrobin = 1
frequency_of_combos = [('sparrow', 'sparrow'):.333, ('sparrow', 'junco'):.333, ('jay', 'robin'):.167, ('robin', 'robin'): .167]
I've tried zipping but I end up zipping all of the lists (not the pairs) into tuples and I'm stumped on the rest.
I think it's somewhat related to How to calculate counts and frequencies for pairs in list of lists? but I can't figure out how to apply this to my data.
Zip the two lists, then filter out the pairs that don't match, and use collections.Counter to count them:
from collections import Counter
a = ['sparrow','sparrow','sparrow','junco','jay','robin']
b = ['sparrow','sparrow','junco', 'sparrow','robin','robin']
c = Counter([ i for i in zip(a,b) if i[0] == i[1]])
print(c)
Counter({('sparrow', 'sparrow'): 2, ('robin', 'robin'): 1})
You seem to have the frequency part figured out, but that should clear up the use of zip and Counter.

Intersection in single pandas Series

0
0 [g,k]
1 [e,g]
2 [e]
3 [k,e]
4 [s]
5 [g]
I am trying to get the value which appears once in the data column, in this example the solution should be 's'.
But I can only find methods to solve this problem while having two series or two dataframe columns.
I can't do it in one column, because if the value is part of a combination unique won't work as far as I know.
If need test if one value only is possible use Series.explode with Series.value_counts and then filter index by 1 in boolean indexing:
s = df[0].explode().value_counts()
L = s.index[s == 1].tolist()
print (L)
['s']
Or use pure python solution with Counter and flatten nested lists in Series in list comprehension:
from collections import Counter
L = [k for k, v in Counter([y for x in df[0] for y in x]).items() if v == 1]
print (L)
['s']

Counting the number of pairs of items in a grouped dataframe's column. (Pandas)

I would like to count the number of pairs of items in a column. I made my solution for it, but I would like to know if there are more concise solutions.
Here is an example and my approach.
I have a DataFrame like this.
df = pd.DataFrame({'id':[1,1,2,2], 'item':['apple','orange','orange','apple']})
Finally, I would like to know what items are bought most together. Therefore, in this case, I'd like to get a result that orange and apple are bought most together.
Then, I did groupby based on values in id column.
id_group = df.groupby('id')
Then, to count the number of pairs of items in item column, I made a function like below and applied to item column of id_group. After this, I combined lists of tuples using sum(). Finally, I used Counter() to count the number of pairs containing same items.
In combos(), I used sorted()to avoid counting ('apple','orange') and ('orange','apple') separately.
Are there better approaches to get the result showing there are 2 pairs of ('apple','orange') or 2 pairs of ('orange','apple')
import itertools
from collections import Counter
def combos(x):
combinations = []
num = x.size
while num != 1:
combinations += list(itertools.combinations(x,num))
num -= 1
element_sorted = map(sorted,combinations)
return list(map(tuple,element_sorted))
k= id_group['item'].apply(lambda x:combos(x)).sum()
Counter(k)
Use all_subsets function with change 0 to 2 for pairs, triples... like your soluton:
#https://stackoverflow.com/a/5898031
from itertools import chain, combinations
def all_subsets(ss):
return chain(*map(lambda x: combinations(ss, x), range(2, len(ss)+1)))
And then flatten values, I think better is not use sum to concatenate lists. It looks fancy but it's quadratic and should be considered bad practice.
So here is used flattening with sorted tuples in list comprehension:
k = [tuple(sorted(z)) for y in id_group['item'].apply(all_subsets) for z in y]
print (Counter(k))
Counter({('apple', 'orange'): 2})
How about this?
from collections import Counter
k = df.groupby('id')['item'].apply(lambda x: tuple(x.sort_values()))
Counter(k)
Counter({('apple', 'orange'): 2})

Python: Combination with criteria

I have the following list of combinations:
a = [(1,10),(2,8),(300,28),(413,212)]
b = [(8,28), (8,15),(10,21),(28,34),(413,12)]
I want to create a new combination list from these two lists which follow the following criteria:
A. List a and List b have common elements.
The second element of the tuple in list a equals the first element of the
Tuple in the list b.
Combination of List a and List b should form a new combination:
d = [(1,10,21),(2,8,28),(2,8,15),(300,28,34)]
All other tuples in both lists which do not satisfy the criteria get ignored.
QUESTIONS
Can I do this criteria based combination using itertools?
What is the most elegant way to solve this problem with/without using modules?
How can one display the output in excel sheet to display each element of a tuple in list d to a separate column such as:
d = [(1,10,21),(2,8,28),(2,8,15),(300,28,34)] is displayed in excel as:
Col A = [1, 2, 2, 300]
Col B = [10,8,8,28]
Col C = [21,28,15,34]
pandas works like a charm for excel.
Here is the code:
a = [(1,10),(2,8),(300,28),(413,212)]
b = [(8,28), (8,15),(10,21),(28,34),(413,12)]
c = [(x, y, t) for x, y in a for z, t in b if y == z]
import pandas as pd
df = pd.DataFrame(c)
df.to_excel('MyFile.xlsx', header=False, index=False)

Categories