Add bi-grams to a pandas dataframe - python

I have a list of bi-grams like this:
[['a','b'],['e', ''f']]
Now I want to add these bigrams to a DataFrame with their frequencies like this:
b f
a|1 0
e|0 1
I tried doing this with the following code, but this raises an error, because the index doesn't exist yet. Is there a fast way to do this for really big data? (like 200000 bigrams)
matrixA = pd.DataFrame()
# Put the counts in a matrix
for elem in grams:
tag1, tag2 = elem[0], elem[1]
matrixA.loc[tag1, tag2] += 1

from collections import Counter
bigrams = [[['a','b'],['e', 'f']], [['a','b'],['e', 'g']]]
pairs = []
for bg in bigrams:
pairs.append((bg[0][0], bg[0][1]))
pairs.append((bg[1][0], bg[1][1]))
c = Counter(pairs)
>>> pd.Series(c).unstack() # optional: .fillna(0)
b f g
a 2 NaN NaN
e NaN 1 1
The above is for the intuition. This can be wrapped up in a one line generator expression as follows:
pd.Series(Counter((bg[i][0], bg[i][1]) for bg in bigrams for i in range(2))).unstack()

You can use Counter from the collections package. Note that I changed the contents of the list to be tuples rather than lists. This is because Counter keys (like dict keys) must be hashable.
from collections import Counter
l = [('a','b'),('e', 'f')]
index, cols = zip(*l)
df = pd.DataFrame(0, index=index, columns=cols)
c = Counter(l)
for (i, c), count in c.items():
df.loc[i, c] = count

Related

How to check if any of elements in a dictionary value is in string?

I have a dataframe with strings and a dictionary which values are lists of strings.
I need to check if each string of the dataframe contains any element of every value in the dictionary. And if it does, I need to label it with the appropriate key from the dictionary. All I need to do is to categorize all the strings in the dataframe with keys from the dictionary.
For example.
df = pd.DataFrame({'a':['x1','x2','x3','x4']})
d = {'one':['1','aa'],'two':['2','bb']}
I would like to get something like this:
df = pd.DataFrame({
'a':['x1','x2','x3','x4'],
'Category':['one','two','x3','x4']})
I tried this, but it has not worked:
df['Category'] = np.nan
for k, v in d.items():
for l in v:
df['Category'] = [k if l in str(x).lower() else x for x in df['a']]
Any ideas appreciated!
Firstly create a function that do this for you:-
def func(val):
for x in range(0,len(d.values())):
if val in list(d.values())[x]:
return list(d.keys())[x]
Now make use of split() and apply() method:-
df['Category']=df['a'].str.split('',expand=True)[2].apply(func)
Finally use fillna() method:-
df['Category']=df['Category'].fillna(df['a'])
Now if you print df you will get your expected output:-
a Category
0 x1 one
1 x2 two
2 x3 x3
3 x4 x4
Edit:
You can also do this by:-
def func(val):
for x in range(0,len(d.values())):
if any(l in val for l in list(d.values())[x]):
return list(d.keys())[x]
then:-
df['Category']=df['a'].apply(func)
Finally:-
df['Category']=df['Category'].fillna(df['a'])
I've come up with the following heuristic, which looks really dirty.
It outputs what you desire, albeit with some warnings, since I've used indices to append values to dataframe.
import pandas as pd
import numpy as np
def main():
df = pd.DataFrame({'a': ['x1', 'x2', 'x3', 'x4']})
d = {'one': ['1', 'aa'], 'two': ['2', 'bb']}
found = False
i = 0
df['Category'] = np.nan
for x in df['a']:
for k,v in d.items():
for item in v:
if item in x:
df['Category'][i] = k
found = True
break
else:
df['Category'][i] = x
if found:
found = False
break
i += 1
print(df)
main()

How to use a dictionary to speed up the task of look up and counting?

Consider the following snippet:
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
df = pd.DataFrame(data,columns=["col1","col2"]) # my actual dataframe has
# 20,00,000 such rows
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
# After doing a combination of 2 elements between the 2 lists in both orders,
# we get a list that resembles something like this:
new_list = ["ccc-ggg", "ggg-ccc", "aaa-fff", "fff-aaa", ..."ccc-fff", "fff-ccc", ...]
Given a huge dataframe and 2 lists, I want to count the number of elements in new_list that are in the same in the dataframe. In the above pseudo example, The result would be 3 as: "aaa-fff", "ccc-ggg", & "ddd-ccc" are in the same row of the dataframe.
Right now, I am using a linear search algorithm but it is very slow as I have to scan through the entire dataframe.
df['col3']=df['col1']+"-"+df['col2']
for a in list_a:
c1 = 0
for b in list_b:
str1=a+"-"+b
str2=b+"-"+a
str1=a+"-"+b
c2 = (df['col3'].str.contains(str1).sum())+(df['col3'].str.contains(str2).sum())
c1+=c2
return c1
Can someone kindly help me implement a faster algorithm preferably with a dictionary data structure?
Note: I have to iterate through the 7,000 rows of another dataframe and create the 2 lists dynamically, and get an aggregate count for each row.
Here is another way. First, I used your definition of df (with 2 columns), list_a and list_b.
# combine two columns in the data frame
df['col3'] = df['col1'] + '-' + df['col2']
# create set with list_a and list_b pairs
s = ({ f'{a}-{b}' for a, b in zip(list_a, list_b)} |
{ f'{b}-{a}' for a, b in zip(list_a, list_b)})
# find intersection
result = set(df['col3']) & s
print(len(result), '\n', result)
3
{'ddd-ccc', 'ccc-ggg', 'aaa-fff'}
UPDATE to handle duplicate values.
# build list (not set) from list_a and list_b
idx = ([ f'{a}-{b}' for a, b in zip(list_a, list_b) ] +
[ f'{b}-{a}' for a, b in zip(list_a, list_b) ])
# create `col3`, and do `value_counts()` to preserve info about duplicates
df['col3'] = df['col1'] + '-' + df['col2']
tmp = df['col3'].value_counts()
# use idx to sub-select from to value counts:
tmp[ tmp.index.isin(idx) ]
# results:
ddd-ccc 1
aaa-fff 1
ccc-ggg 1
Name: col3, dtype: int64
Try this:
from itertools import product
# all combinations of the two lists as tuples
all_list_combinations = list(product(list_a, list_b))
# tuples of the two columns
dftuples = [x for x in df.itertuples(index=False, name=None)]
# take the length of hte intersection of the two sets and print it
print(len(set(dftuples).intersection(set(all_list_combinations))))
yields
3
First join the columns before looping, then instead of looping pass an optional regex to contains with all possible strings.
joined = df.col1+ '-' + df.col2
pat = '|'.join([f'({a}-{b})' for a in list_a for b in list_b] +
[f'({b}-{a})' for a in list_a for b in list_b]) # substitute for itertools.product
ct = joined.str.contains(pat).sum()
To work with dicts instead of dataframes, you can use filter(re, joined) as in this question
import re
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
### build the regex pattern
pat_set = set('-'.join(combo) for combo in set(
list(itertools.product(list_a, list_b)) +
list(itertools.product(list_b, list_a))))
pat = '|'.join(pat_set)
# use itertools to generalize with many colums, remove duplicates with set()
### join the columns row-wise
joined = ['-'.join(row) for row in zip(*[vals for key, vals in data.items()])]
### filter joined
match_list = list(filter(re.compile(pat).match, joined))
ct = len(match_list)
Third option with series.isin() inspired by jsmart's answer
joined = df.col1 + '-' + df.col2
ct = joined.isin(pat_set).sum()
Speed testing
I repeated data 100,000 times for scalability testing. series.isin() takes the day, while jsmart's answer is fast but does not find all occurrences because it removes duplicates from joined
with dicts: 400000 matches, 1.00 s
with pandas: 400000 matches, 1.77 s
with series.isin(): 400000 matches, 0.39 s
with jsmart answer: 4 matches, 0.50 s

Is there a way to vectorize counting items' co-occurences in pandas/numpy?

I frequently need to generate network graphs based on the co-occurences of items in a column. I start of with something like this:
letters
0 [b, a, e, f, c]
1 [a, c, d]
2 [c, b, j]
In the following example, I want a to make a table of all pairs of letters, and then have a "weight" column, which describes how many times each two letter pair appeared in the same row together (see bottom for example).
I am currently doing large parts of it using a for loop, and I was wondering if there is a way for me to vectorize it, as I am often dealing with enormous datasets that take an extremely long time to process in this way. I am also concerned about keeping things within memory limits. This is my code right now:
import pandas as pd
# Make some data
df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})
# I make a list of sets, which contain pairs of all the elements
# that co-occur in the data in the same list
sets = []
for lst in df['letters']:
for i, a in enumerate(lst):
for b in lst[i:]:
if not a == b:
sets.append({a, b})
# Sets now looks like:
# [{'a', 'b'},
# {'b', 'e'},
# {'b', 'f'},...
# Dataframe with one column containing the sets
df = pd.DataFrame({'weight': sets})
# We count how many times each pair occurs together
df = df['weight'].value_counts().reset_index()
# Split the sets into two seperate columns
split = pd.DataFrame(df['index'].values.tolist()) \
.rename(columns = lambda x: f'Node{x+1}') \
.fillna('-')
# Merge the 'weight' column back onto the dataframe
df = pd.concat([df['weight'], split], axis = 1)
print(df.head)
# Output:
weight Node1 Node2
0 2 c b
1 2 a c
2 1 f e
3 1 d c
4 1 j b
A numpy/scipy solution using sparse incidence matrices:
from itertools import chain
import numpy as np
from scipy import sparse
from simple_benchmark import BenchmarkBuilder, MultiArgument
B = BenchmarkBuilder()
#B.add_function()
def pp(L):
SZS = np.fromiter(chain((0,),map(len,L)),int,len(L)+1).cumsum()
unq,idx = np.unique(np.concatenate(L),return_inverse=True)
S = sparse.csr_matrix((np.ones(idx.size,int),idx,SZS),(len(L),len(unq)))
SS = (S.T#S).tocoo()
idx = (SS.col>SS.row).nonzero()
return unq[SS.row[idx]],unq[SS.col[idx]],SS.data[idx] # left, right, count
from collections import Counter
from itertools import combinations
#B.add_function()
def yatu(L):
return Counter(chain.from_iterable(combinations(sorted(i),r=2) for i in L))
#B.add_function()
def feature_engineer(L):
Counter((min(nodes), max(nodes))
for row in L for nodes in combinations(row, 2))
from string import ascii_lowercase as ltrs
ltrs = np.array([*ltrs])
#B.add_arguments('array size')
def argument_provider():
for exp in range(4, 30):
n = int(1.4**exp)
L = [ltrs[np.maximum(0,np.random.randint(-2,2,26)).astype(bool).tolist()] for _ in range(n)]
yield n,L
r = B.run()
r.plot()
We see that the method presented here (pp) comes with the typical numpy constant overhead, but from ~100 sublists it starts winning.
OPs example:
import pandas as pd
df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})
pd.DataFrame(dict(zip(["left", "right", "count"],pp(df['letters']))))
Prints:
left right count
0 a b 1
1 a c 2
2 b c 2
3 c d 1
4 a d 1
5 c e 1
6 a e 1
7 b e 1
8 c f 1
9 e f 1
10 a f 1
11 b f 1
12 b j 1
13 c j 1
For a performance improvement you could use itertooos.combinations in order to get all length 2 combinations from the inner lists, and Counter to count the pairs in a flattened list.
Note that in addition to obtaining all combinations from each sublist, sorting is a necessary step since it will ensure that all pairs of tuples will appear in the same order:
from itertools import combinations, chain
from collections import Counter
l = df.letters.tolist()
t = chain.from_iterable(combinations(sorted(i), r=2) for i in l)
print(Counter(t))
Counter({('a', 'b'): 1,
('a', 'c'): 2,
('a', 'e'): 1,
('a', 'f'): 1,
('b', 'c'): 2,
('b', 'e'): 1,
('b', 'f'): 1,
('c', 'e'): 1,
('c', 'f'): 1,
('e', 'f'): 1,
('a', 'd'): 1,
('c', 'd'): 1,
('b', 'j'): 1,
('c', 'j'): 1})
Notes:
As suggested in the other answers, make use of collections.Counter for the counting. Since it behaves like a dict though, it needs hashable types. {a,b} is not hashable, because it's a set. Replacing it with a tuple fixes the hashability problem, but introduces possible duplicates (e.g., ('a', 'b') and ('b', 'a')). To fix this issue, just sort the tuple.
since sorted returns a list, we need to turn that back into a tuple: tuple(sorted((a,b))). A bit cumbersome, but convenient in combination with Counter.
Quick and easy speedup: Comprehensions instead of loops
When rearranged, your nested loops can be replaced with the following comprehension:
sets = [ sorted((a,b)) for lst in df['letters'] for i,a in enumerate(lst) for b in lst[i:] if not a == b ]
Python has optimizations in place for comprehension execution, so this will already bring some speedup.
Bonus: If you combine it with Counter, you don't even need the result as a list, but can instead use a generator expression (almost no extra memory is used instead of storing all pairs):
Counter( tuple(sorted((a, b))) for lst in lists for i,a in enumerate(lst) for b in lst[i:] if not a == b ) # note the lack of [ ] around the comprehension
Evaluation: What is the faster approach?
As usual, when dealing with performance, the final answer must come from testing different approaches and choosing the best one.
Here I compare the (IMO very elegant and readable) itertools-based approach by #yatu, the original nested-for and the comprehension.
All tests run on the same sample data, randomly generated to look like the given example.
from timeit import timeit
setup = '''
import numpy as np
import random
from collections import Counter
from itertools import combinations, chain
random.seed(42)
np.random.seed(42)
DF_SIZE = 50000 # make it big
MAX_LEN = 6
list_lengths = np.random.randint(1, 7, DF_SIZE)
letters = 'abcdefghijklmnopqrstuvwxyz'
lists = [ random.sample(letters, ln) for ln in list_lengths ] # roughly equivalent to df.letters.tolist()
'''
#################
comprehension = '''Counter( tuple(sorted((a, b))) for lst in lists for i,a in enumerate(lst) for b in lst[i:] if not a == b )'''
itertools = '''Counter(chain.from_iterable(combinations(sorted(i), r=2) for i in lists))'''
original_for_loop = '''
sets = []
for lst in lists:
for i, a in enumerate(lst):
for b in lst[i:]:
if not a == b:
sets.append(tuple(sorted((a, b))))
Counter(sets)
'''
print(f'Comprehension: {timeit(setup=setup, stmt=comprehension, number=10)}')
print(f'itertools: {timeit(setup=setup, stmt=itertools, number=10)}')
print(f'nested for: {timeit(setup=setup, stmt=original_for_loop, number=10)}')
Running the code above on my machine (python 3.7) prints:
Comprehension: 1.6664735930098686
itertools: 0.5829475829959847
nested for: 1.751666523006861
So, both suggested approaches improve over the nested for loops, but itertools is indeed faster in this case.

2D list to csv - by column

I'd like to export the content of a 2D-list into a csv file.
The size of the sublists can be different. For example, the 2D-list can be something like :
a = [ ['a','b','c','d'], ['e','f'], ['g'], [], ['h','i'] ]
I want my csv to store the data like this - "by column" :
a,e,g, ,h
b,f, , ,i
c
d
Do I have to add some blank spaces to get the same size for each sublist ? Or is there another way to do so ?
Thank you for your help
You can use itertools.zip_longest:
import itertools, csv
a = [ ['a','b','c','d'], ['e','f'], ['g'], [], ['h','i'] ]
with open('filename.csv', 'w') as f:
write = csv.writer(f)
write.writerows(list(itertools.zip_longest(*a, fillvalue='')))
Output:
a,e,g,,h
b,f,,,i
c,,,,
d,,,,
It can be done using pandas and transpose function (T)
import pandas as pd
pd.DataFrame(a).T.to_csv('test.csv')
Result:
(test.csv)
,0,1,2,3,4
0,a,e,g,,h
1,b,f,,,i
2,c,,,,
3,d,,,,
import itertools
import pandas as pd
First create a dataframe using a nested array:
a = ['a','b','c','d']
b = ['e','f']
c = ['g']
d = []
e = ['h','i']
nest = [a,b,c,d,e]
df = pd.DataFrame((_ for _ in itertools.zip_longest(*nest)), columns=['a', 'b', 'c', 'd', 'e'])
like that:
a b c d e
0 a e g None h
1 b f None None i
2 c None None None None
3 d None None None None
and then store it using pandas:
df.to_csv('filename.csv', index=False)
We have three task to do here: fill sublist so all have same length, transpose, write to csv.
Sadly Python has not built-in function for filling, however it can be done relatively easily, I would do it following way
(following code is intended to give result as requested in OP):
a = [['a','b','c','d'],['e','f'],['g'],[],['h','i']]
ml = max([len(i) for i in a]) #number of elements of longest sublist
a = [(i+[' ']*ml)[:ml] for i in a] #adding ' ' for sublist shorter than longest
a = list(zip(*a)) #transpose
a = [','.join(i) for i in a] #create list of lines to be written
a = [i.rstrip(', ') for i in a] #jettison spaces if not followed by value
a = '\n'.join(a) #create string to be written to file
with open('myfile.csv','w') as f: f.write(a)
Content of myfile.csv:
a,e,g, ,h
b,f, , ,i
c
d

How to efficiently convert the entries of a dictionary into a dataframe

I have a dictionary like this:
mydict = {'A': 'some thing',
'B': 'couple of words'}
All the values are strings that are separated by white spaces. My goal is to convert this into a dataframe which looks like this:
key_val splitted_words
0 A some
1 A thing
2 B couple
3 B of
4 B words
So I want to split the strings and then add the associated key and these words into one row of the dataframe.
A quick implementation could look like this:
import pandas as pd
mydict = {'A': 'some thing',
'B': 'couple of words'}
all_words = " ".join(mydict.values()).split()
df = pd.DataFrame(columns=['key_val', 'splitted_words'], index=range(len(all_words)))
indi = 0
for item in mydict.items():
words = item[1].split()
for word in words:
df.iloc[indi]['key_val'] = item[0]
df.iloc[indi]['splitted_words'] = word
indi += 1
which gives me the desired output.
However, I am wondering whether there is a more efficient solution to this!?
Here is my on-line approach:
df = pd.DataFrame([(k, s) for k, v in mydict.items() for s in v.split()], columns=['key_val','splitted_words'])
If I split it, it will be:
d=[(k, s) for k, v in mydict.items() for s in v.split()]
df = pd.DataFrame(d, columns=['key_val','splitted_words'])
Output:
Out[41]:
key_val splitted_words
0 A some
1 A thing
2 B couple
3 B of
4 B words
Based on #qu-dong's idea and using a generator function for readability a working example:
#! /usr/bin/env python
from __future__ import print_function
import pandas as pd
mydict = {'A': 'some thing',
'B': 'couple of words'}
def splitting_gen(in_dict):
"""Generator function to split in_dict items on space."""
for k, v in in_dict.items():
for s in v.split():
yield k, s
df = pd.DataFrame(splitting_gen(mydict), columns=['key_val', 'splitted_words'])
print (df)
# key_val splitted_words
# 0 A some
# 1 A thing
# 2 B couple
# 3 B of
# 4 B words
# real 0m0.463s
# user 0m0.387s
# sys 0m0.057s
but this only caters efficiency in elegance/readability of the solution requested.
If you note the timings they are all alike approx. a tad shorted than 500 milli seconds. So one might continue to profile further to not suffer when feeding in larger texts ;-)

Categories