Find elements that appear in more than k sets in python

Find elements that appear in more than k sets in python - python

I am implementing a basic spell-correction system and I have built an inverted index for my domain's language, where every character bigram is mapped to a list of words that contain that bigram.
Now I want to find all words that share more than 3 character bigrams with the given word w. So the main problem is: given a set of lists, how can one efficiently find elements that occur in 3 or more of them?
For example, given sets:
('a', 'b', 'c', 'd') , ('a', 'e', 'f', 'g'), ('e', 'f', 'g', 'h'), ('b', 'c', 'z', 'y'), ('e', 'k', 'a', 'j')
I like to get the output:
('a', 'e')
since a and e have each appeared in more than 3 sets.
I would appreciate your ideas.

Additional to #Ralf. You can use dicts to construct a histogram
someCollection = [('a', 'b', 'c', 'd') , ('a', 'e', 'f', 'g'), ('e', 'f', 'g', 'h'), ('b', 'c', 'z', 'y'), ('e', 'k', 'a', 'j')]
hist = {}
for collection in someCollection:
for member in collection:
hist[member] = hist.get(member, 0) + 1
Hist now is:
{'a': 3,
'b': 2,
'c': 2,
'd': 1,
'e': 3,
'f': 2,
'g': 2,
'h': 1,
'z': 1,
'y': 1,
'k': 1,
'j': 1}
Which can be sorted with sorted(hist.items(), key = lambda x[1]) # sort along values

You could try using collections.Counter:
from collections import Counter
data = [
('a', 'b', 'c', 'd'),
('a', 'e', 'f', 'g'),
('e', 'f', 'g', 'h'),
('b', 'c', 'z', 'y'),
('e', 'k', 'a', 'j'),
]
c = Counter()
for e in data:
c.update(e)
# print(c)
# for k, v in c.items():
# if v >= 3:
# print(k, v)
You get the output by using this (or something similar):
>>> [k for k, v in c.items() if v >= 3]
['a', 'e']

Related

How can I rearrange a set of values into new pattern on python and print results

so I will do my best to explain what I'm looking for,
at the moment I have a 100 item list that I want to repetitively shuffle using a set pattern to first check if the pattern will eventually bring me back to where I began
and 2 to print the result of each loop to a text file.
so using a 3 item list as my example
[a,b,c]
and the shuffle pattern [3 1 2]
where the 3rd item becomes the first.
the first item becomes the second
and the second item becomes the 3rd
on a loop would generate the following patterns
[a,b,c]
[3,1,2]
[c,a,b]
[b,c,a]
[a,b,c]
but I have a list at the moment of 100 items that I need to find every single arrangement for a few different patterns I would like to test out.
does anyone know of a way to do this in python please.

You can define function and call this function multi times like below:
>>> def func(lst, ptr):
... return [lst[idx-1] for idx in ptr]
>>> lst = ['a','b','c']
>>> ptr = [3,1,2]
>>> for _ in range(5):
... lst = func(lst, ptr)
... print(lst)
['c', 'a', 'b']
['b', 'c', 'a']
['a', 'b', 'c']
['c', 'a', 'b']
['b', 'c', 'a']

You could use numpy advanced integer indexing if your list contains a numeric type:
import numpy as np
original_list=[1,2,3]
numpy_array = np.array(original_list)
pattern = [2,1,0]
print(numpy_array[pattern])
>>> array([3, 2, 1])

def rearrange(pattern : list,L:list):
new_list = []
for i in pattern :
new_list.append(L[i-1])
return new_list
print(rearrange([3,1,2],['a','b','c']))
output :
['c', 'a', 'b']

Itertools could be what you need.
import itertools
p = itertools.permutations(['a','b','c', 'd'])
list(p)
Output:
[('a', 'b', 'c', 'd'),
('a', 'b', 'd', 'c'),
('a', 'c', 'b', 'd'),
('a', 'c', 'd', 'b'),
('a', 'd', 'b', 'c'),
('a', 'd', 'c', 'b'),
('b', 'a', 'c', 'd'),
('b', 'a', 'd', 'c'),
('b', 'c', 'a', 'd'),
('b', 'c', 'd', 'a'),
('b', 'd', 'a', 'c'),
('b', 'd', 'c', 'a'),
('c', 'a', 'b', 'd'),
('c', 'a', 'd', 'b'),
('c', 'b', 'a', 'd'),
('c', 'b', 'd', 'a'),
('c', 'd', 'a', 'b'),
('c', 'd', 'b', 'a'),
('d', 'a', 'b', 'c'),
('d', 'a', 'c', 'b'),
('d', 'b', 'a', 'c'),
('d', 'b', 'c', 'a'),
('d', 'c', 'a', 'b'),
('d', 'c', 'b', 'a')]

Python - Calculate combinations of different values as a sum

Given a list of tuples as following:
values = [
('a', 'b', 'c'),
('d', 'e'),
('f', 'g', 'h')
]
I'd like to calculate different combinations of those values, but not as a cartesian product, rather as a sum on some custom rules. To clarify, if we calculate the cartesian product between those tuples, we will get 3*2*3 = 18 different combinations. But my desire is to get something like this:
combinations = [
('a', 'd', 'f'),
('a', 'e', 'g'),
('a', 'e', 'h'),
('b', 'd', 'f'),
('b', 'e', 'g'),
('b', 'e', 'h'),
('c', 'd', 'f'),
('c', 'e', 'g'),
('c', 'e', 'h')
]
So the resulting list contains 9 different combinations instead of 18.
Example with 4 tuples:
values = [
('a', 'b', 'c'),
('d', 'e'),
('f', 'g', 'h'),
('i', 'j', 'k', 'l')
]
The result would be
combinations = [
('a', 'd', 'f', 'i'),
('a', 'e', 'g', 'j'),
('a', 'e', 'h', 'k'),
('a', 'e', 'h', 'l'),
('b', 'd', 'f', 'i'),
('b', 'e', 'g', 'j'),
('b', 'e', 'h', 'k'),
('b', 'e', 'h', 'l'),
('c', 'd', 'f', 'i'),
('c', 'e', 'g', 'j'),
('c', 'e', 'h', 'k'),
('c', 'e', 'h', 'l'),
]
To Explain the logic for the outputs further:
In both inputs, the first tuple is behaving as it would in a cartesian product.
However, all the other tuples except the first are being iterated (or zipped) together. Additionally, if one of the tuples being iterated together "runs out of values" so to speak, we use the last value in the tuple instead.
What would be the efficient way to achieve this?

With the extra example provided, we can figure out how the logic will look. Essentially, the first row is being treated specially and used in the normal "cartesian product" sense.
However, the rest of the rows are being effectively extended to the largest length, and being zipped together. Coding that up, it can look something like follows:
from itertools import product
def extend_to_max_len(tup, length):
'''extends a tuple to a specified length by
filling the empty spaces with last element of given tuple
'''
fill_count = length - len(tup)
return (*tup, *[tup[-1]]*fill_count)
def non_cartesian_sum(values):
'''Expects a list of tuples.
gives the output according to the custom rules:
top: first row: to be used for cartesian product with zip of remaining rows
bottom: remaining rows: extended to longest length before zipping
'''
if len(values) < 2:
print("Check length of input provided")
return None
top = values[0]
bottom = values[1:]
max_len = max(len(row) for row in bottom)
bottom = [extend_to_max_len(row, max_len) for row in bottom]
out = [(first, *rest) for first, rest in product(top, zip(*bottom))]
return out
values = [
('a', 'b', 'c'),
('d', 'e'),
('f', 'g', 'h'),
('i', 'j', 'k', 'l')
]
out = non_cartesian_sum(values)
print(out)
Output:
[('a', 'd', 'f', 'i'),
('a', 'e', 'g', 'j'),
('a', 'e', 'h', 'k'),
('a', 'e', 'h', 'l'),
('b', 'd', 'f', 'i'),
('b', 'e', 'g', 'j'),
('b', 'e', 'h', 'k'),
('b', 'e', 'h', 'l'),
('c', 'd', 'f', 'i'),
('c', 'e', 'g', 'j'),
('c', 'e', 'h', 'k'),
('c', 'e', 'h', 'l')]
Note that you may want to add more input validation as required, before using this function for your use case.

This works for the data provided.
values = [
('a', 'b', 'c'),
('d', 'e'),
('f', 'g', 'h')
]
length_of_1 = len(values[1])
length_of_2 = len(values[2])
output = []
for item0 in values[0]:
for i in range(max(length_of_1, length_of_2)):
if i >= length_of_1:
item1 = values[1][-1]
else:
item1 = values[1][i]
if i >= length_of_2:
item2 = values[2][-1]
else:
item2 = values[2][i]
triple = (item0, item1, item2)
output.append(triple)
for tup in output:
print(tup)
Output:
('a', 'd', 'f')
('a', 'e', 'g')
('a', 'e', 'h')
('b', 'd', 'f')
('b', 'e', 'g')
('b', 'e', 'h')
('c', 'd', 'f')
('c', 'e', 'g')
('c', 'e', 'h')

Try this
values = [
('a', 'b', 'c'),
('d', 'e'),
('f', 'g', 'h')
]
combination = [(a,b,c) for a in values[0] for b in values[1] for c in values[2]]
print(combination)

How to make generator work in spark mapPartitions()?

I am trying to use mapPartiton in spark to process large text corpus:
Let's say we have some half-processed data that looks like this:
text_1 = [['A', 'B', 'C', 'D', 'E'],
['F', 'E', 'G', 'A', 'B'],
['D', 'E', 'H', 'A', 'B'],
['A', 'B', 'C', 'F', 'E'],
['A', 'B', 'C', 'J', 'E'],
['E', 'H', 'A', 'B', 'C'],
['E', 'G', 'A', 'B', 'C'],
['C', 'F', 'E', 'G', 'A'],
['C', 'D', 'E', 'H', 'A'],
['C', 'J', 'E', 'H', 'A'],
['H', 'A', 'B', 'C', 'F'],
['H', 'A', 'B', 'C', 'J'],
['B', 'C', 'F', 'E', 'G'],
['B', 'C', 'D', 'E', 'H'],
['B', 'C', 'F', 'E', 'K'],
['B', 'C', 'J', 'E', 'H'],
['G', 'A', 'B', 'C', 'F'],
['J', 'E', 'H', 'A', 'B']]
Each letter is a word. I also have vocabulary :
V = ['D','F','G','C','J','K']
text_1RDD = sc.parallelize(text_1)
and I want to run the following in spark:
filtered_lists = text_1RDD.mapPartitions(partitions)
filtered_lists.collect()
I have this function:
def partitions(list_of_lists,vc):
for w in vc:
iterator = []
for sub_list in list_of_lists:
if w in sub_list:
iterator.append(sub_list)
yield (w,len(iterator))
If I run it like this:
c = partitions(text_1,V)
for item in c:
print(item)
it returns correct count
('D', 4)
('F', 7)
('G', 5)
('C', 15)
('J', 5)
('K', 1)
However, I have no idea how to run it in spark:
filtered_lists = text_1RDD.mapPartitions(partitions)
filtered_lists.collect()
It has just one argument and generates a lot of errors when running in Spark...
But even if I code vocabulary inside partitions function:
def partitionsV(list_of_lists):
vc = ['D','F','G','C','J','K']
for w in vc:
iterator = []
for sub_list in list_of_lists:
if w in sub_list:
iterator.append(sub_list)
yield (w,len(iterator))
..I got this:
filtered_lists = text_1RDD.mapPartitions(partitionsV)
filtered_lists.collect()
output:
[('D', 2),
('F', 0),
('G', 0),
('C', 0),
('J', 0),
('K', 0),
('D', 0),
('F', 0),
('G', 0),
('C', 0),
('J', 0),
('K', 0),
('D', 1),
('F', 0),
('G', 0),
('C', 0),
('J', 0),
('K', 0),
('D', 1),
('F', 0),
('G', 0),
('C', 0),
('J', 0),
('K', 0)]
Obviously, generator didn't work as expected. I am totally stuck.
I am very new to spark. I would be so grateful if someone can explain to me what is going on here...

That's yet another word-count problem, and mapPartitions is not the tool for the job:
from operator import add
v = set(['D','F','G','C','J','K'])
result = text_1RDD.flatMap(v.intersection).map(lambda x: (x, 1)).reduceByKey(add)
And the result is
for x in result.sortByKey().collect():
print(x)
('C', 15)
('D', 4)
('F', 7)
('G', 5)
('J', 5)
('K', 1)

Printing values in row order from dictionary/ list

Could someone show me how to get the values from a dictionary in row order (STARTING FROM SECOND ROW) e.g. get the first value from all rows, when rows finish, move onto getting the second value from all rows until all the values have been collected (when no more columns).
E.g. here is a table:
('E', 'K', 'Y') # <- don't get the values from the first row
('B', 'C', 'B') # start getting values from this (second row)
('C', 'B', 'F')
('F', 'C', 'A')
('C', 'C', 'C')
('B', 'C', 'B')
('E', 'B', 'F')
('B', 'B', 'F')
('D', 'A', 'A')
('A', 'D', 'F')
The table above should print the values:
BCFCBEBDACBCCCBBADBFACBFFAF
Creating an encode and decoder program. stuck with printing values in correct order.
Thanks.

If you have all the tuples in a list you can use zip and join :
>>> l=[('E', 'K', 'Y') ,('B', 'C', 'B') ,('C', 'B', 'F'),('F', 'C', 'A'),('C', 'C', 'C'),('B', 'C', 'B'),('E', 'B', 'F'),('B', 'B', 'F'),('D', 'A', 'A')]
>>> ''.join([''.join(i) for i in zip(*l[1:])])
'BCFCBEBDCBCCCBBABFACBFFA'

Alterantive way, that gets a list of characters instead of str:
print([v for r in (row for row in zip(*l[1:])) for v in r])
#['B', 'C', 'F', 'C', 'B', 'E', 'B', 'D', 'A', 'C', 'B', 'C', 'C', 'C', 'B', 'B', 'A', 'D', 'B', 'F', 'A', 'C', 'B', 'F', 'F', 'A', 'F']

If the data structure is a list of tuples, meaning:
l =[('C', 'B', 'F'),
('F', 'C', 'A'),
('C', 'C', 'C'),
('B', 'C', 'B'),
('E', 'B', 'F'),
('B', 'B', 'F'),
('D', 'A', 'A'),
('A', 'D', 'F')]
Something like this could solve the problem:
string = ""
for i in range(3):
for element in range(1, len(l)):
string+=l[element][i]

how to remove a specific element in the tuple?

how to remove a specific element in the tuple?
for example:
L={('a','b','c','d'):1,('a','b','c','e'):2}
remove='b'
I want to get a result of :
{('a','c','d'):1,('a','c','e'):2}

In [20]: L={('a','b','c','d'):1,('a','b','c','e'):2}
In [21]: {tuple(y for y in x if y != "b"):L[x] for x in L}
Out[21]: {('a', 'c', 'd'): 1, ('a', 'c', 'e'): 2}
or using filter():
In [24]: { tuple(filter(lambda y:y!="b",x)) : L[x] for x in L}
Out[24]: {('a', 'c', 'd'): 1, ('a', 'c', 'e'): 2}

You could create an updated version of the dictionary using a dictionary comprehension expression:
L = {('a', 'b', 'c', 'd'): 1, ('a', 'b', 'c', 'e'): 2, ('f', 'g', 'h'): 3}
remove='b'
L = {tuple(i for i in k if i != remove) if remove in k else k:v for (k,v) in L.items()}
print L
Output:
{('a', 'c', 'e'): 2, ('a', 'c', 'd'): 1, ('f', 'g', 'h'): 3}
As you can see it, it leaves items without the element in their tuple key alone.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find elements that appear in more than k sets in python - python

Related

How can I rearrange a set of values into new pattern on python and print results

Python - Calculate combinations of different values as a sum

How to make generator work in spark mapPartitions()?

Printing values in row order from dictionary/ list

how to remove a specific element in the tuple?

Categories

Resources