creating itemsets in apriori algorithm - python

I am reading about association analysis in book titled Machine learning in action. Following code is given in book
The k-2 thing may be a little confusing. Let’s look at that a little
further. When you were creating {0,1} {0,2}, {1,2} from {0}, {1}, {2},
you just combined items. Now, what if you want to use {0,1} {0,2},
{1,2} to create a three-item set? If you did the union of every set,
you’d get {0, 1, 2}, {0, 1, 2}, {0, 1, 2}. That’s right. It’s the same
set three times. Now you have to scan through the list of three-item
sets to get only unique values. You’re trying to keep the number of
times you go through the lists to a minimum. Now, if you compared the
first element {0,1} {0,2}, {1,2} and only took the union of those that
had the same first item, what would you have? {0, 1, 2} just one time.
Now you don’t have to go through the list looking for unique values.
def aprioriGen(Lk, k): #creates Ck
retList = []
lenLk = len(Lk)
for i in range(lenLk):
for j in range(i+1, lenLk):
L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2] # Join sets if first k-2 items are equal
L1.sort(); L2.sort()
if L1==L2:
retList.append(Lk[i] | Lk[j])
return retLis
Suppose i am calling above function
Lk = [frozenset({2, 3}), frozenset({3, 5}), frozenset({2, 5}), frozenset({1, 3})]
k = 3
aprioriGen(Lk,3)
I am geting following output
[frozenset({2, 3, 5})]
I think there is bug in above logic since we are missing other combinations like {1,2,3}, {1,3,5}. Isn't it? Is my understanding right?

I think you are following the below link, Output set depends on the minSupport what we pass.
http://adataanalyst.com/machine-learning/apriori-algorithm-python-3-0/
If we reduce the minSupport value to 0.2, we get all sets.
Below is the complete code
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 31 16:57:26 2018
#author: rponnurx
"""
from numpy import *
def loadDataSet():
return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]
def createC1(dataSet):
C1 = []
for transaction in dataSet:
for item in transaction:
if not [item] in C1:
C1.append([item])
C1.sort()
return list(map(frozenset, C1))#use frozen set so we
#can use it as a key in a dict
def scanD(D, Ck, minSupport):
ssCnt = {}
for tid in D:
for can in Ck:
if can.issubset(tid):
if not can in ssCnt: ssCnt[can]=1
else: ssCnt[can] += 1
numItems = float(len(D))
retList = []
supportData = {}
for key in ssCnt:
support = ssCnt[key]/numItems
if support >= minSupport:
retList.insert(0,key)
supportData[key] = support
return retList, supportData
dataSet = loadDataSet()
print(dataSet)
C1 = createC1(dataSet)
print(C1)
#D is a dataset in the setform.
D = list(map(set,dataSet))
print(D)
L1,suppDat0 = scanD(D,C1,0.5)
print(L1)
def aprioriGen(Lk, k): #creates Ck
retList = []
print("Lk")
print(Lk)
lenLk = len(Lk)
for i in range(lenLk):
for j in range(i+1, lenLk):
L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]
L1.sort(); L2.sort()
if L1==L2: #if first k-2 elements are equal
retList.append(Lk[i] | Lk[j]) #set union
return retList
def apriori(dataSet, minSupport = 0.5):
C1 = createC1(dataSet)
D = list(map(set, dataSet))
L1, supportData = scanD(D, C1, minSupport)
L = [L1]
k = 2
while (len(L[k-2]) > 0):
Ck = aprioriGen(L[k-2], k)
Lk, supK = scanD(D, Ck, minSupport)#scan DB to get Lk
supportData.update(supK)
L.append(Lk)
k += 1
return L, supportData
L,suppData = apriori(dataSet,0.2)
print(L)
Output:
[[frozenset({5}), frozenset({2}), frozenset({4}), frozenset({3}), frozenset({1})], [frozenset({1, 2}), frozenset({1, 5}), frozenset({2, 3}), frozenset({3, 5}), frozenset({2, 5}), frozenset({1, 3}), frozenset({1, 4}), frozenset({3, 4})], [frozenset({1, 3, 5}), frozenset({1, 2, 3}), frozenset({1, 2, 5}), frozenset({2, 3, 5}), frozenset({1, 3, 4})], [frozenset({1, 2, 3, 5})], []]
Thanks,
Rajeswari Ponnuru

Related

Function to make a list as unsorted as possible

I am looking for a function to make the list as unsorted as possible. Preferably in Python.
Backstory:
I want to check URLs statuses and see if URLs give a 404 or not. I just use asyncio and requests modules. Nothing fancy.
Now I don't want to overload servers, so I want to minimize checking URLs which are on the same domain at the same time. I have this idea to sort the URLs in a way that items which are close to one another (having the same sort key = domain name) are placed as far apart from each other in the list as possible.
An example with numbers:
a=[1,1,2,3,3] # <== sorted list, sortness score = 2
0,1,2,3,4 # <== positions
could be unsorted as:
b=[1,3,2,1,3] # <== unsorted list, sortness score = 6
0,1,2,3,4 # <== positions
I would say that we can compute a sortness score by summing up the distances between equal items (which have the same key = domain name). Higher sortness means better unsorted. Maybe there is a better way for testing unsortness.
The sortness score for list a is 2. The sum of distances for 1 is (1-0)=1, for 2 is 0, for 3 is (4-3)=1.
The sortness score for list b is 6. The sum of distances for 1 is (3-0)=3, for 2 is 0, for 3 is (4-1)=3.
URLs list would look something like a list of (domain, URL) tuples:
[
('example.com', 'http://example.com/404'),
('test.com', 'http://test.com/404'),
('test.com', 'http://test.com/405'),
('example.com', 'http://example.com/405'),
...
]
I am working on a prototype which works Ok-ish, but not optimal as I can find some variants which are better unsorted by hand.
Anyone wants to give it a go?
This is my code, but it's not great :):
from collections import Counter
from collections import defaultdict
import math
def test_unsortness(lst:list) -> float:
pos = defaultdict(list)
score = 0
# Store positions for each key
# input = [1,3,2,3,1] => {1: [0, 4], 3: [1, 3], 2: [2]}
for c,l in enumerate(lst):
pos[l].append(c)
for k,poslst in pos.items():
for i in range(len(poslst)-1):
score += math.sqrt(poslst[i+1] - poslst[i])
return score
def unsort(lst:list) -> list:
free_positions = list(range(0,len(lst)))
output_list = [None] * len(free_positions)
for val, count in Counter(lst).most_common():
pos = 0
step = len(free_positions) / count
for i in range(count):
output_list[free_positions[int(pos)]] = val
free_positions[int(pos)] = None # Remove position later
pos = pos + step
free_positions = [p for p in free_positions if p]
return output_list
lsts = list()
lsts.append( [1,1,2,3,3] )
lsts.append( [1,3,2,3,1] ) # This has the worst score after unsort()
lsts.append( [1,2,3,0,1,2,3] ) # This has the worst score after unsort()
lsts.append( [3,2,1,0,1,2,3] ) # This has the worst score after unsort()
lsts.append( [3,2,1,3,1,2,3] ) # This has the worst score after unsort()
lsts.append( [1,2,3,4,5] )
for lst in lsts:
ulst = unsort(lst)
print( ( lst, '%.2f'%test_unsortness(lst), '====>', ulst, '%.2f'%test_unsortness(ulst), ) )
# Original score Unsorted score
# ------- ----- -------- -----
# ([1, 1, 2, 3, 3], '2.00', '====>', [1, 3, 1, 3, 2], '2.83')
# ([1, 3, 2, 3, 1], '3.41', '====>', [1, 3, 1, 3, 2], '2.83')
# ([1, 2, 3, 0, 1, 2, 3], '6.00', '====>', [1, 2, 3, 1, 2, 3, 0], '5.20')
# ([3, 2, 1, 0, 1, 2, 3], '5.86', '====>', [3, 2, 1, 3, 2, 1, 0], '5.20')
# ([3, 2, 1, 3, 1, 2, 3], '6.88', '====>', [3, 2, 3, 1, 3, 2, 1], '6.56')
# ([1, 2, 3, 4, 5], '0.00', '====>', [1, 2, 3, 4, 5], '0.00')
PS. I am not looking just for a randomize function and I know there are crawlers which can manage domain loads, but this is for the sake of exercise.
Instead of unsorting your list of URLs, why not grouping them by domain, each in a queue, then process them asynchronously with a delay (randomised?) in between?
It looks to me less complex than what you're trying to do to achieve the same thing and if you have a lot of domain, you can always throttle the number to run through concurrently at that point.
I used Google OR Tools to solve this problem. I framed it as a constraint optimization problem and modeled it that way.
from collections import defaultdict
from itertools import chain, combinations
from ortools.sat.python import cp_model
model = cp_model.CpModel()
data = [
('example.com', 'http://example.com/404'),
('test.com', 'http://test.com/404'),
('test.com', 'http://test.com/405'),
('example.com', 'http://example.com/405'),
('google.com', 'http://google.com/404'),
('example.com', 'http://example.com/406'),
('stackoverflow.com', 'http://stackoverflow.com/404'),
('test.com', 'http://test.com/406'),
('example.com', 'http://example.com/407')
]
tmp = defaultdict(list)
for (domain, url) in sorted(data):
var = model.NewIntVar(0, len(data) - 1, url)
tmp[domain].append(var) # store URLs as model variables where the key is the domain
vals = list(chain.from_iterable(tmp.values())) # create a single list of all variables
model.AddAllDifferent(vals) # all variables must occupy a unique spot in the output
constraint = []
for urls in tmp.values():
if len(urls) == 1: # a single domain does not need a specific constraint
constraint.append(urls[0])
continue
combos = combinations(urls, 2)
for (x, y) in combos: # create combinations between each URL of a specific domain
constraint.append((x - y))
model.Maximize(sum(constraint)) # maximize the distance between similar URLs from our constraint list
solver = cp_model.CpSolver()
status = solver.Solve(model)
output = [None for _ in range(len(data))]
if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
for val in vals:
idx = solver.Value(val)
output[idx] = val.Name()
print(output)
['http://example.com/407',
'http://test.com/406',
'http://example.com/406',
'http://test.com/405',
'http://example.com/405',
'http://stackoverflow.com/404',
'http://google.com/404',
'http://test.com/404',
'http://example.com/404']
There is no obvious definition of unsortedness that would work best for you, but here's something that at least works well:
Sort the list
If the length of the list is not a power of two, then spread the items out evenly in a list with the next power of two size
Find a new index for each item by reversing the bits in its old index.
Remove the gaps to bring the list back to its original size.
In sorted order, the indexes of items that are close together usually differ only in the smallest bits. By reversing the bit order, you make the new indexes for items that are close together differ in the largest bits, so they will end up far apart.
def bitreverse(x, bits):
# reverse the lower 32 bits
x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1)
x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2)
x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4)
x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8)
x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16)
# take only the appropriate length
return (x>>(32-bits)) & ((1<<bits)-1)
def antisort(inlist):
if len(inlist) < 3:
return inlist
inlist = sorted(inlist)
#get the next power of 2 list length
p2len = 2
bits = 1
while p2len < len(inlist):
p2len *= 2
bits += 1
templist = [None] * p2len
for i in range(len(inlist)):
newi = i * p2len // len(inlist)
newi = bitreverse(newi, bits)
templist[newi] = inlist[i]
return [item for item in templist if item != None]
print(antisort(["a","b","c","d","e","f","g",
"h","i","j","k","l","m","n","o","p","q","r",
"s","t","u","v","w","x","y","z"]))
Output:
['a', 'n', 'h', 'u', 'e', 'r', 'k', 'x', 'c', 'p', 'f', 's',
'm', 'z', 'b', 'o', 'i', 'v', 'l', 'y', 'd', 'q', 'j', 'w', 'g', 't']
You could implement an inverted binary search.
from typing import Union, List
sorted_int_list = [1, 1, 2, 3, 3]
unsorted_int_list = [1, 3, 2, 1, 3]
sorted_str_list = [
"example.com",
"example.com",
"test.com",
"stackoverflow.com",
"stackoverflow.com",
]
unsorted_str_list = [
"example.com",
"stackoverflow.com",
"test.com",
"example.com",
"stackoverflow.com",
]
def inverted_binary_search(
input_list: List[Union[str, int]],
search_elem: Union[int, str],
list_selector_start: int,
list_selector_end: int,
) -> int:
if list_selector_end - list_selector_start <= 1:
if search_elem < input_list[list_selector_start]:
return list_selector_start - 1
else:
return list_selector_start
list_selector_mid = (list_selector_start + list_selector_end) // 2
if input_list[list_selector_mid] > search_elem:
return inverted_binary_search(
input_list=input_list,
search_elem=search_elem,
list_selector_start=list_selector_mid,
list_selector_end=list_selector_end,
)
elif input_list[list_selector_mid] < search_elem:
return inverted_binary_search(
input_list=input_list,
search_elem=search_elem,
list_selector_start=list_selector_start,
list_selector_end=list_selector_mid,
)
else:
return list_selector_mid
def inverted_binary_insertion_sort(your_list: List[Union[str, int]]):
for idx in range(1, len(your_list)):
selected_elem = your_list[idx]
inverted_binary_search_position = (
inverted_binary_search(
input_list=your_list,
search_elem=selected_elem,
list_selector_start=0,
list_selector_end=idx,
)
+ 1
)
for idk in range(idx, inverted_binary_search_position, -1):
your_list[idk] = your_list[idk - 1]
your_list[inverted_binary_search_position] = selected_elem
return your_list
Output
inverted_sorted_int_list = inverted_binary_insertion_sort(sorted_int_list)
print(inverted_sorted_int_list)
>> [1, 3, 3, 2, 1]
inverted_sorted_str_list = inverted_binary_insertion_sort(sorted_str_list)
print(inverted_sorted_str_list)
>> ['example.com', 'stackoverflow.com', 'stackoverflow.com', 'test.com', 'example.com']
Update:
Given the comments, you could also run the function twice. This will untangle duplicates.
inverted_sorted_int_list = inverted_binary_insertion_sort(
inverted_binary_insertion_sort(sorted_int_list)
)
>> [1, 3, 2, 1, 3]
Here's a stab at it, but I am not sure it wouldn't degenerate a bit given particular input sets.
We pick the most frequent found item and append its first occurrence to a list. Then same with the 2nd most frequent and so on.
Repeat half the size of the most found item. That's the left half of the list.
Then moving from least frequent to most frequent, pick first item and add its values. When an item is found less than half the max, pick on which side you want to put it.
Essentially, we layer key by key and end up with more frequent items at left-most and right-most positions in the unsorted list, leaving less frequent ones in the middle.
def unsort(lst:list) -> list:
"""
build a dictionary by frequency first
then loop thru the keys and append
key by key with the other keys in between
"""
result = []
#dictionary by keys (this would be domains to urls)
di = defaultdict(list)
for v in lst:
di[v].append(v)
#sort by decreasing dupes length
li_len = [(len(val),key) for key, val in di.items()]
li_len.sort(reverse=True)
#most found count
max_c = li_len[0][0]
#halfway point
odd = max_c % 2
num = max_c // 2
if odd:
num += 1
#frequency, high to low
by_freq = [tu[1] for tu in li_len]
di_side = {}
#for the first half, pick from frequent to infrequent
#alternating by keys
for c in range(num):
#frequent to less
for key in by_freq:
entries = di[key]
#first pass: less than half the number of values
#we don't want to put all the infrequents first
#and have a more packed second side
if not c:
#pick on way in or out?
if len(entries) <= num:
#might be better to alternate L,R,L
di_side[key] = random.choice(["L","R"])
else:
#pick on both
di_side[key] = "B"
#put in the left half
do_it = di_side[key] in ("L","B")
if do_it and entries:
result.append(entries.pop(0))
#once you have mid point pick from infrequent to frequent
for c in range(num):
#frequent to less
for key in reversed(by_freq):
entries = di[key]
#put in the right half
do_it = di_side[key] in ("R","B")
if entries:
result.append(entries.pop(0))
return result
Running this I got:
([1, 1, 2, 3, 3], '2.00', '====>', [3, 1, 2, 1, 3], '3.41')
([1, 3, 2, 3, 1], '3.41', '====>', [3, 1, 2, 1, 3], '3.41')
([1, 2, 3, 0, 1, 2, 3], '6.00', '====>', [3, 2, 1, 0, 1, 2, 3], '5.86')
([3, 2, 1, 0, 1, 2, 3], '5.86', '====>', [3, 2, 1, 0, 1, 2, 3], '5.86')
([3, 2, 1, 3, 1, 2, 3], '6.88', '====>', [3, 2, 3, 2, 1, 3, 1], '5.97')
([1, 2, 3, 4, 5], '0.00', '====>', [5, 1, 2, 3, 4], '0.00')
Oh, and I also added an assert to check nothing had been dropped or altered by the unsorting:
assert(sorted(lst) == sorted(ulst))
alternate approach?
I'll leave it as a footnote for now, but the general idea of not clustering (not the OP's specific application of not overloading domains) seems like it would be a candidate for a force-repulsive approach, where identical domains would try to keep as far from each other as possible. i.e. 1, 1, 2 => 1, 2, 1 because the 1s would repel each other. That's a wholly different algorithmic approach however.
When I faced a similar problem, here's how I solved it:
Define the "distance" between two strings (URLs in this case) as their Levenshtein distance (code to compute this value is readily available)
Adopt your favorite travelling-salesman algorithm to find the (approximate) shortest path through your set of strings (finding the exact shortest path isn't computationally feasible but the approximate algorithms are fairly efficient)
Now modify your "distance" metric to be inverted -- i.e. compute the distance between two strings (s1,s2) as MAX_INT - LevenshteinDistance(s1,s2)
With this modification, the "shortest path" through your set is now really the longest path, i.e. the most un-sorted one.
An easy way to scramble a list is to maximize its "sortness" score using a genetic algorithm with a permutation chromosome. I was able to hack quickly a version in R using the GA package. I'm not a Python user, but I am sure there are GA libraries for Python that include permutation chromosomes. If not, a general GA library with real-valued vector chromosomes can be adapted. You just use a vector with values in [0, 1] as a chromosome and convert each vector to its sort index.
I hope this algorithm works correctly:
unsorted_list = ['c', 'a', 'a', 'a', 'a', 'b', 'b']
d = {i: unsorted_list.count(i) for i in unsorted_list}
print(d) # {'c': 1, 'a': 4, 'b': 2}
d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
print(d) # {'a': 4, 'b': 2, 'c': 1}
result = [None] * len(unsorted_list)
border_index_left = 0
border_index_right = len(unsorted_list) - 1
it = iter(d)
def set_recursively(k, nk):
set_borders(k)
set_borders(nk)
if d[k]:
set_recursively(k, nk)
def set_borders(key):
global border_index_left, border_index_right
if key is not None and d[key]:
result[border_index_left] = key
d[key] = d[key] - 1
border_index_left = border_index_left + 1
if key is not None and d[key]:
result[border_index_right] = key
d[key] = d[key] - 1
border_index_right = border_index_right - 1
next_element = next(it, None)
for k, v in d.items():
next_element = next(it, None)
set_recursively(k, next_element)
print(result) # ['a', 'b', 'a', 'c', 'a', 'b', 'a']
Visually, it looks as walking from the edge to the middle:
[2, 3, 3, 3, 1, 1, 0]
[None, None, None, None, None, None, None]
[3, None, None, None, None, None, None]
[3, None, None, None, None, None, 3]
[3, 1, None, None, None, None, 3]
[3, 1, None, None, None, 1, 3]
[3, 1, 3, None, None, 1, 3]
[3, 1, 3, 2, None, 1, 3]
[3, 1, 3, 2, 0, 1, 3]
Just saying, putting a short time delay would work just fine. I think someone mentioned this already. It is very simple and very reliable. You could do something like:
from random import sample
from time import sleep
import requests
intervalList = list(range(0.1, 0.5))
error404 = []
connectionError = []
for i in your_URL_list:
ststusCode = req.get(str(i)).status_code
if ststusCode == 404:
error404.append(i)
sleep(sample(intervalList,1))
Cheers

Finding the best supersets in a list based on intersections

I have a file including lines as follows,
finalInjectionList is input file: [0, 2, 3] [0, 2, 3, 4] [0, 3] [1, 2, 4] [2, 3] [2, 3, 4]
Here [0, 2, 3, 4] and [1, 2, 4] are the best supersets for my problem and I want to write them to an outputfile. Because those are supersets of some other elements and NOT subsets of any line.
my code:
import ast
import itertools
def get_data(filename):
with open(filename, 'r') as fi:
data = fi.readlines()
return data
def get_ast_set(line):
return set(ast.literal_eval(line))
def check_infile(datafile, savefile):
list1 = [get_ast_set(row) for row in get_data(datafile)]
print(list1)
outlist = []
#for i in range(len(list1)):
for a, b in itertools.combinations(list1, 2):
if a.issuperset(b):
with open(savefile, 'a') as fo:
fo.writelines(str(a))
if __name__ == "__main__":
datafile = str("./finalInjectionList")
savefile = str("./filteredSets" )
check_infile(datafile, savefile)
My code writes all supersets, e.g {2, 3, 4} also. But {0, 2, 3, 4} covers {2, 3, 4} already, so I do not want to write {2, 3, 4} to output file.
Is there any suggestion?
Your logic in the for loop with itertools.combinations is a bit flawed, as it would create a combination ((2,3,4} , (2,3)), where (2,3,4) is the superset.
I would approach the problem by removing items from the list if they are a subset of another item.
import itertools
import ast
with open(r"C:\Users\%USERNAME%\Desktop\test.txt", 'r') as f:
data = f.readlines()
data = [d.replace('\n','') for d in data]
data = [set(ast.literal_eval(d)) for d in data]
data.sort(key=len)
data1 = data
for d in data:
flag = 0
for d1 in data1:
print(d, d1)
if d == d1:
print('both sets are same')
continue
if d.issubset(d1):
print(str(d) + ' is a subset of ' + str(d1))
flag = 1
break
else:
print(str(d) + ' is not a subset of ' + str(d1))
if flag == 1:
# if the set is a subset of another set, remove it
data1 = [d1 for d1 in data1 if d1 != d]
print('set: ',data1) # data1 will contain your result at the end of the loop
With input:
0, 2, 3
0, 2, 3, 4
0, 3
1, 2, 4
2, 3
2, 3, 4
The output will be
[{1, 2, 4}, {0, 2, 3, 4}]
which can be written to the file
Solved by modifying routine check_infile
import ast
import itertools
# A union by rank and path compression based
# program to detect cycle in a graph
from collections import defaultdict
def findparent(d, node):
"""Goes through chain of parents, until we reach node which is its own parent
Meaning, no node has it has a subset"""
if d[node] == node:
return node
else:
return findparent(d, d[node])
def get_data(filename):
with open(filename, 'r') as fi:
data = fi.readlines()
return data
def get_ast_set(line):
return set(ast.literal_eval(line))
def check_infile(datafile, savefile):
"""Find minimum number of supersets as follows:
1) identify superset of each set
2) Go through superset chains (findparents) to find set of nodes which are supersets (roots) """
list1 = [get_ast_set(row) for row in get_data(datafile)]
print(list1)
outlist = []
n = len(list1)
# Initially each node is its own parent (i.e. include self as superset)
# Here parent means superset
parents = {u:u for u in range(n)}
for u in range(n):
a = list1[u]
for v in range(u+1, n):
b = list1[v]
if a.issuperset(b):
parents[v] = u # index u is superset of v
elif b.issuperset(a):
parents[u] = v # index v is superset of u
# Print root nodes
roots = set()
for u in range(n):
roots.add(findparent(parents, u))
with open(savefile, 'w') as fo:
for i in roots:
fo.write(str(list1[i]))
fo.write('\n')
if __name__ == "__main__":
datafile = str("./finalInjectionList.txt")
savefile = str("./filteredSets.txt" )
check_infile(datafile, savefile)
Test File (finalInjectionList.txt)
[0, 2, 3]
[0, 2, 3, 4]
[0, 3]
[1, 2, 4]
[2, 3]
[2, 3, 4]
Output File (filteredSets.txt)
{0, 2, 3, 4}
{1, 2, 4}

Take the mean of values in a list if a duplicate is found

I have 2 lists which are associated with each other. E.g., here, 'John' is associated with '1', 'Bob' is associated with 4, and so on:
l1 = ['John', 'Bob', 'Stew', 'John']
l2 = [1, 4, 7, 3]
My problem is with the duplicate John. Instead of adding the duplicate John, I want to take the mean of the values associated with the Johns, i.e., 1 and 3, which is (3 + 1)/2 = 2. Therefore, I would like the lists to actually be:
l1 = ['John', 'Bob', 'Stew']
l2 = [2, 4, 7]
I have experimented with some solutions including for-loops and the "contains" function, but can't seem to piece it together. I'm not very experienced with Python, but linked lists sound like they could be used for this.
Thank you
I believe you should use a dict. :)
def mean_duplicate(l1, l2):
ret = {}
# Iterating through both lists...
for name, value in zip(l1, l2):
if not name in ret:
# If the key doesn't exist, create it.
ret[name] = value
else:
# If it already does exist, update it.
ret[name] += value
# Then for the average you're looking for...
for key, value in ret.iteritems():
ret[key] = value / l1.count(key)
return ret
def median_between_listsElements(l1, l2):
ret = {}
for name, value in zip(l1, l2):
# Creating key + list if doesn't exist.
if not name in ret:
ret[name] = []
ret[name].append(value)
for key, value in ret.iteritems():
ret[key] = np.median(value)
return ret
l1 = ['John', 'Bob', 'Stew', 'John']
l2 = [1, 4, 7, 3]
print mean_duplicate(l1, l2)
print median_between_listsElements(l1, l2)
# {'Bob': 4, 'John': 2, 'Stew': 7}
# {'Bob': 4.0, 'John': 2.0, 'Stew': 7.0}
The following might give you an idea. It uses an OrderedDict assuming that you want the items in the order of appearance from the original list:
from collections import OrderedDict
d = OrderedDict()
for x, y in zip(l1, l2):
d.setdefault(x, []).get(x).append(y)
# OrderedDict([('John', [1, 3]), ('Bob', [4]), ('Stew', [7])])
names, values = zip(*((k, sum(v)/len(v)) for k, v in d.items()))
# ('John', 'Bob', 'Stew')
# (2.0, 4.0, 7.0)
Here is a shorter version using dict,
final_dict = {}
l1 = ['John', 'Bob', 'Stew', 'John']
l2 = [1, 4, 7, 3]
for i in range(len(l1)):
if final_dict.get(l1[i]) == None:
final_dict[l1[i]] = l2[i]
else:
final_dict[l1[i]] = int((final_dict[l1[i]] + l2[i])/2)
print(final_dict)
Something like this:
#!/usr/bin/python
l1 = ['John', 'Bob', 'Stew', 'John']
l2 = [1, 4, 7, 3]
d={}
for i in range(0, len(l1)):
key = l1[i]
if d.has_key(key):
d[key].append(l2[i])
else:
d[key] = [l2[i]]
r = []
for values in d.values():
r.append((key,sum(values)/len(values)))
print r
Hope following code helps
l1 = ['John', 'Bob', 'Stew', 'John']
l2 = [1, 4, 7, 3]
def remove_repeating_names(names_list, numbers_list):
new_names_list = []
new_numbers_list = []
for first_index, first_name in enumerate(names_list):
amount_of_occurencies = 1
number = numbers_list[first_index]
for second_index, second_name in enumerate(names_list):
# Check if names match and
# if this name wasn't read in earlier cycles or is not same element.
if (second_name == first_name):
if (first_index < second_index):
number += numbers_list[second_index]
amount_of_occurencies += 1
# Break the loop if this name was read earlier.
elif (first_index > second_index):
amount_of_occurencies = -1
break
if amount_of_occurencies is not -1:
new_names_list.append(first_name)
new_numbers_list.append(number/amount_of_occurencies)
return [new_names_list, new_numbers_list]
# Unmodified arrays
print(l1)
print(l2)
l1, l2 = remove_repeating_names(l1, l2)
# If you want numbers list to be integer, not float, uncomment following line:
# l2 = [int(number) for number in l2]
# Modified arrays
print(l1)
print(l2)

Finding item frequency in list of lists

Let's say I have a list of lists and I want to find the frequency in which pairs (or more) of elements appears in total.
For example, if i have [[a,b,c],[b,c,d],[c,d,e]
I want :(a,b) = 1, (b,c) = 2, (c,d) = 2, etc.
I tried finding a usable apriori algorithm that would allow me to do this, but i couldn't find a easy to implement one in python.
How would I approach this problem in a better way?
This is a way to do it:
from itertools import combinations
l = [['a','b','c'],['b','c','d'],['c','d','e']]
d = {}
for i in l:
# for every item on l take all the possible combinations of 2
comb = combinations(i, 2)
for c in comb:
k = ''.join(c)
if d.get(k):
d[k] += 1
else:
d[k] = 1
Result:
>>> d
{'bd': 1, 'ac': 1, 'ab': 1, 'bc': 2, 'de': 1, 'ce': 1, 'cd': 2}

combining sets within a list

Hi so I'm trying to do the following but have gotten a bit stuck. Say I have a list of sets:
A = [set([1,2]), set([3,4]), set([1,6]), set([1,5])]
I want to create a new list which looks like the following:
B = [ set([1,2,5,6]), set([3,4]) ]
i.e create a list of sets with the sets joined if they overlap. This is probably simple but I can't quite get it right this morning.
This also works and is quite short:
import itertools
groups = [{'1', '2'}, {'3', '2'}, {'2', '4'}, {'5', '6'}, {'7', '8'}, {'7','9'}]
while True:
for s1, s2 in itertools.combinations(groups, 2):
if s1.intersection(s2):
break
else:
break
groups.remove(s1)
groups.remove(s2)
groups.append(s1.union(s2))
groups
This gives the following output:
[{'5', '6'}, {'1', '2', '3', '4'}, {'7', '8', '9'}]
The while True does seems a bit dangerous to me, any thoughts anyone?
How about:
from collections import defaultdict
def sortOverlap(listOfTuples):
# The locations of the values
locations = defaultdict(lambda: [])
# 'Sorted' list to return
sortedList = []
# For each tuple in the original list
for i, a in enumerate(listOfTuples):
for k, element in enumerate(a):
locations[element].append(i)
# Now construct the sorted list
coveredElements = set()
for element, tupleIndices in locations.iteritems():
# If we've seen this element already then skip it
if element in coveredElements:
continue
# Combine the lists
temp = []
for index in tupleIndices:
temp += listOfTuples[index]
# Add to the list of sorted tuples
sortedList.append(list(set(temp)))
# Record that we've covered this element
for element in sortedList[-1]:
coveredElements.add(element)
return sortedList
# Run the example (with tuples)
print sortOverlap([(1,2), (3,4), (1,5), (1,6)])
# Run the example (with sets)
print sortOverlap([set([1,2]), set([3,4]), set([1,5]), set([1,6])])
You could use intersection() and union() in for loops:
A = [set([1,2]), set([3,4]), set([1,6]), set([1,5])]
intersecting = []
for someSet in A:
for anotherSet in A:
if someSet.intersection(anotherSet) and someSet != anotherSet:
intersecting.append(someSet.union(anotherSet))
A.pop(A.index(anotherSet))
A.pop(A.index(someSet))
finalSet = set([])
for someSet in intersecting:
finalSet = finalSet.union(someSet)
A.append(finalSet)
print A
Output: [set([3, 4]), set([1, 2, 5, 6])]
A slightly more straightforward solution,
def overlaps(sets):
overlapping = []
for a in sets:
match = False
for b in overlapping:
if a.intersection(b):
b.update(a)
match = True
break
if not match:
overlapping.append(a)
return overlapping
examples
>>> overlaps([set([1,2]), set([1,3]), set([1,6]), set([3,5])])
[{1, 2, 3, 5, 6}]
>>> overlaps([set([1,2]), set([3,4]), set([1,6]), set([1,5])])
[{1, 2, 5, 6}, {3, 4}]
for set_ in A:
new_set = set(set_)
for other_set in A:
if other_set == new_set:
continue
for item in other_set:
if item in set_:
new_set = new_set.union(other_set)
break
if new_set not in B:
B.append(new_set)
Input/Output:
A = [set([1,2]), set([3,4]), set([2,3]) ]
B = [set([1, 2, 3]), set([2, 3, 4]), set([1, 2, 3, 4])]
A = [set([1,2]), set([3,4]), set([1,6]), set([1,5])]
B = [set([1, 2, 5, 6]), set([3, 4])]
A = [set([1,2]), set([1,3]), set([1,6]), set([3,5])]
B = [set([1, 2, 3, 6]), set([1, 2, 3, 5, 6]), set([1, 3, 5])]
This function will do the job, without touching the input:
from copy import deepcopy
def remove_overlapped(input_list):
input = deepcopy(input_list)
output = []
index = 1
while input:
head = input[0]
try:
next_item = input[index]
except IndexError:
output.append(head)
input.remove(head)
index = 1
continue
if head & next_item:
head.update(next_item)
input.remove(next_item)
index = 1
else:
index += 1
return output
Here is a function that does what you want. Probably not the most pythonic one but does the job, most likely can be improved a lot.
from sets import Set
A = [set([1,2]), set([3,4]), set([2,3]) ]
merges = any( a&b for a in A for b in A if a!=b)
while(merges):
B = [A[0]]
for a in A[1:] :
merged = False
for i,b in enumerate(B):
if a&b :
B[i]=b | a
merged =True
break
if not merged:
B.append(a)
A = B
merges = any( a&b for a in A for b in A if a!=b)
print B
What is happening there is the following, we loop all the sets in A, (except the first since we added that to B already. We check the intersection with all the sets in B, if the intersection result anything but False (aka empty set) we perform a union on the set and start the next iteration, about set operation check this page:
https://docs.python.org/2/library/sets.html
& is intersection operator
| is union operator
You can probably go more pythonic using any() etc but wuold have required more processing so I avoided that

Categories