Cluster analysis within a set of integers

Cluster analysis within a set of integers - python

Sorry for the broad title, I just do not know how to name this.
I have a list of integers, let's say:
X = [20, 30, 40, 50, 60, 70, 80, 100]
And a second list of tuples of size 2 to 6 made from this integers:
Y = [(20, 30), (40, 50, 80, 100), (100, 100, 100), ...]
Some of the numbers come back quite often in Y and I'd like to identify the group of integers coming back often.
Right now, I'm counting the number of apparition of each integer. It gives me some information, but nothing about the groups.
Example:
Y = [(20, 40, 80), (30, 60, 80), (60, 80, 100), (60, 80, 100, 20), (40, 60, 80, 20, 100), ...]
On that example (60, 80) and (60, 80, 100) are combinations which come back often.
I could use itertools.combinations_with_replacement() to generate every combinations and then count the number of apparition, but is there any other better way to do this?
Thanks.

Don't know if it is a strictly better way to do it or rather similar, but you could try to check for appearance fraction of subsets. Below a brute force way of doing so, storing the results in a dictionary. Quite possibly, it would be better to build a tree where you don't search through a branch if the appearance rate of its elements already did not make the cut. (i.e. if (20,80) does not appear together often enough, then why search for (20,80,100)?)
N=len(Y)
dicter = {}
for i in range(2,7):
for comb in itertools.combinations(X,i):
c3 = set(comb)
d3 = sum([c3.issubset(set(val)) for val in Y])/N
dicter['{}'.format(c3)] = d3
As edit: you probably are not interested in all non-appearances, so I'll throw in a piece of code to chop down the final dictionary size..First we define a function to return a shallow copy of our dictionary with 1 value removed. This is required to avoid RunTimeError when looping over the dict.
def removekey(d, key):
r = dict(d)
del r[key]
return r
Then we remove insignificant "clusters"
for d, v in dicter.items():
if v < 0.1:
dicter = removekey(dicter, d)
It will still be unsorted, as itertools and sets do not sort by themselves. Hope this will help you further along.

The approach that you are looking for is called
Frequent Itemset Mining
It finds frequent subsets, given a list of sets.

Related

Extract values from a list using a numpy range of indices between every two numbers

I want to extract a range of indices contained in the 'ranges' list from the 'nums' array.
For example:
ranges=np.arange(10, 100, 10).tolist()
#[10, 20, 30, 40, 50, 60, 70, 80, 90]
nums=np.arange(10, 1000, 5.5)
Here, I want to extract indices 10 to 20, and then 20 to 30 and so on until indices 80 to 90 specified in the 'ranges' list from the 'nums' array. I'm not sure how to cycle between every two numbers through the 'ranges' list.
If I just had to extract 2-3 ranges of indices, I would just hard code the indices and slice-
idx1 = nums[10:21]
idx2 = nums[20:31]
idx3 = nums[30:41]
But this gets tedious to do for various combinations of ranges, especially in my original dataset with almost 100 ranges of indices to extract.
Appreciate any help on this.

Something like this?
>>> ranges = [10, 20, 35, 42]
>>> for start, end in zip(ranges[:-1], ranges[1:]):
... print(start, end)
...
10 20
20 35
35 42
(Of course, do your extraction in place of print)

You can use a for loop with a chosen step:
idxs = []
for n in range(10, 1000, 10):
idx.append(nums[n:n+11])

This should do it :
dic={}
for i in np.transpose([ranges[:-1],ranges[1:]]):
dic[str(i[0])+"-"+str(i[1])]=nums[i[0]:i[1]]
print(dic)

Knapsack I/O classic problem to get least valuable items

The classic knapsack addresses the solution to get the most valuable items inside the knapsack which has a limited weight it can carry.
I am trying to get instead the least valuable items.
The following code is a very good one using Recursive dynamic programming from rosetacode http://rosettacode.org/wiki/Knapsack_problem/0-1#Recursive_dynamic_programming_algorithm
def total_value(items, max_weight):
return sum([x[2] for x in items]) if sum([x[1] for x in items]) <= max_weight else 0
cache = {}
def solve(items, max_weight):
if not items:
return ()
if (items,max_weight) not in cache:
head = items[0]
tail = items[1:]
include = (head,) + solve(tail, max_weight - head[1])
dont_include = solve(tail, max_weight)
if total_value(include, max_weight) > total_value(dont_include, max_weight):
answer = include
else:
answer = dont_include
cache[(items,max_weight)] = answer
return cache[(items,max_weight)]
items = (
("map", 9, 150), ("compass", 13, 35), ("water", 153, 200), ("sandwich", 50, 160),
("glucose", 15, 60), ("tin", 68, 45), ("banana", 27, 60), ("apple", 39, 40),
("cheese", 23, 30), ("beer", 52, 10), ("suntan cream", 11, 70), ("camera", 32, 30),
("t-shirt", 24, 15), ("trousers", 48, 10), ("umbrella", 73, 40),
("waterproof trousers", 42, 70), ("waterproof overclothes", 43, 75),
("note-case", 22, 80), ("sunglasses", 7, 20), ("towel", 18, 12),
("socks", 4, 50), ("book", 30, 10),
)
max_weight = 400
solution = solve(items, max_weight)
print "items:"
for x in solution:
print x[0]
print "value:", total_value(solution, max_weight)
print "weight:", sum([x[1] for x in solution])
I have been trying to figure out how can i get the least valuable items looking on the internet with no luck so maybe somebody can help me with that.
I really apreciate your help in advance.

I'll try my best to guide you through what should be done to achieve this.
In order to make changes to this code and find the least valuable items with which you can fill the bag make a function which,
Takes in the most valuable items(solution in your code) as the
input
Find the (I'll call it least_items) items that you
will be leaving behind
Check if the total weight of the items in least_items is greater
than the max_weight.
If yes find the most valuable items in least_items and remove them from least_items.This will be a place where you will have
to initiate some sort of recursion to keep seperating the least
valueable from the most valuable
If no that means you could fill you knapsack with more items.So then you have to go back to the most valuable items you had
and keep looking for the least valuable items until you fill the
knapsack.Again some sort of recursion will have too be initiated
But take note that you will also have to include a terminating step so that the program stops when it has found the best solution.
This is not the best solution you could make though.I tried finding something better myself but unfortunately it demands more time than I thought.Feel free to leave any problems in the comments.I'll be happy to help.
Hope this helps.

How to solve Students marks dashboard kinds of problems - Can't we use simple code in python to solve this problem..?

Consider the marks list of class students given in two lists
Students = ['student1','student2','student3','student4','student5','student6','student7','student8','student9','student10']
Marks = [45, 78, 12, 14, 48, 43, 45, 98, 35, 80]
from the above two lists the Student[0] got Marks[0], Student[1] got Marks[1] and so on
Who got marks between >25th percentile <75th percentile, in the increasing order of marks
My question -
Can't we use simple code in python to solve this problem..?
I have written code till this. To find the numbers >25 and <75 but unable to make it in ascending order. Sort() is not working and sorted is also not working. Please help how to extract the particular array values and assign to another array to solve this problem.
for i in range(0,10):
if Marks[i]>25 and Marks[i]<75:
print(Students[i],Marks[i])
print(i)

A small addition to your code can solve this issue, below is the solution
Students = ['student1','student2','student3','student4','student5','student6','student7','student8','student9','student10']
Marks = [45, 78, 12, 14, 48, 43, 45, 98, 35, 80]
Students,Marks=zip(*sorted(zip(Students, Marks))) #addition to your code
for i in range(0,10):
if Marks[i]>25 and Marks[i]<75:
print(Students[i],Marks[i])

25th percentile is "bottom fourth out of those who took the thing", and 75th percentile is "top fourth", regardless of the actual score. So what you need to do is sort the list, then take a slice out of the middle, based on the index.
Here's what I think you're trying to do:
import math
students = ['student1','student2','student3','student4','student5','student6','student7','student8','student9','student10']
marks = [45, 78, 12, 14, 48, 43, 45, 98, 35, 80]
# zip() will bind together corresponding elements of students and marks
# e.g. [('student1', 45), ('student2', 78), ...]
grades = list(zip(students, marks))
# once that's all in one list of 2-tuples, sort it by calling .sort() or using sorted()
# give it a "key", which specifies what criteria it should sort on
# in this case, it should sort on the mark, so the second element (index 1) of the tuple
grades.sort(key=lambda e:e[1])
# [('student3', 12), ('student4', 14), ('student9', 35), ('student6', 43), ('student1', 45), ('student7', 45), ('student5', 48), ('student2', 78), ('student10', 80), ('student8', 98)]
# now, just slice out the 25th and 75th percentile based on the length of that list
twentyfifth = math.ceil(len(grades) / 4)
seventyfifth = math.floor(3 * len(grades) / 4)
middle = grades[twentyfifth : seventyfifth]
print(middle)
# [('student6', 43), ('student1', 45), ('student7', 45), ('student5', 48)]
You have 10 students here, so how you round twentyfifth and seventyfifth is up to you (I chose to include those strictly those within 25-75th percentile, by rounding 'inwards' - you could do the opposite by switching ceil and floor, and get your final list to have two more elements in this case - or you could round them both the same way).

Looks like #Green Cloak Guy answer is the correct. But anyway, if what you want is to get the data of students with marks between two ranges I'll do it like this:
# Get a dict of students with it's mark, filtered by those with mark between 25 and 75
students_mark = {s: m for s, m in zip(Students, Marks) if m > 25 and m < 75}
# Sort results
res = dict(sorted(students_mark.items(), key=lambda i: i[1])
# res: {'student9': 35, 'student6': 43, 'student1': 45, 'student7': 45, 'student5': 48}
# In one line
res = {s: m for s, m in sorted(zip(Students, Marks), key=lambda i: i[1]) if m > 25 and m < 75}
As a summary: first link each student with it's score, and then filter and sort. I stored the result as dictionary because it seems more convinient.

All possible combinations of value-pairs (2-item tuples) in a sequence - PYTHON 2.7

I'm having a math brain fart moment, and google has failed to answer my quandary.
Given a sequence or list of 2 item tuples (from a Counter object), how do I quickly and elegantly get python to spit out a linear sequence or array of all the possible combinations of those tuples? My goal is trying to find the combinations of results from a Counter object.....
For example clarity, if I have this sequence:
[(500, 2), (250, 1)]
Doing this example out manually by hand, it should yield these results:
250, 500, 750, 1000, 1250.
Basically, I THINK it's a*b for the range of b and then add the resulting lists together...
I've tried this (where c=Counter object):
res = [[k*(j+1) for j in range(c[k])] for k in c]
And it will give me back:
res = [[250], [500, 1000]]
So far so good, it's going through each tuple and multiplying x * y for each y... But the resulting list isn't full of all the combinations yet, the first list [250] needs to be added to each element of the second list. This would be the case for any number of results I believe.
Now I think I need to take each list in this result list and add it to the other elements in the other lists in turn. Am I going about this wrong? I swear there should be a simpler way. I feel there should be a way to do this in a one line list comp.
Is the solution recursive? Is there a magic import or builtin method I don't know about? My head hurts......

I'm not entirely sure I follow you, but maybe you're looking for something like
from itertools import product
def lincombs(s):
terms, ffs = zip(*s)
factors = product(*(range(f+1) for f in ffs))
outs = (sum(v*f for v,f in zip(terms, ff)) for ff in factors if any(ff))
return outs
which gives
>>> list(lincombs([(500, 2), (250, 1)]))
[250, 500, 750, 1000, 1250]
>>> list(lincombs([(100, 3), (10, 3)]))
[10, 20, 30, 100, 110, 120, 130, 200, 210, 220, 230, 300, 310, 320, 330]

v*f multiplication from #DSM's answer could be avoided:
>>> from itertools import product
>>> terms = [(500, 2), (250, 1)]
>>> map(sum, product(*[xrange(0, v*a+1, v) for v, a in terms]))
[0, 250, 500, 750, 1000, 1250]
To get a sorted output without duplicates:
from itertools import groupby, imap
from operator import itemgetter
it = imap(itemgetter(0), groupby(sorted(it)))
though sorted(set(it)) that you use is ok in this case.

Is there a Python library for handling complicated mathematical sets (constructed using mathematical set-builder notation)?

I often work with multidimensional arrays whose array indices are generated from a complicated user-specified set.
I'm looking for a library with classes for representing complicated sets with an arbitrary number of indices, and arbitrarily complicated predicates. Given a set description, the desired output would be a generator. This generator would in turn produce either dicts or tuples which correspond to the multidimensional array indices.
Does such a library exist?
Example
Suppose we had the following user-specified set (in set-builder notation), which represents the indices of some array variable x[i][j]:
{i in 1..100, j in 1..50: i >= 20, j >= 21, 2*(i + j) <= 100}
I'd like to put this into some sort of a lazy class (a generator expression perhaps) that will allow me to lazily evaluate the elements of the set to generate the indices for my array. Suppose this class were called lazyset; this would be the desired behavior:
>>> S = lazyset("{i in 1..100, j in 1..50: i >= 20, j >= 21, 2*(i+j) <= 100}")
>>> S
<generator object <genexpr> at 0x1f3e7d0>
>>> next(S)
{'i': 20, 'j': 21}
>>> next(S)
{'i': 20, 'j': 22}
I'm thinking I could roll my own using generator expressions, but this almost seems like a solved problem. So I thought I'd asked if anyone's come across an established library that handles this (to some extent, at least). Does such a library exist?

This looks more like a constraint-solver problem to me:
import constraint as c
p = c.Problem()
p.addVariable(0, range(1,101))
p.addVariable(1, range(1,51))
p.addConstraint(lambda i: i >= 20, [0])
p.addConstraint(lambda j: j >= 21, [1])
p.addConstraint(c.MaxSumConstraint(50))
indices = ((s[0], s[1]) for s in p.getSolutionIter()) # convert to tuple generator
then if you do
for ij in indices:
print ij
you get
(29, 21)
(28, 22)
(28, 21)
(27, 23)
(27, 22)
(27, 21)
...
(20, 25)
(20, 24)
(20, 23)
(20, 22)
(20, 21)

Although I am not certain if this specifically (the set-builder notation) is supported by scipy. I think scipy is your best bet regardless.
There is support for sparse arrays/sets in scipy so you can easily let it handle the allocation of those without actually allocating the space :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cluster analysis within a set of integers - python

The approach that you are looking for is called Frequent Itemset Mining It finds frequent subsets, given a list of sets.

Related

Extract values from a list using a numpy range of indices between every two numbers

Knapsack I/O classic problem to get least valuable items

How to solve Students marks dashboard kinds of problems - Can't we use simple code in python to solve this problem..?

All possible combinations of value-pairs (2-item tuples) in a sequence - PYTHON 2.7

Is there a Python library for handling complicated mathematical sets (constructed using mathematical set-builder notation)?

Categories

Resources