Related
I am looking for a way to visualize, for the lack of a better word, the "density" or "heatmap" of some synthetic time series I have created.
I have a loop that creates a list, which are values of one time series. I don't think it matters but just in case, here is the code of what's going on. This is a Markov Process, so with each i, which represents the hour, i create a new value, depending on the former i and state:
for x in range(10000):
start_h = 0
start_s = 1
generated_values_list = []
for i in range(start_h,120):
if i>=24:
i=i%24
print(str(start_s)+" | " +str(i))
pot_value_list = GMM_vals_container_workingdays_spring["State: "+ str(start_s)+", hour: "+str(i)]
if len(pot_value_list)>50:
actual_value = random.choice(pot_value_list)#
#cdf, gmm_x, gmm = GMM_erstellen(pot_value_list,50)
#actual_value = gmm.sample()[0][0][0]
#print("made by GMM")
else:
actual_value = random.choice(pot_value_list)
#print("made not by GMM")
generated_values_list.append(actual_value)
probabilities_next_state = TPMs_WD[i][start_s-1]
next_state = random.choices(states,weights=probabilities_next_state)
start_s = next_state[0]
plt.plot(generated_values_list)
But - I think - the only part that matters is this:
for x in range(10000):
#some code that creates the generated_values_list
plt.plot(generated_values_list)
This creates, as expected a picture like this:
It is not clear from here which are the most common paths so I would like to make values that are hit frequently are more colorful while not so frequent values are rather grey.
I think seaborn library has something for that but I don't seem to understand the docs.
I have been stuck trying to do this with numpy with no luck. I am trying to move from MATLAB to Python, however, the transition hasn't been so easy. Anyway, that doesn't matter.
I am trying to code the Python analog of this simple MATLAB line of code:
A(:,:,condtype==1 & Mat(:,9)==contra(ii)) = A(:,:, condtype ==1 & Mat(:,9)==contra(ii))-mean(A(:,:, condtype ==1 & Mat(:,9)==contra(ii)),3);
Right, so the above convoluted line of code does the following. Indexes a condition which is half of the 3rd dimension of A and removes the mean of those indexes which simultaneously changing the values in A to the new mean removed values.
How would one go about doing this in Python?
I actually figured it out. I was trying to use and when I should have been using np.isequal. Also, I needed to use keepdims=True for the mean. Here it is for anyone that wants to see:
def RmContrastMean(targettype,trialsMat,Contrastlvls,dX):
present = targettype==1
absent = targettype==0
for i in range(0,Contrastlvls.size):
CurrentContrast = trialsMat[:,8]==Contrastlvls[i]
preIdx = np.equal(present, CurrentContrast)
absIdx = np.equal(absent, CurrentContrast)
#mean
dX[:,:,preIdx] = dX[:,:,preIdx]-np.mean(dX[:,:,preIdx],axis=2,keepdims=True)
dX[:,:,absIdx] = dX[:,:,absIdx]-np.mean(dX[:,:,absIdx],axis=2,keepdims=True)
#std
dX[:,:,preIdx] = dX[:,:,preIdx]/np.std(dX[:,:,preIdx],axis=2,keepdims=True)
dX[:,:,absIdx] = dX[:,:,absIdx]/np.std(dX[:,:,absIdx],axis=2,keepdims=True)
return dX
I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.
I'm working on a project that has a goal of detecting sub populations in a group of patients. I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject.
I there are 42 variables in total. Of those, 20 are continuous and had to be discretized. For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into.
def Freedman_Diaconis(column_values):
#sort the list first
column_values[1].sort()
first_quartile = int(len(column_values[1]) * .25)
third_quartile = int(len(column_values[1]) * .75)
fq_value = column_values[1][first_quartile]
tq_value = column_values[1][third_quartile]
iqr = tq_value - fq_value
n_to_pow = len(column_values[1])**(-1/3)
h = 2 * iqr * n_to_pow
retval = (column_values[1][-1] - column_values[1][1])/h
test = int(retval+1)
return test
From there I used min-max normalization
def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints
to transform my data and then I simply took the interger portion to get the final categorization.
def take_int(list_of_float):
ints = []
for flt in list_of_float:
asint = int(flt)
ints.append(asint)
return ints
I then also wrote a function that I used to combine this value with the variable name.
def string_transform(prefix, column, index):
transformed_list = []
transformed = ""
if index < 4:
for entry in column[1]:
transformed = prefix+str(entry)
transformed_list.append(transformed)
else:
prefix_num = prefix.split('x')
for entry in column[1]:
transformed = str(prefix_num[1])+'x'+str(entry)
transformed_list.append(transformed)
return transformed_list
This was done to differentiate variables that have the same value, but appear in different columns. For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. The string transform function would create 14x1 and 20x1 for the previously mentioned examples.
After this, I wrote everything to a file in basket format
def create_basket(list_of_lists, headers):
#for filename in os.listdir("."):
# if filename.e
if not os.path.exists('baskets'):
os.makedirs('baskets')
down_length = len(list_of_lists[0])
with open('baskets/dataset.basket', 'w') as basketfile:
basket_writer = csv.DictWriter(basketfile, fieldnames=headers)
for i in range(0, down_length):
basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
"x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
"x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
"x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
"x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
"x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
"x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
"x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
"x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
"x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
"x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
"x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
"x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
"x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})
and I used the apriori package in Orange to see if there were any association rules.
rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules:
my_rule = str(r)
split_rule = my_rule.split("->")
if 'trt' in split_rule[1]:
print 'treatment rule'
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
Using this, technique I found quite a few association rules with my testing data.
THIS IS WHERE I HAVE A PROBLEM
When I read the notes for the training data, there is this note
...That is, the only
reason for the differences among observed responses to the same treatment across patients is
random noise. Hence, there is NO meaningful subgroup for this dataset...
My question is,
why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?
I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state.
Supp Conf Rule
0.3 0.7 6x0 -> trt1
Even though my code runs, I'm not getting results anywhere close to what should be expected. This leads me to believe that I messed something up, but I'm not sure what it is.
After some research, I realized that my sample size is too small for the number of variables that I have. I would need a way larger sample size in order to really use the method that I was using. In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows.
Hopefully this can be done with python! I used two clustering programs on the same data and now have a cluster file from both. I reformatted the files so that they look like this:
Cluster 0:
Brucellaceae(10)
Brucella(10)
abortus(1)
canis(1)
ceti(1)
inopinata(1)
melitensis(1)
microti(1)
neotomae(1)
ovis(1)
pinnipedialis(1)
suis(1)
Cluster 1:
Streptomycetaceae(28)
Streptomyces(28)
achromogenes(1)
albaduncus(1)
anthocyanicus(1)
etc.
These files contain bacterial species info. So I have the cluster number (Cluster 0), then right below it 'family' (Brucellaceae) and the number of bacteria in that family (10). Under that is the genera found in that family (name followed by number, Brucella(10)) and finally the species in each genera (abortus(1), etc.).
My question: I have 2 files formatted in this way and want to write a program that will look for differences between the two. The only problem is that the two programs cluster in different ways, so two cluster may be the same, even if the actual "Cluster Number" is different (so the contents of Cluster 1 in one file might match Cluster 43 in the other file, the only different being the actual cluster number). So I need something to ignore the cluster number and focus on the cluster contents.
Is there any way I could compare these 2 files to examine the differences? Is it even possible? Any ideas would be greatly appreciated!
Given:
file1 = '''Cluster 0:
giant(2)
red(2)
brick(1)
apple(1)
Cluster 1:
tiny(3)
green(1)
dot(1)
blue(2)
flower(1)
candy(1)'''.split('\n')
file2 = '''Cluster 18:
giant(2)
red(2)
brick(1)
tomato(1)
Cluster 19:
tiny(2)
blue(2)
flower(1)
candy(1)'''.split('\n')
Is this what you need?
def parse_file(open_file):
result = []
for line in open_file:
indent_level = len(line) - len(line.lstrip())
if indent_level == 0:
levels = ['','','']
item = line.lstrip().split('(', 1)[0]
levels[indent_level - 1] = item
if indent_level == 3:
result.append('.'.join(levels))
return result
data1 = set(parse_file(file1))
data2 = set(parse_file(file2))
differences = [
('common elements', data1 & data2),
('missing from file2', data1 - data2),
('missing from file1', data2 - data1) ]
To see the differences:
for desc, items in differences:
print desc
print
for item in items:
print '\t' + item
print
prints
common elements
giant.red.brick
tiny.blue.candy
tiny.blue.flower
missing from file2
tiny.green.dot
giant.red.apple
missing from file1
giant.red.tomato
So just for help, as I see lots of different answers in the comment, I'll give you a very, very simple implementation of a script that you can start from.
Note that this does not answer your full question but points you in one of the directions in the comments.
Normally if you have no experience I'd argue to go a head and read up on Python (which i'll do anyways, and i'll throw in a few links in the bottom of the answer)
On to the fun stuffs! :)
class Cluster(object):
'''
This is a class that will contain your information about the Clusters.
'''
def __init__(self, number):
'''
This is what some languages call a constructor, but it's not.
This method initializes the properties with values from the method call.
'''
self.cluster_number = number
self.family_name = None
self.bacteria_name = None
self.bacteria = []
#This part below isn't a part of the class, this is the actual script.
with open('bacteria.txt', 'r') as file:
cluster = None
clusters = []
for index, line in enumerate(file):
if line.startswith('Cluster'):
cluster = Cluster(index)
clusters.append(cluster)
else:
if not cluster.family_name:
cluster.family_name = line
elif not cluster.bacteria_name:
cluster.bacteria_name = line
else:
cluster.bacteria.append(line)
I wrote this as dumb and overly simple as I could without any fancy stuff and for Python 2.7.2
You could copy this file into a .py file and run it directly from command line python bacteria.py for example.
Hope this helps a bit and don't hesitate to come by our Python chat room if you have any questions! :)
http://learnpythonthehardway.org/
http://www.diveintopython.net/
http://docs.python.org/2/tutorial/inputoutput.html
check if all elements in a list are identical
Retaining order while using Python's set difference
You have to write some code to parse the file. If you ignore the cluster, you should be able to distinguish between family, genera and species based on indentation.
The easiest way it to define a named tuple:
import collections
Bacterium = collections.namedtuple('Bacterium', ['family', 'genera', 'species'])
You can make in instance of this object like this:
b = Bacterium('Brucellaceae', 'Brucella', 'canis')
Your parser should read a file line by line, and set the family and genera. If it then finds a species, it should add a Bacterium to a list;
with open('cluster0.txt', 'r') as infile:
lines = infile.readlines()
family = None
genera = None
bacteria = []
for line in lines:
# set family and genera.
# if you detect a bacterium:
bacteria.append(Bacterium(family, genera, species))
Once you have a list of all bacteria in each file or cluster, you can select from all the bacteria like this:
s = [b for b in bacteria if b.genera == 'Streptomycetaceae']
Comparing two clusterings is not trivial task and reinventing the wheel is unlikely to be successful. Check out this package which has lots of different cluster similarity metrics and can compare dendrograms (the data structure you have).
The library is called CluSim and can be found here:
https://github.com/Hoosier-Clusters/clusim/
After learning so much from Stackoverflow, finally I have an opportunity to give back! A different approach from those offered so far is to relabel clusters to maximize alignment, and then comparison becomes easy. For example, if one algorithm assigns labels to a set of six items as L1=[0,0,1,1,2,2] and another assigns L2=[2,2,0,0,1,1], you want these two labelings to be equivalent since L1 and L2 are essentially segmenting items into clusters identically. This approach relabels L2 to maximize alignment, and in the example above, will result in L2==L1.
I found a soution to this problem in "Menéndez, Héctor D. A genetic approach to the graph and spectral clustering problem. MS thesis. 2012." and below is an implementation in Python using numpy. I'm relatively new to Python, so there may be better implementations, but I think this gets the job done:
def alignClusters(clstr1,clstr2):
"""Given 2 cluster assignments, this funciton will rename the second to
maximize alignment of elements within each cluster. This method is
described in in Menéndez, Héctor D. A genetic approach to the graph and
spectral clustering problem. MS thesis. 2012. (Assumes cluster labels
are consecutive integers starting with zero)
INPUTS:
clstr1 - The first clustering assignment
clstr2 - The second clustering assignment
OUTPUTS:
clstr2_temp - The second clustering assignment with clusters renumbered to
maximize alignment with the first clustering assignment """
K = np.max(clstr1)+1
simdist = np.zeros((K,K))
for i in range(K):
for j in range(K):
dcix = clstr1==i
dcjx = clstr2==j
dd = np.dot(dcix.astype(int),dcjx.astype(int))
simdist[i,j] = (dd/np.sum(dcix!=0) + dd/np.sum(dcjx!=0))/2
mask = np.zeros((K,K))
for i in range(K):
simdist_vec = np.reshape(simdist.T,(K**2,1))
I = np.argmax(simdist_vec)
xy = np.unravel_index(I,simdist.shape,order='F')
x = xy[0]
y = xy[1]
mask[x,y] = 1
simdist[x,:] = 0
simdist[:,y] = 0
swapIJ = np.unravel_index(np.where(mask.T),simdist.shape,order='F')
swapI = swapIJ[0][1,:]
swapJ = swapIJ[0][0,:]
clstr2_temp = np.copy(clstr2)
for k in range(swapI.shape[0]):
swapj = [swapJ[k]==i for i in clstr2]
clstr2_temp[swapj] = swapI[k]
return clstr2_temp
I'm having a rare honest-to-goodness computer science problem (as opposed to the usual how-do-I-make-this-language-I-don't-write-often-enough-do-what-I-want problem), and really feeling my lack of a CS degree for a change.
This is a bit messy, because I'm using several dicts of lists, but the basic concept is this: a Twitter-scraping function that adds retweets of a given tweet to a graph, node-by-node, building outwards from the original author (with follower relationships as edges).
for t in RTs_list:
g = nx.DiGraph()
followers_list=collections.defaultdict(list)
level=collections.defaultdict(list)
hoppers=collections.defaultdict(list)
retweets = []
retweeters = []
try:
u = api.get_status(t)
original_tweet = u.retweeted_status.id_str
print original_tweet
ot = api.get_status(original_tweet)
node_adder(ot.user.id, 1)
# Can't paginate -- can only get about ~20 RTs max. Need to work on small data here.
retweets = api.retweets(original_tweet)
for r in retweets:
retweeters.append(r.user.id)
followers_list["0"] = api.followers_ids(ot.user.id)[0]
print len(retweets),"total retweets"
level["1"] = ot.user.id
g.node[ot.user.id]['crossover'] = 1
if g.node[ot.user.id]["followers_count"]<4000:
bum_node_adder(followers_list["0"],level["1"], 2)
for r in retweets:
rt_iterator(r,retweets,0,followers_list,hoppers,level)
except:
print ""
def rt_iterator(r,retweets,q,followers_list,hoppers,level):
q = q+1
if r.user.id in followers_list[str(q-1)]:
hoppers[str(q)].append(r.user.id)
node_adder(r.user.id,q+1)
g.add_edge(level[str(q)], r.user.id)
try:
followers_list[str(q)] = api.followers_ids(r.user.id)[0]
level[str(q+1)] = r.user.id
if g.node[r.user.id]["followers_count"]<4000:
bum_node_adder(followers_list[str(q)],level[str(q+1)],q+2)
crossover = pull_crossover(followers_list[str(q)],followers_list[str(q-1)])
if q<10:
for r in retweets:
rt_iterator(r,retweets,q,followers_list,hoppers,level)
except:
print ""
There's some other function calls in there, but they're not related to the problem. The main issue is how Q counts when going from a (e.g.) a 2-hop node to a 3-hop node. I need it to build out to the maximum depth (10) for every branch from the center, whereas right now I believe it's just building out to the maximum depth for the first branch it tries. Hope that makes sense. If not, typing it up here has helped me; I think I'm just missing a loop in there somewhere but it's tough for me to see.
Also, ignore that various dicts refer to Q+1 or Q-1, that's an artifact of how I implemented this before I refactored to make it recurve.
Thanks!
I'm not totally sure what you mean by "the center" but I think you want something like this:
def rt_iterator(depth, other-args):
# store whatever info you need from this point in the tree
if depth>= MAX_DEPTH:
return
# look at the nodes you want to expand from here
for each node, in the order you want them expanded:
rt_iterator(depth+1, other-args)
think I've fixed it... this way Q isn't incremented when it shouldn't be.
def rt_iterator(r,retweets,q,depth,followers_list,hoppers,level):
def node_iterator (r,retweets,q,depth,followers_list,hoppers,level):
for r in retweets:
if r.user.id in followers_list[str(q-1)]:
hoppers[str(q)].append(r.user.id)
node_adder(r.user.id,q+1)
g.add_edge(level[str(q)], r.user.id)
try:
level[str(q+1)] = r.user.id
if g.node[r.user.id]["followers_count"]<4000:
followers_list[str(q)] = api.followers_ids(r.user.id)[0]
bum_node_adder(followers_list[str(q)],level[str(q+1)],q+2)
crossover = pull_crossover(followers_list[str(q)],followers_list[str(q-1)])
if q<10:
node_iterator(r,retweets,q+1,depth,followers_list,hoppers,level)
except:
print ""
depth = depth+1
q = depth
if q<10:
rt_iterator(r,retweets,q,depth,followers_list,hoppers,level)