How to fix Seaborn clustermap matrix? - python

I have a three column csv file that I am trying to convert to a clustered heatmap. My code looks like this:
sum_mets = pd.read_csv('sum159_localization_met_magma.csv')
df5 = sum_mets[['Phenotype','Gene','P']]
clustermap5 = sns.clustermap(df5, cmap= 'inferno', figsize=(40, 40), pivot_kws={'index': 'Phenotype',
'columns' : 'Gene',
'values' : 'P'})
I then receive this ValueError:
ValueError: The condensed distance matrix must contain only finite values.
For context all of my values are non-zero. I am not sure what values is it unable to process.
Thank you in advance to anyone who can help.

While you have no NaN, you need to check whether your observations are complete, because there is a pivot underneath, for example:
df = pd.DataFrame({'Phenotype':np.repeat(['very not cool','not cool','very cool','super cool'],4),
'Gene':["Gene"+str(i) for i in range(4)]*4,
'P':np.random.uniform(0,1,16)})
pd.pivot(df,columns="Gene",values="P",index="Phenotype")
Gene Gene0 Gene1 Gene2 Gene3
Phenotype
not cool 0.567653 0.984555 0.634450 0.406642
super cool 0.820595 0.072393 0.774895 0.185072
very cool 0.231772 0.448938 0.951706 0.893692
very not cool 0.227209 0.684660 0.013394 0.711890
The above pivots without NaN, and plots well:
sns.clustermap(df,figsize=(5, 5),pivot_kws={'index': 'Phenotype','columns' : 'Gene','values' : 'P'})
but let's say if we have 1 less observation:
df1 = df[:15]
pd.pivot(df1,columns="Gene",values="P",index="Phenotype")
Gene Gene0 Gene1 Gene2 Gene3
Phenotype
not cool 0.106681 0.415873 0.480102 0.721195
super cool 0.961991 0.261710 0.329859 NaN
very cool 0.069925 0.718771 0.200431 0.196573
very not cool 0.631423 0.403604 0.043415 0.373299
And it fails if you try to call clusterheatmap:
sns.clustermap(df1, pivot_kws={'index': 'Phenotype','columns' : 'Gene','values' : 'P'})
The condensed distance matrix must contain only finite values.
I suggest checking whether the missing values are intended or a mistake. So if you indeed have some missing values, you can get around the clustering but pre-computing the linkage and passing it to the function, for example using correlation below:
import scipy.spatial as sp, scipy.cluster.hierarchy as hc
row_dism = 1 - df1.T.corr()
row_linkage = hc.linkage(sp.distance.squareform(row_dism), method='complete')
col_dism = 1 - df1.corr()
col_linkage = hc.linkage(sp.distance.squareform(col_dism), method='complete')
sns.clustermap(df1,figsize=(5, 5),row_linkage=row_linkage, col_linkage=col_linkage)

Related

How do I calculate a large distance matrix with the haversine library in python?

I have a small set and a large set of locations and I need to know the geographic distance between the locations in these sets. An example of my datasets (they have the same structure, but one is larger):
location lat long
0 Gieten 53.003312 6.763908
1 Godlinze 53.372605 6.814674
2 Grijpskerk 53.263894 6.306134
3 Groningen 53.219065 6.568008
In order to calculate the distances, I am using the haversine library.
The haversine function wants the input to look like this:
lyon = (45.7597, 4.8422) # (lat, lon)
london = (51.509865, -0.118092)
paris = (48.8567, 2.3508)
new_york = (40.7033962, -74.2351462)
haversine_vector([lyon, london], [paris, new_york], Unit.KILOMETERS, comb=True)
after which the output looks like this:
array([[ 392.21725956, 343.37455271],
[6163.43638211, 5586.48447423]])
How do I get the function to calculate a distance matrix with my two datasets without adding all the locations separately? I have tried using dictionaries and I have tried looping over the locations in both datasets, but I can't seem to figure it out. I am pretty new to python, so if someone has a solution that is easy to understand but not very elegant I would prefer that over lambda functions and such. Thanks!
You are on the right track using haversine.haversine_vector.
Since I'm not sure how you got your dataset, this is a self-contained example using CSV datasets, but so long as you get lists of cities and coordinates somehow, you should be able to work it out.
Note that this does not compute distances between cities in the same array (e.g. not Helsinki <-> Turku) – if you want that too, you could concatenate your two datasets into one and pass it to haversine_vector twice.
import csv
import haversine
def read_csv_data(csv_data):
cities = []
locations = []
for (city, lat, lng) in csv.reader(csv_data.strip().splitlines(), delimiter=";"):
cities.append(city)
locations.append((float(lat), float(lng)))
return cities, locations
cities1, locations1 = read_csv_data(
"""
Gieten;53.003312;6.763908
Godlinze;53.372605;6.814674
Grijpskerk;53.263894;6.306134
Groningen;53.219065;6.568008
"""
)
cities2, locations2 = read_csv_data(
"""
Turku;60.45;22.266667
Helsinki;60.170833;24.9375
"""
)
distance_matrix = haversine.haversine_vector(locations1, locations2, comb=True)
distances = {}
for y, city2 in enumerate(cities2):
for x, city1 in enumerate(cities1):
distances[city1, city2] = distance_matrix[y, x]
print(distances)
This prints out e.g.
{
("Gieten", "Turku"): 1251.501257597515,
("Godlinze", "Turku"): 1219.2012174066822,
("Grijpskerk", "Turku"): 1251.3232414412073,
("Groningen", "Turku"): 1242.8700308545722,
("Gieten", "Helsinki"): 1361.4575055586013,
("Godlinze", "Helsinki"): 1331.2811273683897,
("Grijpskerk", "Helsinki"): 1364.5464743878606,
("Groningen", "Helsinki"): 1354.8847270142198,
}

How to use one dataframe's index to reindex another one in pandas

I am so sorry that I truly don't know what title I should use. But here is my question
Stocks_Open
d-1 d-2 d-3 d-4
000001.HR 1817.670960 1808.937405 1796.928768 1804.570628
000002.ZH 4867.910878 4652.713598 4652.713598 4634.904168
000004.HD 92.046474 92.209029 89.526880 96.435445
000005.SS 28.822245 28.636893 28.358865 28.729569
000006.SH 192.362963 189.174626 185.986290 187.403328
000007.SH 79.190528 80.515892 81.509916 78.693516
Stocks_Volume
d-1 d-2 d-3 d-4
000001.HR 324234 345345 657546 234234
000002.ZH 4867343 465234 4652598 4634168
000004.HD 9246474 929029 826880 965445
000005.SS 2822245 2836893 2858865 2829569
000006.SH 19262963 1897466 1886290 183328
000007.SH 7190528 803892 809916 7693516
Above are the data I parsed from a database, what I exactly want to do is to obtain the correlation of open price and volume in 4 days for each stock (The first column consists of codes of different stocks). In other words, I am trying to calculate the correlation of corresponding rows of each DataFrame. (This is only simplified example, the real data should be extended to more than 1000 different stocks.)
My attempt is to create a dataframe and to run a loop, assigning the results to that dataframe. But here is a problem, which is, the index pf the created dataframe is not exactly what I want. When I tried to append the correlation column, the bug occurred. (Please ignore the value of correlation, which is I concocted here, just to give an example)
r = pd.DataFrame(index = range(6),columns = ['c']
for i in range(6):
r.iloc[i-1,:] = Stocks_Open.iloc[i-1].corr(Stocks_Volume.iloc[i-1])
Correlation_in_4days = pd.concat([Stocks_Open,Stocks_Volume], axis = 1)
Correlation_in_4days['corr'] = r['c']
for i in range(6):
Correlation_in_4days.iloc[i-1,8] = r.iloc[i-1,:]
r c
1 0.654
2 -0.454
3 0.3321
4 0.2166
5 -0.8772
6 0.3256
The bug occurred.
"ValueError: Incompatible indexer with Series"
I realized that my correlation dataframe's index is integer but not the stock code, but I don't know how to fix it, is there any help?
My ideal result is:
corr
000001.HR 0.654
000002.ZH -0.454
000004.HD 0.3321
000005.SS 0.2166
000006.SH -0.8772
000007.SH 0.3256
Try assign the index back
r.index = Stocks_Open.index

Python Scikit-Learn PCA: Get Component Score

I am trying to perform a Principal Component Analysis for work. While i have successful in getting the the Principal Components laid out, i don't really know how to assign the resulting Component Score to each line item. I am looking for an output sort of like this.
Town PrinComponent 1 PrinComponent 2 PrinComponent 3
Columbia 0.31989 -0.44216 -0.44369
Middletown -0.37101 -0.24531 -0.47020
Harrisburg -0.00974 -0.06105 0.32792
Newport -0.38678 0.40935 -0.62996
The scikit-learn docs are not being helpful in this circumstance. Can anybody explain to me how i can reach this output?
The code i have so far is below.
def perform_PCA(df):
threshold = 0.1
pca = decomposition.PCA(n_components=3)
numpyMatrix = df.as_matrix().astype(float)
scaled_data = preprocessing.scale(numpyMatrix)
pca.fit(scaled_data)
pca.transform(scaled_data)
pca_components_df = pd.DataFrame(data = pca.components_,columns = df.columns.values)
#print pca_components_df
#pca_components_df.to_csv('pca_components_df.csv')
filtered = pca_components_df[abs(pca_components_df) > threshold]
trans_filtered= filtered.T
#print filtered.T #Tranformed Dataframe
trans_filtered.to_csv('trans_filtered.csv')
print pca.explained_variance_ratio_
I pumped the transformed array into the data portion of the DataFrame function, and then defined the index and columns the by putting them into columns= and index= respectively.
pd.DataFrame(data=transformed, columns=["PC1", "PC2"], index=df.index)

How Can I Compute Line-by-Line Statistics Across Multiple Files in Python

I have a series of space-delimited data files in x y format as below for a dummy data set, where y represents independent sample population means for value x.
File1.dat
1 15.99
2 17.34
3 16.50
4 18.12
File2.dat
1 10.11
2 12.76
3 14.10
4 19.46
File3.dat
1 13.13
2 12.14
3 14.99
4 17.42
I am trying to compute the standard error of the mean (SEM) line-by-line to get an idea of the spread of the data for each value of x. As an example using the first line of each file (x = 1), a solution would first compute the SEM of sample population means 15.99, 10.11, and 13.13 and print the solution in format:
x1 SEMx1
...and so on, iterating for every line across the three files.
At the moment, I envisage a solution to be something along the lines of:
Read in the data using something like numpy, perhaps specifying only the line of interest for the current iteration. e.g.
import numpy as np
data1 = np.loadtxt('File1.dat')
data2 = np.loadtxt('File2.dat')
data3 = np.loadtxt('File3.dat')
Use a tool such as Scipy stats, calculate the SEM from the three sample population means extracted in step 1
Print result to stout
Repeat for remaining lines
While I imagine other stats packages such as R are well-suited to this task, I'd like to try and keep the solution solely contained within Python. I'm fairly new to the language, and I'm trying to get some practical knowledge in using it.
I see this as being a problem ideally suited for Scipy from what I've seen here in the forums, but haven't the vaguest idea where to start based upon the documentation.
NB: These files contain an equal number of lines.
First let's try to get just the columns of data that we need:
import numpy as np
filenames = map('File{}.dat'.format, range(1,4)) # ['File1.dat', ...]
data = map(np.loadtxt, filenames) # 3 arrays, each 4x2
stacked = np.vstack((arr[:,1] for arr in data))
Now we have just the data we need in a single array:
array([[ 15.99, 17.34, 16.5 , 18.12],
[ 10.11, 12.76, 14.1 , 19.46],
[ 13.13, 12.14, 14.99, 17.42]])
Next:
import scipy.stats as ss
result = ss.sem(stacked)
This gives you:
array([ 1.69761925, 1.63979674, 0.70048396, 0.59847956])
You can now print it, write it to a file (np.savetxt), etc.
Since you mentioned R, let's try that too!
filenames = c('File1.dat', 'File2.dat', 'File3.dat')
data = lapply(filenames, read.table)
stacked = cbind(data[[1]][2], data[[2]][2], data[[3]][2])
Now you have:
V2 V2 V2
1 15.99 10.11 13.13
2 17.34 12.76 12.14
3 16.50 14.10 14.99
4 18.12 19.46 17.42
Finally:
apply(stacked, 1, sd) / sqrt(length(stacked))
Gives you:
1.6976192 1.6397967 0.7004840 0.5984796
This R solution is actually quite a bit worse in terms of performance, because it uses apply on all the rows to get the standard deviation (and apply is slow, because it invokes the target function once per row). This is because base R does not offer row-wise (nor column-wise, etc.) standard deviation. And I needed sd because base R does not offer SEM. At least you can see it gives the same results.

Inexpensive way to add time series intensity in python pandas dataframe

I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:
Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:
My brute forece attempt:
import pandas as ps
import math
import numpy as np
person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)
starti=-1
endi=0
startState=0
for i in range(3):
starti=starti+2
print starti
endi=endi+2
for time in uniqueTimes:
def helper(row):
start=row[starti]
end=row[endi]
track=row[7]
if start <= time and time < end:
return possibleStates[i+1]
else:
return possibleStates[0]
def trackHelp(row):
status=row[8]
track=row[7]
if track<=status:
return status
else:
return track
def Multiplier(row):
x=row[8]
if x==0:
return 0.0*row[0]
if x==1:
return 5.0*row[0]
if x==2:
return 10.0*row[0]
if x==-1:#numeric place holder for non-contributing
return 0.0*row[0]
allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
for k,v in stateData.iteritems():
comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates
Plots of weight being held over time might look like the following:
And the sum of the intensities over time might look like the black line in the following:
with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with:
print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort()
,but I can't come up with a slick way of getting the corresponding intensity values.
I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!
In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.
Edit--Implementation of mgab's provided solution:
import pandas as ps
import math
import numpy as np
person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
End Edit
Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.
I would try to keep the data in this format:
import pandas as pd
df = pd.DataFrame([[10,'A',5],
[10,'B',7],
[13,'C',10],
[15,'A',15],
[20,'A',7],
[23,'C',0]], columns=["time", "key", "intensity"])
time key intensity
0 10 A 5
1 10 B 7
2 13 C 10
3 15 A 15
4 20 A 7
5 23 C 0
where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs
df[df.key=="A"].drop('key',1)
time intensity
0 10 5
3 15 15
4 20 7
Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)
df["increment"]=df.groupby("key")["intensity"].transform(
lambda x: x.sub(x.shift(), fill_value= 0 ))
df
time key intensity increment
0 10 A 5 5
1 10 B 7 7
2 13 C 10 10
3 15 A 15 10
4 20 A 7 -8
5 23 C 0 -10
And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates
df.groupby("time").sum()["increment"].cumsum()
time
10 12
13 22
15 32
20 24
23 14
dtype: int64
EDIT: applying the specific data presented in question
Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:
data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]
And if we know the weight/intensity of each one of the states, we can define:
known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]
Then, the easiest way I came up to load the data includes this function:
import pandas as pd
def read_data(data, states, columns):
id = data[0]
factor = data[1]
reshaped_data = []
for i in xrange(len(states)):
j += 2+2*i
if not data[j] == data[j+1]:
reshaped_data.append([data[j], id, factor*states[i]])
reshaped_data.append([data[j+1], id, -1*factor*states[i]])
return pd.DataFrame(reshaped_data, columns=columns)
Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.
Then, you load the data:
df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...
And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)
Appears to be what .sum() is for:
In [10]:
allPeopleDf.sum()
Out[10]:
aStart 0
aEnd 35
bStart 35
bEnd 50
cStart 50
cEnd 90
dtype: int32

Categories