Using scipy pdist on pandas data frames (and) lists - python

I've run into an odd problem yet again.
Suppose I have the following dummy data frame (by way of demonstrating my problem):
import numpy as np
import pandas as pd
import string
# Test data frame
N = 3
col_ids = string.letters[:N]
df = pd.DataFrame(
np.random.randn(5, 3*N),
columns=['{}_{}'.format(letter, coord) for letter in col_ids for coord in list('xyz')])
df
This produces:
A_x A_y A_z B_x B_y B_z C_x C_y C_z
0 -1.339040 0.185817 0.083120 0.498545 -0.569518 0.580264 0.453234 1.336992 -0.346724
1 -0.938575 0.367866 1.084475 1.497117 0.349927 -0.726140 -0.870142 -0.371153 -0.881763
2 -0.346819 -1.689058 -0.475032 -0.625383 -0.890025 0.929955 0.683413 0.819212 0.102625
3 0.359540 -0.125700 -0.900680 -0.403000 2.655242 -0.607996 1.117012 -0.905600 0.671239
4 1.624630 -1.036742 0.538341 -0.682000 0.542178 -0.001380 -1.126426 0.756532 -0.701805
Now I would like to use scipy.spatial.distance.pdist on this pandas data frame. This turns out to be a rather non-trivial process. What pdist does is to compute the distance between m points using Euclidean distance (2-norm) as the distance metric between the points. The points are arranged as m n-dimensional row vectors in the matrix X (source).
So, there are a couple of things that one has to do to create a function that operates on a pandas data frame, such that the pdist function can be used. You will note that pdist is convenient when the number of points gets very large. I've tried making my own, which works for a one-row data-frame, but I cannot get it to work, ideally, on the whole data frame at once.
Here's my attempt:
from scipy.spatial.distance import pdist, squareform
import numpy as np
import pandas as pd
import string
def Euclidean_distance(df):
EcDist = pd.DataFrame(index=df.index) # results container
arr = df.values # Store data frame values into a numpy array
tag_list = [num for elem in arr for num in elem] # flatten numpy array into single list
tag_list_3D = zip(*[iter(tag_list)]*3) # separate list into length = 3 sub-lists, that pdist() can work with
EcDist = pdist(tag_list_3D) # the distance between m points using Euclidean distance (2-norm)
return EcDist
First I begin my creating a results container in pandas form, to store the result in. Secondly I save the pandas data frame as a numpy array, in order to get it into list form in the next step. It has to be list form because the pdist function does only operate on lists. When saving the data frame into an array, it stores it as a list within a list. This has to be flattened which is saved in the 'tag_list' variable. Thirdly, the tag_list is furthered reduced into sub-lists of length three, such that the x, y and z coordinates can be obtained for each point, which can the be used to find the Euclidean distance between all of these points (in this example there are three points: A,B and C each being three dimensional).
As said, the function works if the data frame is a single row, but when using the function in the given example it calculates the Euclidean distance for 5x3 points, which yields a total of 105 distances. What I want it to do is to calculate the distances per row (so pdist should only work on a 1x3 vector at a time). Such that my final results, for this example, would look something like this:
dist_1 dist_2 dist_3
0 0.807271 0.142495 1.759969
1 0.180112 0.641855 0.257957
2 0.196950 1.334812 0.638719
3 0.145780 0.384268 0.577387
4 0.044030 0.735428 0.549897
(these are just dummy numbers to show the desired shape)
Hence how do I get my function to apply to the data frame in a row-wise fashion?
Or better yet, how can I get it to perform the function on the entire data frame at once, and then store the result in a new data frame?
Any help would be very appreciated. Thanks.

If I understand correctly, you have "groups" of points. In your example each group has three points, which you call A, B and C. A is represented by three columns A_x, A_y, A_z, and likewise for B and C.
What I suggest is that you restructure your "wide-form" data into a "long" form in which each row contains only one point. Each row then will have only three columns for the coordinates, and then you will add an additional column to represent which group a point is in. Here's an example:
>>> d = pandas.DataFrame(np.random.randn(12, 3), columns=["X", "Y", "Z"])
>>> d["Group"] = np.repeat([1, 2, 3, 4], 3)
>>> d
X Y Z Group
0 -0.280505 0.888417 -0.936790 1
1 0.823741 -0.428267 1.483763 1
2 -0.465326 0.005103 -1.107431 1
3 -1.009077 -1.618600 -0.443975 2
4 0.535634 0.562617 1.165269 2
5 1.544621 -0.858873 -0.349492 2
6 0.839795 0.720828 -0.973234 3
7 -2.273654 0.125304 0.469443 3
8 -0.179703 0.962098 -0.179542 3
9 -0.390777 -0.715896 -0.897837 4
10 -0.030338 0.746647 0.250173 4
11 -1.886581 0.643817 -2.658379 4
The three points with Group==1 correspond to A, B and C in your first row; the three points with Group==2 correspond to A, B, and C in your second row; etc.
With this structure, computing the pairwise distances by group using pdist becomes straightforward:
>>> d.groupby('Group')[["X", "Y", "Z"]].apply(lambda g: pandas.Series(distance.pdist(g), index=["D1", "D2", "D3"]))
D1 D2 D3
Group
1 2.968517 0.918435 2.926395
2 3.119856 2.665986 2.309370
3 3.482747 1.314357 2.346495
4 1.893904 2.680627 3.451939
It is possible to do a similar thing with your existing setup, but it will be more awkward. The problem with the way you set it up is that you have encoded critical information in a difficult-to-extract way. The information about which columns are X coordinates and which are Y or Z coordinates, as well as the information about which columns refer to point A versus B or C, in your setup, is encoded in the textual names of the columns. You as a human can see which columns are X values just by looking at them, but specifying that programmatically requires parsing the string names of the columns.
You can see this in how you made the column names with your '{}_{}'.format(letter, coord) business. This means that in order to get to use pdist on your data, you will have to do the reverse operation of parsing the column names as strings in order to decide which columns to compare. Needless to say, this will be awkward. On the other hand, if you put the data into "long" form, there is no such difficulty: the X coordinates of all points line up in one column, and likewise for Y and Z, and the information about which points are to be compared is also contained in one column (the "Group" column).
When you want to do large-scale operations on subsets of data, it's usually better to split out things into separate rows. This allows you to leverage the power of groupby, and is also usually what is expected by scipy tools.

Related

What is the most efficient way to run a cosine similarity test comparing all rows to all rows of two sparse pandas dataframes?

I'm working to speed up this chunk of code. It's currently taking multiple days to complete two compare two dataframes [192184 rows x 256 columns] by [7739 rows x 256 columns]. This is a relatively small use-case and its way too slow already. I'd like to compare all rows of dataframe 1 to all rows of dataframe 2. This is done in order to rank the rows in DF2 by similarity to each row of DF1.
There is no need to keep this in pandas format, my end goal is just to write the output to a csv.
For simplicity, I'll write short example inputs, and the expected example output.
Input:
DF1:
A 1 10 7 8 3
B 1 0 1 0 1
C 6 1 0 0 0
DF2:
D 4 0 0 0 1
E 1 0 0 0 2
F 2 8 8 5 2
Output:
D E F
A 0.113690 0.209633 0.971074
B 0.700140 0.774597 0.546019
C 0.956943 0.441129 0.259129
My current code:
import pandas
from scipy import spatial
#set up test case
a = [1,10,7,8,3]
b = [1,0,1,0,1]
c = [6,1,0,0,0]
d = [4,0,0,0,1]
e = [1,0,0,0,2]
f = [2,8,8,5,2]
DF1 = pd.DataFrame([a,b,c])
DF1.index = ["A","B","C"]
DF2 = pd.DataFrame([d,e,f])
DF2.index = ["D","E","F"]
#declare variables
a_dict = {}
rows_list = list(DF2.index.values)
#something is wrong with this meat
for line in DF1.iterrows():
temp = []
for row in DF2.iterrows():
temp.append(1- spatial.distance.cosine(line[1],row[1]))
a_dict[line[0]] = temp
#write final output
final_output = pd.DataFrame.from_dict(a_dict, orient="index", columns = rows_list)
For speeding this up, I've read that itertuples should be faster, but I don't think that addresses the main issue which is either the nested for loop or the spatial.distance.cosine test itself. I've read about different functions to run this test, but honestly don't know how to apply them in this scenario.
Also, this is my first question on here - feel free to give me tips for next time :)
Let's start to see how we could have accelerated the same code. Just as an exercise about how to avoid some for loops.
Tho, spoiler alert, we won't use that algorithm at the end anyway, so it is just an exercise.
Vectorization of your code
What you are doing here is calling spatial.distance.cosine for every combination of rows. Hence the two nested for loops. Let's remove the inner for loop, by using np.vectorize. Whose purpose is to apply the same function to all items of a vector, and return the vector of result. A bit like map, but for numpy.
Example:
def f(a,b):
return a*b
fv=np.vectorize(f)
fv(np.array([1,2,3]), np.array([4,5,6]))
#array([ 4, 10, 18])
fv(np.array([1,2,3]), 4)
#array([ 4, 8, 12])
fv(np.array([[1,2],[3,4]]), np.array([5,6]))
#array([[ 5, 12],
# [15, 24]])
So, fv is the vectorized version of f. You recognize the behavior of * on numpy arrays. Not a coincidence.
np.vectorize goes down any dimension needed to find scalars, in each parameters, and apply the original function on them. But in your case, we don't want it to do that. Because we don't want it to iterate through all 256 columns, and then apply spatial.distance.cosine on individual scalars.
But there is a remedy to that. We can say specify during vectorization, what is the type of the "atoms", what is the signature of f.
sdcv=np.vectorize(spatial.distance.cosine, signature="(n),(n)->()")
creates a vectorized version of cosine expecting to find, in both arguments, arrays of the same size, or arrays of such arrays, or arrays of arrays of such arrays, etc.
So, now, if you call
sdcv(line[1], DF2.values)
Since line[1] is an array of scalars (5 in your example. 256 in your real use case), and DF2.values is an array of arrays of scalars, what this does is calling cosine(line[1], x) for x being each line of DF2.values. And return the len(DF2)` results as an array.
Hence an optimization
sdcv=np.vectorize(spatial.distance.cosine, signature="(n),(n)->()")
def mvec():
#declare variables
a_dict = {}
rows_list = list(DF2.index.values)
for line in DF1.iterrows():
temp = 1-sdcv(line[1], DF2.values)
a_dict[line[0]] = temp
#write final output
return pd.DataFrame.from_dict(a_dict, orient="index", columns = rows_list)
We still have a loop. But the inner loop is now implicit, and done by vectorization.
Second layer
We can do the same with the outer loop
sdcv2=np.vectorize(sdcv, signature="(n),(m,n)->(m)")
def mvec2():
return pd.DataFrame(1-sdcv2(DF1.values, DF2.values))
This time, we tell vectorize that we want 1st argument to be a row, and 2nd argument to be a whole dataframe. And result of to be a list of cosine. So, what the previous vectorization usage was doing.
So, if we call sdcv2 with DF1 and DF2, since DF1 is an array of rows, it will call sdcv for each row of DF1, with second argument being DF2 (it doesn't iterate on DF2, because DF2 is already the expected 2d array).
And since a call to sdcv, with DF2 as second argument, returns an array for each row, we get as a result an array of array.
Linalg solution
Now, that was just an exercise on vectorization. Because, there is a detail in your code, and therefore in both of mine. It is the 1- before spatial.distance.cosine.
But spatial.distance.cosine(u,v) definition is 1-u.v/(||u||.||v||)
So what you really want is u.v/(||u||.||v||)
For the u.v part, it is very easy. There is a well known operation in linear algebra that computes the scalar product of all possible combinations of a set of rows. That is simply matrix multiplication. Or almost so: you need to transpose the 2nd matrix (since matrix multiplication results in the array of all combination of a scalar product between a row of 1st matrix and a column of second matrix).
So, in your case,
DF1.values # DF2.values.T
is matrix of n×n scalar products between rows of DF1 and rows of DF2.
The u.v for u being a row of DF1 and v being a row of DF2
We need to divide that by norms of u and v in each cell.
To be more accurate, in all cells of the first line, u was the first row of DF1. In all cells of 2nd line, u was the 2nd row of DF1. Etc.
So we need to divide the kth line of DF1.values#DF2.values.T by norm of kth line of DF1.
We can compute the norms of all lines of DF1 easily
np.linalg.norm(DF1.values, axis=1)
is an 1D array of the norm of all lines of DF1.
And
np.linalg.norm(DF1.values, axis=1).reshape(-1,1)
is the same result, but in the form of a column
So
(DF1.values#DF2.values.T)/np.linalg.norm(DF1.values,axis=1).reshape(-1,1)
is a matrix of u.v/||u|| for u each row of DF1, and v each row of DF2.
Likewise, line of norms of rows of DF2 is np.linalg.norm(DF2,axis=1)
So
(DF1.values#DF2.values.T)/np.linalg.norm(DF1.values,axis=1).reshape(-1,1)/np.linalg.norm(DF2.values, axis=1)
Is a matrix made of u.v/||u||/||v|| for u each row of DF1 and v each row of DF2.
Exactly what you wanted.
Hence my linalg method
def mnp():
d=(DF1.values#DF2.values.T) / np.linalg.norm(DF1.values,axis=1).reshape(-1,1) / np.linalg.norm(DF2.values, axis=1)
return pd.DataFrame(d, columns=DF2.index, index=DF1.index)
Timings
With dataframes of 500 columns and 3 rows (At first, I just increased the number of rows so that vectorization has a chance to be usefull, but still keeping a 3x3 matrix that I can visually control)
Method
Timing (ms)
Yours
2.9259588550194167
Vec
2.3712358399643563
Vec2
1.5134614360285923
NP
0.19741979095852003
So, 20% gain for vectorization. ×2 for double vectorization. And ×15 gain for linalg.
But 3 rows is not realistic. Now that we know it works, let increase the number of rows
With 100 rows (so 30x more, and it is a O(N²) algorithm, even with constant column size)
Method
Timing (ms)
Yours
1968.870209006127
Vec
904.9228489748202
Vec2
807.5719091983046
NP
5.865840002661571
This times, advantage of vectorization, and moreover, of using global numpy operations, is more visible. ×2 for vectorization (2nd layer this time is less impressive. Because outer loop is never the one that works the more). And ×335 for NP.
On my machine, for 256 columns, and 1000 rows for DF1 and 7739 for DF2, it takes 4 seconds (and that is a very slow machine. Pretty sure, with vectorization, that my Mac would do it way under 1 second, but I don't have it at hand)
You've 192184 rows in DF1, you say, not 1000. But since you have "only" 7739 for DF2, at this point it is linear: it would take only 192 times more to have the all 192184 rows than 1000. So a few minutes.
What might be more a problem with that many data, is memory. Because of storage of result matrix itself.
But that is not a problem. You can slice DF1 by batches of 1000 or 10000 rows. And you'll get a 1000x7739 matrix each time. The real result is just the concatenation of those matrix.
Edit
A variant I've tried, but expecting that improvement would be negligible, is using einsum. Not a game changer, but not as negligible I was expected.
einsum is a way to specify many arrays iterative sum operation. For example, matrix multiplication A#B is the equivalent of np.einsum('ij,jk',A,B).
So, cross scalar of every rows, would be np.einsum('ij,kj', A, B).
Which is, again, the same as matrix multiplication, with a transposition of second argument.
And just avoiding this .T is not that negligible apparently.
So, code for this methond
def mein():
d=np.einsum('ij,kj', DF1.values, DF2.values) / np.linalg.norm(DF1.values,axis=1).reshape(-1,1) / np.linalg.norm(DF2.values, axis=1)
return pd.DataFrame(d, columns=DF2.index, index=DF1.index)
It is the exact same as before. But DF1.values#DF2.values.T has been changed into np.einsum('ij,kj',DF1.values,DF2.values)
And update of timings with this
Method
Timing (ms)
Yours
1968.87
Vec
904.92
Vec2
807.57
NP
5.87
einsum
5.41
So, not a revolution, but still, and optimization.

Pandas Correlation Between List of Columns X Whole Dataframe

I'm looking for help with the Pandas .corr() method.
As is, I can use the .corr() method to calculate a heatmap of every possible combination of columns:
corr = data.corr()
sns.heatmap(corr)
Which, on my dataframe of 23,000 columns, may terminate near the heat death of the universe.
I can also do the more reasonable correlation between a subset of values
data2 = data[list_of_column_names]
corr = data2.corr(method="pearson")
sns.heatmap(corr)
That gives me something that I can use--here's an example of what that looks like:
What I would like to do is compare a list of 20 columns with the whole dataset. The normal .corr() function can give me a 20x20 or 23,000x23,000 heatmap, but essentially I would like a 20x23,000 heatmap.
How can I add more specificity to my correlations?
Thanks for the help!
Make a list of the subset that you want (in this example it is A, B, and C), create an empty dataframe, then fill it with the desired values using a nested loop.
df = pd.DataFrame(np.random.randn(50, 7), columns=list('ABCDEFG'))
# initiate empty dataframe
corr = pd.DataFrame()
for a in list('ABC'):
for b in list(df.columns.values):
corr.loc[a, b] = df.corr().loc[a, b]
corr
Out[137]:
A B C D E F G
A 1.000000 0.183584 -0.175979 -0.087252 -0.060680 -0.209692 -0.294573
B 0.183584 1.000000 0.119418 0.254775 -0.131564 -0.226491 -0.202978
C -0.175979 0.119418 1.000000 0.146807 -0.045952 -0.037082 -0.204993
sns.heatmap(corr)
After working through this last night, I came to the following answer:
#datatable imported earlier as 'data'
#Create a new dictionary
plotDict = {}
# Loop across each of the two lists that contain the items you want to compare
for gene1 in list_1:
for gene2 in list_2:
# Do a pearsonR comparison between the two items you want to compare
tempDict = {(gene1, gene2): scipy.stats.pearsonr(data[gene1],data[gene2])}
# Update the dictionary each time you do a comparison
plotDict.update(tempDict)
# Unstack the dictionary into a DataFrame
dfOutput = pd.Series(plotDict).unstack()
# Optional: Take just the pearsonR value out of the output tuple
dfOutputPearson = dfOutput.apply(lambda x: x.apply(lambda x:x[0]))
# Optional: generate a heatmap
sns.heatmap(dfOutputPearson)
Much like the other answers, this generates a heatmap (see below) but it can be scaled to allow for a 20,000x30 matrix without computing the correlation between the entire 20,000x20,000 combinations (and therefore terminating much quicker).
Usually the calculation of correlation coefficients pairwise for all variables make most sense. pd.corr() is convenience function to calculate the correlation coefficient pairwise (and for all pairs).
You can do it with scipy also only for specified pairs within a loop.
Example:
d=pd.DataFrame([[1,5,8],[2,5,4],[7,3,1]], columns=['A','B','C'])
One pair in pandas could be:
d.corr().loc['A','B']
-0.98782916114726194
Equivalent in scipy:
import scipy.stats
scipy.stats.pearsonr(d['A'].values,d['B'].values)[0]
-0.98782916114726194

How to segment time series data into 3 column and 3 channels?

I have a time series data(1000 data points) with following column names :
X , Y , Z , A , B .
I want to generate 10 segments each of 100 data points with 3 channels, where the first channel contains the column X,A,B , the second channel Y,A,B and the third channel Z,A,B?
How can I accomplish this in python?
Numpy
To rearrange the time series into the 10 segments, you can simply use np.reshape.
Example data of shape (XYZAB, timepoints):
a = np.random.randint(0,10,(5,1000))
print a.shape
>> (5L, 1000L)
Reshaping into the ten segments, resulting in (XYZAB, segments, timepoints):
b = np.reshape(a,(5,10,100))
print b.shape
>> (5L, 10L, 100L)
At this point, it may not be desirable to create what you call 'channels', as you would triplicate parts of your data (A and B) without really making it easier to access that data. You could access e.g. XAB simply like this:
xab = b[(0,3,4),:,:]
If you absolutely need the channels as individual copies, you can simply get them like this:
c = np.array([b[(0,3,4),:,:],
b[(1,3,4),:,:],
b[(2,3,4),:,:]])
print c.shape
>> (3L, 3L, 10L, 100L)
Which results in an array of shape (channel,column,segment,timepoints), where column refers to the original column names (e.g. (X,A,B) for channel 0).
Pandas
Just saw the pandas tag on your question, so...
df = pd.DataFrame(a.T, columns=list('XYZAB'))
Split into segments of 100 time points as a list of dfs:
segments = []
for group, segment in df.groupby(np.arange(len(df)) // 100):
segments.append(segment)
Or, even better, just create a new column that indicates which segment each row belongs to:
df['segment'] = df.apply(lambda x : x.name // 100, axis=1)
At this point it's probably again best not to triplicate your data and instead use the df as it is. You can easily apply operations per time segment using df.groupby(['segment']), while selecting columns of interest by standard column selection, e.g.
df.groupby(['segment'])['X','A','B'].mean()
to get the per-segment mean of columns X, A and B.
Of course you can create e.g. a list or dict of 'channels' in this way, if you really need it.
channels = {'XAB':df[['segment','X','A','B']],
'YAB':df[['segment','Y','A','B']],
'ZAB':df[['segment','Z','A','B']]}
And you can make this into a pandas Panel:
pnl = pd.Panel(channels)
The best data structure to use depends on your particular use-case, but in general I would avoid using Panels and stick with either the 2D df or the 3D array (i.e. b).

How to sum over columns with some weight in a csr matrix in python

If I have a large csr_matrix A, I want to sum over its columns, simply
A.sum(axis=0)
does this for me, right? Are the corresponding axis values: 1->rows, 0->columns?
I stuck when I want to sum over columns with some weights which are specified in a list, e.g. [1 2 3 4 5 4 3 ... 4 2 5] with the same length as the number of rows in the csr_matrix A. To be more clear, I want the inner product of each column vector with this weight vector. How can I achieve this with Python?
This is a part of my code:
uniFeature = csr_matrix(uniFeature)
[I,J] = uniFeature.shape
sumfreq = uniFeature.sum(axis=0)
sumratings = []
for j in range(J):
column = uniFeature.getcol(j)
column = column.toarray()
sumtemp = np.dot(ratings,column)
sumratings.append(sumtemp)
sumfreq = sumfreq.toarray()
average = np.true_divide(sumratings,sumfreq)
(Numpy is imported as np) There is a weight vector "ratings", the program is supposed to output the average rating for each column of the matrix "uniFeature".
I experimented to dot column=uniFeature.getcol(j) directly with ratings(which is a list), but there is an error that says format does not agree. It's ok after column.toarray() then dot with ratings. But isn't making each column back to dense form losing the point of having the sparse matrix and would be very slow? I ran the above code and it's too slow to show the results. I guess there should be a way that dots the vector "ratings" with each column of the sparse matrix efficiently.
Thanks in advance!

Plotting results of Pandas GroupBy

I'm starting to learn Pandas and am trying to find the most Pythonic (or panda-thonic?) ways to do certain tasks.
Suppose we have a DataFrame with columns A, B, and C.
Column A contains boolean values: each row's A value is either true or false.
Column B has some important values we want to plot.
What we want to discover is the subtle distinctions between B values for rows that have A set to false, vs. B values for rows that have A is true.
In other words, how can I group by the value of column A (either true or false), then plot the values of column B for both groups on the same graph? The two datasets should be colored differently to be able to distinguish the points.
Next, let's add another feature to this program: before graphing, we want to compute another value for each row and store it in column D. This value is the mean of all data stored in B for the entire five minutes before a record - but we only include rows that have the same boolean value stored in A.
In other words, if I have a row where A=True and time=t, I want to compute a value for column D that is the mean of B for all records from time t-5 to t that have the same A=True.
In this case, how can we execute the groupby on values of A, then apply this computation to each individual group, and finally plot the D values for the two groups?
I think #herrfz hit all the high points. I'll just flesh out the details:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
sin = np.sin
cos = np.cos
pi = np.pi
N = 100
x = np.linspace(0, pi, N)
a = sin(x)
b = cos(x)
df = pd.DataFrame({
'A': [True]*N + [False]*N,
'B': np.hstack((a,b))
})
for key, grp in df.groupby(['A']):
plt.plot(grp['B'], label=key)
grp['D'] = pd.rolling_mean(grp['B'], window=5)
plt.plot(grp['D'], label='rolling ({k})'.format(k=key))
plt.legend(loc='best')
plt.show()

Categories