Converting Pandas DataFrame to sparse matrix

Converting Pandas DataFrame to sparse matrix - python

Here is my code:
data=pd.get_dummies(data['movie_id']).groupby(data['user_id']).apply(max)
df=pd.DataFrame(data)
replace=df.replace(0,np.NaN)
t=replace.fillna(-1)
sparse=sp.csr_matrix(t.values)
My data consist of two columns which are movie_id and user_id.
user_id movie_id
5 1000
6 1007
I want to convert the data to a sparse matrix. I first created an interaction matrix where rows indicate user_id and columns indicate movie_id with positive interaction as +1 and negative interaction as -1. Then I converted it to a sparse matrix using scipy. My result looks like this:
(0,0) -1
(0,1) -1
(0,2) 1
but what actually i want is this:
(1000,0) -1
(1000,1) 1
(1007,0) -1
Any help would be appreciated.

If you have both the row and column index (in your case movie_id and user_id, respectively), it is advisable to use the COO format for creation.
You can convert it into a sparse format like so:
import scipy
sparse_mat = scipy.sparse.coo_matrix((t.values, (df.movie_id, df.user_id)))
Importantly, note how the constructor gives the implicit shape of the sparse matrix by passing both the movie ID and user ID as arguments for the data.
Furthermore, you can convert this matrix to any other sparse format you desire, as for example CSR.

Related

Creating a csr_matrix with unary/binary data from the start

I import binary data from a SQL in a pandas Dataframe consisting of the columns UserId and ItemId. I am using implicit/binary data, as you can see in the pivot_table below.
Dummy data
frame=pd.DataFrame()
frame['Id']=[2134, 23454, 5654, 68768]
frame['ItemId']=[123, 456, 789, 101]
I know how to create a pivot_table in Pandas using:
print(frame.groupby(['Id', 'ItemId'], sort=False).size().unstack(fill_value=0))
ItemId 123 456 789 101
Id
2134 1 0 0 0
23454 0 1 0 0
5654 0 0 1 0
68768 0 0 0 1
and convert that to a SciPy csr_matrix, but I want to create a sparse matrix right from the get-go without having to convert from a Pandas df. The reason for this is that I get an error: Unstacked DataFrame is too big, causing int32 overflow, because my original data consists of 378.777 rows.
Any help is much appreciated!
I am trying to do the same as these answers Efficiently create sparse pivot tables in pandas?
But I do not have the frame['count'] data yet.

Using the 4th option to instantiate the matrix:
Id = [2134, 23454, 5654, 68768]
ItemId = [123, 456, 789, 101]
csrm = csr_matrix(([1]*len(Id), (Id,ItemId)))
Result:
<68769x790 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in Compressed Sparse Row format>

I am assuming that you can somehow read the lines of data values into separate lists in memory, i.e., like you did it in your example (having lists for the Id and ItemId). According to the comments on your post, we also do not expect duplicates. Note that the following will not work, if you have duplicates!
The presented solution also introduces a (sparse) matrix that is not as dense as shown in the example, as we will directly use the Id values as matrix/row entries.
To pass them to the constructor, if you're having a look at the SciPy documentation:
csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
where data, row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k].
Meaning we can directly pass the lists as indices to our sparse matrix as follows:
from scipy.sparse import csr_matrix
Id_values = load_values() # gets the list of entries as in the post example
ItemId_values = load_more_values()
sparse_mat = csr_matrix(([1]*len(Id_values), # entries will be filled with ones
(Id_values, ItemId_values)), # at those positions
shape=(max(Id_values)+1, max(ItemId_values)+1)) # shape is the respective maximum entry of each dimension
Note that this will not give you any sorting, but instead put the values at their respective Id position, i.e. the first pair would be held at position (2134, 134) instead of (0, 0)

Efficient way to populate a sparse matrix in Python

I am trying to set up a sparse matrix (dok_matrix) of journal co-occurences. Unfortunately, my solution is (too) inefficient to be of any use and I couldn't find any solution online.
EDIT: I would also like to create the sparse matrix directly, not by first creating a dense matrix and then turning it into a sparse matrix.
I start with a dataframe of how often certain journal are cited together. In this example, Nature and Science are cited together 3 times. I would like to end up with a sparse, symmetric matrix where the rows and columns are journals and the non-empty entries are how often these journals are cited together. I.e., here the full matrix would have four rows (Lancet, Nature, NEJM, Science) and four columns (Lancet, Nature, NEJM, Science) and three non-zero entries. Since my real data is much larger, I would like to use a sparse matrix representation.
What I currently do in my code is to update the non-zero entries with the values from my Dataframe. Unfortunately, the comparison of journal names is quite time-consuming and my question is, whether there is a quicker way of setting up a sparse matrix here.
My understanding is that my dataframe is close to a dok_matrix anyways, with the journal combination being equivalent to the tuple used as a key in the dok_matrix. However, I do not know how to make this transformation.
Any help is appreciated!
# Import packages
import pandas as pd
from scipy.sparse import dok_matrix
# Set up dataframe
d = {'journal_comb': ['Nature//// Science', 'NEJM//// Nature', 'Lancet//// NEJM'], 'no_combs': [3, 5, 6], 'journal_1': ['Nature', 'NEJM', 'Lancet'], 'journal_2': ['Science', 'Nature', 'NEJM']}
df = pd.DataFrame(d)
# Create list of all journal titles
journal_list = list(set(set(list(df['journal_1'])) | set(list(df['journal_2']))))
journal_list.sort()
# Set up empty sparse matrix with final size
S = dok_matrix((len(journal_list), len(journal_list)))
# Loop over all journal titles and get value from Dataframe for co-occuring journals
# Update sparse matrix value with value from Dataframe
for i in range(len(journal_list)):
print i
# Check whether journal name is actually in column 'journal_1'
if len(df[(df['journal_1'] == journal_list[i])]) > 0:
for j in range(len(journal_list)):
# If clause to circumvent error due to empty series if journals are not co-cited
if len(df[(df['journal_1'] == journal_list[i]) & (df['journal_2'] == journal_list[j])]['no_combs']) == 1:
# Update value in sparse matrix
S[i, j] = df[(df['journal_1'] == journal_list[i]) & (df['journal_2'] == journal_list[j])]['no_combs'].iloc[0]

Use pandas first to shape your matrix -
dok_matrix(pd.concat([df, df.rename(index=str, columns={'journal_1' : 'journal_2', 'journal_2' : 'journal_1'})], axis=0).pivot(index='journal_1', columns = 'journal_2', values = 'no_combs').as_matrix())
I have first appended the reverse journal1 as journal 2, then pivoted to make the correct shape, then converted to matrix, and then to dok_matrix

Python - split matrix data into separate columns

I have read data from a file and stored into a matrix (frag_coords):
frag_coords =
[[ 916.0907976 -91.01391344 120.83596334]
[ 916.01117655 -88.73389753 146.912555 ]
[ 924.22832597 -90.51682575 120.81734705]
...
[ 972.55384732 708.71316138 52.24644577]
[ 972.49089559 710.51583744 72.86369124]]
type(frag_coords) =
class 'numpy.matrixlib.defmatrix.matrix'
I do not have any issues when reordering the matrix by a specified column. For example, the code below works just fine:
order = np.argsort(frag_coords[:,2], axis=0)
My issue is that:
len(frag_coords[0]) = 1
I need to access the individual numbers of the first row individually, I've tried splitting it, transforming it into a list and everything seems to return the 3 numbers not as columns but rather as a single element with len=1. I need help please!

Your problem is that you're using a matrix instead of an ndarray. Are you sure you want that?
For a matrix, indexing the first row alone leads to another matrix, a row matrix. Check frag_coords[0].shape: it will be (1,3). For an ndarray, it would be (3,).
If you only need to index the first row, use two indices:
frag_coords[0,j]
Or if you store the row temporarily, just index into it as a row matrix:
tmpvar = frag_coords[0] # shape (1,3)
print(tmpvar[0,2]) # for column 2 of row 0
If you don't need too many matrix operations, I'd advise that you use np.arrays instead. You can always read your data into an array directly, but at a given point you can just transform an existing matrix with np.array(frag_coords) too if you wish.

How to efficiently create a SparseDataFrame from a long table?

I have a SQL table which I can read in as a Pandas data frame, that has the following structure:
user_id value
1 100
1 200
2 100
4 200
It's a representation of a matrix, for which all the values are 1 or 0. The dense representation of this matrix would look like this:
100 200
1 1 1
2 1 0
4 0 1
Normally, to do this conversion you can use pivot, but in my case with tens or hundreds of millions of rows in the first table one gets a big dense matrix full of zeros which is expensive to drag around. You can convert it to sparse, but getting that far requires a lot of resources.
Right now I'm working on a solution to assign row numbers to each user_id, sorting, and then splitting the 'value' column into SparseSeries before recombining into a SparseDataFrame. Is there a better way?

I arrived at a solution, albeit a slightly imperfect one.
What one can do is to manually create from the columns a number of Pandas SparseSeries, combine them into a dict, and then cast that dict to a DataFrame (not a SparseDataFrame). Casting as SparseDataFrame currently hits an immature constructor, which deconstructs the whole object into dense and then back into sparse form regardless of the input. Building SparseSeries into a conventional DataFrame, however, maintains sparsity but creates a viable and otherwise complete DataFrame object.
Here's a demonstration of how to do it, written more for clarity than for performance. One difference with my own implementation is I created the dict of sparse vectors as a dict comprehension instead of a loop.
import pandas
import numpy
df = pandas.DataFrame({'user_id':[1,2,1,4],'value':[100,100,200,200]})
# Get unique users and unique features
num_rows = len(df['user_id'].unique())
num_features = len(df['value'].unique())
unique_users = df['user_id'].unique().copy()
unique_features = df['value'].unique().copy()
unique_users.sort()
unique_features.sort()
# assign each user_id to a row_number
user_lookup = pandas.DataFrame({'uid':range(num_rows), 'user_id':unique_users})
vec_dict = {}
# Create a sparse vector for each feature
for i in range(num_features):
users_with_feature = df[df['value']==unique_features[i]]['user_id']
uid_rows = user_lookup[user_lookup['user_id'].isin(users_with_feature)]['uid']
vec = numpy.zeros(num_rows)
vec[uid_rows] = 1
sparse_vec = pandas.Series(vec).to_sparse(fill_value=0)
vec_dict[unique_features[i]] = sparse_vec
my_pandas_frame = pandas.DataFrame(vec_dict)
my_pandas_frame = my_pandas_frame.set_index(user_lookup['user_id'])
The results:
>>> my_pandas_frame
100 200
user_id
1 1 1
2 1 0
4 0 1
>>> type(my_pandas_frame)
<class 'pandas.core.frame.DataFrame'>
>>> type(my_pandas_frame[100])
<class 'pandas.sparse.series.SparseSeries'>
Complete, but still sparse. There are a few caveats, if you do a simple copy or subset not-in-place then it will forget itself and try to recast to dense, but for my purposes I'm pretty happy with it.

How to sum over columns with some weight in a csr matrix in python

If I have a large csr_matrix A, I want to sum over its columns, simply
A.sum(axis=0)
does this for me, right? Are the corresponding axis values: 1->rows, 0->columns?
I stuck when I want to sum over columns with some weights which are specified in a list, e.g. [1 2 3 4 5 4 3 ... 4 2 5] with the same length as the number of rows in the csr_matrix A. To be more clear, I want the inner product of each column vector with this weight vector. How can I achieve this with Python?
This is a part of my code:
uniFeature = csr_matrix(uniFeature)
[I,J] = uniFeature.shape
sumfreq = uniFeature.sum(axis=0)
sumratings = []
for j in range(J):
column = uniFeature.getcol(j)
column = column.toarray()
sumtemp = np.dot(ratings,column)
sumratings.append(sumtemp)
sumfreq = sumfreq.toarray()
average = np.true_divide(sumratings,sumfreq)
(Numpy is imported as np) There is a weight vector "ratings", the program is supposed to output the average rating for each column of the matrix "uniFeature".
I experimented to dot column=uniFeature.getcol(j) directly with ratings(which is a list), but there is an error that says format does not agree. It's ok after column.toarray() then dot with ratings. But isn't making each column back to dense form losing the point of having the sparse matrix and would be very slow? I ran the above code and it's too slow to show the results. I guess there should be a way that dots the vector "ratings" with each column of the sparse matrix efficiently.
Thanks in advance!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting Pandas DataFrame to sparse matrix - python

Related

Creating a csr_matrix with unary/binary data from the start

Efficient way to populate a sparse matrix in Python

Python - split matrix data into separate columns

How to efficiently create a SparseDataFrame from a long table?

How to sum over columns with some weight in a csr matrix in python

Categories

Resources