I have a SQL table which I can read in as a Pandas data frame, that has the following structure:
user_id value
1 100
1 200
2 100
4 200
It's a representation of a matrix, for which all the values are 1 or 0. The dense representation of this matrix would look like this:
100 200
1 1 1
2 1 0
4 0 1
Normally, to do this conversion you can use pivot, but in my case with tens or hundreds of millions of rows in the first table one gets a big dense matrix full of zeros which is expensive to drag around. You can convert it to sparse, but getting that far requires a lot of resources.
Right now I'm working on a solution to assign row numbers to each user_id, sorting, and then splitting the 'value' column into SparseSeries before recombining into a SparseDataFrame. Is there a better way?
I arrived at a solution, albeit a slightly imperfect one.
What one can do is to manually create from the columns a number of Pandas SparseSeries, combine them into a dict, and then cast that dict to a DataFrame (not a SparseDataFrame). Casting as SparseDataFrame currently hits an immature constructor, which deconstructs the whole object into dense and then back into sparse form regardless of the input. Building SparseSeries into a conventional DataFrame, however, maintains sparsity but creates a viable and otherwise complete DataFrame object.
Here's a demonstration of how to do it, written more for clarity than for performance. One difference with my own implementation is I created the dict of sparse vectors as a dict comprehension instead of a loop.
import pandas
import numpy
df = pandas.DataFrame({'user_id':[1,2,1,4],'value':[100,100,200,200]})
# Get unique users and unique features
num_rows = len(df['user_id'].unique())
num_features = len(df['value'].unique())
unique_users = df['user_id'].unique().copy()
unique_features = df['value'].unique().copy()
unique_users.sort()
unique_features.sort()
# assign each user_id to a row_number
user_lookup = pandas.DataFrame({'uid':range(num_rows), 'user_id':unique_users})
vec_dict = {}
# Create a sparse vector for each feature
for i in range(num_features):
users_with_feature = df[df['value']==unique_features[i]]['user_id']
uid_rows = user_lookup[user_lookup['user_id'].isin(users_with_feature)]['uid']
vec = numpy.zeros(num_rows)
vec[uid_rows] = 1
sparse_vec = pandas.Series(vec).to_sparse(fill_value=0)
vec_dict[unique_features[i]] = sparse_vec
my_pandas_frame = pandas.DataFrame(vec_dict)
my_pandas_frame = my_pandas_frame.set_index(user_lookup['user_id'])
The results:
>>> my_pandas_frame
100 200
user_id
1 1 1
2 1 0
4 0 1
>>> type(my_pandas_frame)
<class 'pandas.core.frame.DataFrame'>
>>> type(my_pandas_frame[100])
<class 'pandas.sparse.series.SparseSeries'>
Complete, but still sparse. There are a few caveats, if you do a simple copy or subset not-in-place then it will forget itself and try to recast to dense, but for my purposes I'm pretty happy with it.
Related
I am trying to construct an affiliation matrix for a social network. I have a pd dataframe where column i is the i index of an element and column j is the j index of an element. Column v is the value of weight between two nodes.
I made up the following table for demonstration. I'll just call it df
i
j
v
1
3
0
2
4
2
5
3
0
2
1
2
1
2
0.5
3
1
1
My idea was to first construct a matrix
A_matrix = np.zeros((i_num, j_num))
Then I use the apply function
df.apply(set_to_matrix)
where
def set_to_matrix(row):
A_matrix[row.i, row.j] = row.v
My question is, Is it possible to get a better performance?
I have i_num = 100000 and j_num = 1000; with the code above it took me 1 minute 53 sec.
I tried using the swifter package to speed up the apply function, but it turns out to be 2 minutes 23 sec, which is longer.
If possible, also let me know why mine is slower and how other approach can potentially speed up the process.
There is no need to use apply, you can use the i and j columns to index inside the A_matrix then assign the values from v column to the corresponding index positions:
A_matrix = np.zeros((i_num, j_num))
A_matrix[df.i, df.j] = df.v
Your code is not working for me & I didn't spend time to debug it. The following code will give you the matrix you require pretty quickly. The only issue is the duplicate rows (1 & 2) and columns (1& 3) will be combined together (& to me it makes sense!).
df = pd.DataFrame({'i': [1,2,5,2,1,3],
'j': [3,4,3,1,2,1],
'v': [0,2,0,2,0.5,1]})
df1 = pd.pivot_table(df, values='v',index='i', columns='j', aggfunc=np.mean).reset_index().fillna(0)
Final network matrix:
print(df1.to_numpy())
I had a DataFrame called "segments" that looks like the below:
ORIGIN_AIRPORT_ID DEST_AIRPORT_ID FL_COUNT ORIGIN_INDEX DEST_INDEX OUTDEGREE
WEIGHT
0 10135 10397 77 119 373 3 0.333333
1 10135 11433 85 119 1375 3 0.333333
Using this, I created two Boolean Series objects: One in which I'm storing all the IDs for which the WEIGHT column is not 0 and one in which they are:
Zeroes = (segments['WEIGHT'] == 0).groupby(segments['ORIGIN_INDEX']).all()
Non_zeroes = (segments['WEIGHT'] != 0).groupby(segments['ORIGIN_INDEX']).all()
I want to do two things (because I'm not sure which this task needs):
Create a NumPy vector where all "True" values in the Non_zeroes Series are set to the result of 1/4191 (~0.024~) and all "True" values in the Zeroes Series are set to 0 (or the same logic using True and False of one Series) keeping the IDs (e.g. ORIGIN_INDEX 119 0.024%, etc.)
And I'd also like to create a NumPy vector that is JUST a list of the percentages and zeroes WITHOUT the IDs
EDIT to add extra detail requested!
I tried using a condition as a variable, then using .loc to apply it:
cond_array = copied.WEIGHT is not 0
df.loc[cond_array, ID] = 1/4191
I tried using from_coo(), toarray(), and DataFrame to convert:
pd.Series.sparse.from_coo(P, dense_index=True)
P.toarray()
pd.DataFrame(P)
Finally, I tried applying logic to the DF instead of the COO Matrix. I THINK this gets close, but it is still failing. I believe it fails because it is not including the 0s (copied is just a DF that's a copy of segments):
copied['WEIGHT'] = copied.loc[copied['WEIGHT'] != 0, 'WEIGHT'] = float((1/len(copied))) #0.00023860653
The last code passes the first two tests (testing if it's an array and that it sums to 1.0), but fails the last
assert np.isclose(x0.max(), 1.0/n_actual, atol=10*n*np.finfo(float).eps), "x0` values seem off..."
EDIT 2:
Had the wrong count. It was supposed to be 1/300, not 1/4191. All fixed now, thanks all who took a look :)
I import binary data from a SQL in a pandas Dataframe consisting of the columns UserId and ItemId. I am using implicit/binary data, as you can see in the pivot_table below.
Dummy data
frame=pd.DataFrame()
frame['Id']=[2134, 23454, 5654, 68768]
frame['ItemId']=[123, 456, 789, 101]
I know how to create a pivot_table in Pandas using:
print(frame.groupby(['Id', 'ItemId'], sort=False).size().unstack(fill_value=0))
ItemId 123 456 789 101
Id
2134 1 0 0 0
23454 0 1 0 0
5654 0 0 1 0
68768 0 0 0 1
and convert that to a SciPy csr_matrix, but I want to create a sparse matrix right from the get-go without having to convert from a Pandas df. The reason for this is that I get an error: Unstacked DataFrame is too big, causing int32 overflow, because my original data consists of 378.777 rows.
Any help is much appreciated!
I am trying to do the same as these answers Efficiently create sparse pivot tables in pandas?
But I do not have the frame['count'] data yet.
Using the 4th option to instantiate the matrix:
Id = [2134, 23454, 5654, 68768]
ItemId = [123, 456, 789, 101]
csrm = csr_matrix(([1]*len(Id), (Id,ItemId)))
Result:
<68769x790 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in Compressed Sparse Row format>
I am assuming that you can somehow read the lines of data values into separate lists in memory, i.e., like you did it in your example (having lists for the Id and ItemId). According to the comments on your post, we also do not expect duplicates. Note that the following will not work, if you have duplicates!
The presented solution also introduces a (sparse) matrix that is not as dense as shown in the example, as we will directly use the Id values as matrix/row entries.
To pass them to the constructor, if you're having a look at the SciPy documentation:
csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
where data, row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k].
Meaning we can directly pass the lists as indices to our sparse matrix as follows:
from scipy.sparse import csr_matrix
Id_values = load_values() # gets the list of entries as in the post example
ItemId_values = load_more_values()
sparse_mat = csr_matrix(([1]*len(Id_values), # entries will be filled with ones
(Id_values, ItemId_values)), # at those positions
shape=(max(Id_values)+1, max(ItemId_values)+1)) # shape is the respective maximum entry of each dimension
Note that this will not give you any sorting, but instead put the values at their respective Id position, i.e. the first pair would be held at position (2134, 134) instead of (0, 0)
What is the fastest (and most efficient) way to create a new column in a DataFrame that is a function of other rows in pandas ?
Consider the following example:
import pandas as pd
d = {
'id': [1, 2, 3, 4, 5, 6],
'word': ['cat', 'hat', 'hag', 'hog', 'dog', 'elephant']
}
pandas_df = pd.DataFrame(d)
Which yields:
id word
0 1 cat
1 2 hat
2 3 hag
3 4 hog
4 5 dog
5 6 elephant
Suppose I want to create a new column bar containing a value that is based on the output of using a function foo to compare the word in the current row to the other rows in the dataframe.
def foo(word1, word2):
# do some calculation
return foobar # in this example, the return type is numeric
threshold = some_threshold
for index, _id, word in pandas_df.itertuples():
value = sum(
pandas_df[pandas_df['word'] != word].apply(
lambda x: foo(x['word'], word),
axis=1
) < threshold
)
pandas_df.loc[index, 'bar'] = value
This does produce the correct output, but it uses itertuples() and apply(), which is not performant for large DataFrames.
Is there a way to vectorize (is that the correct term?) this approach? Or is there another better (faster) way to do this?
Notes / Updates:
In the original post, I used edit distance/levenshtein distance as the foo function. I have changed the question in an attempt to be more generic. The idea is that the function to be applied is to compare the current rows value against all other rows and return some aggregate value.
If foo was nltk.metrics.distance.edit_distance and the threshold was set to 2 (as in the original post), this produces the output below:
id word bar
0 1 cat 1.0
1 2 hat 2.0
2 3 hag 2.0
3 4 hog 2.0
4 5 dog 1.0
5 6 elephant 0.0
I have the same question for spark dataframes as well. I thought it made sense to split these into two posts so they are not too broad. However, I have generally found that solutions to similar pandas problems can sometimes be modified to work for spark.
Inspired by this answer to my spark version of this question, I tried to use a cartesian product in pandas. My speed tests indicate that this is slightly faster (though I suspect that may vary with the size of the data). Unfortunately, I still can't get around calling apply().
Example code:
from nltk.metrics.distance import edit_distance as edit_dist
pandas_df2 = pd.DataFrame(d)
i, j = np.where(np.ones((len(pandas_df2), len(pandas_df2))))
cart = pandas_df2.iloc[i].reset_index(drop=True).join(
pandas_df2.iloc[j].reset_index(drop=True), rsuffix='_r'
)
cart['dist'] = cart.apply(lambda x: edit_dist(x['word'], x['word_r']), axis=1)
pandas_df2 = (
cart[cart['dist'] < 2].groupby(['id', 'word']).count()['dist'] - 1
).reset_index()
Let's try to analyze the problem for a second:
If you have N rows, then you have N*N "pairs" to consider in your similarity function. In the general case, there is no escape from evaluating all of them (sounds very rational, but I can't prove it). Hence, you have at least O(n^2) time complexity.
What you can try, however, is to play with the constant factors of that time complexity.
The possible options I found are:
1. Parallelization:
Since you have some large DataFrame, parallelizing the processing is the best obvious choice. That will gain you (almost) linear improvement in time complexity, so if you have 16 workers you will gain (almost) 16x improvement.
For example, we can partition the rows of the df into disjoint parts, and process each part individually, then combine the results.
A very basic parallel code might look like this:
from multiprocessing import cpu_count,Pool
def work(part):
"""
Args:
part (DataFrame) : a part (collection of rows) of the whole DataFrame.
Returns:
DataFrame: the same part, with the desired property calculated and added as a new column
"""
# Note that we are using the original df (pandas_df) as a global variable
# But changes made in this function will not be global (a side effect of using multiprocessing).
for index, _id, word in part.itertuples(): # iterate over the "part" tuples
value = sum(
pandas_df[pandas_df['word'] != word].apply( # Calculate the desired function using the whole original df
lambda x: foo(x['word'], word),
axis=1
) < threshold
)
part.loc[index, 'bar'] = value
return part
# New code starts here ...
cores = cpu_count() #Number of CPU cores on your system
data_split = np.array_split(data, cores) # Split the DataFrame into parts
pool = Pool(cores) # Create a new thread pool
new_parts = pool.map(work , data_split) # apply the function `work` to each part, this will give you a list of the new parts
pool.close() # close the pool
pool.join()
new_df = pd.concat(new_parts) # Concatenate the new parts
Note: I've tried to keep the code as close to OP's code as possible. This is just a basic demonstration code and a lot of better alternatives exist.
2. "Low level" optimizations:
Another solution is to try to optimize the similarity function computation and iterating/mapping. I don't think this will gain you much speedup compared to the previous option or the next one.
3. Function-dependent pruning:
The last thing you can try are similarity-function-dependent improvements. This doesn't work in the general case, but will work very well if you can analyze the similarity function. For example:
Assuming you are using Levenshtein distance (LD), you can observe that the distance between any two strings is >= the difference between their lengths. i.e. LD(s1,s2) >= abs(len(s1)-len(s2)) .
You can use this observation to prune the possible similar pairs to consider for evaluation. So for each string with length l1, compare it only with strings having length l2 having abs(l1-l2) <= limit. (limit is the maximum accepted dis-similarity, 2 in your provided example).
Another observation is that LD(s1,s2) = LD(s2,s1). That cuts the number of pairs by a factor of 2.
This solution may actually get you down to O(n) time complexity (depends highly on the data).
Why? you may ask.
That's because if we had 10^9 rows, but on average we have only 10^3 rows with "close" length to each row, then we need to evaluate the function for about 10^9 * 10^3 /2 pairs, instead of 10^9 * 10^9 pairs. But that's (again) depends on the data. This approach will be useless if (in this example) you have strings all which have length 3.
Thoughts about preprocessing (groupby)
Because you are looking for edit distance less than 2, you can first group by the length of strings. If the difference of length between groups is greater or equal to 2, you do not need to compare them. (This part is quite similar to Qusai Alothman's answer in section 3. H)
Thus, first thing is to group by the length of the string.
df["length"] = df.word.str.len()
df.groupby("length")["id", "word"]
Then, you compute the edit distance between every two consecutive group if the difference in length is less than or equal to 2. This does not directly relate to your question but I hope it would be helpful.
Potential vectorization (after groupby)
After that, you may also try to vectorize the computation by splitting each string into characters. Note that if the cost of splitting is greater than the vectorized benefits it carries, you should not do this. Or when you are creating the data frame, just create one that with characters rather than words.
We will use the answer in Pandas split dataframe column for every character to split a string into a list of characters.
# assuming we had groupped the df.
df_len_3 = pd.DataFrame({"word": ['cat', 'hat', 'hag', 'hog', 'dog']})
# turn it into chars
splitted = df_len_3.word.apply(lambda x: pd.Series(list(x)))
0 1 2
0 c a t
1 h a t
2 h a g
3 h o g
4 d o g
splitted.loc[0] == splitted # compare one word to all words
0 1 2
0 True True True -> comparing to itself is always all true.
1 False True True
2 False True False
3 False False False
4 False False False
splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1
0 1
1 2
2 2
3 2
4 1
dtype: int64
Explanation of splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1
For each row, lambda x: (x == splitted) compares each row to the whole df just like splitted.loc[0] == splitted above. It will generate a true/false table.
Then, we sum up the table horizontally with a .sum(axis=1) following (x == splitted).
Then, we want to find out which words are similar. Thus, we apply a ge function that checks the number of true is over a threshold. Here, we only allow difference to be 1, so it is set to be len(x)-1.
Finally, we will have to subtract the whole array by 1 because we compare each word with itself in operation. We will want to exclude self-comparison.
Note, this vectorization part only works for within-group similarity checking. You still need to check groups with different length with the edit distance approach, I suppose.
I'm trying to rework much of my analysis code for signal processing using Dataframes instead of numpy arrays. However, I'm having a hard time figuring out how to pass the entire matrix of a dataframe to a function as an entire unit.
E.g., If I'm computing the common average reference a signal, I have something like:
avg = signal.mean(axis=1)
CAR = signal - avg
What I'd like to do is pass a pandas array to this function and have it return a dataframe with CAR as the values now. I'd like to do this without just returning an array and then re-converting it back into a dataframe.
It sounds like when you use df.apply(), it goes row-wise or column-wise, and doesn't put in the whole matrix. I could alter the code of CAR to make this work, but it seems like it would slow it down quite a bit rather than just using numpy's code to do it all at once. It probably wouldn't make a big difference for computing the mean, but I foresee this being a problem with other functions in the future that might take longer.
Can anyone point me in the right direction?
EDIT: To clarify, I'm not just doing this for subtracting the mean, it was just a simple example. A more realistic example would be linearly filtering the array along axis 0. I'd like to use the scipy.signal filtfilt function to filter my array. This is quite easy if I can just pass it a tpts x feats matrix, but right now it seems that the only way to do it is column-wise using "apply"
You can get the raw numpy array version of a DataFrame with df.values. However, in many cases you can just pass the DataFrame itself, since it still allows use of the normal numpy API (i.e., it has all the right methods).
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html this will allow you to perform operations on a row (or column, or the entire dataframe).
import random
signal=pd.DataFrame([[10*random.random() for _ in range(3)]for _ in range(5)])
def testm(frame, average=0):
return frame-average
signal.apply(testm,average=signal.mean(),axis=1)
results:
signal
Out[57]:
0 1 2
0 5.566445 7.612070 8.554966
1 0.869158 2.382429 6.197272
2 5.933192 3.564527 9.805669
3 9.676292 1.707944 2.731479
4 5.319629 3.348337 6.476631
signal.mean()
Out[59]:
0 5.472943
1 3.723062
2 6.753203
dtype: float64
signal.apply(testm,average=signal.mean(),axis=1)
Out[58]:
0 1 2
0 0.093502 3.889008 1.801763
1 -4.603785 -1.340632 -0.555932
2 0.460249 -0.158534 3.052466
3 4.203349 -2.015117 -4.021724
4 -0.153314 -0.374724 -0.276572
This will take the mean of each column, and subtract it from each value in that column.