Export Cosine Simularity Array out as a Matrix with Labels - python

Short version: I have a array and need to create a matrix but with names labels on top and side and export like example csv. (sorry if may wording incorrect)
Long version:
I made a recommendation system self taught and have a website ready after a year in quarantine learning and troubleshooting here on so usually a few day of searching I figure it out, but this got me stuck for about 3 weeks now.
The recommendation system system works in python I can put in a name and it spits of the recommended names i tweaked it and got it to acceptable results. But in the books, website and tutorial and udemy classes etc. Never learn how to take the python and make a Django site to get it to work.
This what the output is like currently is
# creating a Series for the name of the character so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]
# instantiating and generating the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])
​
# creating a Series for the name of the character so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]
0 ZZ Top
1 Zyan Malik
2 Zooey Deschanel
3 Ziggy Marley
4 ZHU
Name: name, dtype: object
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim
array([[1. , 0.11708208, 0.10192614, ..., 0. , 0. ,
0. ],
[0.11708208, 1. , 0.1682581 , ..., 0. , 0. ,
0. ],
[0.10192614, 0.1682581 , 1. , ..., 0. , 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 1. , 1. ,
1. ],
[0. , 0. , 0. , ..., 1. , 1. ,
1. ],
[0. , 0. , 0. , ..., 1. , 1. ,
1. ]])
# I need to then export to csv which I understand
.to_csv('artist_similarities.csv')
Desired Exports
I am trying to have the array with the index name in what i think is called a matrix like this example.
scores ZZ Top Zyan Malik Zooey Deschanel ZHU
0 ZZ Top 0 65.61249881 24.04163056 24.06241883
1 Zyan Malik 65.61249881 0 89.35882721 69.6634768
2 Zooey Deschanel 24.04163056 89.40917179 0 20.09975124
3 ZHU 7.874007874 69.6634768 20.09975124 0
# function that takes in the character name as input and returns the top 10 recommended characters
def recommendations(name, cosine_sim = cosine_sim):
recommended_names = []
# getting the index of the movie that matches the title
idx = indices[indices == name].index[0]
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most characters
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the names of the best 10 matching characters
for i in top_10_indexes:
recommended_names.append(list(df.index)[i])
return recommended_names
# working results which for dataset are pretty good
recommendations('Blues Traveler')
['G-Love & The Special Sauce',
'Phish',
'Spin Doctors',
'Grace Potter and the Nocturnals',
'Jason Mraz',
'Pearl Jam',
'Dave Matthews Band',
'Lukas Nelson & Promise of the Real ',
'Vonda Shepard',
'Goo Goo Dolls']

I'm not sure I understand what you're asking and I can't comment so I'm forced to write here. I assume you want to add column and index fields to the cosine_sim array. You could do something like this:
cos_sim_df = pd.DataFrame(cosine_sim, index=indices, columns=indices)
cos_sim_df.to_csv("artist_similarities.csv")
And then read the csv like
cos_sim_df = pd.read_csv("artist_similarities.csv", header=0, index_col=0)
To make sure pandas knows the first row and columns are field names. Also I assumed your column and row indices are the same, you can change them if you need. Another thing, this won't be exactly like the desired exports because in that csv there is a "score" field which contains the names of the artists, though it seems like the artists should be field names. If you want the exported csv to look exactly like the desired exports you can add the artists in a "score" field like this:
cos_sim_df = pd.DataFrame(cosine_sim, columns=indices)
cos_sim_df["score"] = indices
# make the score field the first field
cos_sim_df = cos_sim_df[["score", *idx]]
Lastly I want to note that indexing data frames is row-major, and it seems you visualized the fields as column indices, for this specific case since your array has a line of symmetry across the diagonal, it doesn't matter which axis is indexed because cos_sim_df["Zayn Malik"] for example will return the same values anyway, but keep this in mind if your array isn't symmetrical.

Related

Using indexing and iteration to reformat values with Numpy Pandas Python

I want to make a code that goes through the lists within vals array one by one for each unique digit_vals value. The digit_vals value shows the nth number for the expected output, so since the first value in digit_vals is 24 then it means that all the numbers before it will be a filled with zeroes and the 24th number will contain value from vals. Since there are two 24s within digit_vals it means that the 2nd index within the first list of vals ([-3.3, -4.3, 23.05, 23.08, 23.88, 3.72]) will contain the 24th value in the Expected Output which is -4.3. The 4th index of the 2nd list within vals will contain the value for the 27th value in digit_vals and so on. The gaps between the digit_vals will be filled with zeroes as well in the results so between 24 and 27 there will be 2 zeroes for the 25th and 26th place respectively. How would I be able to code this function that allows me to achieve the Expected Output below?
import pandas as pd
import numpy as np
digit_vals = np.array([24, 24, 27, 27, 27, 27,
28, 28, 28, 31])
vals = np.array([list([-3.3, -4.3, 23.05, 23.08, 23.88, 3.72]),
list([2.3, 2.05, 3.08, -4.88, 4.72]),
list([5.3, 2.05, 6.08, -13.88, -17.2]),
list([9.05, 6.08, 3.88, -13.72])], dtype=object)
Expected Output:
array([ 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
-4.3, , 0. , 0. , -4.88 ,
6.08, , 0 , 9.05])
First off, if I understand your question correctly, then your output array should be one element longer, with one more zero between to 6.08 value and the 9.05 value, because the 9.05 should be at index position 31 (the other values match their index position specified in digit_vals).
The hardest part of this question is transforming the information in the digits_vals array into two arrays that correctly index into each vals array list, and into the correct index position in the output array.
Because you're already using numpy, I think this is a reasonable approach
val_ind = []
out_ind = []
for ind, cnt in enumerate(np.bincount(digit_vals)):
if cnt > 0:
val_ind.append(cnt-1)
out_ind.append(ind)
Calculate the number of occurrences of each value in digits_vals and use that count (minus one for zero indexing) as the index into each list within the vals array. Each unique number in digits_vals is identified by capturing the index for each value with a nonzero count, assuming digits_vals will be ordered, as specified in the question example.
Once you have the index lists built, it is straightforward to build the output array:
out_arr = np.zeros(np.max(digit_vals)+1)
for r_ind, (v_ind, o_ind) in enumerate(zip(val_ind, out_ind)):
out_arr[o_ind] = vals[r_ind][v_ind]
Again, the enumeration provides the row index for extracting the correct row's data from the vals array. I've confirmed this reproduces the output array you provided, including the fix noted above. Hopefully I understood your question correctly, and made reasonable assumptions. If so, please update your question with a little more detail describing assumptions, etc.

Build a 1x10 dataframe and fill it with a row vector

I have a 1x10 dataframe where the name of each column is a string in a list.
I have a 1x10 row vector with values.
I would like to integrate this vector in the dataframe, so that I have the names in the list as column names, and the values of the vector in a single row.
How could I do that ? The only way I found was appending eveything into one column, and rename the index with the names in my list, but I want a 1x10 dataframe instead of a 10x1.
names = ['agmi', 'aglo', 'mwmi', 'mcha', 'mcmi', 'mclo', 'bkdr', 'mchi', 'melo', 'mwlo']
alt_GPS = np.array([[ 0. , 0. , 0.85253906, 6.12797546, 5.49960327,
11.00892639, 0. , 2.08251953, 0. , 2.4508667 ]])
alt = pd.DataFrame(columns=names)
Your alt_GPS array is already 2D, so isn't simply this what you want:
alt = pd.DataFrame(alt_GPS, columns=names)
Output:
agmi aglo mwmi mcha mcmi mclo bkdr mchi melo mwlo
0 0.0 0.0 0.852539 6.127975 5.499603 11.008926 0.0 2.08252 0.0 2.450867
If you want to create the dataframe first and add afterwards:
alt = pd.DataFrame(columns=names)
alt.loc[0] = alt_GPS[0]

Not able to add two arrays using np.insert()

Hey guys so I want to write a function that performs a z-score transformation to a single column in a 2d array and then return an array where the specified column is "transformed" and the other columns remain the same. So the way I went about this is first I deleted the column that I want to transform using np.delete(), then performed the transformation, and then finally added the array with the deleted column and the transformed column using np.insert(). However all the elements in the transformed column is all 0. What can I do??
I have attached an image so you can view the incorrect output as well.
x1 = np.array([[4,3,12],[1,5,20],[1,2,3],[10,20,40],[7,2,44]])
def myfunc(array, scalar):
total_result = np.delete(array, scalar, axis =1)
z_score = ((array - array.mean())/array.std())[:,1]
answer = np.insert(total_result, scalar, z_score, axis=1)
return answer
myfunc(x1, 1)
Your array is of type integer, and your z-score is float. When you insert float into an integer array, it converts it to integer, hence all 0. You need to convert your array into float first. Also, deleting/inserting is not the right way to do it, simply assign your new values to your desired column. No need for delete/insert. Here is how to do it:
def myfunc(array, scalar):
z_score = ((array - array.mean())/array.std())[:,scalar]
array[:,scalar] = z_score
return array
x1 = x1.astype(np.float64, copy=False)
myfunc(x1, 1)
output:
[[ 4. -0.64344154 12. ]
[ 1. -0.49380397 20. ]
[ 1. -0.71826033 3. ]
[10. 0.62847778 40. ]
[ 7. -0.71826033 44. ]]

#Python iterate over an array with empty (nan) values

In advance, thank you for the time.
The problem is as follows, I have a matrix where both "0" and "empty fields" are necessary in the further calculation:
as data is converted into a numpy array, it automatically replaces the empty fields with "nan" ... how can I loop over each row of an array while ignoring the "nan" values for further calculation .
>>>data
[[ 2. 4. 7.]
[ 7. 0. nan]
[-3. 7. 0.]
[nan nan 6.]]
The idea was to run a set of conditions while iterating over the rows and possibly append to a new numpy array but for simplicity let's say I just want to get a new array without the "nan" so the final result would look something like >
>>>final_data
[[2,4,7], [7,0], [-3,7,0], [6]]

numpy, fill sparse matrix with rows from other matrix

I have trouble figuring out what would be the most efficient way to do the following:
import numpy as np
M = 10
K = 10
ind = np.array([0,1,0,1,0,0,0,1,0,0])
full = np.random.rand(sum(ind),K)
output = np.zeros((M,K))
output[1,:] = full[0,:]
output[3,:] = full[1,:]
output[7,:] = full[2,:]
I want to build output, which is a sparse matrix, whose rows are given in a dense matrix (full) and the row indices are specified through a binary vector.
Ideally, I want to avoid a for-loop. Is that possible? If not, I'm looking for the most efficient way to for-loop this.
I need to perform this operation quite a few times. ind and full will keep changing, hence I've just provided some exemplar values for illustration.
I expect ind to be pretty sparse (at most 10% ones), and both M and K to be large numbers (10e2 - 10e3). Ultimately, I might need to perform this operation in pytorch, but some decent procedure for numpy, would already get me quite far.
Please also help me find a more appropriate title for the question, if you have one or more appropriate categories for this question.
Many thanks,
Max
output[ind.astype(bool)] = full
By converting the integer values in ind to boolean values, you can do boolean indexing to select the rows in output that you want to populate with values in full.
example with a 4x4 array:
M = 4
K = 4
ind = np.array([0,1,0,1])
full = np.random.rand(sum(ind),K)
output = np.zeros((M,K))
output[ind.astype(bool)] = full
print(output)
[[ 0. 0. 0. 0. ]
[ 0.32434109 0.11970721 0.57156261 0.35839647]
[ 0. 0. 0. 0. ]
[ 0.66038644 0.00725318 0.68902177 0.77145089]]

Categories