#Python iterate over an array with empty (nan) values - python

In advance, thank you for the time.
The problem is as follows, I have a matrix where both "0" and "empty fields" are necessary in the further calculation:
as data is converted into a numpy array, it automatically replaces the empty fields with "nan" ... how can I loop over each row of an array while ignoring the "nan" values for further calculation .
>>>data
[[ 2. 4. 7.]
[ 7. 0. nan]
[-3. 7. 0.]
[nan nan 6.]]
The idea was to run a set of conditions while iterating over the rows and possibly append to a new numpy array but for simplicity let's say I just want to get a new array without the "nan" so the final result would look something like >
>>>final_data
[[2,4,7], [7,0], [-3,7,0], [6]]

Related

Using indexing and iteration to reformat values with Numpy Pandas Python

I want to make a code that goes through the lists within vals array one by one for each unique digit_vals value. The digit_vals value shows the nth number for the expected output, so since the first value in digit_vals is 24 then it means that all the numbers before it will be a filled with zeroes and the 24th number will contain value from vals. Since there are two 24s within digit_vals it means that the 2nd index within the first list of vals ([-3.3, -4.3, 23.05, 23.08, 23.88, 3.72]) will contain the 24th value in the Expected Output which is -4.3. The 4th index of the 2nd list within vals will contain the value for the 27th value in digit_vals and so on. The gaps between the digit_vals will be filled with zeroes as well in the results so between 24 and 27 there will be 2 zeroes for the 25th and 26th place respectively. How would I be able to code this function that allows me to achieve the Expected Output below?
import pandas as pd
import numpy as np
digit_vals = np.array([24, 24, 27, 27, 27, 27,
28, 28, 28, 31])
vals = np.array([list([-3.3, -4.3, 23.05, 23.08, 23.88, 3.72]),
list([2.3, 2.05, 3.08, -4.88, 4.72]),
list([5.3, 2.05, 6.08, -13.88, -17.2]),
list([9.05, 6.08, 3.88, -13.72])], dtype=object)
Expected Output:
array([ 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
-4.3, , 0. , 0. , -4.88 ,
6.08, , 0 , 9.05])
First off, if I understand your question correctly, then your output array should be one element longer, with one more zero between to 6.08 value and the 9.05 value, because the 9.05 should be at index position 31 (the other values match their index position specified in digit_vals).
The hardest part of this question is transforming the information in the digits_vals array into two arrays that correctly index into each vals array list, and into the correct index position in the output array.
Because you're already using numpy, I think this is a reasonable approach
val_ind = []
out_ind = []
for ind, cnt in enumerate(np.bincount(digit_vals)):
if cnt > 0:
val_ind.append(cnt-1)
out_ind.append(ind)
Calculate the number of occurrences of each value in digits_vals and use that count (minus one for zero indexing) as the index into each list within the vals array. Each unique number in digits_vals is identified by capturing the index for each value with a nonzero count, assuming digits_vals will be ordered, as specified in the question example.
Once you have the index lists built, it is straightforward to build the output array:
out_arr = np.zeros(np.max(digit_vals)+1)
for r_ind, (v_ind, o_ind) in enumerate(zip(val_ind, out_ind)):
out_arr[o_ind] = vals[r_ind][v_ind]
Again, the enumeration provides the row index for extracting the correct row's data from the vals array. I've confirmed this reproduces the output array you provided, including the fix noted above. Hopefully I understood your question correctly, and made reasonable assumptions. If so, please update your question with a little more detail describing assumptions, etc.

Export Cosine Simularity Array out as a Matrix with Labels

Short version: I have a array and need to create a matrix but with names labels on top and side and export like example csv. (sorry if may wording incorrect)
Long version:
I made a recommendation system self taught and have a website ready after a year in quarantine learning and troubleshooting here on so usually a few day of searching I figure it out, but this got me stuck for about 3 weeks now.
The recommendation system system works in python I can put in a name and it spits of the recommended names i tweaked it and got it to acceptable results. But in the books, website and tutorial and udemy classes etc. Never learn how to take the python and make a Django site to get it to work.
This what the output is like currently is
# creating a Series for the name of the character so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]
# instantiating and generating the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])
​
# creating a Series for the name of the character so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]
0 ZZ Top
1 Zyan Malik
2 Zooey Deschanel
3 Ziggy Marley
4 ZHU
Name: name, dtype: object
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim
array([[1. , 0.11708208, 0.10192614, ..., 0. , 0. ,
0. ],
[0.11708208, 1. , 0.1682581 , ..., 0. , 0. ,
0. ],
[0.10192614, 0.1682581 , 1. , ..., 0. , 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 1. , 1. ,
1. ],
[0. , 0. , 0. , ..., 1. , 1. ,
1. ],
[0. , 0. , 0. , ..., 1. , 1. ,
1. ]])
# I need to then export to csv which I understand
.to_csv('artist_similarities.csv')
Desired Exports
I am trying to have the array with the index name in what i think is called a matrix like this example.
scores ZZ Top Zyan Malik Zooey Deschanel ZHU
0 ZZ Top 0 65.61249881 24.04163056 24.06241883
1 Zyan Malik 65.61249881 0 89.35882721 69.6634768
2 Zooey Deschanel 24.04163056 89.40917179 0 20.09975124
3 ZHU 7.874007874 69.6634768 20.09975124 0
# function that takes in the character name as input and returns the top 10 recommended characters
def recommendations(name, cosine_sim = cosine_sim):
recommended_names = []
# getting the index of the movie that matches the title
idx = indices[indices == name].index[0]
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most characters
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the names of the best 10 matching characters
for i in top_10_indexes:
recommended_names.append(list(df.index)[i])
return recommended_names
# working results which for dataset are pretty good
recommendations('Blues Traveler')
['G-Love & The Special Sauce',
'Phish',
'Spin Doctors',
'Grace Potter and the Nocturnals',
'Jason Mraz',
'Pearl Jam',
'Dave Matthews Band',
'Lukas Nelson & Promise of the Real ',
'Vonda Shepard',
'Goo Goo Dolls']
I'm not sure I understand what you're asking and I can't comment so I'm forced to write here. I assume you want to add column and index fields to the cosine_sim array. You could do something like this:
cos_sim_df = pd.DataFrame(cosine_sim, index=indices, columns=indices)
cos_sim_df.to_csv("artist_similarities.csv")
And then read the csv like
cos_sim_df = pd.read_csv("artist_similarities.csv", header=0, index_col=0)
To make sure pandas knows the first row and columns are field names. Also I assumed your column and row indices are the same, you can change them if you need. Another thing, this won't be exactly like the desired exports because in that csv there is a "score" field which contains the names of the artists, though it seems like the artists should be field names. If you want the exported csv to look exactly like the desired exports you can add the artists in a "score" field like this:
cos_sim_df = pd.DataFrame(cosine_sim, columns=indices)
cos_sim_df["score"] = indices
# make the score field the first field
cos_sim_df = cos_sim_df[["score", *idx]]
Lastly I want to note that indexing data frames is row-major, and it seems you visualized the fields as column indices, for this specific case since your array has a line of symmetry across the diagonal, it doesn't matter which axis is indexed because cos_sim_df["Zayn Malik"] for example will return the same values anyway, but keep this in mind if your array isn't symmetrical.

Not able to add two arrays using np.insert()

Hey guys so I want to write a function that performs a z-score transformation to a single column in a 2d array and then return an array where the specified column is "transformed" and the other columns remain the same. So the way I went about this is first I deleted the column that I want to transform using np.delete(), then performed the transformation, and then finally added the array with the deleted column and the transformed column using np.insert(). However all the elements in the transformed column is all 0. What can I do??
I have attached an image so you can view the incorrect output as well.
x1 = np.array([[4,3,12],[1,5,20],[1,2,3],[10,20,40],[7,2,44]])
def myfunc(array, scalar):
total_result = np.delete(array, scalar, axis =1)
z_score = ((array - array.mean())/array.std())[:,1]
answer = np.insert(total_result, scalar, z_score, axis=1)
return answer
myfunc(x1, 1)
Your array is of type integer, and your z-score is float. When you insert float into an integer array, it converts it to integer, hence all 0. You need to convert your array into float first. Also, deleting/inserting is not the right way to do it, simply assign your new values to your desired column. No need for delete/insert. Here is how to do it:
def myfunc(array, scalar):
z_score = ((array - array.mean())/array.std())[:,scalar]
array[:,scalar] = z_score
return array
x1 = x1.astype(np.float64, copy=False)
myfunc(x1, 1)
output:
[[ 4. -0.64344154 12. ]
[ 1. -0.49380397 20. ]
[ 1. -0.71826033 3. ]
[10. 0.62847778 40. ]
[ 7. -0.71826033 44. ]]

How to split a python array into different arrays based on a condition?

I have a python array like this:
array([[0.34201428, 0.46875536, 0.37900415, 0.4906195 ],
[0.58203477, 0.35279346, 0.61418074, 0.37601328],
[0.3388086 , 0.21167754, 0.37330517, 0.2436498 ],
[0.57343255, 0.34535545, 0.62878576, 0.38982747]],dtype=float32)
I want to split the array into different clusters based on the zeroth column. The final output should be like this:
The first array as,
array([[0.34201428, 0.46875536, 0.37900415, 0.4906195 ],
[0.3388086 , 0.21167754, 0.37330517, 0.2436498 ]])
And the second array as,
array([[0.58203477, 0.35279346, 0.61418074, 0.37601328],
[0.57343255, 0.34535545, 0.62878576, 0.38982747]])
Can anyone help me out. Thanks!!

Converting some columns of a matrix from float to int

I have a matrix tempsyntheticGroup2 with 6 columns. I want to change the value of columns (0,1,2,3,5) from float to int. This is my code:
tempsyntheticGroup2=tempsyntheticGroup2[:,[0,1,2,3,5]].astype(int)
but it doesn't work properly and I loose the other columns.
I don't think you can have a numpy array with some element that are ints, and some that are floats (there is only one possible dtype per array). But if you just want to round to lower integer (while keeping all elements as floats) you can do this:
# define dummy example matrix
t = np.random.rand(3,4) + np.arange(12).reshape((3,4))
array([[ 0.68266426, 1.4115732 , 2.3014562 , 3.5173022 ],
[ 4.52399807, 5.35321628, 6.95888015, 7.17438118],
[ 8.97272076, 9.51710983, 10.94962065, 11.00586511]])
# round some columns to lower int
t[:,[0,2]] = np.floor(t[:,[0,2]])
# or
t[:,[0,2]] = t[:,[0,2]].astype(int)
array([[ 0. , 1.4115732 , 2. , 3.5173022 ],
[ 4. , 5.35321628, 6. , 7.17438118],
[ 8. , 9.51710983, 10. , 11.00586511]])
otherwise you probably need to split your original array into 2 different arrays, with one containing the column that stay floats, the other containing the column that become ints.
t_int = t[:,[0,2]].astype(int)
array([[ 0, 2],
[ 4, 6],
[ 8, 10]])
t_float = t[:,[1,3]]
array([[ 1.4115732 , 3.5173022 ],
[ 5.35321628, 7.17438118],
[ 9.51710983, 11.00586511]])
Note that you'll have to change your indexing accordingly to access your elements...
I think you use wrong syntax to get column data.
read this article.
How do you extract a column from a multi-dimensional array?

Categories