Using indexing and iteration to reformat values with Numpy Pandas Python

Using indexing and iteration to reformat values with Numpy Pandas Python - python

I want to make a code that goes through the lists within vals array one by one for each unique digit_vals value. The digit_vals value shows the nth number for the expected output, so since the first value in digit_vals is 24 then it means that all the numbers before it will be a filled with zeroes and the 24th number will contain value from vals. Since there are two 24s within digit_vals it means that the 2nd index within the first list of vals ([-3.3, -4.3, 23.05, 23.08, 23.88, 3.72]) will contain the 24th value in the Expected Output which is -4.3. The 4th index of the 2nd list within vals will contain the value for the 27th value in digit_vals and so on. The gaps between the digit_vals will be filled with zeroes as well in the results so between 24 and 27 there will be 2 zeroes for the 25th and 26th place respectively. How would I be able to code this function that allows me to achieve the Expected Output below?
import pandas as pd
import numpy as np
digit_vals = np.array([24, 24, 27, 27, 27, 27,
28, 28, 28, 31])
vals = np.array([list([-3.3, -4.3, 23.05, 23.08, 23.88, 3.72]),
list([2.3, 2.05, 3.08, -4.88, 4.72]),
list([5.3, 2.05, 6.08, -13.88, -17.2]),
list([9.05, 6.08, 3.88, -13.72])], dtype=object)
Expected Output:
array([ 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
-4.3, , 0. , 0. , -4.88 ,
6.08, , 0 , 9.05])

First off, if I understand your question correctly, then your output array should be one element longer, with one more zero between to 6.08 value and the 9.05 value, because the 9.05 should be at index position 31 (the other values match their index position specified in digit_vals).
The hardest part of this question is transforming the information in the digits_vals array into two arrays that correctly index into each vals array list, and into the correct index position in the output array.
Because you're already using numpy, I think this is a reasonable approach
val_ind = []
out_ind = []
for ind, cnt in enumerate(np.bincount(digit_vals)):
if cnt > 0:
val_ind.append(cnt-1)
out_ind.append(ind)
Calculate the number of occurrences of each value in digits_vals and use that count (minus one for zero indexing) as the index into each list within the vals array. Each unique number in digits_vals is identified by capturing the index for each value with a nonzero count, assuming digits_vals will be ordered, as specified in the question example.
Once you have the index lists built, it is straightforward to build the output array:
out_arr = np.zeros(np.max(digit_vals)+1)
for r_ind, (v_ind, o_ind) in enumerate(zip(val_ind, out_ind)):
out_arr[o_ind] = vals[r_ind][v_ind]
Again, the enumeration provides the row index for extracting the correct row's data from the vals array. I've confirmed this reproduces the output array you provided, including the fix noted above. Hopefully I understood your question correctly, and made reasonable assumptions. If so, please update your question with a little more detail describing assumptions, etc.

Related

Export Cosine Simularity Array out as a Matrix with Labels

Short version: I have a array and need to create a matrix but with names labels on top and side and export like example csv. (sorry if may wording incorrect)
Long version:
I made a recommendation system self taught and have a website ready after a year in quarantine learning and troubleshooting here on so usually a few day of searching I figure it out, but this got me stuck for about 3 weeks now.
The recommendation system system works in python I can put in a name and it spits of the recommended names i tweaked it and got it to acceptable results. But in the books, website and tutorial and udemy classes etc. Never learn how to take the python and make a Django site to get it to work.
This what the output is like currently is
# creating a Series for the name of the character so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]
# instantiating and generating the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])

# creating a Series for the name of the character so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]
0 ZZ Top
1 Zyan Malik
2 Zooey Deschanel
3 Ziggy Marley
4 ZHU
Name: name, dtype: object
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim
array([[1. , 0.11708208, 0.10192614, ..., 0. , 0. ,
0. ],
[0.11708208, 1. , 0.1682581 , ..., 0. , 0. ,
0. ],
[0.10192614, 0.1682581 , 1. , ..., 0. , 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 1. , 1. ,
1. ],
[0. , 0. , 0. , ..., 1. , 1. ,
1. ],
[0. , 0. , 0. , ..., 1. , 1. ,
1. ]])
# I need to then export to csv which I understand
.to_csv('artist_similarities.csv')
Desired Exports
I am trying to have the array with the index name in what i think is called a matrix like this example.
scores ZZ Top Zyan Malik Zooey Deschanel ZHU
0 ZZ Top 0 65.61249881 24.04163056 24.06241883
1 Zyan Malik 65.61249881 0 89.35882721 69.6634768
2 Zooey Deschanel 24.04163056 89.40917179 0 20.09975124
3 ZHU 7.874007874 69.6634768 20.09975124 0
# function that takes in the character name as input and returns the top 10 recommended characters
def recommendations(name, cosine_sim = cosine_sim):
recommended_names = []
# getting the index of the movie that matches the title
idx = indices[indices == name].index[0]
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most characters
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the names of the best 10 matching characters
for i in top_10_indexes:
recommended_names.append(list(df.index)[i])
return recommended_names
# working results which for dataset are pretty good
recommendations('Blues Traveler')
['G-Love & The Special Sauce',
'Phish',
'Spin Doctors',
'Grace Potter and the Nocturnals',
'Jason Mraz',
'Pearl Jam',
'Dave Matthews Band',
'Lukas Nelson & Promise of the Real ',
'Vonda Shepard',
'Goo Goo Dolls']

I'm not sure I understand what you're asking and I can't comment so I'm forced to write here. I assume you want to add column and index fields to the cosine_sim array. You could do something like this:
cos_sim_df = pd.DataFrame(cosine_sim, index=indices, columns=indices)
cos_sim_df.to_csv("artist_similarities.csv")
And then read the csv like
cos_sim_df = pd.read_csv("artist_similarities.csv", header=0, index_col=0)
To make sure pandas knows the first row and columns are field names. Also I assumed your column and row indices are the same, you can change them if you need. Another thing, this won't be exactly like the desired exports because in that csv there is a "score" field which contains the names of the artists, though it seems like the artists should be field names. If you want the exported csv to look exactly like the desired exports you can add the artists in a "score" field like this:
cos_sim_df = pd.DataFrame(cosine_sim, columns=indices)
cos_sim_df["score"] = indices
# make the score field the first field
cos_sim_df = cos_sim_df[["score", *idx]]
Lastly I want to note that indexing data frames is row-major, and it seems you visualized the fields as column indices, for this specific case since your array has a line of symmetry across the diagonal, it doesn't matter which axis is indexed because cos_sim_df["Zayn Malik"] for example will return the same values anyway, but keep this in mind if your array isn't symmetrical.

Not able to add two arrays using np.insert()

Hey guys so I want to write a function that performs a z-score transformation to a single column in a 2d array and then return an array where the specified column is "transformed" and the other columns remain the same. So the way I went about this is first I deleted the column that I want to transform using np.delete(), then performed the transformation, and then finally added the array with the deleted column and the transformed column using np.insert(). However all the elements in the transformed column is all 0. What can I do??
I have attached an image so you can view the incorrect output as well.
x1 = np.array([[4,3,12],[1,5,20],[1,2,3],[10,20,40],[7,2,44]])
def myfunc(array, scalar):
total_result = np.delete(array, scalar, axis =1)
z_score = ((array - array.mean())/array.std())[:,1]
answer = np.insert(total_result, scalar, z_score, axis=1)
return answer
myfunc(x1, 1)

Your array is of type integer, and your z-score is float. When you insert float into an integer array, it converts it to integer, hence all 0. You need to convert your array into float first. Also, deleting/inserting is not the right way to do it, simply assign your new values to your desired column. No need for delete/insert. Here is how to do it:
def myfunc(array, scalar):
z_score = ((array - array.mean())/array.std())[:,scalar]
array[:,scalar] = z_score
return array
x1 = x1.astype(np.float64, copy=False)
myfunc(x1, 1)
output:
[[ 4. -0.64344154 12. ]
[ 1. -0.49380397 20. ]
[ 1. -0.71826033 3. ]
[10. 0.62847778 40. ]
[ 7. -0.71826033 44. ]]

#Python iterate over an array with empty (nan) values

In advance, thank you for the time.
The problem is as follows, I have a matrix where both "0" and "empty fields" are necessary in the further calculation:
as data is converted into a numpy array, it automatically replaces the empty fields with "nan" ... how can I loop over each row of an array while ignoring the "nan" values for further calculation .
>>>data
[[ 2. 4. 7.]
[ 7. 0. nan]
[-3. 7. 0.]
[nan nan 6.]]
The idea was to run a set of conditions while iterating over the rows and possibly append to a new numpy array but for simplicity let's say I just want to get a new array without the "nan" so the final result would look something like >
>>>final_data
[[2,4,7], [7,0], [-3,7,0], [6]]

Sorting zipped list of scalars and numpy arrays: not handling duplicates

I've been using this structure to sort vectors (the arrays) by some property of the vector. This structure (sorting vectors by a zipping them with scalars,and sorting by the scalars) has been working in other parts of my code, but in this case it fails with the warning:
The truth value of an array with more than one element is ambiguous. This depends on there being duplicate values in the scalars (see below)
from numpy import array
pnts =[array([ 0. , 0.45402743, -0.64209154]),
array([-0.27803373, 0.45402743, -0.64209154]),
array([-0.64874546, 0.45402743, 0. ]),
array([-0.27803373, 0.45402743, 0.64209154]),
array([ 0. , 0.45402743, 0.64209154]),
array([ 0. , -0.45402743, 0.64209154]),
array([-0.27803373, -0.45402743, 0.64209154]),
array([-0.64874546, -0.45402743, 0. ]),
array([-0.27803373, -0.45402743, -0.64209154]),
array([ 0. , -0.45402743, -0.64209154]),
array([-0.46338972, 0. , 0.64209154]),
array([-0.46338972, 0. , -0.64209154]),
array([-0.83410135, 0. , 0. ])]
ds = [0.64209154071986396, 0.69970301064027385, 0.64874545642786008,
0.69970301064027385, 0.64209154071986396, 0.64209154071986396,
0.69970301064027385, 0.64874545642785986, 0.69970301064027385,
0.64209154071986396, 0.79184062463701899, 0.79184062463701899,
0.83410134835400274]
pnts = [pnt for (d,pnt) in sorted(zip(ds,pnts))] #sort by distances ds
print pnts
However if I shorten it to the first 3 points, it does work:
from numpy import array
pnts =[array([ 0. , 0.45402743, -0.64209154]),
array([-0.27803373, 0.45402743, -0.64209154]),
array([-0.64874546, 0.45402743, 0. ])]
ds = [0.64209154071986396, 0.69970301064027385, 0.64874545642786008]
pnts = [pnt for (d,pnt) in sorted(zip(ds,pnts))]
print pnts
>[array([ 0. , 0.45402743, -0.64209154]), array([-0.64874546, 0.45402743, 0. ]), array([-0.27803373, 0.45402743, -0.64209154])]
I'm sure the issue is because there are duplicates among the ds. When I go from 3 to 4 points where the first duplicate appears, it fails again. But other sorting routines in python work fine when there are duplicates. Why not this one?

You're not sorting pnts by the ds values. You're sorting the elements of zip(ds, pnts). Those are tuples, which are ordered lexicographically; if you compare (x, y) to (x, z), the comparison will find the first elements equal and move on to comparing y and z. Since the second elements of your tuples are NumPy arrays, which don't have an ordering relation*, the sort fails.
If you want to sort by the ds values, specify a sort key that picks out those values:
sorted(zip(ds, pnts), key=lambda x: x[0])
or
import operator
sorted(zip(ds, pnts), key=operator.itemgetter(0))
*specifically, if you compare two NumPy arrays with an operator like <, instead of telling you if the first array is somehow "less than" the other, it gives you an array of broadcasted elementwise comparison results.

Tuples in Python are compared lexicographically. This comparison short circuits, i.e. if the first two elements are different the others are skipped because they can't reverse the order.
This is why you do not see this error when there are no duplicates.
one solution would be using np.argsort:
order = np.argsort(ds)
pnts_sorted = np.array(pnts)[order, :]
This avoids the zipping and returns your sorted points as a 2d array, which for many uses is the more convenient structure. If you still want a list of arrays: list(pnts_sorted) will give you one.
np.argsort performs an indirect sort, instead of moving the elements of its argument it writes down how they should be moved to get them sorted. This "shuffle recipe" (just an array of integers each indicating which element of the to be sorted array would have to go in its position) can be applied to other arrays if they have the same number of elements along the sort axis. In the code snippet we convert pnts to a 2d array (because order does not work for indexing into lists) and then use order to sort rows according to ds. (The colon in the index tells numpy to apply the shuffle to entire rows.)
Finally, if I may, a piece of general advice. Unless there are compelling reasons not to it is typically advisable to keep this sort of data (both, ds and pnts) in arrays, not lists. For example, sorting an array will typically be much faster than sorting a list (unless you sort the list using np.sort, but that is only because np.sort returns an array even if you feed it a list).

Converting some columns of a matrix from float to int

I have a matrix tempsyntheticGroup2 with 6 columns. I want to change the value of columns (0,1,2,3,5) from float to int. This is my code:
tempsyntheticGroup2=tempsyntheticGroup2[:,[0,1,2,3,5]].astype(int)
but it doesn't work properly and I loose the other columns.

I don't think you can have a numpy array with some element that are ints, and some that are floats (there is only one possible dtype per array). But if you just want to round to lower integer (while keeping all elements as floats) you can do this:
# define dummy example matrix
t = np.random.rand(3,4) + np.arange(12).reshape((3,4))
array([[ 0.68266426, 1.4115732 , 2.3014562 , 3.5173022 ],
[ 4.52399807, 5.35321628, 6.95888015, 7.17438118],
[ 8.97272076, 9.51710983, 10.94962065, 11.00586511]])
# round some columns to lower int
t[:,[0,2]] = np.floor(t[:,[0,2]])
# or
t[:,[0,2]] = t[:,[0,2]].astype(int)
array([[ 0. , 1.4115732 , 2. , 3.5173022 ],
[ 4. , 5.35321628, 6. , 7.17438118],
[ 8. , 9.51710983, 10. , 11.00586511]])
otherwise you probably need to split your original array into 2 different arrays, with one containing the column that stay floats, the other containing the column that become ints.
t_int = t[:,[0,2]].astype(int)
array([[ 0, 2],
[ 4, 6],
[ 8, 10]])
t_float = t[:,[1,3]]
array([[ 1.4115732 , 3.5173022 ],
[ 5.35321628, 7.17438118],
[ 9.51710983, 11.00586511]])
Note that you'll have to change your indexing accordingly to access your elements...

I think you use wrong syntax to get column data.
read this article.
How do you extract a column from a multi-dimensional array?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.