Matrix is printing wrong dimensions - python

I'm reading in a column from a dataframe named 'OneHot'. Each row of this column has a value of either [1,0] or [0,1]. I am trying to store these values into a variable so I can use it in a neural network.
Problem:
When I read in the values into a variable it stores as (792824, 1) instead of (792824, 2). 792824 is the amount of rows in the dataframe. I have tried reshape and that did not work.
Here is the code I have:
input_matrix = np.matrix(df['VectorTweet'].values.tolist())
​
In [157]:
input_matrix = np.transpose(input_matrix)
x_inputs = input_matrix.shape
print x_inputs
(792824, 1)
In [160]:
output_matrix = np.matrix(df['OneHot'].values.tolist())
y_inputs = np.transpose(output_matrix)
print y_outputs.shape
​
(792824, 1)
print y_outputs[1]
[['[1, 0]']]
attached is a snippet of my dataframe Example of my dataframe.

Looks like each entry in OneHot is a string representation of a list. That's why you're only getting one column in your transpose - you've made a single-element list of a string of a list of integers. You can convert strings of lists to actual lists with ast.literal_eval():
# OneHot as string of list of ints
strOneHot = pd.Series(['[0,1]','[1,0]'])
print(strOneHot.values)
# ['[0,1]' '[1,0]']
import ast
print(strOneHot.apply(ast.literal_eval).values)
# [[0, 1] [1, 0]]
FWIW, you can take the transpose of a Pandas series with .T, if that's useful here:
strOneHot.apply(ast.literal_eval).T
Output:
0 [0, 1]
1 [1, 0]
dtype: object

Related

Iterating over a pandas sparse series, without the missing values

I have a pandas DataFrame with very sparse columns. I would like to iterate over the DataFrame's values but without the missing ones, to save time.
I can't find how to access the indexes of the non-empty cells.
For example:
a = pd.Series([2, 3, 0, 0, 4], dtype='Sparse[int]')
print(a.sparse.sp_values) # --> [2,3,4]
print(a.sparse.sp_index) # --> AttributeError
print(a.sparse.to_coo()) # --> ValueError
I got the non-empty values, but where is the index? In the above example I am looking for [0,1,4].
I looked at the documentation which doesn't seem to mention it. I found information only for SparseArray but not for a Series/DataFrame of sparse type.
Printing dir(a.sparse) (without those starting with '_'):
['density', 'fill_value', 'from_coo', 'npoints', 'sp_values', 'to_coo', 'to_dense']
IIUC, use flatnonzero from numpy :
idx = np.flatnonzero(a).tolist()
print(idx)
​#[0, 1, 4]
Or loc from pandas's with boolean indexing :
idx = a.ne(0).loc[lambda s: s].index.tolist() # or list(a[a.ne(0)].index)
print(idx)
#[0, 1, 4]

How to make a sparse matrix in python from a data frame having column names as string

I need to convert a data frame to sparse matrix. The data frame looks similar to this: (The actual data is way too big (Approx 500 000 rows and 1000 columns)).
I need to convert it into a matrix such that the rows of the matrix are 'id' and columns are 'names' and should show only the finite values. No nans should be shown (to reduce memory usage). And when I tried using pd.pivot_table, it was taking a long time to make the matrix for my big data.
In R, there is a method called 'dMcast' for this purpose. I explored but could not find the alternate of this in python. I'm new to python.
First i will convert the categorical names column to indices. Maybe pandas has this functionality already?
names = list('PQRSPSS')
name_ids_map = {n:i for i, n in enumerate(set(names))}
name_ids = [name_ids_map[n] for n in names]
Then I would use scipy.sparse.coo and then maybe convert that to another sparse format.
ids = [1, 1, 1, 1, 2, 2, 3]
rating = [2, 4, 1, 4, 2, 2, 1]
sp = scipy.sparse.coo_matrix((rating, (ids, name_ids))
print(sp)
sp.tocsc()
I am not aware of a sparse matrix library that can index a dimension with categorical data like 'R', 'S" etc

Get index of elements in first Series within the second series

I want to get the index of all values in the smaller series for the larger series. The answer is in the code snippet below stored in the ans variable.
import pandas as pd
smaller = pd.Series(["a","g","b","k"])
larger = pd.Series(["a","b","c","d","e","f","g","h","i","j","k","l","m"])
# ans to be generated by some unknown combination of functions
ans = [0,6,1,10]
print(larger.iloc[ans,])
print(smaller)
assert(smaller.tolist() == larger.iloc[ans,].tolist())
Context: Series larger serves as an index for the columns in a numpy matrix, and series smaller serves as an index for the columns in a numpy vector. I need indexes for the matrix and vector to match.
You can reverse your larger series, then index this with smaller:
larger_rev = pd.Series(larger.index, larger.values)
res = larger_rev[smaller].values
print(res)
array([ 0, 6, 1, 10], dtype=int64)
for i in list(smaller):
if i in list(larger):
print((list(larger).index(i)))
This will get you the desired output
Using Series get
pd.Series(larger.index, larger.values).get(smaller)
Out[8]:
a 0
g 6
b 1
k 10
dtype: int64
try this :)
import pandas as pd
larger = pd.Series(["a","b","c","d","e","f","g","h","i","j","k","l","m"])
smaller = pd.Series(["a","g","b","k"])
res = pd.Series(larger.index, larger.values).reindex(smaller.values, copy=True)
print(res)

finding the max of a column in an array

def maxvalues():
for n in range(1,15):
dummy=[]
for k in range(len(MotionsAndMoorings)):
dummy.append(MotionsAndMoorings[k][n])
max(dummy)
L = [x + [max(dummy)]] ## to be corrected (adding columns with value max(dummy))
## suggest code to add new row to L and for next function call, it should save values here.
i have an array of size (k x n) and i need to pick the max values of the first column in that array. Please suggest if there is a simpler way other than what i tried? and my main aim is to append it to L in columns rather than rows. If i just append, it is adding values at the end. I would like to this to be done in columns for row 0 in L, because i'll call this function again and add a new row to L and do the same. Please suggest.
General suggestions for your code
First of all it's not very handy to access globals in a function. It works but it's not considered good style. So instead of using:
def maxvalues():
do_something_with(MotionsAndMoorings)
you should do it with an argument:
def maxvalues(array):
do_something_with(array)
MotionsAndMoorings = something
maxvalues(MotionsAndMoorings) # pass it to the function.
The next strange this is you seem to exlude the first row of your array:
for n in range(1,15):
I think that's unintended. The first element of a list has the index 0 and not 1. So I guess you wanted to write:
for n in range(0,15):
or even better for arbitary lengths:
for n in range(len(array[0])): # I chose the first row length here not the number of columns
Alternatives to your iterations
But this would not be very intuitive because the max function already implements some very nice keyword (the key) so you don't need to iterate over the whole array:
import operator
column = 2
max(array, key=operator.itemgetter(column))[column]
this will return the row where the i-th element is maximal (you just define your wanted column as this element). But the maximum will return the whole row so you need to extract just the i-th element.
So to get a list of all your maximums for each column you could do:
[max(array, key=operator.itemgetter(column))[column] for column in range(len(array[0]))]
For your L I'm not sure what this is but for that you should probably also pass it as argument to the function:
def maxvalues(array, L): # another argument here
but since I don't know what x and L are supposed to be I'll not go further into that. But it looks like you want to make the columns of MotionsAndMoorings to rows and the rows to columns. If so you can just do it with:
dummy = [[MotionsAndMoorings[j][i] for j in range(len(MotionsAndMoorings))] for i in range(len(MotionsAndMoorings[0]))]
that's a list comprehension that converts a list like:
[[1, 2, 3], [4, 5, 6], [0, 2, 10], [0, 2, 10]]
to an "inverted" column/row list:
[[1, 4, 0, 0], [2, 5, 2, 2], [3, 6, 10, 10]]
Alternative packages
But like roadrunner66 already said sometimes it's easiest to use a library like numpy or pandas that already has very advanced and fast functions that do exactly what you want and are very easy to use.
For example you convert a python list to a numpy array simple by:
import numpy as np
Motions_numpy = np.array(MotionsAndMoorings)
you get the maximum of the columns by using:
maximums_columns = np.max(Motions_numpy, axis=0)
you don't even need to convert it to a np.array to use np.max or transpose it (make rows to columns and the colums to rows):
transposed = np.transpose(MotionsAndMoorings)
I hope this answer is not to unstructured. Some parts are suggestions to your function and some are alternatives. You should pick the parts that you need and if you have any trouble with it, just leave a comment or ask another question. :-)
An example with a random input array, showing that you can take the max in either axis easily with one command.
import numpy as np
aa= np.random.random([4,3])
print aa
print
print np.max(aa,axis=0)
print
print np.max(aa,axis=1)
Output:
[[ 0.51972266 0.35930957 0.60381998]
[ 0.34577217 0.27908173 0.52146593]
[ 0.12101346 0.52268843 0.41704152]
[ 0.24181773 0.40747905 0.14980534]]
[ 0.51972266 0.52268843 0.60381998]
[ 0.60381998 0.52146593 0.52268843 0.40747905]

Python - Create an array from columns in file

I have a text file with two columns and n rows. Usually I work with two separate vector using x,y=np.loadtxt('data',usecols=(0,1),unpack=True) but I would like to have them as an array of the form array=[[a,1],[b,2],[c,3]...] where all the letters correspond to the x-vector and the numbers to the y-vector so I can ask something like array[0,2]=b. I tried defining
array[0,:]=x but I didn't succeed. Any simple way to do this?
In addition, I want to get the respective x-value for certain y-value. I tried with
x_value=np.argwhere(array[:,1]==3)
And I'm expecting the x_value to be c because it corresponds to 3 in column 1 but it doesn't work either.
I think you simply need to not unpack the array you get back from loadtxt. Do:
arr = np.loadtxt('data', usecols=(0,1))
If your file contained:
0 1
2 3
4 5
arr will be like:
[[0, 1],
[2, 3],
[4, 5]]
Note that to index into this array, you need to specify the row first (and indexes start at 0):
arr[1,0] == 2 # True!
You can find the x values that correspond to a give y value with:
x_vals = arr[:,0][arr[:,1]==y_val]
The indexing will return an array, though x_vals will have only a single value if the y_val was unique. If you know in advance there will be only one match for the y_val, you could tack on [0] to the end of the indexing, so you get the first result.

Categories