Dataframe vs Numpy array in Python

Dataframe vs Numpy array in Python - python

I have a questions regarding dataframe and numpy arrays in Python. When we read any csv file using pandas, it is stored in a dataframe. Dataframe is useful when it comes to data manipulations, viewing data in columns etc. However some preprocessing functions such as Imputer do not work on Dataframes. For these functions we have to get the data in numpy arrays which makes the data manipulation difficult
In following code I while y is stored as int64 array, X is ndarray object of numpy module. I can not use append function on X. Can anyone suggest how to correct this
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('titanic.csv')
y = dataset.iloc[:,1].values
X= dataset.iloc[:,2:12].values

You really should give us more information about what you want/expect, but here's my guess:
In [6]: Y=np.arange(3) # 1d
In [7]: X=np.arange(12).reshape(3,4) # 2d
In [8]: np.column_stack([Y,X])
Out[8]:
array([[ 0, 0, 1, 2, 3],
[ 1, 4, 5, 6, 7],
[ 2, 8, 9, 10, 11]])
This should be the same as
dataset.iloc[:,[1,2,3,...12]].values
though why you didn't do dataset.iloc[:,1:12]?

Related

Create interaction term in scikit-learn

There are certainly many ways of creating interaction terms in Python, whether by using numpy or pandas directly, or some library like patsy. However, I was looking for a way of creating interaction terms scikit-learn style, i.e. in a form that plays nicely with its fit-transform-predict paradigm. How might I do this?

Let's consider the case of making an interaction term between two variables.
You might make use of the FunctionTransformer class, like so:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
# 5 rows, 2 columns
X = np.arange(10).reshape(5, 2)
# Appends interaction of columns at 0 and 1 indices to original matrix
interaction_append_function = lambda x: np.append(x, (x[:, 0] * x[:, 1])[:, None], 1)
interaction_transformer = FunctionTransformer(func=interaction_append_function)
Let's try it out:
>>> interaction_transformer.fit_transform(X)
array([[ 0, 1, 0],
[ 2, 3, 6],
[ 4, 5, 20],
[ 6, 7, 42],
[ 8, 9, 72]])
You now have a transformer that will play well with other workflows like sklearn.pipeline or sklearn.compose.
Certainly there are more extensible ways of handling this, but hopefully you get the idea.

Change the data type of one element in a matrix

I'm looking to implement a hardware-efficient multiplication of a list of large matrices (on the order of 200,000 x 200,000). The matrices are very nearly the identity matrix, but with some elements changed to irrational numbers.
In an effort to reduce the memory footprint and make the computation go faster, I want to store the 0s and 1s of the identity as single bytes like so.
import numpy as np
size = 200000
large_matrix = np.identity(size, dtype=uint8)
and just change a few elements to a different data type.
import sympy as sp
# sympy object
irr1 = sp.sqrt(2)
# float
irr2 = e
large_matrix[123456, 100456] = irr1
large_matirx[100456, 123456] = irr2
Is is possible to hold only these elements of the matrix with a different data type, while all the other elements are still bytes? I don't want to have to change everything to a float just because I need one element to be a float.
-----Edit-----
If it's not possible in numpy, then how can I find a solution without numpy?

Maybe you can have a look at the SciPy's Coordinate-based sparse matrix. In that case SciPy creates a sparse matrix (optimized for such large empty matrices) and with its coordinate format you can access and modify the data as you intend.
From its documentation:
>>> from scipy.sparse import coo_matrix
>>> # Constructing a matrix using ijv format
>>> row = np.array([0, 3, 1, 0])
>>> col = np.array([0, 3, 1, 2])
>>> data = np.array([4, 5, 7, 9])
>>> m = coo_matrix((data, (row, col)), shape=(4, 4))
>>> m.toarray()
array([[4, 0, 9, 0],
[0, 7, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 5]])
It does not create a matrix but a set of coordinates with values, which takes much less space than just filling a matrix with zeros.
>>> from sys import getsizeof
>>> getsizeof(m)
56
>>> getsizeof(m.toarray())
176

By definition, NumPy arrays only have one dtype. You can see in the NumPy documentation:
A numpy array is homogeneous, and contains elements described by a dtype object. A dtype object can be constructed from different combinations of fundamental numeric types.
Further reading: https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

difference between numpy.convolve and scipy signal.convolve

now I have a numpy 2D array and want to make a convolution with a 2D kernel. I have tried using numpy.convolve and the out put was :
ValueError: object too deep for desired array
when trying signal.convolve it works well .
so is there any way to fix np.convolve??
and is the result of signal.convolve will be the same as np.convolve
?
here is my simple code:
import numpy as np
from scipy import signal
a=np.array([[1,2,3],[4,5,6]])
b=np.array([[1,1],[1,1]])
A=np.convolve(a,b,'same')
out:ValueError: object too deep for desired array
B=signal.convolve(a,b,'same')
Out[53]:
array([[ 1, 3, 5],
[ 5, 12, 16]])

scipy.stats.ks_2samp for masked and unmasked data give different results when no values are masked

There are two versions of the scipy.stats.ks_2samp function. scipy.stats.ks_2samp is the standard version, and scipy.stats.mstats.ks_2samp is the version in which "Missing values are discarded". Given distributions in which no entires are missing, the results are different. Why? Code:
import numpy as np
from scipy.stats import ks_2samp
from scipy.stats.mstats import ks_2samp as ks_2sampm
a = np.array([1, 3, 6, 8, 8])
b = np.array([2, 3, 4, 6])
ks_2samp(a, b)# statistic=0.40000000000000002, pvalue=0.75428850089034016
ks_2sampm(a, b) #(statistic=0.39999999999999997, pvalue=0.86916357240334474)
Why the different p-values? I'm using scipy v1.0.0

How to add a scalar to a numpy array within a specific range?

Is there a simpler and more memory efficient way to do the following in numpy alone.
import numpy as np
ar = np.array(a[l:r])
ar += c
a = a[0:l] + ar.tolist() + a[r:]
It may look primitive but it involves obtaining a subarray copy of the given array, then prepare two more copies of the same to append in left and right direction in addition to the scalar add. I was hoping to find some more optimized way of doing this. I would like a solution that is completely in Python list or NumPy array, but not both as converting from one form to another as shown above would cause serious overhead when the data is huge.

You can just do the assignment inplace as follows:
import numpy as np
a = np.array([1, 1, 1, 1, 1])
a[2:4] += 5
>>> a
array([1, 1, 6, 6, 1])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe vs Numpy array in Python - python

Related

Create interaction term in scikit-learn

Change the data type of one element in a matrix

difference between numpy.convolve and scipy signal.convolve

scipy.stats.ks_2samp for masked and unmasked data give different results when no values are masked

How to add a scalar to a numpy array within a specific range?

Categories

Resources