Reading data from text file with missing values - python

I want to read data from a file that has many missing values, as in this example:
1,2,3,4,5
6,,,7,8
,,9,10,11
I am using the numpy.loadtxt function:
data = numpy.loadtxt('test.data', delimiter=',')
The problem is that the missing values break loadtxt (I get a "ValueError: could not convert string to float:", no doubt because of the two or more consecutive delimiters).
Is there a way to do this automatically, with loadtxt or another function, or do I have to bite the bullet and parse each line manually?

I'd probably use genfromtxt:
>>> from numpy import genfromtxt
>>> genfromtxt("missing1.dat", delimiter=",")
array([[ 1., 2., 3., 4., 5.],
[ 6., nan, nan, 7., 8.],
[ nan, nan, 9., 10., 11.]])
and then do whatever with the nans (change them to something, use a mask instead, etc.) Some of this could be done inline:
>>> genfromtxt("missing1.dat", delimiter=",", filling_values=99)
array([[ 1., 2., 3., 4., 5.],
[ 6., 99., 99., 7., 8.],
[ 99., 99., 9., 10., 11.]])

This is because the function expects to return a numpy array with all cells of the same type.
If you want a table with mixed strings and number, you should read it into a structured array instead, also you probably want to add skip_header=1 to skip the first line, ie in your case something like:
np.genfromtxt('upeak_names.txt', delimiter="\t", dtype="S10,S10,f4,S10,f4,S10,f4",
names=["id", "name", "Distance", "name2", "Distance2", "name3", "Distance3], skip_header=1)
See also:
Documentation for genfromtxt:
https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.genfromtxt.html
Documentation for structured arrays in numpy:
https://docs.scipy.org/doc/numpy-1.15.0/user/basics.rec.html

Related

Python Median Filter for 1D numpy array

I have a numpy.array with a dimension dim_array. I'm looking forward to obtain a median filter like scipy.signal.medfilt(data, window_len).
This in fact doesn't work with numpy.array may be because the dimension is (dim_array, 1) and not (dim_array, ).
How to obtain such filter?
Next, another question, how can I obtain other filter, i.e., min, max, mean?
Based on this post, we could create sliding windows to get a 2D array of such windows being set as rows in it. These windows would merely be views into the data array, so no memory consumption and thus would be pretty efficient. Then, we would simply use those ufuncs along each row axis=1.
Thus, for example sliding-median` could be computed like so -
np.median(strided_app(data, window_len,1),axis=1)
For the other ufuncs, just use the respective ufunc names there : np.min, np.max & np.mean. Please note this is meant to give a generic solution to use ufunc supported functionality.
For the best performance, one must still look into specific functions that are built for those purposes. For the four requested functions, we have the builtins, like so -
Median : scipy.signal.medfilt.
Max : scipy.ndimage.filters.maximum_filter1d.
Min : scipy.ndimage.filters.minimum_filter1d.
Mean : scipy.ndimage.filters.uniform_filter1d
The fact that applying of a median filter with the window size 1 will not change the array gives us a freedom to apply the median filter row-wise or column-wise.
For example, this code
from scipy.ndimage import median_filter
import numpy as np
arr = np.array([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.]])
median_filter(arr, size=3, cval=0, mode='constant')
#with cval=0, mode='constant' we set that input array is extended with zeros
#when window overlaps edges, just for visibility and ease of calculation
outputs an expected filtered with window (3, 3) array
array([[0., 2., 0.],
[2., 5., 3.],
[0., 5., 0.]])
because median_filter automatically extends the size to all dimensions, so the same effect we can get with:
median_filter(arr, size=(3, 3), cval=0, mode='constant')
Now, we can also apply median_filter row-wise with setting 1 to the first element of size
median_filter(arr, size=(1, 3), cval=0, mode='constant')
Output:
array([[1., 2., 2.],
[4., 5., 5.],
[7., 8., 8.]])
And column-wise with the same logic
median_filter(arr, size=(3, 1), cval=0, mode='constant')
Output:
array([[1., 2., 3.],
[4., 5., 6.],
[4., 5., 6.]])

Float Array is not converted to int: Python

I have a float numpy array x, which contains values like, 0, .5, 1, 1.5,etc. I want to convert the float values into integers based on some equation and store them in a new array, newx. I did this,
newx=np.zeros(x.shape[0])
for i in range (x.shape [0]):
newx[i]= ((2*x[i]) +1)
print(newx, v)
However, when printing xnew, I get values like
(array([ 1., 2., 3., 4., 5., 6., 7., 8., 9.,
10., 11., 12., 13., 14., 15., 16., 17., 18.])
newx must be used in some process, and it must be integer, when I want to use it in that process, I get an error stating that it must be of integer or Boolean type. Can anyone please tell me what mistake I've done?
Thank You.
Numpy is specifically designed for array manipulation. Try not to iterate over a numpy array like you did. You can read about how numpy datatypes are a little different than inbuilt datatypes. This leads to much higher run times.
Anyways Here is a working code for your problem
newx=x*2+1
newx=numpy.int16(newx) # as easy as this. ;)

Use numpy to read a two-column csv file

I was curious whether there is a nicer way to do this. I have a csv file with two columns (which is fairly common), for example the 1st column is the time stamp and the second column is the data.
# temp.csv
0., 0.
1., 5.
2., 10.
3., 15.
4., 10.
5., 0.
And then I want to read this temp.csv file:
import numpy as np
my_csv = np.genfromtxt('./temp.csv', delimiter=',')
time = my_csv[:, 0]
data = my_csv[:, 1]
This is entirely fine, but I am just curious whether there is a more elegant way to do this, since this is a fairly common practice.
Thank you!
-Shawn
You can do:
my_csv = np.genfromtxt('./temp.csv', delimiter=',')
time, data = my_csv.transpose()
or the one liner:
time, data = np.genfromtxt('./temp.csv', delimiter=',').transpose()
or another one liner where genfromtxt does the transposition:
time, data = np.genfromtxt('./temp.csv', delimiter=',', unpack=True)
Are they more elegant? That's up to the reader.
You can use the unpack argument to have genfromtxt return the transpose of the array. Then you can do your assignments like this:
In [3]: time, data = genfromtxt('temp.csv', unpack=True, delimiter=',')
In [4]: time
Out[4]: array([ 0., 1., 2., 3., 4., 5.])
In [5]: data
Out[5]: array([ 0., 5., 10., 15., 10., 0.])
Use pandas dataframes if you want nice ways for working with csv type tables.'
http://pandas.pydata.org/pandas-docs/dev/dsintro.html

Recomended way to create a matrix containing strings in python

I need to write a programm that collects different datasets and unites them. For this I have to read in a comma seperated matrix: In this case each row represents an instance (in this case proteins), each column represents an attribute of the instances. If an instance has an attribute, it is represented by a 1, otherwise 0. The matrix looks like the example given below, but much larger, with 35000 instances and hundreds of attributes.
Proteins,Attribute 1,Attribute 2,Attribute 3,Attribute 4
Protein 1,1,1,1,0
Protein 2,0,1,0,1
Protein 3,1,0,0,0
Protein 4,1,1,1,0
Protein 5,0,0,0,0
Protein 6,1,1,1,1
I need a way to store the matrix before writing into a new file with other information about the instances. I thought of using numpy arrays, since i would like to be able to select and check single columns. I tried to use numpy.empty to create the array of the given size, but it seems that you have to preselect the lengh of the strings and cannot change them afterwards.
Is there a better way to deal with such data? I also thought of dictionarys of lists but then iI cannot select single columns.
You can use numpy.loadtxt, for example:
import numpy as np
a = np.loadtxt(filename, delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=float)
Which will result in something like:
#array([[ 1., 1., 1., 0.],
# [ 0., 1., 0., 1.],
# [ 1., 0., 0., 0.],
# [ 1., 1., 1., 0.],
# [ 0., 0., 0., 0.],
# [ 1., 1., 1., 1.]])
Or, using structured arrays (`np.recarray'):
a = np.loadtxt('stack.txt', delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=[('Attribute 1', float),
('Attribute 2', float),
('Attribute 3', float),
('Attribute 4', float)])
from where you can get each field like:
a['Attribute 1']
#array([ 1., 0., 1., 1., 0., 1.])
Take a look at pandas.
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
You could use genfromtxt instead:
data = np.genfromtxt('file.txt', dtype=None)
This will create a structured array (aka record array) of your table.

How to operate elementwise on a matrix of type scipy.sparse.csr_matrix?

In numpy if you want to calculate the sinus of each entry of a matrix (elementise) then
a = numpy.arange(0,27,3).reshape(3,3)
numpy.sin(a)
will get the job done! If you want the power let's say to 2 of each entry
a**2
will do it.
But if you have a sparse matrix things seem more difficult. At least I haven't figured a way to do that besides iterating over each entry of a lil_matrix format and operate on it.
I've found this question on SO and tried to adapt this answer but I was not succesful.
The Goal is to calculate elementwise the squareroot (or the power to 1/2) of a scipy.sparse matrix of CSR format.
What would you suggest?
The following trick works for any operation which maps zero to zero, and only for those operations, because it only touches the non-zero elements. I.e., it will work for sin and sqrt but not for cos.
Let X be some CSR matrix...
>>> from scipy.sparse import csr_matrix
>>> X = csr_matrix(np.arange(10).reshape(2, 5), dtype=np.float)
>>> X.A
array([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.]])
The non-zero elements' values are X.data:
>>> X.data
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9.])
which you can update in-place:
>>> X.data[:] = np.sqrt(X.data)
>>> X.A
array([[ 0. , 1. , 1.41421356, 1.73205081, 2. ],
[ 2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ]])
Update In recent versions of SciPy, you can do things like X.sqrt() where X is a sparse matrix to get a new copy with the square roots of elements in X.

Categories