Use numpy to read a two-column csv file - python

I was curious whether there is a nicer way to do this. I have a csv file with two columns (which is fairly common), for example the 1st column is the time stamp and the second column is the data.
# temp.csv
0., 0.
1., 5.
2., 10.
3., 15.
4., 10.
5., 0.
And then I want to read this temp.csv file:
import numpy as np
my_csv = np.genfromtxt('./temp.csv', delimiter=',')
time = my_csv[:, 0]
data = my_csv[:, 1]
This is entirely fine, but I am just curious whether there is a more elegant way to do this, since this is a fairly common practice.
Thank you!
-Shawn

You can do:
my_csv = np.genfromtxt('./temp.csv', delimiter=',')
time, data = my_csv.transpose()
or the one liner:
time, data = np.genfromtxt('./temp.csv', delimiter=',').transpose()
or another one liner where genfromtxt does the transposition:
time, data = np.genfromtxt('./temp.csv', delimiter=',', unpack=True)
Are they more elegant? That's up to the reader.

You can use the unpack argument to have genfromtxt return the transpose of the array. Then you can do your assignments like this:
In [3]: time, data = genfromtxt('temp.csv', unpack=True, delimiter=',')
In [4]: time
Out[4]: array([ 0., 1., 2., 3., 4., 5.])
In [5]: data
Out[5]: array([ 0., 5., 10., 15., 10., 0.])

Use pandas dataframes if you want nice ways for working with csv type tables.'
http://pandas.pydata.org/pandas-docs/dev/dsintro.html

Related

How to use numpy.save in append mode

I use numpy.save and numpy.load to R/W large datasets in my project. I realized that that numpy.save does not apply append mode. For instance (Python 3):
import numpy as np
n = 5
dim = 5
for _ in range(3):
Matrix = np.random.choice(np.arange(10, 40, dim), size=(n, dim))
np.save('myfile', Matrix)
M1 = np.load('myfile.npy', mmap_mode='r')[1:7].copy()
print(M1)
Loading specific portion of data using slicing [1:7] is not correct because the np.save does not append. I found this answer but it looks strange ( file(filename, 'a') what is file file??). Is there a clever workaround to achieve that without using additional lists?
The npy file format doesn't work that way. An npy file encodes a single array, with a header specifying shape, dtype, and other metadata. You can see the npy file format spec in the NumPy docs.
Support for appending data was not a design goal of the npy format. Even if you managed to get numpy.save to append to an existing file instead of overwriting the contents, the result wouldn't be a valid npy file. Producing a valid npy file with additional data would require rewriting the header, and since this could require resizing the header, it could shift the data and require the whole file to be rewritten.
NumPy comes with no tools to append data to existing npy files, beyond reading the data into memory, building a new array, and writing the new array to a file. If you want to save more data, consider writing a new file, or pick a different file format.
In Python3 repeated save and load to the same open file works:
In [113]: f = open('test.npy', 'wb')
In [114]: np.save(f, np.arange(10))
In [115]: np.save(f, np.zeros(10))
In [116]: np.save(f, np.ones(10))
In [117]: f.close()
In [118]: f = open('test.npy', 'rb')
In [119]: for _ in range(3):
...: print(np.load(f))
...:
[0 1 2 3 4 5 6 7 8 9]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
In [120]: np.load(f)
OSError: Failed to interpret file <_io.BufferedReader name='test.npy'> as a pickle
Each save writes a self contained block of data to the file. That consists of a header block, and an image of the databuffer. The header block has information about the length of the databuffer.
Each load reads the defined header block, and the known number of data bytes.
As far as I know this is not documented, but has been demonstrated in previous SO questions. It is also evident from the save and load code.
Note these are separate arrays, both on saving and loading. But we could concatenate the loads into one file if the dimensions are compatible.
In [122]: f = open('test.npy', 'rb')
In [123]: np.stack([np.load(f) for _ in range(3)])
Out[123]:
array([[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
In [124]: f.close()
Append multiple numpy files to one big numpy file in python
loading arrays saved using numpy.save in append mode
The file function was deprecated in Python 3. Though I won't guarantee that it works, the Python 3 code equivalent to the code in the link in your question would be
with open('myfile.npy', 'ab') as f_handle:
np.save(f_handle, Matrix)
This should then append Matrix to 'myfile.npy'.

logically adding in numpy

I want to do the following operation. But It likes the histogram operation.
maxIndex = 6
dst =zeros((1,6))
a =array([1,2,3,4,7,0,3,4,5,7])
index=array([1,1,1,3,3,4,4,5,5,5])
a's length == index's length,
for i in (a.size):
dst[index[i]] = dst[index[i]] + a[i]
How can I do this more pythonic. and more efficiently
If I understand correctly, I think you are looking for numpy.bincount:
dst = numpy.bincount(index, weights=a, minlength=maxIndex)
This give me array([ 0., 6., 0., 11., 3., 16.]) as the output. If you don't want to calculate maxIndex by hand, you can omit minlength parameter from the function call and numpy will return an appropriately-sized array for you.

Recomended way to create a matrix containing strings in python

I need to write a programm that collects different datasets and unites them. For this I have to read in a comma seperated matrix: In this case each row represents an instance (in this case proteins), each column represents an attribute of the instances. If an instance has an attribute, it is represented by a 1, otherwise 0. The matrix looks like the example given below, but much larger, with 35000 instances and hundreds of attributes.
Proteins,Attribute 1,Attribute 2,Attribute 3,Attribute 4
Protein 1,1,1,1,0
Protein 2,0,1,0,1
Protein 3,1,0,0,0
Protein 4,1,1,1,0
Protein 5,0,0,0,0
Protein 6,1,1,1,1
I need a way to store the matrix before writing into a new file with other information about the instances. I thought of using numpy arrays, since i would like to be able to select and check single columns. I tried to use numpy.empty to create the array of the given size, but it seems that you have to preselect the lengh of the strings and cannot change them afterwards.
Is there a better way to deal with such data? I also thought of dictionarys of lists but then iI cannot select single columns.
You can use numpy.loadtxt, for example:
import numpy as np
a = np.loadtxt(filename, delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=float)
Which will result in something like:
#array([[ 1., 1., 1., 0.],
# [ 0., 1., 0., 1.],
# [ 1., 0., 0., 0.],
# [ 1., 1., 1., 0.],
# [ 0., 0., 0., 0.],
# [ 1., 1., 1., 1.]])
Or, using structured arrays (`np.recarray'):
a = np.loadtxt('stack.txt', delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=[('Attribute 1', float),
('Attribute 2', float),
('Attribute 3', float),
('Attribute 4', float)])
from where you can get each field like:
a['Attribute 1']
#array([ 1., 0., 1., 1., 0., 1.])
Take a look at pandas.
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
You could use genfromtxt instead:
data = np.genfromtxt('file.txt', dtype=None)
This will create a structured array (aka record array) of your table.

How do I add a column to a python (matix) multi-dimensional array? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What's the simplest way to extend a numpy array in 2 dimensions?
I've been frustrated as a Matlab user switching over to python because I don't know all the tricks and get stuck hacking together code until it works. Below is an example where I have a matrix that I want to add a dummy column to. Surely, there is a simpler way then the zip vstack zip method below. It works, but it is totally a noob attempt. Please enlighten me. Thank you in advance for taking the time for this tutorial.
# BEGIN CODE
from pylab import *
# Find that unlike most things in python i must build a dummy matrix to
# add stuff in a for loop.
H = ones((4,10-1))
print "shape(H):"
print shape(H)
print H
### enter for loop to populate dummy matrix with interesting data...
# stuff happens in the for loop, which is awesome and off topic.
### exit for loop
# more awesome stuff happens...
# Now I need a new column on H
H = zip(*vstack((zip(*H),ones(4)))) # THIS SEEMS LIKE THE DUMB WAY TO DO THIS...
print "shape(H):"
print shape(H)
print H
# in conclusion. I found a hack job solution to adding a column
# to a numpy matrix, but I'm not happy with it.
# Could someone educate me on the better way to do this?
# END CODE
Use np.column_stack:
In [12]: import numpy as np
In [13]: H = np.ones((4,10-1))
In [14]: x = np.ones(4)
In [15]: np.column_stack((H,x))
Out[15]:
array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
In [16]: np.column_stack((H,x)).shape
Out[16]: (4, 10)
There are several functions that let you concatenate arrays in different dimensions:
np.vstack along axis=0
np.hstack along axis=1
np.dstack along axis=2
In your case, the np.hstack looks what you want. np.column_stack stacks a set 1D arrays as a 2D array, but you have already a 2D array to start with.
Of course, nothing prevents you to do it the hard way:
>>> new = np.empty((a.shape[0], a.shape[1]+1), dtype=a.dtype)
>>> new.T[:a.shape[1]] = a.T
Here, we created an empty array with an extra column, then used some tricks to set the first columns to a (using the transpose operator T, so that new.T has an extra row compared to a.T...)

Reading data from text file with missing values

I want to read data from a file that has many missing values, as in this example:
1,2,3,4,5
6,,,7,8
,,9,10,11
I am using the numpy.loadtxt function:
data = numpy.loadtxt('test.data', delimiter=',')
The problem is that the missing values break loadtxt (I get a "ValueError: could not convert string to float:", no doubt because of the two or more consecutive delimiters).
Is there a way to do this automatically, with loadtxt or another function, or do I have to bite the bullet and parse each line manually?
I'd probably use genfromtxt:
>>> from numpy import genfromtxt
>>> genfromtxt("missing1.dat", delimiter=",")
array([[ 1., 2., 3., 4., 5.],
[ 6., nan, nan, 7., 8.],
[ nan, nan, 9., 10., 11.]])
and then do whatever with the nans (change them to something, use a mask instead, etc.) Some of this could be done inline:
>>> genfromtxt("missing1.dat", delimiter=",", filling_values=99)
array([[ 1., 2., 3., 4., 5.],
[ 6., 99., 99., 7., 8.],
[ 99., 99., 9., 10., 11.]])
This is because the function expects to return a numpy array with all cells of the same type.
If you want a table with mixed strings and number, you should read it into a structured array instead, also you probably want to add skip_header=1 to skip the first line, ie in your case something like:
np.genfromtxt('upeak_names.txt', delimiter="\t", dtype="S10,S10,f4,S10,f4,S10,f4",
names=["id", "name", "Distance", "name2", "Distance2", "name3", "Distance3], skip_header=1)
See also:
Documentation for genfromtxt:
https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.genfromtxt.html
Documentation for structured arrays in numpy:
https://docs.scipy.org/doc/numpy-1.15.0/user/basics.rec.html

Categories