numpy recfromcsv and genfromtxt skips first row of data file - python

numpy's recfromcsv skips the first line of my data. (Same thing for genfromtxt)
import numpy as np
filename = 'data.csv'
writer = open(filename,mode='w')
writer.write('0,1.1,1.2\n1,2.1,2.2\n2,3.1,3.2')
writer.close()
data = np.recfromcsv(filename)
print data
Is this a bug, or how can I load the data without loosing the first line?

The default first line of a csv file contains the field names.
The function recfromcsv invoke genfromtxt with parameters names=True as default. It means that it read the first line of the data as the header.
Definition:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
You should write it before the array.
import numpy as np
filename = 'data.csv'
writer = open(filename,mode='w')
writer.write('first column,second column,third column\n')
writer.write('0,1.1,1.2\n1,2.1,2.2\n2,3.1,3.2')
writer.close()
data = np.recfromcsv(filename)
print data
Or use recfromtxt instead of recfromcsv.
Or overwrite the default name as
recfromcsv(filename, names=['a','a','a'])

You can add skiprow=0 to keep recfromcsv from skipping the first row.

The default behavior of recfromcsv is to read a header row, which is why it's skipping the first row. It works for me with genfromtxt (if I pass delimiter=','). Can you provide output showing how genfromtxt fails?
Unfortunately it seems there is a bug in Numpy that won't let you specify the dtype in recfromcsv (see https://github.com/numpy/numpy/issues/311), so I can't see how to read it in with specified column names, which I think is what you need to do to avoid reading the header line. But you can read the data in with genfromtxt.
Edit: It looks like you can read it in just by passing in a list of names:
np.recfromcsv(filename, delimiter=',', names=['a', 'b', 'c'])
(The reason it wasn't working for me is I had done from __future__ import unicode_literals and it apparently doesn't like unicode in dtypes.)

Related

Trying to size down HDF5 File by changing index field types using h5py

I have a very large CSV File (~12Gb) that looks something like this:
posX,posY,posZ,eventID,parentID,clockTime
-117.9853515625,60.2998046875,0.29499998688697815,0,0,0
-117.9853515625,60.32909393310547,0.29499998688697815,0,0,0
-117.9560546875,60.2998046875,0.29499998688697815,0,0,0
-117.9560546875,60.32909393310547,0.29499998688697815,0,0,0
-117.92676544189453,60.2998046875,0.29499998688697815,0,0,0
-117.92676544189453,60.32909393310547,0.29499998688697815,0,0,0
-118.04051208496094,60.34012985229492,4.474999904632568,0,0,0
-118.04051208496094,60.36941909790039,4.474999904632568,0,0,0
-118.04051208496094,60.39870834350586,4.474999904632568,0,0,0
I want to convert this CSV File into the HDF5 Format using the library h5py while also lowering the total file size by setting the field / index types i.G. saying:
Save posX, posY and posZ as float32. Save eventID, parentID and clockTime as int32 or something along those lines.
Note: I need to chunk the data in some form when I read it in to avoid Memory Errors.
However I am unable to get the wished result. What I have tried so far:
Using Pandas own methods following this guide: How to write a large csv file to hdf5 in python?
This creates the file but im somehow unable to change the types and the file remains too big (~10.7Gb). The field types are float64 and int64.
I also tried to split the CSV up into parts (using split -n x myfile.csv) before working with the increments. I ran into some data errors in the beginning and end on each file which I was able to fix by removing said lines using sed. Then I tried out the following code:
import pandas as pd
import h5py
PATH_csv = "/home/MYNAME/Documents/Workfolder/xaa" #xaa is my csv increment
DATA_csv = pd.read_csv(PATH_csv)
with h5py.File("pct_data-hdf5.h5", "a") as DATA_hdf:
dset = DATA_hdf.create_dataset("posX", data=DATA_csv["posX"], dtype="float32")
Sadly this created the file and the table but didn't write any data into it.
Expectation
Creating a HDF5 File containing the data of a large CSV file while also changing the variable type of each index.
If something is unclear please ask me for clarification. Im still a beginner!
Have you considered the numpy module?
It has a handy function (genfromtxt) to read CSV data with headers into a Numpy array. You define the dtype. The array is suitable for loading into HDF5 with the h5py.create_dataset() function.
See code below. I included 2 print statements. The first shows the dtype names created from the CSV headers. The second shows how you can access the data in the numpy array by field (column) name.
import h5py
import numpy as np
PATH_csv = 'SO_55576601.csv'
csv_dtype= ('f8', 'f8', 'f8', 'i4', 'i4', 'i4' )
csv_data = np.genfromtxt(PATH_csv, dtype=csv_dtype, delimiter=',', names=True)
print (csv_data.dtype.names)
print (csv_data['posX'])
with h5py.File('SO_55576601.h5', 'w') as h5f:
dset = h5f.create_dataset('CSV_data', data=csv_data)
h5f.close()

How to fetch input from the csv file in python

I have a csv (input.csv) file as shown below:
VM IP Naa_Dev Datastore
vm1 xx.xx.xx.x1 naa.ab1234 ds1
vm2 xx.xx.xx.x2 naa.ac1234 ds1
vm3 xx.xx.xx.x3 naa.ad1234 ds2
I want to use this csv file as an input file for my python script. Here in this file, first line i.e. (VM IP Naa_Dev Datastore) is the column heading and each value is separated by space.
So my question is how we can use this csv file for input values in python so if I need to search in python script that what is the value of vm1 IP then it should pickup xx.xx.xx.x1 or same way if I am looking for VM which has naa.ac1234 Naa_Dev should take vm2.
I am using Python version 2.7.8
Any help is much appreciated.
Thanks
Working with tabular data like this, the best way is using pandas.
Something like:
import pandas
dataframe = pandas.read_csv('csv_file.csv')
# finding IP by vm
print(dataframe[dataframe.VM == 'vm1'].IP)
# OUTPUT: xx.xx.xx.x1
# or find by Naa_Dev
print(dataframe[dataframe.Naa_Dev == 'xx.xx.xx.x2'].VM)
# OUTPUT: vm2
For importing csv into python you can use pandas, in your case the code would look like:
import pandas as pd
df = pd.read_csv('input.csv', sep=' ')
and for locating certain rows in created dataframe you can multiple options (that you can easily find in pandas or just by googling 'filter data python'), for example:
df['VM'].where(df['Naa_Dev'] == 'naa.ac1234')
Use the pandas module to read the file into a DataFrame. There is a lot of parameters for reading csv files with pandas.read_csv. The dataframe.to_string() function is extremely useful.
Solution:
# import module with alias 'pd'
import pandas as pd
# Open the CSV file, delimiter is set to white space, and then
# we specify the column names.
dframe = pd.read_csv("file.csv",
delimiter=" ",
names=["VM", "IP", "Naa_Dev", "Datastore"])
# print will output the table
print(dframe)
# to_string will allow you to align and adjust content
# e.g justify = left to align columns to the left.
print(dframe.to_string(justify="left"))
Pandas is probably the best answer but you can also:
import csv
your_list = []
with open('dummy.csv') as csvfile:
reader = csv.DictReader(csvfile, delimiter=' ')
for row in reader:
your_list += [row]
print(your_list)

Pyfits or astropy.io.fits add row to binary table in fits file

How can I add a single row to binary table inside large fits file using pyfits, astropy.io.fits or maybe some other python library?
This file is used as a log, so every second a single row will be added, eventually the size of the file will reach gigabytes so reading all the file and writing it back or keeping the copy of data in memory and writing it to the file every seconds is actually impossible. With pyfits or astropy.io.fits so far I could only read everything to memory add new row and then write it back.
Example. I create fits file like this:
import numpy, pyfits
data = numpy.array([1.0])
col = pyfits.Column(name='index', format='E', array=data)
cols = pyfits.ColDefs([col])
tbhdu = pyfits.BinTableHDU.from_columns(cols)
tbhdu.writeto('test.fits')
And I want to add some new value to the column 'index', i.e. add one more row to the binary table.
Solution This is a trivial task for cfitsio library (method fits_insert_row(...)), so I use python module which is based on it: https://github.com/esheldon/fitsio
And here is the solution using fitsio. To create new fits file the one can do:
import fitsio, numpy
from fitsio import FITS,FITSHDR
fits = FITS('test.fits','rw')
data = numpy.zeros(1, dtype=[('index','i4')])
data[0]['index'] = 1
fits.write(data)
fits.close()
To append a row:
fits = FITS('test.fits','rw')
#you can actually use the same already opened fits file,
#to flush the changes you just need: fits.reopen()
data = numpy.zeros(1, dtype=[('index','i4')])
data[0]['index'] = 2
fits[1].append(data)
fits.close()
Thank you for your help.

how to read rows of a column from csv file on python?

activity=csv.reader(open('activity(delimited).csv')
data = np.array([activity])
url_data = data[:, 70]
I am trying to extract a column from the CSV file. This column has a list of URLs that I would like to read. However every time I run these few lines, I get:-
IndexError: too many indices for array
There are a bunch of very similar issues on StackOverflow [post 1], [post 2].
For your reference, there is a much cleaner way of achieving this.
genfromtxt will suit your requirements pretty well. From the documentation,
Load data from a text file, with missing values handled as specified.
Each line past the first skip_header lines is split at the delimiter
character, and characters following the comments character are
discarded.
You would need to do something like this.
from numpy import genfromtxt
my_data = genfromtxt('activity(delimited)', delimiter=',')
Given that yours is a csv, the delimiter is a ,.
my_data should now hold a numpy ndarray.

Parsing CSV File into Python into a contigous block

I am trying to load in time series/Apple's stock price data (3000X5) into Python.
So date, open, high, low, close. I am running the following code in python spyder.
import matplotlib.pyplot as plt
import csv
datafile = open('C:\Users\Riemmman\Desktop\SAMPLE_AAPL_DATA_FOR_Python.csv')
datareader = csv.reader(datafile)
data = []
for row in datareader:
data.append(row)
But the 'data' file still remains as a list file. I want it separated into a continuous block with the headers on top and the data in it's respective column with date being at the utmost left-hand side. As one would see the data in R/Matlab. What am I missing? Thank you for your help.
You want to transpose the data; rows to columns. The zip() function, when applied to all rows, does this for you. Use *datareader to have Python pull all rows in and apply them as separate arguments to the zip() function:
filename = 'C:\Users\Riemmman\Desktop\SAMPLE_AAPL_DATA_FOR_Python.csv'
with open(filename, 'rb') as datafile:
datareader = csv.reader(datafile)
columns = zip(*datareader)
This also uses some more best practices:
Using the file as a context manager with the with statement ensures it is clsed automatically
Open the file in binary mode so the csv module can manage line endings correctly

Categories