I am importing a decent amount of data from a CSV file. The original CSV file has around 40 columns, I am trimming it down and want to use my own names (as shown below) for column names.
import numpy as np
import math
import sys
import matplotlib.pyplot as plt
print(sys.argv)
data1 = np.genfromtxt(
sys.argv[1],
skip_header=10,
skip_footer=1,
delimiter=',',
usecols=(0,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,20,21,22,23,24,25,26,27),
names=['line','TPS','mph','rpm','load','engtemp','trq','month','day','T1','T2','T3','T4','T5','T6','T7','T8','T9','T10','T11','T12','T13','T14','T15','T16'])
Unfortunately this changes the shape of the array and I don't know how to compensate for it, neither do I know how to call out the names. Originally I was hoping that it would save each column as it's own list with the name specified. If you know of a function that does that, that would be neat!
So, the question.
Why does the above not work when I try to reference a column for plotting
plt.plot(data1[:,9])
but when I remove the names argument from genfromtxt it works fine?
and
is there a way to do something like?
plt.plot(data1[T1])
Could it be because the 'names' argument is trying to name the rows, not the columns?
When names is used I get the error
Traceback (most recent call last):
File "HeatReport.py", line 48, in <module>
print(data1[3,9])
IndexError: too many indices for array
Related
I have a problem loading specific Matlab file into pandas data frame (data). Basically, it is a multidimensional array consisting of 2x4 arrays and then either 3 or 4 columns. I searched through and found this Convert mat file to pandas dataframe, matlab data file to pandas DataFrame but neither works for me. Here is what I am using. Thit creates a pandas for one of the nests. I would like to have a column which will tell me where I am, ie first column tells if it is all_params[0] or all_params[1], second distinguishes the next level for each of those etc.
import numpy as np
import scipy.io as sio
all_params = sio.loadmat(path+'all_params')
all_params = all_params['all_params']
pd.DataFrame(all_params[0][0][:], columns=['par1', 'par2', 'par3'])
It must be simple, I'm just not able to figure it out. Or is there a way how to do it directly using scipy or another loading tool?
Thanks
I have a very large CSV File (~12Gb) that looks something like this:
posX,posY,posZ,eventID,parentID,clockTime
-117.9853515625,60.2998046875,0.29499998688697815,0,0,0
-117.9853515625,60.32909393310547,0.29499998688697815,0,0,0
-117.9560546875,60.2998046875,0.29499998688697815,0,0,0
-117.9560546875,60.32909393310547,0.29499998688697815,0,0,0
-117.92676544189453,60.2998046875,0.29499998688697815,0,0,0
-117.92676544189453,60.32909393310547,0.29499998688697815,0,0,0
-118.04051208496094,60.34012985229492,4.474999904632568,0,0,0
-118.04051208496094,60.36941909790039,4.474999904632568,0,0,0
-118.04051208496094,60.39870834350586,4.474999904632568,0,0,0
I want to convert this CSV File into the HDF5 Format using the library h5py while also lowering the total file size by setting the field / index types i.G. saying:
Save posX, posY and posZ as float32. Save eventID, parentID and clockTime as int32 or something along those lines.
Note: I need to chunk the data in some form when I read it in to avoid Memory Errors.
However I am unable to get the wished result. What I have tried so far:
Using Pandas own methods following this guide: How to write a large csv file to hdf5 in python?
This creates the file but im somehow unable to change the types and the file remains too big (~10.7Gb). The field types are float64 and int64.
I also tried to split the CSV up into parts (using split -n x myfile.csv) before working with the increments. I ran into some data errors in the beginning and end on each file which I was able to fix by removing said lines using sed. Then I tried out the following code:
import pandas as pd
import h5py
PATH_csv = "/home/MYNAME/Documents/Workfolder/xaa" #xaa is my csv increment
DATA_csv = pd.read_csv(PATH_csv)
with h5py.File("pct_data-hdf5.h5", "a") as DATA_hdf:
dset = DATA_hdf.create_dataset("posX", data=DATA_csv["posX"], dtype="float32")
Sadly this created the file and the table but didn't write any data into it.
Expectation
Creating a HDF5 File containing the data of a large CSV file while also changing the variable type of each index.
If something is unclear please ask me for clarification. Im still a beginner!
Have you considered the numpy module?
It has a handy function (genfromtxt) to read CSV data with headers into a Numpy array. You define the dtype. The array is suitable for loading into HDF5 with the h5py.create_dataset() function.
See code below. I included 2 print statements. The first shows the dtype names created from the CSV headers. The second shows how you can access the data in the numpy array by field (column) name.
import h5py
import numpy as np
PATH_csv = 'SO_55576601.csv'
csv_dtype= ('f8', 'f8', 'f8', 'i4', 'i4', 'i4' )
csv_data = np.genfromtxt(PATH_csv, dtype=csv_dtype, delimiter=',', names=True)
print (csv_data.dtype.names)
print (csv_data['posX'])
with h5py.File('SO_55576601.h5', 'w') as h5f:
dset = h5f.create_dataset('CSV_data', data=csv_data)
h5f.close()
I am attempting to read a .csv file and assign a particular range of values to its own list for indexing purposes:
import numpy as np
import scipy as sp
import matplotlib as plot
import pandas as pd
# read in the data held in the csv file using "read_csv", a function built into the "pandas" module
Ne = pd.read_table('Ionosphere_data.csv', sep='\s+', dtype=np.float64);
print(Ne.shape)
print(np.dtype)
# store each dimension from the csv file into its own array for plotting use
Altitude = Ne[:,1];
Time = Ne[0,:];
# loop through each electron density value with respect to
Ne_max = []
for i in range(0,len(Time)):
for j in range(0,len(Altitude)):
Ne_max[j] = np.max(Ne[:,i]);
print Ne_max
#plot(Time,Altitude,Ne_max);
#xaxislabel('Time (hours)')
#yaxislabel('Altitude (km)')
however when I run this code, ipython displays the error message:
"TypeError: data type not understood" in context to line 10. (Another side note, when 'print(np.dtype)' is not included, a separate error message is given to line 13: "TypeError: unhashable type").
Does anyone know if I am reading in the file incorrectly or if there is another problem?
In a comment, you say that the line
print(np.dtype)
should be
print(np.dtype(Ne))
That gives the error TypeError: data type not understood. numpy.dtype tries to convert its argument into a numpy data type object. It is not used to inspect the data type of the argument.
For a Pandas DataFrame, use the dtypes attribute:
print(Ne.dtypes)
I am trying to create a pandas DataFrame from an R Dataframe. I am encountering the following error, which I cannot figure out.
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 291, in init
raise PandasError('DataFrame constructor not properly called!')
PandasError: DataFrame constructor not properly called!
The code I am using is:
import rpy2.robjects as robjects
from rpy2.robjects import r
robjects.r['load']("file.RData")
my_data = pd.DataFrame(r['ops.data'])
and the error comes after the last line.
You need to read in data sequentially uses a for loop. DataFrames don't easily read in data in the way you are representing it. They are much more suited to dictionaries. Write some headers and then write the data underneath the headers.
Furthermore by saying ['ops.data'] means you are specifying "ops.data" as a data header. Obviously you can't read in an entire file as a column header
numpy's recfromcsv skips the first line of my data. (Same thing for genfromtxt)
import numpy as np
filename = 'data.csv'
writer = open(filename,mode='w')
writer.write('0,1.1,1.2\n1,2.1,2.2\n2,3.1,3.2')
writer.close()
data = np.recfromcsv(filename)
print data
Is this a bug, or how can I load the data without loosing the first line?
The default first line of a csv file contains the field names.
The function recfromcsv invoke genfromtxt with parameters names=True as default. It means that it read the first line of the data as the header.
Definition:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
You should write it before the array.
import numpy as np
filename = 'data.csv'
writer = open(filename,mode='w')
writer.write('first column,second column,third column\n')
writer.write('0,1.1,1.2\n1,2.1,2.2\n2,3.1,3.2')
writer.close()
data = np.recfromcsv(filename)
print data
Or use recfromtxt instead of recfromcsv.
Or overwrite the default name as
recfromcsv(filename, names=['a','a','a'])
You can add skiprow=0 to keep recfromcsv from skipping the first row.
The default behavior of recfromcsv is to read a header row, which is why it's skipping the first row. It works for me with genfromtxt (if I pass delimiter=','). Can you provide output showing how genfromtxt fails?
Unfortunately it seems there is a bug in Numpy that won't let you specify the dtype in recfromcsv (see https://github.com/numpy/numpy/issues/311), so I can't see how to read it in with specified column names, which I think is what you need to do to avoid reading the header line. But you can read the data in with genfromtxt.
Edit: It looks like you can read it in just by passing in a list of names:
np.recfromcsv(filename, delimiter=',', names=['a', 'b', 'c'])
(The reason it wasn't working for me is I had done from __future__ import unicode_literals and it apparently doesn't like unicode in dtypes.)