Load txt file into numpy array - python

I want to load a txt file into a numpy array. The file has this format:
1,10,1,11,1,13,1,12,1,1,9
2,11,2,13,2,10,2,12,2,1,9
3,12,3,11,3,13,3,10,3,1,9
4,10,4,11,4,1,4,13,4,12,9
4,1,4,13,4,12,4,11,4,10,9
1,2,1,4,1,5,1,3,1,6,8
1,9,1,12,1,10,1,11,1,13,8
2,1,2,2,2,3,2,4,2,5,8
3,5,3,6,3,9,3,7,3,8,8
4,1,4,4,4,2,4,3,4,5,8
.
.
.
a width of 11 values and a height of 25010 (so I don't want to put them manually into the array)
I already tried load, loadtxt, loadfile and genfromtxt. All of them failed.
I'm pretty sure the commas are the problem.
You have a solution?

You need to specify the delimiter as ,:
import numpy as np
data = np.loadtxt("file.txt", delimiter=",")

Related

Load mat file into pandas data frame

I have a problem loading specific Matlab file into pandas data frame (data). Basically, it is a multidimensional array consisting of 2x4 arrays and then either 3 or 4 columns. I searched through and found this Convert mat file to pandas dataframe, matlab data file to pandas DataFrame but neither works for me. Here is what I am using. Thit creates a pandas for one of the nests. I would like to have a column which will tell me where I am, ie first column tells if it is all_params[0] or all_params[1], second distinguishes the next level for each of those etc.
import numpy as np
import scipy.io as sio
all_params = sio.loadmat(path+'all_params')
all_params = all_params['all_params']
pd.DataFrame(all_params[0][0][:], columns=['par1', 'par2', 'par3'])
It must be simple, I'm just not able to figure it out. Or is there a way how to do it directly using scipy or another loading tool?
Thanks

Trying to size down HDF5 File by changing index field types using h5py

I have a very large CSV File (~12Gb) that looks something like this:
posX,posY,posZ,eventID,parentID,clockTime
-117.9853515625,60.2998046875,0.29499998688697815,0,0,0
-117.9853515625,60.32909393310547,0.29499998688697815,0,0,0
-117.9560546875,60.2998046875,0.29499998688697815,0,0,0
-117.9560546875,60.32909393310547,0.29499998688697815,0,0,0
-117.92676544189453,60.2998046875,0.29499998688697815,0,0,0
-117.92676544189453,60.32909393310547,0.29499998688697815,0,0,0
-118.04051208496094,60.34012985229492,4.474999904632568,0,0,0
-118.04051208496094,60.36941909790039,4.474999904632568,0,0,0
-118.04051208496094,60.39870834350586,4.474999904632568,0,0,0
I want to convert this CSV File into the HDF5 Format using the library h5py while also lowering the total file size by setting the field / index types i.G. saying:
Save posX, posY and posZ as float32. Save eventID, parentID and clockTime as int32 or something along those lines.
Note: I need to chunk the data in some form when I read it in to avoid Memory Errors.
However I am unable to get the wished result. What I have tried so far:
Using Pandas own methods following this guide: How to write a large csv file to hdf5 in python?
This creates the file but im somehow unable to change the types and the file remains too big (~10.7Gb). The field types are float64 and int64.
I also tried to split the CSV up into parts (using split -n x myfile.csv) before working with the increments. I ran into some data errors in the beginning and end on each file which I was able to fix by removing said lines using sed. Then I tried out the following code:
import pandas as pd
import h5py
PATH_csv = "/home/MYNAME/Documents/Workfolder/xaa" #xaa is my csv increment
DATA_csv = pd.read_csv(PATH_csv)
with h5py.File("pct_data-hdf5.h5", "a") as DATA_hdf:
dset = DATA_hdf.create_dataset("posX", data=DATA_csv["posX"], dtype="float32")
Sadly this created the file and the table but didn't write any data into it.
Expectation
Creating a HDF5 File containing the data of a large CSV file while also changing the variable type of each index.
If something is unclear please ask me for clarification. Im still a beginner!
Have you considered the numpy module?
It has a handy function (genfromtxt) to read CSV data with headers into a Numpy array. You define the dtype. The array is suitable for loading into HDF5 with the h5py.create_dataset() function.
See code below. I included 2 print statements. The first shows the dtype names created from the CSV headers. The second shows how you can access the data in the numpy array by field (column) name.
import h5py
import numpy as np
PATH_csv = 'SO_55576601.csv'
csv_dtype= ('f8', 'f8', 'f8', 'i4', 'i4', 'i4' )
csv_data = np.genfromtxt(PATH_csv, dtype=csv_dtype, delimiter=',', names=True)
print (csv_data.dtype.names)
print (csv_data['posX'])
with h5py.File('SO_55576601.h5', 'w') as h5f:
dset = h5f.create_dataset('CSV_data', data=csv_data)
h5f.close()

How to concat many numpy arrays?

I am trying to concatenate many numpy arrays, I put each array in one file, In fact the problem that I have a lot of files, Memory can't support to create a big array Data_Array = np.zeros((1000000,7000)), where I will put all my files. So, I found in this question Combining NumPy arrays that I can use np.concatenate:
file1= np.load('file1_Path.npy')
file2= np.load('file2_Path.npy')
file3= np.load('file3_Path.npy')
file4= np.load('file4_Path.npy')
dataArray=np.concatenate((file1, file2, file3, file4), axis=0)
test= dataArray.shape
print(test)
print (dataArray)
print (dataArray.shape)
plt.plot(dataArray.T)
plt.show()
This way gives me a very good result, but now, I need to replace file1, file2, file3, file4 by the path to the folder of my files:
import matplotlib.pyplot as plt
import numpy as np
import glob
import os, sys
fpath ="Path_To_Big_File"
npyfilespath =r'Path_To_Many_Numpy_Files'
os.chdir(npyfilespath)
npfiles= glob.glob("*.npy")
npfiles.sort()
for i,npfile in enumerate(npfiles):
dataArray=np.concatenate(npfile, axis=0)
np.save(fpath, all_arrays)
It gives me this error:
np.concatenate(npfile, axis=0)
ValueError: zero-dimensional arrays cannot be concatenated
Could you please help me to make this method np.concatenate works?
If you wish to use large arrays, just use np.memmap instead of loading the data into memory. The advantage of memmap is that data is always saved to disk when necessary. For example, you can create a memory mapped array in the following way:
import numpy as np
a=np.memmap('myFile',dtype=np.int,mode='w+',shape=(1000000,8000))
You can then use 'a' as a normal numpy array.
The limit is then your hard disk ! This creates a file on your hard disk that you can read later. You just change mode to 'r' and read data from the array.
More info about memmap here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html
In order to fill that array from npy files of shape (1,8000), just write:
for i,npFile in enumerate(npfFiles):
a[i,:]=np.load(npFile)
a.flush()
The flush method insures everything has been written on disk

Efficient way to process CSV file into a numpy array

CSV file may not be clean (lines with inconsistent number of elements), unclean lines would need to be disregarded.
String manipulation is required during processing.
Example input:
20150701 20:00:15.173,0.5019,0.91665
Desired output: float32 (pseudo-date, seconds in the day, f3, f4)
0.150701 72015.173 0.5019 0.91665 (+ the trailing trash floats usually get)
The CSV file is also very big, the numpy array in memory would be expected to take 5-10 GB, CSV file is over 30GB.
Looking for an efficient way to process the CSV file and end up with a numpy array.
Current solution: use csv module, process line by line and use a list() as a buffer that later gets turned to numpy array with asarray(). Problem is, during the turning process memory consumption is doubled and the copying process adds execution overhead.
Numpy's genfromtxt and loadtxt don't appear to be able to process the data as desired.
If you know in advance how many rows are in the data, you could dispense with the intermediate list and write directly to the array.
import numpy as np
no_rows = 5
no_columns = 4
a = np.zeros((no_rows, no_columns), dtype = np.float)
with open('myfile') as f:
for i, line in enumerate(f):
a[i,:] = cool_function_that_returns_formatted_data(line)
did you think for using pandas read_csv (with engine='C')
I find it as one of the best and easy solutions to handling csv. I worked with 4GB file and it worked for me.
import pandas as pd
df=pd.read_csv('abc.csv',engine='C')
print(df.head(10))
I think i/o capability of pandas is the best way to get data into a numpy array. Specifically the read_csv method will read into a pandas DataFrame. You can then access the underlying numpy array using the as_matrix method of the returned DataFrame.

Dump a sparse matrix into a file

I have a scipy.sparse.csr matrix and would like to dump it to a CSV file. Is there a way to preserve the sparsity of the matrix and write it to a CSV?
SciPy includes functions that read/write sparse matrices in the MatrixMarket format via the scipy.io module, including mmwrite: http://docs.scipy.org/doc/scipy/reference/generated/scipy.io.mmwrite.html
MatrixMarket is not CSV, but close. It consists of a one-line header that has #rows, #cols, # of nonzeros, followed by one line per nonzero. Each of these lines is the row index, column index, value. You could write a simple script that turns whitespace into commas and you'd have a CSV.
Now with Scipy 0.19, it is super easy:
import scipy.sparse as sp
m = sp.csr_matrix([[1,0,0],[0,1,0],[0,0,1]])
sp.save_npz("file_name.npz", m)
Loading file to memory
new_m = sp.load_npz("file_name.npz")

Categories