How to concat many numpy arrays? - python

I am trying to concatenate many numpy arrays, I put each array in one file, In fact the problem that I have a lot of files, Memory can't support to create a big array Data_Array = np.zeros((1000000,7000)), where I will put all my files. So, I found in this question Combining NumPy arrays that I can use np.concatenate:
file1= np.load('file1_Path.npy')
file2= np.load('file2_Path.npy')
file3= np.load('file3_Path.npy')
file4= np.load('file4_Path.npy')
dataArray=np.concatenate((file1, file2, file3, file4), axis=0)
test= dataArray.shape
print(test)
print (dataArray)
print (dataArray.shape)
plt.plot(dataArray.T)
plt.show()
This way gives me a very good result, but now, I need to replace file1, file2, file3, file4 by the path to the folder of my files:
import matplotlib.pyplot as plt
import numpy as np
import glob
import os, sys
fpath ="Path_To_Big_File"
npyfilespath =r'Path_To_Many_Numpy_Files'
os.chdir(npyfilespath)
npfiles= glob.glob("*.npy")
npfiles.sort()
for i,npfile in enumerate(npfiles):
dataArray=np.concatenate(npfile, axis=0)
np.save(fpath, all_arrays)
It gives me this error:
np.concatenate(npfile, axis=0)
ValueError: zero-dimensional arrays cannot be concatenated
Could you please help me to make this method np.concatenate works?

If you wish to use large arrays, just use np.memmap instead of loading the data into memory. The advantage of memmap is that data is always saved to disk when necessary. For example, you can create a memory mapped array in the following way:
import numpy as np
a=np.memmap('myFile',dtype=np.int,mode='w+',shape=(1000000,8000))
You can then use 'a' as a normal numpy array.
The limit is then your hard disk ! This creates a file on your hard disk that you can read later. You just change mode to 'r' and read data from the array.
More info about memmap here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html
In order to fill that array from npy files of shape (1,8000), just write:
for i,npFile in enumerate(npfFiles):
a[i,:]=np.load(npFile)
a.flush()
The flush method insures everything has been written on disk

Related

Text transform with sklearn TF-IDF vectorizer generates too big csv file

I have a 1000 texts each text has 200-1000 words. size of text csv file is about 10 MB. when I vectorize them with this code, the size of output CSV is exceptionally big (2.5 GB). I am not sure what I did wrong. Your help is highly appreciated. Code:
import numpy as np
import pandas as pd
from copy import deepcopy
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from numpy import savetxt
df = pd.read_csv('data.csv')
#data has two columns: teks and groups
filtered_df = deepcopy(df)
vectorizer = TfidfVectorizer()
vectorizer.fit(filtered_df["teks"])
vector = vectorizer.transform(filtered_df["teks"])
print(vector.shape) # shape (1000, 83000)
savetxt('dataVectorized1.csv', vector.toarray(), delimiter=',')
Sparse matrices (like your vector here) are not supposed to be converted to dense ones (as you do with .toarray()) and saved as CSV files; doing that makes no sense, and invalidates the whole concept of sparse matrices itself. Given that, the big size is not a surprise.
You should seriously consider saving your sparse vector to an appropriate format, e.g. using scipy.sparse:
import scipy.sparse
scipy.sparse.save_npz('dataVectorized1.npz', vector)
See also Save / load scipy sparse csr_matrix in portable data format for possible other options.
If, for any reason, you must stick to a CSV file for storage, you could try compressing the output file by simply using the .gz extension in the file name; from the np.savetxt() documentation:
If the filename ends in .gz, the file is automatically saved in compressed gzip format. loadtxt understands gzipped files transparently.
So, this should do the job:
np.savetxt('dataVectorized1.csv.gz', vector.toarray(), delimiter=',')
However, I would not really recommend this; keep in mind that:
Beyond their convenience for tutorials and introductory exhibitions, CSV files do not really hold any "special" status as inputs to ML tasks, as you might seem to believe.
There is absolutely no reason why the much more efficient .npz file cannot be used as input for further downstream tasks, like classification, visualization, and clustering; on the contrary, its use is very much justified and recommended in similar situations.

Load txt file into numpy array

I want to load a txt file into a numpy array. The file has this format:
1,10,1,11,1,13,1,12,1,1,9
2,11,2,13,2,10,2,12,2,1,9
3,12,3,11,3,13,3,10,3,1,9
4,10,4,11,4,1,4,13,4,12,9
4,1,4,13,4,12,4,11,4,10,9
1,2,1,4,1,5,1,3,1,6,8
1,9,1,12,1,10,1,11,1,13,8
2,1,2,2,2,3,2,4,2,5,8
3,5,3,6,3,9,3,7,3,8,8
4,1,4,4,4,2,4,3,4,5,8
.
.
.
a width of 11 values and a height of 25010 (so I don't want to put them manually into the array)
I already tried load, loadtxt, loadfile and genfromtxt. All of them failed.
I'm pretty sure the commas are the problem.
You have a solution?
You need to specify the delimiter as ,:
import numpy as np
data = np.loadtxt("file.txt", delimiter=",")

Efficient way to process CSV file into a numpy array

CSV file may not be clean (lines with inconsistent number of elements), unclean lines would need to be disregarded.
String manipulation is required during processing.
Example input:
20150701 20:00:15.173,0.5019,0.91665
Desired output: float32 (pseudo-date, seconds in the day, f3, f4)
0.150701 72015.173 0.5019 0.91665 (+ the trailing trash floats usually get)
The CSV file is also very big, the numpy array in memory would be expected to take 5-10 GB, CSV file is over 30GB.
Looking for an efficient way to process the CSV file and end up with a numpy array.
Current solution: use csv module, process line by line and use a list() as a buffer that later gets turned to numpy array with asarray(). Problem is, during the turning process memory consumption is doubled and the copying process adds execution overhead.
Numpy's genfromtxt and loadtxt don't appear to be able to process the data as desired.
If you know in advance how many rows are in the data, you could dispense with the intermediate list and write directly to the array.
import numpy as np
no_rows = 5
no_columns = 4
a = np.zeros((no_rows, no_columns), dtype = np.float)
with open('myfile') as f:
for i, line in enumerate(f):
a[i,:] = cool_function_that_returns_formatted_data(line)
did you think for using pandas read_csv (with engine='C')
I find it as one of the best and easy solutions to handling csv. I worked with 4GB file and it worked for me.
import pandas as pd
df=pd.read_csv('abc.csv',engine='C')
print(df.head(10))
I think i/o capability of pandas is the best way to get data into a numpy array. Specifically the read_csv method will read into a pandas DataFrame. You can then access the underlying numpy array using the as_matrix method of the returned DataFrame.

Efficient way of inputting large raster data into PyTables

I am looking for the efficient way to feed up the raster data file (GeoTiff) with 20GB size into PyTables for further out of core computation.
Currently I am reading it as numpy array using Gdal, and writing the numpy array into
pytables using the code below:
import gdal, numpy as np, tables as tb
inraster = gdal.Open('infile.tif').ReadAsArray().astype(np.float32)
f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=np.shape(inraster)
dataset[:] = inraster
dataset.flush()
dataset.close()
f.close()
inraster = None
Unfortunately, since my input file is extremely large, while reading it as numpy error, my PC shows memory error. Is there any alternative way to feed up the data into PyTables or any suggestions to improve my code?
I do not have a geotiff file, so I fiddled around with a normal tif file. You may have to omit the 3 in the shape and the slice in the writing if the data to the pytables file. Essentially, I loop over the array without reading everything into memory in one go. You have to adjust n_chunks so the chunksize that gets read in one go does not exceed your system memory.
ds=gdal.Open('infile.tif')
x_total,y_total=ds.RasterXSize,ds.RasterYSize
n_chunks=100
f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=(3,y_total,x_total)
#prepare the chunk indices
x_offsets=linspace(0,x_total,n_chunks).astype(int)
x_offsets=zip(x_offsets[:-1],x_offsets[1:])
y_offsets=linspace(0,y_total,n_chunks).astype(int)
y_offsets=zip(y_offsets[:-1],y_offsets[1:])
for x1,x2 in x_offsets:
for y1,y2 in y_offsets:
dataset[:,y1:y2,x1:x2]=ds.ReadAsArray(xoff=x1,yoff=y1,xsize=x2-x1, ysize=y2-y1)

Construct 1 dimensional array from several data files

I'm trying to code a small Python script to do some data analysis. I have several data files, each one being a single column of data, I know how to import each one in python using numpy.loadtxt giving me back a ndarray. But I can't figure out how to concatenate these ndarrays, numpy.concatenate or numpy.append always giving me back error messages even if I try to flatten them first.
Are you aware of a solution?
Ok, as you were asking for code and data details. Here is what my data file look like:
1.4533423
1.3709900
1.7832323
...
Just a column of float numbers, I have no problem import a single file using:
data = numpy.loadtxt("data_filename")
My code trying to concatenate the arrays looks like that now (after trying numpy.concatenate and numpy.append I'm now trying numpy.insert ) :
data = numpy.zeros(0) #creating an empty first array that will be incremented by each file after
for filename in sys.argv[1:]:
temp = numpy.loadtxt(filename)
numpy.insert(data, numpy.arange(len(temp), temp))
I'm passing the filenames when running my script with:
./my_script.py ALL_THE_DATAFILES
And the error message I get is:
TypeError: only length-1 arrays can be converted to Python scalars
numpy.concatenate will definitely be a valid choice - without sample data and sample code and corresponding error messages we cannot help further.
Alternatives would be numpy.r_, numpy.s_
EDIT
This code snippet:
import sys
import numpy as np
filenames = sys.argv[1:]
arrays = [np.loadtxt(filename) for filename in filenames]
final_array = np.concatenate(arrays, axis=0)

Categories