How to read more efficiently in Python? - python

I am trying to access a file, it has 27000+ lines so when I read it takes too long that is 30mins or more. Now just to clarify I am running it in a Coursera external Jupyter notebook, so I don't think it is a system limitation.
with open(filename) as training_file:
# Your code starts here
file = training_file.read()
lines = file.split('\n')
images = []
labels = []
images = np.array(images)
labels = np.array(labels)
c=0
for line in lines[1:]:
row = line.split(',')
labels = np.append(labels, row[0])
images = np.append(images, np.array_split(row[1:], 28))
c += 1
print(c)
images = images.astype(np.float64)
labels = labels.astype(np.float64)
# Your code ends here
return images, labels

Use the built-in numpy functions for reading a CSV (fromfile, genfromtxt etc) rather than rolling your own; they're written in C and much faster than doing the same thing in Python.

Are you sure that it takes too much time because of file reading? Comment out numpy code and run only file read part. In my opinion, numpy.append is the slowest part. Have a look at this: NumPy append vs Python append

you can save on memory by reading the file line by line with a for loop:
with open("filename") as f:
for line in f:
<your code>
but as mentioned in other comments, there are CSV tools you can use that will be way faster: see csv or numpy

Related

Saving deque to file minding performance and portability

I have a while loop that collects data from a microphone (replaced here for np.random.random() to make it more reproducible). I do some operations, let's say I take the abs().mean() here because my output will be a one dimensional array.
This loop is going to run for a LONG time (e.g., once a second for a week) and I am wondering my options to save this. My main concerns are saving the data with acceptable performance and having the result being portable (e.g, .csv beats .npy).
The simple way: just append things into a .txt file. Could be replaced by csv.gz maybe? Maybe using np.savetxt()? Would it be worth it?
The hdf5 way: this should be a nicer way, but reading the whole dataset to append to it doesn't seem like good practice or better performing than dumping into a text file. Is there another way to append to hdf5 files?
The npy way (code not shown): I could save this into a .npy file but I would rather make it portable using a format that could be read from any program.
from collections import deque
import numpy as np
import h5py
amplitudes = deque(maxlen=save_interval_sec)
# Read from the microphone in a continuous stream
while True:
data = np.random.random(100)
amplitude = np.abs(data).mean()
print(amplitude, end="\r")
amplitudes.append(amplitude)
# Save the amplitudes to a file every n iterations
if len(amplitudes) == save_interval:
with open("amplitudes.txt", "a") as f:
for amp in amplitudes:
f.write(str(amp) + "\n")
amplitudes.clear()
# Save the amplitudes to an HDF5 file every n iterations
if len(amplitudes) == save_interval:
# Convert the deque to a Numpy array
amplitudes_array = np.array(amplitudes)
# Open an HDF5 file
with h5py.File("amplitudes.h5", "a") as f:
# Get the existing dataset or create a new one if it doesn't exist
dset = f.get("amplitudes")
if dset is None:
dset = f.create_dataset("amplitudes", data=amplitudes_array, dtype=np.float32,
maxshape=(None,), chunks=True, compression="gzip")
else:
# Get the current size of the dataset
current_size = dset.shape[0]
# Resize the dataset to make room for the new data
dset.resize((current_size + save_interval,))
# Write the new data to the dataset
dset[current_size:] = amplitudes_array
# Clear the deque
amplitudes.clear()
# For debug only
if len(amplitudes)>3:
break
Update
I get that the answer might depend a bit on the sampling frequency (once a second might be too slow) and the data dimensions (single column might be too little). I guess I asked because anything can work, but I always just dump to text. I am not sure where the breaking points are that tip the decision into one or the other method.

How to read/print the header (first 100 lines) of a netCDF file in Python?

I have been trying to read the header (first 100 lines) of a netCDF file in Python, but have been facing some issues. I am familiar with the read_nc function available in the synoptReg package for R and with the ncread function that comes with MATLAB, as well as the read_csv function available in the pandas library. To my knowledge, however, there isn't anything similar for netCDF (.nc) files.
Noting this, and using answers from this question, I've tried the following (with no success):
with open(filepath,'r') as f:
for i in range(100):
line = next(f).strip()
print(line)
However, I receive this error, even though I've ensured that tabs have not been mixed with spaces and that the for statement is within the with block (as given as explanations by the top answers to this question):
'utf-8' codec can't decode byte 0xbb in position 411: invalid start byte
I've also tried the following:
with open(filepath,'r') as f:
for i in range(100):
line = [next(f) for i in range(100)]
print(line)
and
from itertools import islice
with open('/Users/toshiro/Desktop/Projects/CCAR/Data/EDGAR/v6.0_CO2_excl_short-cycle_org_C_2010_TOTALS.0.1x0.1.nc','r') as f:
for i in range(100):
line = list(islice(f, 100))
print(line)
But receive the same error as above. Are there any workarounds for this?
You can't. netCDFs are binary files and can't be interpreted as text.
If the files are netCDF3 encoded, you can read them in with scipy.io.netcdf_file. But it's much more likely they are netCDF4, in which case you'll need the netCDF4 package.
On top of this, I'd highly recommend the xarray package for reading and working with netCDF data. It supports a labeled N-dimensional array interface - think pandas indexes on each dimension of a numpy array.
Whether you go with netCDF or xarray, netCDFs are self-describing and support arbitrary reads, so you don't need to load the whole file to view the metadata. So similar to viewing the head of a text file, you can simply do:
import xarray as xr
ds = xr.open_dataset("path/to/myfile.nc")
print(ds) # this will give you a preview of your data
Additionally, xarray does have a xr.Dataset.head function which will display the first 5 (or N if you provide an int) elements along each dimension:
ds.head() # display a 5x5x...x5 preview of your data
See the getting started guide and the User guide section on reading and writing netCDF files for more info.

Can I use h5py to write strings to an HDF5 file in one line, rather than looping over entries?

I need to store a list/array of strings in an HDF5 file using h5py. These strings are variable length. Following the examples I find online, I have a script that works.
import h5py
h5File=h5py.File('outfile.h5','w')
data=['this','is','a','sentence']
dt = h5py.special_dtype(vlen=str)
dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
dset[i] = word
h5File.flush()
h5File.close()
However, when data gets very large, the write takes a long time as it's looping over each entry and inserting it into the file.
I thought I could do it all in one line, just as I would with ints or floats. But the following script fails. Note that I added some code to test that int works.
import h5py
h5File=h5py.File('outfile.h5','w')
data_numbers = [0, 1, 2, 3, 4]
data = ['this','is','a','sentence']
dt = h5py.special_dtype(vlen=str)
dset_num = h5File.create_dataset('numbers',(len(data_numbers),1),dtype=int,data=data_numbers)
print("Created the dataset with numbers!\n")
dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
print("Created the dataset with strings!\n")
h5File.flush()
h5File.close()
That script gives the following output.
Created the dataset with numbers!
Traceback (most recent call last):
File "write_strings_to_HDF5_file.py", line 32, in <module>
dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 170, in make_new_dset
dset_id.write(h5s.ALL, h5s.ALL, data)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 211, in h5py.h5d.DatasetID.write
File "h5py/h5t.pyx", line 1652, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1713, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U8')
I've read the documentation about UTF-8 encoding and tried a number of variations on the above syntax but I seem to be missing some key point. Maybe it can't be done?
Thanks to anyone who has a suggestion!
If anyone wants to see the slowdown on the example that works, here's a test case.
import h5py
h5File=h5py.File('outfile.h5','w')
sentence=['this','is','a','sentence']
data = []
for i in range(10000):
data += sentence
print(len(data))
dt = h5py.special_dtype(vlen=str)
dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
dset[i] = word
h5File.flush()
h5File.close()
Writing data 1 row at a time is the slowest way to write to an HDF5 file. You won't notice the performance issue when you write 100 rows, but you will see it as the number of rows increases. There is another answer that discusses that issue. See this: pytables writes much faster than h5py. Why? (Note: I am NOT suggesting you use PyTables. The linked answer shows performance for both h5py and PyTables). As you can see, it takes a lot longer longer to write the same amount of data when writing a lot of small chunks.
To improve performance, you need to write more data each time. Since you have all the data loaded in list data, you can do it in one shot. It will be nearly instantaneous for 10,000 rows. The answer referenced in the comments touches on this technique (creating a np.array() from the list data. However, it works from small lists (1/row)...so not exactly the same. You have to take care when you create the array. You can't use NumPy's default Unicode dtype -- it isn't supported by h5py. Instead, you need dtype='S#'
Code below show show to convert your list of strings to a np.array() of strings. Also, I highly recomend you use Python's with/as: contect manager to open the file. This avoids situations where the file is accidentally left open due to an unexpected exit (due to crash or logic error).
Code below:
import h5py
import numpy as np
sentence=['this','is','a','sentence']
data = []
for i in range(10_000):
data += sentence
print(len(data))
longest_word=len(max(data, key=len))
print('longest_word=',longest_word)
dt = h5py.special_dtype(vlen=str)
arr = np.array(data,dtype='S'+str(longest_word))
with h5py.File('outfile.h5','w') as h5File:
dset = h5File.create_dataset('words',data=arr,dtype=dt)
print(dset.shape, dset.dtype)

Saving multiple Numpy arrays to a Numpy binary file (Python)

I want to save multiple large-sized numpy arrays to a numpy binary file to prevent my code from crashing, but it seems like it keeps getting overwritten when I add on an array. The last array saved is what is set to allarrays when save.npy is opened and read. Here is my code:
with open('save.npy', 'wb') as f:
for num in range(500):
array = np.random.rand(100,400)
np.save(f, array)
with open('save.npy', 'rb') as f:
allarrays = np.load(f)
If the file existed before, I want it to be overwritten if the code is rerun. That's why I chose 'wb' instead of 'ab'.
alist =[]
with open('save.npy', 'rb') as f:
alist.append(np.load(f))
When you load you have collect all loads in a list or something. load only loads one array, starting at the current file position.
You can try memory mapping to disk.
# merge arrays using memory mapped file
mm = np.memmap("mmap.bin", dtype='float32', mode='w+', shape=(500,100,400))
for num in range(500):
mm[num::] = np.random.rand(100,400)
# save final array to npy file
with open('save.npy', 'wb') as f:
np.save(f, mm[::])
I ran into this problem as well, and solved it in not a very neat way, but perhaps it's useful for others. It's inspired by hpaulj's approach, which is incomplete (i.e., doesn't load the data). Perhaps this is not how one is supposed to solve this problem to begin with...but anyhow, read on.
I had saved my data using a similar procedure as the OP,
# Saving the data in a for-loop
with open(savefilename, 'wb') as f:
for datafilename in list_of_datafiles:
# Do the processing
data_to_save = ...
np.save( savefilename, data_to_save )
And ran into the problem that calling np.load() only loaded the last saved array, none of the rest. However, I knew that the data was in principle contained in the *.npy file, given the file size was growing during the saving loop. What was required was to simply loop over the content of the numpy array while calling the load command repeatedly. As I didn't quite know how many files were contained in the file, I simply looped over the loading loop until it failed. It's hacky, but it works.
# Loading the data in a for-loop
data_to_read = []
with open(savefilename, 'r') as f:
while True:
try:
data_to_read.append( np.load(f) )
except:
print("all data has been read!")
break
Then you can call, e.g., len(data_to_read) to see how many of the arrays are contained in it. Calling, e.g., data_to_read[0] gives you the first saved array, etc.

How to loop through multiple csv files and output their contents into one array?

I working in python and trying to take x, y, z coordinates from multiple LAZ files and put them into one array that can be used for another analysis. I am trying to automate this task as I have about 2000 files to turn into one or even 10 arrays.The example involves two files but I can't get the loop to work properly. I think I am not correctly naming my variables. below is an example code I have been trying to write (note that I am extremely new to programming so apologize if this is a horrible code).
Create list of las files, then turn them into an array--attempt at better automation
import numpy as np
from laspy.file import File
import glob as glob
# create list of vegetation files to be opened
VegList = sorted(glob.glob('/Users/sophiathompson/Desktop/copys/Clips/*.las'))
for f in VegList:
print(f)
Veg = File(filename = f, mode = "r") # Open the file
points = Veg.get_points() # Grab all of the points from the file.
print points #this is a check that the number of rows changes at the end
print ("array shape:")
print points.shape
VegListCoords = np.vstack((Veg.x, Veg.y, Veg.z)).transpose()
print VegListCoords
This block reads both files but fills VegListCoords with the results of the second file in the file list. I need it to hold the records from both. if this is a horrible way to go about it, I am very open to a new way.
You keep overwriting VegListCoords by assigning the values in your last opened file
instead, initialize at the beginning :
VegListCoords = []
and do instead :
VegListCoords.append(np.vstack((Veg.x, Veg.y, Veg.z)).transpose())
If you want them in one numpy array at the end, use np.concatenate

Categories