"Reading in" large text file into hdf5 via PyTables or PyHDF? - python

I'm attempting some statistics using SciPy, but my input dataset is quite large (~1.9GB) and in dbf format.
The file is large enough that Numpy returns an error message when I try to create an array with genfromtxt. (I've got 3GB ram, but running win32).
i.e.:
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
ind_sum = numpy.genfromtxt(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf", dtype = (int, int, int, float, float, int), names = True, usecols = (5))
File "C:\Python26\ArcGIS10.0\lib\site-packages\numpy\lib\npyio.py", line 1335, in genfromtxt
for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):
MemoryError
From other posts, I see that the chunked array provided by PyTables could be useful, but my problem is reading in this data in the first place. Or in other words, PyTables or PyHDF easily create a HDF5 output that is desired, but what should I do to get my data into an array first?
For instance:
import numpy, scipy, tables
h5file = tables.openFile(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\HET_IND_SUM2.h5", mode = "w", title = "Diversity Index Results")
group = h5.createGroup("/", "IND_SUM", "Aggregated Index Values"`)
and then I could either create a table or array, but how do I refer back to the original dbf data? In the description?
Thanks for any thoughts you might have!

If the data is too big to fit in memory, you can work with a memory-mapped file (it's like a numpy array but stored on disk - see docs here), though you may be able to get similar results using HDF5 depending on what operations you need to perform on the array. Obviously this will make many operations slower but this is better than not being able to do them at all.
Because you are hitting a memory limit, I think you cannot use genfromtxt. Instead, you should iterate through your text file one line at a time, and write the data to the relevant position in the memmap/hdf5 object.
It is not clear what you mean by "referring back to the original dbf data"? Obviously you can just store the filename it came from somewhere. HDF5 objects have "attributes" which are designed to store this kind of meta-data.
Also, I have found that using h5py is a much simpler and cleaner way to access hdf5 files than pytables, though this is largely a matter of preference.

If the data is in a dbf file, you might try my dbf package -- it only keeps the records in memory that are being accessed, so you should be able to cycle through the records pulling out the data that you need:
import dbf
table = dbf.Table(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf")
sums = [0, 0, 0, 0.0, 0.0, 0]
for record in table:
for index in range(5):
sums[index] += record[index]

Related

Can I use h5py to write strings to an HDF5 file in one line, rather than looping over entries?

I need to store a list/array of strings in an HDF5 file using h5py. These strings are variable length. Following the examples I find online, I have a script that works.
import h5py
h5File=h5py.File('outfile.h5','w')
data=['this','is','a','sentence']
dt = h5py.special_dtype(vlen=str)
dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
dset[i] = word
h5File.flush()
h5File.close()
However, when data gets very large, the write takes a long time as it's looping over each entry and inserting it into the file.
I thought I could do it all in one line, just as I would with ints or floats. But the following script fails. Note that I added some code to test that int works.
import h5py
h5File=h5py.File('outfile.h5','w')
data_numbers = [0, 1, 2, 3, 4]
data = ['this','is','a','sentence']
dt = h5py.special_dtype(vlen=str)
dset_num = h5File.create_dataset('numbers',(len(data_numbers),1),dtype=int,data=data_numbers)
print("Created the dataset with numbers!\n")
dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
print("Created the dataset with strings!\n")
h5File.flush()
h5File.close()
That script gives the following output.
Created the dataset with numbers!
Traceback (most recent call last):
File "write_strings_to_HDF5_file.py", line 32, in <module>
dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 170, in make_new_dset
dset_id.write(h5s.ALL, h5s.ALL, data)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 211, in h5py.h5d.DatasetID.write
File "h5py/h5t.pyx", line 1652, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1713, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U8')
I've read the documentation about UTF-8 encoding and tried a number of variations on the above syntax but I seem to be missing some key point. Maybe it can't be done?
Thanks to anyone who has a suggestion!
If anyone wants to see the slowdown on the example that works, here's a test case.
import h5py
h5File=h5py.File('outfile.h5','w')
sentence=['this','is','a','sentence']
data = []
for i in range(10000):
data += sentence
print(len(data))
dt = h5py.special_dtype(vlen=str)
dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
dset[i] = word
h5File.flush()
h5File.close()
Writing data 1 row at a time is the slowest way to write to an HDF5 file. You won't notice the performance issue when you write 100 rows, but you will see it as the number of rows increases. There is another answer that discusses that issue. See this: pytables writes much faster than h5py. Why? (Note: I am NOT suggesting you use PyTables. The linked answer shows performance for both h5py and PyTables). As you can see, it takes a lot longer longer to write the same amount of data when writing a lot of small chunks.
To improve performance, you need to write more data each time. Since you have all the data loaded in list data, you can do it in one shot. It will be nearly instantaneous for 10,000 rows. The answer referenced in the comments touches on this technique (creating a np.array() from the list data. However, it works from small lists (1/row)...so not exactly the same. You have to take care when you create the array. You can't use NumPy's default Unicode dtype -- it isn't supported by h5py. Instead, you need dtype='S#'
Code below show show to convert your list of strings to a np.array() of strings. Also, I highly recomend you use Python's with/as: contect manager to open the file. This avoids situations where the file is accidentally left open due to an unexpected exit (due to crash or logic error).
Code below:
import h5py
import numpy as np
sentence=['this','is','a','sentence']
data = []
for i in range(10_000):
data += sentence
print(len(data))
longest_word=len(max(data, key=len))
print('longest_word=',longest_word)
dt = h5py.special_dtype(vlen=str)
arr = np.array(data,dtype='S'+str(longest_word))
with h5py.File('outfile.h5','w') as h5File:
dset = h5File.create_dataset('words',data=arr,dtype=dt)
print(dset.shape, dset.dtype)

How can I create a Numpy Array that is much bigger than my RAM from 1000s of CSV files?

I have 1000s of CSV files that I would like to append and create one big numpy array. The problem is that the numpy array would be much bigger than my RAM. Is there a way of writing a bit at a time to disk without having the entire array in RAM?
Also is there a way of reading only a specific part of the array from disk at a time?
When working with numpy and large arrays, there are several approaches depending on what you need to do with that data.
The simplest answer is to use less data. If your data has lots of repeating elements, it is often possible to use a sparse array from scipy because the two libraries are heavily integrated.
Another answer (IMO: the correct solution to your problem) is to use a memory mapped array. This will let numpy only load the necessary parts to ram when needed, and leave the rest on disk. The files containing the data can be simple binary files created using any number of methods, but the built-in python module that would handle this is struct. Appending more data would be as simple as opening the file in append mode, and writing more bytes of data. Make sure that any references to the memory mapped array are re-created any time more data is written to the file so the information is fresh.
Finally is something like compression. Numpy can compress arrays with savez_compressed which can then be opened with numpy.load. Importantly, compressed numpy files cannot be memory-mapped, and must be loaded into memory entirely. Loading one column at a time may be able to get you under the threshold, but this could similarly be applied to other methods to reduce memory usage. Numpy's built in compression techniques will only save disk space not memory. There may exist other libraries that perform some sorts of streaming compression, but that is beyond the scope of my answer.
Here is an example of putting binary data into a file then opening it as a memory-mapped array:
import numpy as np
#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)
#read some data
print(np.mean(column_mmap[0:1024]))
#write some data
column_mmap[0:512] = .5
#deletion closes the memory-mapped file and flush changes to disk.
# del isn't specifically needed as python will garbage collect objects no
# longer accessable. If for example you intend to read the entire array,
# you will need to periodically make sure the array gets deleted and re-created
# or the entire thing will end up in memory again. This could be done with a
# function that loads and operates on part of the array, then when the function
# returns and the memory-mapped array local to the function goes out of scope,
# it will be garbage collected. Calling such a function would not cause a
# build-up of memory usage.
del column_mmap
#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())

Extraction of file using numpy memory map python

Trying to extract a sample from of large file using numpy's memmap:
# indices - boolean vector indicating which lines we want to extract
file_big = np.memmap('path_big_file',dtype='int16',shape=(indices.shape[0],L))
file_small = np.memmap('new_path_for_small_file',dtype='int16',shape=(indices.sum(),L))
The expected result would be that a new file will be created with only part of the data, as identified by the indices.
# place data in files:
file_small[:] = file_big[indices]
The above is the described procedure in the manual. It does not work - called as not having enough memory, even though memory should not be an issue: using only memmap and not uploading data.

Trying to size down HDF5 File by changing index field types using h5py

I have a very large CSV File (~12Gb) that looks something like this:
posX,posY,posZ,eventID,parentID,clockTime
-117.9853515625,60.2998046875,0.29499998688697815,0,0,0
-117.9853515625,60.32909393310547,0.29499998688697815,0,0,0
-117.9560546875,60.2998046875,0.29499998688697815,0,0,0
-117.9560546875,60.32909393310547,0.29499998688697815,0,0,0
-117.92676544189453,60.2998046875,0.29499998688697815,0,0,0
-117.92676544189453,60.32909393310547,0.29499998688697815,0,0,0
-118.04051208496094,60.34012985229492,4.474999904632568,0,0,0
-118.04051208496094,60.36941909790039,4.474999904632568,0,0,0
-118.04051208496094,60.39870834350586,4.474999904632568,0,0,0
I want to convert this CSV File into the HDF5 Format using the library h5py while also lowering the total file size by setting the field / index types i.G. saying:
Save posX, posY and posZ as float32. Save eventID, parentID and clockTime as int32 or something along those lines.
Note: I need to chunk the data in some form when I read it in to avoid Memory Errors.
However I am unable to get the wished result. What I have tried so far:
Using Pandas own methods following this guide: How to write a large csv file to hdf5 in python?
This creates the file but im somehow unable to change the types and the file remains too big (~10.7Gb). The field types are float64 and int64.
I also tried to split the CSV up into parts (using split -n x myfile.csv) before working with the increments. I ran into some data errors in the beginning and end on each file which I was able to fix by removing said lines using sed. Then I tried out the following code:
import pandas as pd
import h5py
PATH_csv = "/home/MYNAME/Documents/Workfolder/xaa" #xaa is my csv increment
DATA_csv = pd.read_csv(PATH_csv)
with h5py.File("pct_data-hdf5.h5", "a") as DATA_hdf:
dset = DATA_hdf.create_dataset("posX", data=DATA_csv["posX"], dtype="float32")
Sadly this created the file and the table but didn't write any data into it.
Expectation
Creating a HDF5 File containing the data of a large CSV file while also changing the variable type of each index.
If something is unclear please ask me for clarification. Im still a beginner!
Have you considered the numpy module?
It has a handy function (genfromtxt) to read CSV data with headers into a Numpy array. You define the dtype. The array is suitable for loading into HDF5 with the h5py.create_dataset() function.
See code below. I included 2 print statements. The first shows the dtype names created from the CSV headers. The second shows how you can access the data in the numpy array by field (column) name.
import h5py
import numpy as np
PATH_csv = 'SO_55576601.csv'
csv_dtype= ('f8', 'f8', 'f8', 'i4', 'i4', 'i4' )
csv_data = np.genfromtxt(PATH_csv, dtype=csv_dtype, delimiter=',', names=True)
print (csv_data.dtype.names)
print (csv_data['posX'])
with h5py.File('SO_55576601.h5', 'w') as h5f:
dset = h5f.create_dataset('CSV_data', data=csv_data)
h5f.close()

Elepahs: Spark+Keras - Creating an RDD from numpy arrays for large data [duplicate]

My spark application is using RDD's of numpy arrays.
At the moment, I'm reading my data from AWS S3, and its represented as
a simple text file where each line is a vector and each element is seperated by space, for example:
1 2 3
5.1 3.6 2.1
3 0.24 1.333
I'm using numpy's function loadtxt() in order to create a numpy array from it.
However, this method seems to be very slow and my app is spending too much time(I think) for converting my dataset to a numpy array.
Can you suggest me a better way for doing it? For example, should I keep my dataset as a binary file?,
should I create the RDD in another way?
Some code for how I create my RDD:
data = sc.textFile("s3_url", initial_num_of_partitions).mapPartitions(readData)
readData function:
def readPointBatch(iterator):
return [(np.loadtxt(iterator,dtype=np.float64)]
It would be a little bit more idiomatic and slightly faster to simply map with numpy.fromstring as follows:
import numpy as np.
path = ...
initial_num_of_partitions = ...
data = (sc.textFile(path, initial_num_of_partitions)
.map(lambda s: np.fromstring(s, dtype=np.float64, sep=" ")))
but ignoring that there is nothing particularly wrong with your approach. As far as I can tell, with basic configuration, it is roughly twice a slow a simply reading the data and slightly slower than creating dummy numpy arrays.
So it looks like the problem is somewhere else. It could be cluster misconfiguration, cost of fetching data from S3 or even unrealistic expectations.
You shouldn't use numpy while working with Spark. Spark has its own methodology of processing data assuring that your sometimes really big files aren't loaded into memory at once, exceeding the memory limit. You should load your file like this with Spark:
data = sc.textFile("s3_url", initial_num_of_partitions) \
.map(lambda row: map(lambda x: float(x), row.split(' ')))
Now this will output an RDD like this, based on your example:
>>> print(data.collect())
[[1.0, 2.0, 3.0], [5.1, 3.6, 2.1], [3.0, 0.24, 1.333]]
#edit Some suggestions on file formats and numpy usage:
Text files are just as good as CSV, TSV, Parquet or anything you feel comfortable with. Binary files are not preferred, according to the Spark docs on binary files loading:
binaryFiles(path, minPartitions=None)
Note: Experimental
Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Note: Small files are preferred, large file is also allowable, but may cause bad performance.
As for numpy usage, if I were you I'd deffinitely tried to replace any external package with native Spark, for example pyspark.mlib.random for randomization: http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.random
The best thing to do under these circumstances is to use pandas library for io.
Please refer to this question : pandas read_csv() and python iterator as input
.
There you will see how to replace the np.loadtxt() function so it would be much faster to create a RDD of numpy array.

Categories