Reading pandas from disk during concurrent process pool - python

I've wrote a cli tool to generate simulations and i'm hoping to generate about 10k (~10 minutes) for each cut of data I have ~200. I have functions that do this fine in a for loop but when I converted it to concurrent.futures.ProcessPoolExecutor() I realized that multiple processes can't read in the same pandas dataframe.
Here's the smallest example I could think of:
import concurrent.futures
import pandas as pd
def example():
# This is a static table with basic information like distributions
df = pd.read_parquet("batch/data/mappings.pq")
# Then there's a bunch of etl, even reading in a few other static tables
return sum(df.shape)
def main():
results = []
with concurrent.futures.ProcessPoolExecutor() as pool:
futr_results = [pool.submit(example) for _ in range(100)]
done_results = concurrent.futures.as_completed(futr_results)
for _ in futr_results:
results.append(next(done_results).result())
return results
if __name__ == "__main__":
print(main())
Errors:
<jemalloc>: background thread creation failed (11)
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
Traceback (most recent call last):
File "batch/testing.py", line 19, in <module>
main()
File "batch/testing.py", line 14, in main
results.append(next(done_results).result())
File "/home/a114383/miniconda3/envs/hailsims/lib/python3.7/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/home/a114383/miniconda3/envs/hailsims/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I'm hoping there's a quick an dirty way to read these (i'm guessing without reference?), otherwise it's looking like i'll need to create all the parameters first without getting them on the fly.

Three things I would try:
Pandas has an option for using either PyArrow or FastParquet when reading parquet files. Try using a different one - this seems to be a bug.
Try forcing pandas to open the file in read only mode to prevent conflicts due to the file being locked:
pd.read_parquet(open("batch/data/mappings.pq", "rb"))
# Also try "r" instead of "rb", not sure if pandas expects string or binary data
Try loading the file into a StringIO/BytesIO buffer, and then handing that to pandas - this avoids any interaction with the file from pandas itself:
import io
# either this (binary)
data = io.BytesIO(open("batch/data/mappings.pq", "rb").read())
# or this (string)
data = io.StringIO(open("batch/data/mappings.pq", "r").read())
pd.read_parquet(data)

Related

How would I go about converting a .csv to an .arrow file without loading it all into memory?

I found a similar question here: Read CSV with PyArrow
In this answer it references sys.stdin.buffer and sys.stdout.buffer, but I am not exactly sure how that would be used to write the .arrow file, or name it.
I can't seem to find the exact information I am looking for in the docs for pyarrow. My file will not have any nans, but it will have a timestamped index. The file is ~100 gb, so loading it into memory simply isn't an option. I tried changing the code, but as I assumed, the code ended up overwriting the previous file every loop.
***This is my first post. I would like to thank all the contributors who answered 99.9% of my other questions before I had even the asked them.
import sys
import pandas as pd
import pyarrow as pa
SPLIT_ROWS = 1 ### used one line chunks for a small test
def main():
writer = None
for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):
table = pa.Table.from_pandas(split)
# Write out to file
with pa.OSFile('test.arrow', 'wb') as sink: ### no append mode yet
with pa.RecordBatchFileWriter(sink, table.schema) as writer:
writer.write_table(table)
writer.close()
if __name__ == "__main__":
main()
Below is the code I used in the command line
>cat data.csv | python test.py
As suggested by #Pace, you should consider moving the output file creation outside of the reading loop. Something like this:
import sys
import pandas as pd
import pyarrow as pa
SPLIT_ROWS = 1 ### used one line chunks for a small test
def main():
# Write out to file
with pa.OSFile('test.arrow', 'wb') as sink: ### no append mode yet
with pa.RecordBatchFileWriter(sink, table.schema) as writer:
for split in pd.read_csv('data.csv', chunksize=SPLIT_ROWS):
table = pa.Table.from_pandas(split)
writer.write_table(table)
if __name__ == "__main__":
main()
You also don't have to use sys.stdin.buffer if you would prefer to specify specific input and output files. You could then just run the script as:
python test.py
By using with statements, both writer and sink will be automatically closed afterwards (in this case when main() returns). This means it should not be necessary to include an explicit close() call.
Solution adapted from #Martin-Evans code:
Closed file after the for loop as suggested by #Pace
import sys
import pandas as pd
import pyarrow as pa
SPLIT_ROWS = 1000000
def main():
schema = pa.Table.from_pandas(pd.read_csv('Data.csv',nrows=2)).schema
### reads first two lines to define schema
with pa.OSFile('test.arrow', 'wb') as sink:
with pa.RecordBatchFileWriter(sink, schema) as writer:
for split in pd.read_csv('Data.csv',chunksize=SPLIT_ROWS):
table = pa.Table.from_pandas(split)
writer.write_table(table)
writer.close()
if __name__ == "__main__":
main()

Can I use h5py to write strings to an HDF5 file in one line, rather than looping over entries?

I need to store a list/array of strings in an HDF5 file using h5py. These strings are variable length. Following the examples I find online, I have a script that works.
import h5py
h5File=h5py.File('outfile.h5','w')
data=['this','is','a','sentence']
dt = h5py.special_dtype(vlen=str)
dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
dset[i] = word
h5File.flush()
h5File.close()
However, when data gets very large, the write takes a long time as it's looping over each entry and inserting it into the file.
I thought I could do it all in one line, just as I would with ints or floats. But the following script fails. Note that I added some code to test that int works.
import h5py
h5File=h5py.File('outfile.h5','w')
data_numbers = [0, 1, 2, 3, 4]
data = ['this','is','a','sentence']
dt = h5py.special_dtype(vlen=str)
dset_num = h5File.create_dataset('numbers',(len(data_numbers),1),dtype=int,data=data_numbers)
print("Created the dataset with numbers!\n")
dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
print("Created the dataset with strings!\n")
h5File.flush()
h5File.close()
That script gives the following output.
Created the dataset with numbers!
Traceback (most recent call last):
File "write_strings_to_HDF5_file.py", line 32, in <module>
dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 170, in make_new_dset
dset_id.write(h5s.ALL, h5s.ALL, data)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 211, in h5py.h5d.DatasetID.write
File "h5py/h5t.pyx", line 1652, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1713, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U8')
I've read the documentation about UTF-8 encoding and tried a number of variations on the above syntax but I seem to be missing some key point. Maybe it can't be done?
Thanks to anyone who has a suggestion!
If anyone wants to see the slowdown on the example that works, here's a test case.
import h5py
h5File=h5py.File('outfile.h5','w')
sentence=['this','is','a','sentence']
data = []
for i in range(10000):
data += sentence
print(len(data))
dt = h5py.special_dtype(vlen=str)
dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
dset[i] = word
h5File.flush()
h5File.close()
Writing data 1 row at a time is the slowest way to write to an HDF5 file. You won't notice the performance issue when you write 100 rows, but you will see it as the number of rows increases. There is another answer that discusses that issue. See this: pytables writes much faster than h5py. Why? (Note: I am NOT suggesting you use PyTables. The linked answer shows performance for both h5py and PyTables). As you can see, it takes a lot longer longer to write the same amount of data when writing a lot of small chunks.
To improve performance, you need to write more data each time. Since you have all the data loaded in list data, you can do it in one shot. It will be nearly instantaneous for 10,000 rows. The answer referenced in the comments touches on this technique (creating a np.array() from the list data. However, it works from small lists (1/row)...so not exactly the same. You have to take care when you create the array. You can't use NumPy's default Unicode dtype -- it isn't supported by h5py. Instead, you need dtype='S#'
Code below show show to convert your list of strings to a np.array() of strings. Also, I highly recomend you use Python's with/as: contect manager to open the file. This avoids situations where the file is accidentally left open due to an unexpected exit (due to crash or logic error).
Code below:
import h5py
import numpy as np
sentence=['this','is','a','sentence']
data = []
for i in range(10_000):
data += sentence
print(len(data))
longest_word=len(max(data, key=len))
print('longest_word=',longest_word)
dt = h5py.special_dtype(vlen=str)
arr = np.array(data,dtype='S'+str(longest_word))
with h5py.File('outfile.h5','w') as h5File:
dset = h5File.create_dataset('words',data=arr,dtype=dt)
print(dset.shape, dset.dtype)

Reading custom file format to Dask dataframe

I have a huge custom text file (cant load the entire data into one pandas dataframe) which I want to read into Dask dataframe. I wrote a generator to read and parse the data in chunks and create pandas dataframes. I want to load these pandas dataframes into a dask dataframe and perform operations on the resulting dataframe (things like creating calculated columns, extracting parts of the dataframe, plotting etc).
I tried using Dask bag but couldnt succeed.
So I decided to write the resulting dataframe into an HDFStore and then use Dask to read from the HDFStore file. This worked well when I was doing it from my own computer. Code below.
cc = read_custom("demo.xyz", chunks=1000) # Generator of pandas dataframes
from pandas import HDFStore
s = HDFStore("demo.h5")
for c in cc:
s.append("data", c, format='t', append=True)
s.close()
import dask.dataframe as dd
ddf = dd.read_hdf("demo.h5", "data", chunksize=100000)
seqv = (
(
(ddf.sxx - ddf.syy) ** 2
+ (ddf.syy - ddf.szz) ** 2
+ (ddf.szz - ddf.sxx) ** 2
+ 6 * (ddf.sxy ** 2 + ddf.syz ** 2 + ddf.sxz ** 2)
)
/ 2
) ** 0.5
seqv.compute()
Since the last compute was slow, I decided to distribute it over a few systems on my LAN and started a scheduler on my machine and couple of workers in other systems. And fired up a Client as below.
from dask.distributed import Client
client = Client('mysystemip:8786') #Establishing connection with the scheduler all fine.
And then read in the Dask dataframe. However, I got error below when I executed seqv.compute().
HDF5ExtError: HDF5 error back trace
File "H5F.c", line 509, in H5Fopen
unable to open file
File "H5Fint.c", line 1400, in H5F__open
unable to open file
File "H5Fint.c", line 1615, in H5F_open
unable to lock the file
File "H5FD.c", line 1640, in H5FD_lock
driver lock request failed
File "H5FDsec2.c", line 941, in H5FD_sec2_lock
unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'
End of HDF5 error back trace
Unable to open/create file 'demo.h5'
I have made sure that all workers have access to demo.h5 file. I tried passing in the lock=False in read_hdf. Got the same error.
Isn't this possible to do? May be try another file format? I guess writing each pandas dataframe to separate files may work, but I'm trying to avoid it (I dont even want an intermediate HDFS file). But before I get to that route, I'd like to know if there is any other better approach to solve the problem.
Thanks for any suggestions!
If you want to read data from a custom format in a text file I recommend using the dask.bytes.read_bytes function, which returns a list of delayed objects, each of which points to a block of bytes from your file. Those blocks will be cleanly separated by a line delimiter by default.
Something like this might work:
def parse_bytes(b: bytes) -> pandas.DataFrame:
...
blocks = dask.bytes.read_bytes("my-file.txt", delimiter=b"\n")
dataframes = [dask.delayed(parse_bytes)(block) for block in blocks]
df = dask.dataframe.from_delayed(dataframes)
https://docs.dask.org/en/latest/remote-data-services.html#dask.bytes.read_bytes
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_delayed

PySpark serialization EOFError

I am reading in a CSV as a Spark DataFrame and performing machine learning operations upon it. I keep getting a Python serialization EOFError - any idea why? I thought it might be a memory issue - i.e. file exceeding available RAM - but drastically reducing the size of the DataFrame didn't prevent the EOF error.
Toy code and error below.
#set spark context
conf = SparkConf().setMaster("local").setAppName("MyApp")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
#read in 500mb csv as DataFrame
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
#get dataframe into machine learning format
r_formula = RFormula(formula = "outcome ~ .")
mldf = r_formula.fit(df).transform(df)
#fit random forest model
rf = RandomForestClassifier(numTrees = 3, maxDepth = 2)
model = rf.fit(mldf)
result = model.transform(mldf).head()
Running the above code with spark-submit on a single node repeatedly throws the following error, even if the size of the DataFrame is reduced prior to fitting the model (e.g. tinydf = df.sample(False, 0.00001):
Traceback (most recent call last):
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 157,
in manager
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 61,
in worker
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/worker.py", line 136,
in main if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/serializers.py", line 545,
in read_int
raise EOFError
EOFError
The error appears to happen in the pySpark read_int function. Code for which is as follows from spark site :
def read_int(stream):
length = stream.read(4)
if not length:
raise EOFError
return struct.unpack("!i", length)[0]
This would mean that when reading 4bytes from the stream, if 0 bytes are read, EOF error is raised. The python docs are here.
I have faced the same issues and don't know how to debug it. seems that it will cause executor thread stuck and never return anything.
Have you checked to see where in your code the EOError is arising?
My guess would be that it's coming as you attempt to define df with, since that's the only place in your code that the file is actually trying to be read.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
At every point after this line, your code is working with the variable df, not the file itself, so it would seem likely that this line is generating the error.
A simple way to test if this is the case would be to comment out the rest of your code, and/or place a line like this right after the line above.
print(len(df))
Another way would be to use a try loop, like:
try:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
except:
print("Failed to load file into df!")
If it turns out that that line is the one generating the EOFError, then you're never getting the dataframes in the first place, so attempting to reduce them won't make a difference.
If that is the line generating the error, two possibilities come to mind:
Your code is calling one or both of the .csv files earlier on, and isn't closing it prior to this line. If so, simply close it above your code here.
There's something wrong with the .csv files themselves. Try loading them outside of this code, and see if you can get them into memory properly in the first place, using something like csv.reader, and manipulate them in ways you'd expect.

Can't save pandas dataframe to HDF file

I have been trying for a while to save a pandas dataframe to an HDF5 file. I tried various different phrasings eg. df.to_hdf etc. but to no avail. I am running this in a python virtual environment see here. Even without the use of the VE it has the same error. The following script comes up with the error below:
''' This script reads in a pickles dictionary converts it to panda
dataframe and then saves it to an hdf file. The arguments are the
file names of the pickle files.
'''
import numpy as np
import pandas as pd
import pickle
import sys
# read in filename arguments
for fn in sys.argv[1:]:
print 'converting file %s to hdf format...' % fn
fl = open(fn, 'r')
data = pickle.load(fl)
fl.close()
frame = pd.DataFrame(data)
fnn = fn.split('.')[0]+'.h5'
store = pd.HDFStore(fnn)
store.put([fn.split('.')[0]], frame)
store.close()
frame = 0
data = 0
Error is:
$ ./p_to_hdf.py LUT_*.p
converting file LUT_0.p to hdf format...
Traceback (most recent call last):
File "./p_to_hdf.py", line 22, in <module>
store = pd.HDFStore(fnn)
File "/usr/lib/python2.7/site-packages/pandas/io/pytables.py", line 270, in __init__
raise Exception('HDFStore requires PyTables')
Exception: HDFStore requires PyTables
pip list shows both pandas and tables are installed and the latest versions.
pandas (0.16.2)
tables (3.2.0)
The solution had noting to do with the code but how to source a virtual environment in python. The correct way is to use . venv/bin/activate instead of source ~/venv/bin/activate. Now which python shows the python installed under ~/venv/bin/python and the code runs correctly.

Categories