I am reading in a CSV as a Spark DataFrame and performing machine learning operations upon it. I keep getting a Python serialization EOFError - any idea why? I thought it might be a memory issue - i.e. file exceeding available RAM - but drastically reducing the size of the DataFrame didn't prevent the EOF error.
Toy code and error below.
#set spark context
conf = SparkConf().setMaster("local").setAppName("MyApp")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
#read in 500mb csv as DataFrame
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
#get dataframe into machine learning format
r_formula = RFormula(formula = "outcome ~ .")
mldf = r_formula.fit(df).transform(df)
#fit random forest model
rf = RandomForestClassifier(numTrees = 3, maxDepth = 2)
model = rf.fit(mldf)
result = model.transform(mldf).head()
Running the above code with spark-submit on a single node repeatedly throws the following error, even if the size of the DataFrame is reduced prior to fitting the model (e.g. tinydf = df.sample(False, 0.00001):
Traceback (most recent call last):
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 157,
in manager
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 61,
in worker
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/worker.py", line 136,
in main if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/serializers.py", line 545,
in read_int
raise EOFError
EOFError
The error appears to happen in the pySpark read_int function. Code for which is as follows from spark site :
def read_int(stream):
length = stream.read(4)
if not length:
raise EOFError
return struct.unpack("!i", length)[0]
This would mean that when reading 4bytes from the stream, if 0 bytes are read, EOF error is raised. The python docs are here.
I have faced the same issues and don't know how to debug it. seems that it will cause executor thread stuck and never return anything.
Have you checked to see where in your code the EOError is arising?
My guess would be that it's coming as you attempt to define df with, since that's the only place in your code that the file is actually trying to be read.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
At every point after this line, your code is working with the variable df, not the file itself, so it would seem likely that this line is generating the error.
A simple way to test if this is the case would be to comment out the rest of your code, and/or place a line like this right after the line above.
print(len(df))
Another way would be to use a try loop, like:
try:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
except:
print("Failed to load file into df!")
If it turns out that that line is the one generating the EOFError, then you're never getting the dataframes in the first place, so attempting to reduce them won't make a difference.
If that is the line generating the error, two possibilities come to mind:
Your code is calling one or both of the .csv files earlier on, and isn't closing it prior to this line. If so, simply close it above your code here.
There's something wrong with the .csv files themselves. Try loading them outside of this code, and see if you can get them into memory properly in the first place, using something like csv.reader, and manipulate them in ways you'd expect.
Related
I am doing some simulation where I compute some stuff for several time step. For each I want to save a parquet file where each line correspond to a simulation this looks like so :
def simulation():
nsim = 3
timesteps = [1,2]
data = {} #initialization not shown here
for i in nsim:
compute_stuff()
for j in timesteps:
data[str(j)]= compute_some_other_stuff()
return data
Once I have my dict data containing the result of my simulation (under numpy arrays), I transform it into dask.DataFrame objects to then save them to file using the .to_parquet() method as follows:
def save(data):
for i in data.keys():
data[i] = pd.DataFrame(data[i], bins=...)
df = from_pandas(data[i], npartitions=2)
f.to_parquet(datafolder + i + "/", engine="pyarrow", append=True, ignore_divisions = True)
When use this code only once it works perfectly and the struggle arrises when I try to implement it in parallel. Using dask I do:
client = Client(n_workers=10, processes=True)
def f():
data = simulation()
save(data)
to_compute = [delayed(f)(n) for n in range(20)]
compute(to_compute)
The behaviour of this last portion of code is quite random. At some point this happens:
distributed.worker - WARNING - Compute Failed
Function: f
args: (4)
kwargs: {}
Exception: "ArrowInvalid('Parquet file size is 0 bytes')"
....
distributed.worker - WARNING - Compute Failed
Function: f
args: (12)
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
I think these errors are due to the fact that 2 processes try to write at the same time to the same parquet file, and it not well handled (as it can be on txt file). I already tried to switch to pySpark / Koalas without much success. I there a better way to save the result along a simulation (in case of a crash / wall time on a cluster)?
You are making a classic dask mistake of invoking the dask API from within functions that are themselves delayed. The error indicates that things are happening in parallel (which is what dask does!) which are not expected to change during processing. Specifically, a file is clearly being edited by one task while another one is reading it (not sure which).
What you probably want to do, is use concat on the dataframe pieces and then a single call to to_parquet.
Note that it seems all of your data is actually held in the client, and you are using from_parquet. This seems like a bad idea, since you are missing out on one of dask's biggest features, to only load data when needed. You should, instead, load your data inside delayed functions or dask dataframe API calls.
I've wrote a cli tool to generate simulations and i'm hoping to generate about 10k (~10 minutes) for each cut of data I have ~200. I have functions that do this fine in a for loop but when I converted it to concurrent.futures.ProcessPoolExecutor() I realized that multiple processes can't read in the same pandas dataframe.
Here's the smallest example I could think of:
import concurrent.futures
import pandas as pd
def example():
# This is a static table with basic information like distributions
df = pd.read_parquet("batch/data/mappings.pq")
# Then there's a bunch of etl, even reading in a few other static tables
return sum(df.shape)
def main():
results = []
with concurrent.futures.ProcessPoolExecutor() as pool:
futr_results = [pool.submit(example) for _ in range(100)]
done_results = concurrent.futures.as_completed(futr_results)
for _ in futr_results:
results.append(next(done_results).result())
return results
if __name__ == "__main__":
print(main())
Errors:
<jemalloc>: background thread creation failed (11)
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
Traceback (most recent call last):
File "batch/testing.py", line 19, in <module>
main()
File "batch/testing.py", line 14, in main
results.append(next(done_results).result())
File "/home/a114383/miniconda3/envs/hailsims/lib/python3.7/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/home/a114383/miniconda3/envs/hailsims/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I'm hoping there's a quick an dirty way to read these (i'm guessing without reference?), otherwise it's looking like i'll need to create all the parameters first without getting them on the fly.
Three things I would try:
Pandas has an option for using either PyArrow or FastParquet when reading parquet files. Try using a different one - this seems to be a bug.
Try forcing pandas to open the file in read only mode to prevent conflicts due to the file being locked:
pd.read_parquet(open("batch/data/mappings.pq", "rb"))
# Also try "r" instead of "rb", not sure if pandas expects string or binary data
Try loading the file into a StringIO/BytesIO buffer, and then handing that to pandas - this avoids any interaction with the file from pandas itself:
import io
# either this (binary)
data = io.BytesIO(open("batch/data/mappings.pq", "rb").read())
# or this (string)
data = io.StringIO(open("batch/data/mappings.pq", "r").read())
pd.read_parquet(data)
I'm trying to create a new fits file from an initial template.fits
This template.fits has the table of the voice 1 with 3915 rows, instead, my new file, must have more then 50000 rows.
The part of the code is the following:
hdulist = fits.open('/Users/Martina/Desktop/Ubuntu_Condivisa/Post_Doc_IAPS/ASTRI/ASTRI_scienceTools/Astrisim_MC/template.fits')
hdu0=hdulist[0]
hdu0.writeto(out_pile+'.fits', clobber=True)
hdu1=hdulist[1]
hdu1.header['NAXIS2'] = na
hdu1.header['ONTIME'] = tsec
hdu1.header['LIVETIME'] = tsec
hdu1.writeto(out_pile+'.fits', clobber=True)
hdu1_data=hdu1.data
for j in range(na-1):
hdu1_data[j+1][1]=j+1
hdu1_data[j+1][3]=t[j]+0.
hdu1_data[j+1][7]=ra[j]
hdu1_data[j+1][8]=dec[j]
hdu1_data[j+1][21]=enetot[j]
hdu1.writeto(out_pile+'.fits', clobber=True)
When I try to fill the new table (the last part of the code), the error is the following:
Traceback (most recent call last):
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\utils\decorators.py", line 734, in __get__
return obj.__dict__[self._key]
KeyError: 'data'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Astrisim_MC_4.py", line 340, in
hdu1_data=hdu1.data
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\utils\decorators.py", line 736, in __get__
val = self.fget(obj)
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\io\fits\hdu\table.py", line 404, in data
data = self._get_tbdata()
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\io\fits\hdu\table.py", line 171, in _get_tbdata
self._data_offset)
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\io\fits\hdu\base.py", line 478, in _get_raw_data
return self._file.readarray(offset=offset, dtype=code, shape=shape)
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\io\fits\file.py", line 279, in readarray
buffer=self._mmap)
TypeError: buffer is too small for requested array
I tried to vary the number of rows and the code works correctly up to 3969 rows.
How can I solve the problem?
Thank you very much in advance,
cheers!
Martina
Your initial problem where where you did this:
hdu1.header['NAXIS2'] = na
A natural thing to think you might be able to do, but you actually should not. In general when working with astropy.io.fits, one should almost never manually mess with keywords in the FITS header that describe the structure of the data itself. This stems in part from the design of FITS itself--that it mixes these structural keywords in with metadata keywords--and partly a design issue with astropy.io.fits that it lets you manipulate these keywords at all, or that it doesn't more tightly tie the data to them. I wrote about this issue at more length here: https://github.com/astropy/astropy/issues/3836 but never got around to adding more explanation of this to the documentation.
Basically the way you can think about it is that when a FITS file is opened, its header is first read and parsed into a Header object containing all the header keywords. Some book-keeping is also done to keep track of how much data is in the file after the header. Then when you access the data of the HDU the header keywords are used to determine what the type and shape of the data is. So by doing something like
hdu1.header['NAXIS2'] = na
hdu1_data = hdu1.data
this isn't somehow growing the data in the file. Instead it's just confusing it into thinking there are more rows of data in the file then there actually are, hence the error "buffer is too small for requested array". The "buffer" it's referring to in this case is the rest of the data in the file, and you're requesting that it read an array that's longer than there is data in the file.
The fact that it allows you do break this at all is a bug in Astropy IMO. When the file is first opened it should save away all the correct structural keywords in the background, so that the data can still be loaded properly even if the user accidentally modifies these keywords (or perhaps the user should be completely prevented from modifying these keywords directly.
That's a long way to explain where you went wrong, but maybe it will help better understand how the library works.
As to your actual question, I think #Evert's advice is good, to use the higher-evel and easier to work with astropy.table to create a new table that's the size you need, and then copy the existing table into the new one. You can open the FITS table directly as a Table object as well with Table.read. I think you can also copy the FITS metadata over but I'm not sure exactly the best way to do that.
One other minor comment unrelated to your main question--when working with arrays you don't have to (and in fact shouldn't) use for loops to perform vectorizable operations.
For example since this is just looping over array indices:
for j in range(na-1):
hdu1_data[j+1][1]=j+1
hdu1_data[j+1][3]=t[j]+0.
hdu1_data[j+1][7]=ra[j]
hdu1_data[j+1][8]=dec[j]
hdu1_data[j+1][21]=enetot[j]
you can write operations like this like:
hdu1_data[:][1] = np.arange(na)
hdu1_data[:][3] = t + 0.
hdu1_data[:][7] = ra
and so on (I'm not sure why you were doing j+1 because this is skipping the first row, but the point still stands). This assumes of course that the array being updated (hdu1_data, in this case) already has na rows. But that's why you need to grow or concatenate to your array first if it's not already that size.
The implemented algorithm which I use is quite heavy and has three parts. Thus, I used pickle to dump everything in between various stages in order to do testing on each stage separately.
Although the first dump always works fine, the second one behaves as if it is size dependent. It will work for a smaller dataset but not for a somewhat larger one. (The same actually also happens with a heatmap I try to create but that's a different question) The dumped file is about 10MB so it's nothing really large.
The dump which creates the problem contains a whole class which in turn contains methods, dictionaries, lists and variables.
I actually tried dumping both from inside and outside the class but both failed.
The code I'm using looks like this:
data = pickle.load(open("./data/tmp/data.pck", 'rb')) #Reads from the previous stage dump and works fine.
dataEvol = data.evol_detect(prevTimeslots, xLablNum) #Export the class to dataEvol
dataEvolPck = open("./data/tmp/dataEvol.pck", "wb") #open works fine
pickle.dump(dataEvol, dataEvolPck, protocol = 2) #dump works fine
dataEvolPck.close()
and even tried this:
dataPck = open("./data/tmp/dataFull.pck", "wb")
pickle.dump(self, dataPck, protocol=2) #self here is the dataEvol in the previous part of code
dataPck.close()
The problem appears when i try to load the class using this part:
dataEvol = pickle.load(open("./data/tmp/dataEvol.pck", 'rb'))
The error in hand is:
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
dataEvol = pickle.load(open("./data/tmp/dataEvol.pck", 'rb'))
ValueError: itemsize cannot be zero
Any ideas?
I'm using Python 3.3 on a 64-bit Win-7 computer. Please forgive me if I'm missing anything essential as this is my first question.
Answer:
The problem was an empty numpy string in one of the dictionaries. Thanks Janne!!!
It is a NumPy bug that has been fixed recently in this pull request. To reproduce it, try:
import cPickle
import numpy as np
cPickle.loads(cPickle.dumps(np.string_('')))
I'm exporting results of my script into Excel spreadsheet. Everything works fine, I put big sets of data into SpreadSheet, but sometimes an error occurs:
File "C:\Python26\lib\site-packages\win32com\client\dynamic.py", line 550, in __setattr__
self._oleobj_.Invoke(entry.dispid, 0, invoke_type, 0, value)
pywintypes.com_error: (-2147352567, 'Exception.', (0, None, None, None, 0, -2146777998), None)***
I suppose It's not a problem of input data format. I put several different types of data strings, ints, floats, lists and it works fine. When I run the sript for the second time it works fine - no error. What's going on?
PS. This is code that generates error, what's strange is that the error doesn't occur always. Say 30% of runs results in an error. :
import win32com.client
def Generate_Excel_Report():
Excel=win32com.client.Dispatch("Excel.Application")
Excel.Workbooks.Add(1)
Cells=Excel.ActiveWorkBook.ActiveSheet.Cells
for i in range(100):
Row=int(35+i)
for j in range(10):
Cells(int(Row),int(5+j)).Value="string"
for i in range(100):
Row=int(135+i)
for j in range(10):
Cells(int(Row),int(5+j)).Value=32.32 #float
Generate_Excel_Report()
The strangest for me is that when I run the script with the same code, the same input many times, then sometimes an error occurs, sometimes not.
This is most likely a synchronous COM access error. See my answer to Error while working with excel using python for details about why and a workaround.
I can't see why the file format/extension would make a difference. You'd be calling the same COM object either way. My experience with this error is that it's more or less random, but you can increase the chances of it happening by interacting with Excel while your script is running.
edit: It doesn't change a thing. Error occurs, but leff often. Once in 10 simulations while with .xlsx file once in 3 simulations. Please help
The problem was with the file I was opening. It was .xlsx , while I've saved it as .xls the problem disappeared. So beware, do not ever use COM interface with .xlsx or You'll get in trouble !
You should diseable excel interactivity while doing this.
import win32com.client
def Generate_Excel_Report():
Excel=win32com.client.Dispatch("Excel.Application")
#you won't see what happens (faster)
Excel.ScreenUpdating = False
#clics on the Excel window have no effect
#(set back to True before closing Excel)
Excel.Interactive = False
Excel.Workbooks.Add(1)
Cells=Excel.ActiveWorkBook.ActiveSheet.Cells
for i in range(100):
Row=int(35+i)
for j in range(10):
Cells(int(Row),int(5+j)).Value="string"
for i in range(100):
Row=int(135+i)
for j in range(10):
Cells(int(Row),int(5+j)).Value=32.32 #float
Excel.ScreenUpdating = True
Excel.Interactive = True
Generate_Excel_Report()
Also you could do that to increase your code performance :
#Construct data block
string_line = []
for i in range(10)
string_line.append("string")
string_block = []
for i in range(100)
string_block.append(string_line)
#Write data block in one call
ws = Excel.Workbooks.Sheets(1)
ws.Range(
ws.Cells(35, 5)
ws.Cells(135,15)
).Values = string block
I had the same error while using xlwings for interacting with Excel. xlwings also use win32com clients in the backend.
After some debugging, I realized that this error pops up whenever the code is executed and the excel file (containing data) is not in focus. In order to resolve the issue, I simply select the file which is being processed and run the code and it always works for me.