Pandas to_csv overwriting, prevent data loss

Pandas to_csv overwriting, prevent data loss - python

I have a script that is constantly updating a data-frame and saving it to disk (overwriting the old csv-file). I found out that if interrupt the program right at the saving call, df.to_csv("df.csv"), all data is losed, and the df.csv is empty only containing the column-index.
I can perhaps do a workaround by temporarily saving the data to df.temp.csv, and then replacing df.csv. But is there a pythonic, short way to make the saving "Atomary" and prevent data-loss? This is the stack trace I get when interrupting right at the saving call.
Traceback (most recent call last):
File "/opt/homebrew-cask/Caskroom/pycharm/2016.1.3/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1531, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/opt/homebrew-cask/Caskroom/pycharm/2016.1.3/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 938, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Users/user/test.py", line 49, in <module>
d.to_csv("out.csv", index=False)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1344, in to_csv
formatter.save()
File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 1551, in save
self._save()
File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 1652, in _save
self._save_chunk(start_i, end_i)
File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 1666, in _save_chunk
quoting=self.quoting)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 1443, in to_native_types
return formatter.get_result_as_array()
File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 2171, in get_result_as_array
formatted_values = format_values_with(float_format)
File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 2157, in format_values_with
for val in values.ravel()[imask]])
File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 2108, in base_formatter
return str(v) if notnull(v) else self.na_rep
File "/usr/local/lib/python2.7/site-packages/pandas/core/common.py", line 250, in notnull
res = isnull(obj)
File "/usr/local/lib/python2.7/site-packages/pandas/core/common.py", line 73, in isnull
def isnull(obj):
File "_pydevd_bundle/pydevd_cython.pyx", line 937, in _pydevd_bundle.pydevd_cython.ThreadTracer.__call__ (_pydevd_bundle/pydevd_cython.c:15522)
File "/opt/homebrew-cask/Caskroom/pycharm/2016.1.3/PyCharm.app/Contents/helpers/pydev/_pydev_bundle/pydev_is_thread_alive.py", line 14, in is_thread_alive
def is_thread_alive(t):
KeyboardInterrupt

You can create a context manager to handle your atomic overwriting:
import os
import contextlib
#contextlib.contextmanager
def atomic_overwrite(filename):
temp = filename + '~'
with open(temp, "w") as f:
yield f
os.rename(temp, filename) # this will only happen if no exception was raised
The to_csv method on a Pandas DataFrame will accept a file object instead of a path, so you can use:
with atomic_overwrite("df.csv") as f:
df.to_csv(f)
The temporary filename I chose is the requested filename with a tilde at the end. You can of course change the code to use something else if you want. I'm also not exactly sure what mode the file should be opened with, you may need "wb" instead of just "w".

The best you can do is to implement a signal handler (signal module) which waits with terminating the program until the last write operation has finished.
Something along the lines (pseudo-code):
import signal
import sys
import time
import pandas as pd
lock = threading.Lock()
def handler(signum, frame):
# ensure that latest data is written
sys.exit(1)
signal.signal(signal.SIGTERM, handler)
signal.signal(signal.SIGINT, handler)
while True:
# might exit any time.
pd.to_csv(...)
time.sleep(1)

Related

Read files with ZipFile using multiprocessing

I'm trying to read raw data from a zipfile. The structure of that file is:
zipfile
data
Spectral0.data
Spectral1.data
Spectral[...].data
Spectral300.data
Header
The goal is to read all Spectral[...].data into an 2D numpy array (whereas Spectral0.data would be the first column). The single threaded approach takes a lot of time since reading one .data file takes some seconds.
import zipfile
import numpy as np
spectralData = np.zeros(shape = (dimY, dimX), dtype=np.int16)
archive = zipfile.ZipFile(path, 'r')
for file in range(fileCount):
spectralDataRaw = archive.read('data/Spectral' + str(file) + '.data')
spectralData[:,file] = np.frombuffer(spectralDataRaw, np.short)
And I thought using multiprocessing could speed up the process. So I read some tutorials how to set up a multiprocessing procedure. This is what I came up with:
import zipfile
import numpy as np
import multiprocessing
from joblib import Parallel, delayed
archive = zipfile.ZipFile(path, 'r')
numCores = multiprocessing.cpu_count()
def testMult(file):
spectralDataRaw = archive.read('data/Spectral' + str(file) + '.data')
return np.frombuffer(spectralDataRaw, np.short)
output = Parallel(n_jobs=numCores)(delayed(testMult)(file)for file in range(fileCount))
output = np.flipud(np.rot90(np.array(output), 1, axes = (0,2)))
Using this approach I get the following error:
numCores = multiprocessing.cpu_count()
def testMult(file):
spectralDataRaw = archive.read('data/Spectral' + str(file) + '.data')
return np.frombuffer(spectralDataRaw, np.short)
output = Parallel(n_jobs=numCores)(delayed(testMult)(file)for file in range(fileCount))
output = np.flipud(np.rot90(np.array(output), 1, axes = (0,2)))
_RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\externals\loky\backend\queues.py", line 153, in _feed
obj_ = dumps(obj, reducers=reducers)
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 271, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 264, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\externals\cloudpickle\cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_io.BufferedReader' object
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<ipython-input-94-c4b007eea8e2>", line 8, in <module>
output = Parallel(n_jobs=numCores)(delayed(testMult)(file)for file in range(fileCount))
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\parallel.py", line 1061, in __call__
self.retrieve()
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\parallel.py", line 940, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\concurrent\futures\_base.py", line 432, in result
return self.__get_result()
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\concurrent\futures\_base.py", line 388, in __get_result
raise self._exception
PicklingError: Could not pickle the task to send it to the workers.
My question is, how do I set up this parallelization correctly. I've read that zipfile is not thread safe and therefore I might need a different approach to read the zip content into memory(RAM). I would rather not read the whole zipfile into memory since the file can be quite large.
I thought about using from numba import njit, prange but there the problem occurs that zip is not supported by numba.
What else could I do to make this work?

How to release the GIL for a thread in Python, using Numba?

I want to make a program that consists of 2 parts: one is to receive data and the other is to write it to a file. I thought that it would be better if I could use 2 threads(and possibly 2 cpu cores) to do the jobs separately. I found this: https://numba.pydata.org/numba-doc/dev/user/jit.html#compilation-options and it allows you to release the GIL. I wonder if it suits my purpose and if I could adopt it for this kind of job. This is what I tried:
import threading
import time
import os
import queue
import numba
import numpy as np
condition = threading.Condition()
q_text = queue.Queue()
##numba.jit(nopython=True, nogil=True)
def consumer():
t = threading.currentThread()
with condition:
while True:
str_test = q_text.get()
with open('hello.txt', 'a') as f:
f.write(str_test)
condition.wait()
def sender():
with condition:
condition.notifyAll()
def add_q(arr="hi\n"):
q_text.put(arr)
sender()
c1 = threading.Thread(name='c1', target=consumer)
c1.start()
add_q()
It works fine without numba, but when I apply it to consumer, it gives me an error:
Exception in thread c1:
Traceback (most recent call last):
File "d:\python36-32\lib\threading.py", line 916, in _bootstrap_inner
self.run()
File "d:\python36-32\lib\threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "d:\python36-32\lib\site-packages\numba\dispatcher.py", line 368, in _compile_for_args
raise e
File "d:\python36-32\lib\site-packages\numba\dispatcher.py", line 325, in _compile_for_args
return self.compile(tuple(argtypes))
File "d:\python36-32\lib\site-packages\numba\dispatcher.py", line 653, in compile
cres = self._compiler.compile(args, return_type)
File "d:\python36-32\lib\site-packages\numba\dispatcher.py", line 83, in compile
pipeline_class=self.pipeline_class)
File "d:\python36-32\lib\site-packages\numba\compiler.py", line 873, in compile_extra
return pipeline.compile_extra(func)
File "d:\python36-32\lib\site-packages\numba\compiler.py", line 367, in compile_extra
return self._compile_bytecode()
File "d:\python36-32\lib\site-packages\numba\compiler.py", line 804, in _compile_bytecode
return self._compile_core()
File "d:\python36-32\lib\site-packages\numba\compiler.py", line 791, in _compile_core
res = pm.run(self.status)
File "d:\python36-32\lib\site-packages\numba\compiler.py", line 253, in run
raise patched_exception
File "d:\python36-32\lib\site-packages\numba\compiler.py", line 245, in run
stage()
File "d:\python36-32\lib\site-packages\numba\compiler.py", line 381, in stage_analyze_bytecode
func_ir = translate_stage(self.func_id, self.bc)
File "d:\python36-32\lib\site-packages\numba\compiler.py", line 937, in translate_stage
return interp.interpret(bytecode)
File "d:\python36-32\lib\site-packages\numba\interpreter.py", line 92, in interpret
self.cfa.run()
File "d:\python36-32\lib\site-packages\numba\controlflow.py", line 515, in run
assert not inst.is_jump, inst
AssertionError: Failed at nopython (analyzing bytecode)
SETUP_WITH(arg=60, lineno=17)
There was no error if I exclude condition(threading.Condion) from consumer, so maybe it's because JIT doesn't interpret it? I'd like to know if I can adopt numba to this kind of purpose and how to fix this problem(if it's possible).

You can't use the threading module within a Numba function, and opening/writing a file isn't supported either. Numba is great when you need computational performance, your example is purely I/O, that's not a usecase for Numba.
The only way Numba would add something is if you apply a function on your str_test data. Compiling that function with nogil=True would allow multi-threading. But again, that's only worth it if you that function would be computationally expensive compared to the I/O.
You could look into an async solution, that's more appropriate for I/O bound performance.
See this example from the Numba documentation for a case where threading improves performance:
https://numba.pydata.org/numba-doc/dev/user/examples.html#multi-threading

How to debug vage error lxml.etree.SerialisationError: unknown error -2029930774 in python

I am using some legacy code from python2 that has to work with python3. So far so good, most of the things work as they should. However I get the most vage error from a library called lxml.
In my understanding this is a library that binds to a binary program written in c.
The problem comes from this piece of code:
with etree.xmlfile(self.temp_file, encoding='utf-8') as xf:
with xf.element('{http://www.opengis.net/citygml/2.0}CityModel', nsmap=nsmap):
with open(input_gml, mode='rb') as f:
context = etree.iterparse(f)
for action, elem in context:
if action == 'end' and elem.tag == '{http://www.opengis.net/citygml/2.0}cityObjectMember':
# Duplicate feature and subfeatures
self.duplicateFeature(xf, elem)
# Clean up the original element and the node of its previous sibling
# (https://www.ibm.com/developerworks/xml/library/x-hiperfparse/)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
xf.flush()
It processes this xml file. And gets the following error:
Traceback (most recent call last):
File "/usr/local/bin/stetl", line 4, in <module>
__import__('pkg_resources').run_script('Stetl==2.0', 'stetl')
File "/usr/local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1446, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/EGG-INFO/scripts/stetl", line 43, in <module>
main()
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/EGG-INFO/scripts/stetl", line 36, in main
etl.run()
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/stetl/etl.py", line 157, in run
chain.run()
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/stetl/chain.py", line 172, in run
packet = self.first_comp.process(packet)
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/stetl/component.py", line 213, in process
packet = self.next.process(packet)
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/stetl/component.py", line 213, in process
packet = self.next.process(packet)
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/stetl/component.py", line 213, in process
packet = self.next.process(packet)
File "/usr/local/lib/python3.6/site-packages/Stetl-2.0-py3.6.egg/stetl/component.py", line 199, in process
packet = self.invoke(packet)
File "/app/bgt/etl/stetlbgt/subfeaturehandler.py", line 144, in invoke
del context
File "src/lxml/serializer.pxi", line 925, in lxml.etree.xmlfile.__exit__
File "src/lxml/serializer.pxi", line 1263, in lxml.etree._IncrementalFileWriter._close
File "src/lxml/serializer.pxi", line 1269, in lxml.etree._IncrementalFileWriter._handle_error
File "src/lxml/serializer.pxi", line 199, in lxml.etree._raiseSerialisationError
lxml.etree.SerialisationError: unknown error -2029930774
I'm not sure what's going wrong here. It seems that something is wrong with some weird encoded character.
How to debug this?

XLWings can't connect to excel workbook when run as a scheduled task through APScheduler

I'm trying to access an excel workbook every minute to save data that it's currently displaying from a different program. When the scheduler gets to accessing the workbook, I'm getting "OSError: [WinError -2147467259] Unspecified error". Is there any fix/workaround? Any help would be appreciated, thanks!
from apscheduler.schedulers.blocking import BlockingScheduler
from apscheduler.triggers.combining import OrTrigger
from apscheduler.triggers.cron import CronTrigger
import xlwings as xw
def tick():
wb= xw.Book('currently_open_workbook.xlsx')
sched = BlockingScheduler()
trigger = OrTrigger([
CronTrigger(day_of_week='mon-fri', hour='0-16', second=0),
CronTrigger(day_of_week='sun', hour='17-23', second=0)
])
sched.add_job(tick, trigger)
sched.start()
The full error is here
Traceback (most recent call last):
File "C:\Users\eric\anaconda3\envs\untitled\lib\site-packages\apscheduler\executors\base.py", line 125, in run_job
retval = job.func(*job.args, **job.kwargs)
File "C:/Users/eric/PycharmProjects/untitled/blank.py", line 8, in tick
wb= xw.Book('currently_open_workbook.xlsx')
File "C:\Users\eric\anaconda3\envs\untitled\lib\site-packages\xlwings\main.py", line 472, in __init__
for wb in app.books:
File "C:\Users\eric\anaconda3\envs\untitled\lib\site-packages\xlwings\main.py", line 358, in books
return Books(impl=self.impl.books)
File "C:\Users\eric\anaconda3\envs\untitled\lib\site-packages\xlwings\_xlwindows.py", line 374, in books
return Books(xl=self.xl.Workbooks)
File "C:\Users\eric\anaconda3\envs\untitled\lib\site-packages\xlwings\_xlwindows.py", line 302, in xl
self._xl = get_xl_app_from_hwnd(self._hwnd)
File "C:\Users\eric\anaconda3\envs\untitled\lib\site-packages\xlwings\_xlwindows.py", line 218, in get_xl_app_from_hwnd
ptr = accessible_object_from_window(child_hwnd)
File "C:\Users\eric\anaconda3\envs\untitled\lib\site-packages\xlwings\_xlwindows.py", line 189, in accessible_object_from_window
byref(IDispatch._iid_), byref(ptr))
File "_ctypes/callproc.c", line 918, in GetResult
OSError: [WinError -2147467259] Unspecified error

Not really a solution but rather an explanation. I think the issue is that APScheduler uses threading and xlwings objects can't be passed around directly in threads, see: http://docs.xlwings.org/en/stable/threading.html
It might be solvable with something like this: https://stackoverflow.com/a/27966218/918626 but currently nothing that is available out of the box with xlwings.

AttributeError: exit when I try to mock out build in functions

I'm currently trying to mock out the open() built in method in Python for a test. However, I always end up getting a crash and this resulting message:
File "/opt/home/venv/lib/python2.7/site-packages/nose-1.3.0-py2.7.egg/nose/result.py", line 187, in _exc_info_to_string
return _TextTestResult._exc_info_to_string(self, err, test)
File "/opt/python-2.7.3/lib/python2.7/unittest/result.py", line 164, in _exc_info_to_string
msgLines = traceback.format_exception(exctype, value, tb)
File "/opt/python-2.7.3/lib/python2.7/traceback.py", line 141, in format_exception
list = list + format_tb(tb, limit)
File "/opt/python-2.7.3/lib/python2.7/traceback.py", line 76, in format_tb
return format_list(extract_tb(tb, limit))
File "/opt/python-2.7.3/lib/python2.7/traceback.py", line 101, in extract_tb
line = linecache.getline(filename, lineno, f.f_globals)
File "/opt/home/venv/lib/python2.7/linecache.py", line 14, in getline
lines = getlines(filename, module_globals)
File "/opt/home/venv/lib/python2.7/linecache.py", line 40, in getlines
return updatecache(filename, module_globals)
File "/opt/home/venv/lib/python2.7/linecache.py", line 127, in updatecache
with open(fullname, 'rU') as fp:
AttributeError: __exit__
Here is my test code:
m = mox.Mox()
m.StubOutWithMock(__builtin__, 'open')
mock_file = m.CreateMock(__builtin__.file)
open(mox.IgnoreArg(), mox.IgnoreArg()).AndReturn(mock_file)
mock_file.write(mox.IgnoreArg()).MultipleTimes()
mock_file.close()
write_file_method()

__exit__ is the method that gets called when you try to close a file. Your mock file does not handle mock_file.close(), just open(). You'll need to mock the close method too.
Edit:
On second though, why do you want to mock open? AFAIK you shouldn't be doing that method. The method under test should take an open stream (instead of a filename, for instance). In production code, clients are responsible for opening a file (e.g. pickle.dump). In your tests, you pass in a StringIO, or a mock object that supports writing.
Edit 2:
I would split your method in two and test each bit separately.
creating a file: check that prior to calling this method the file does not exist, and it does after that. One might argue such a one-line method isn't worth testing.
writing to a file: see above. Create a StringIO and write to that, so your tests can then verify the correct thing has been written.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas to_csv overwriting, prevent data loss - python

Related

Read files with ZipFile using multiprocessing

How to release the GIL for a thread in Python, using Numba?

How to debug vage error lxml.etree.SerialisationError: unknown error -2029930774 in python

XLWings can't connect to excel workbook when run as a scheduled task through APScheduler

AttributeError: exit when I try to mock out build in functions

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas to_csv overwriting, prevent data loss - python

Related

Read files with ZipFile using multiprocessing

How to release the GIL for a thread in Python, using Numba?

How to debug vage error lxml.etree.SerialisationError: unknown error -2029930774 in python

XLWings can't connect to excel workbook when run as a scheduled task through APScheduler

AttributeError: __exit__ when I try to mock out build in functions

Categories

Resources

AttributeError: exit when I try to mock out build in functions