Unable to export in parallel from Exasol using pyexasol - python

I'm attempting to fetch data from Exasol using PyExasol, in parallel. I'm following the example here - https://github.com/badoo/pyexasol/blob/master/examples/14_parallel_export.py
My code looks like this :
import multiprocessing
import pyexasol
import pyexasol.callback as cb
class ExportProc(multiprocessing.Process):
def __init__(self, node):
self.node = node
self.read_pipe, self.write_pipe = multiprocessing.Pipe(False)
super().__init__()
def start(self):
super().start()
self.write_pipe.close()
def get_proxy(self):
return self.read_pipe.recv()
def run(self):
self.read_pipe.close()
http = pyexasol.http_transport(self.node['host'], self.node['port'], pyexasol.HTTP_EXPORT)
self.write_pipe.send(http.get_proxy())
self.write_pipe.close()
pd1 = http.export_to_callback(cb.export_to_pandas, None)
print(f"{self.node['idx']}:{len(pd)}")
EXASOL_HOST = "<IP-ADDRESS>:8563"
EXASOL_USERID = "username"
EXASOL_PASSWORD = "password"
c = pyexasol.connect(dsn=EXASOL_HOST, user=EXASOL_USERID, password=EXASOL_PASSWORD, compression=True)
nodes = c.get_nodes(10)
pool = list()
proxy_list = list()
for n in nodes:
proc = ExportProc(n)
proc.start()
proxy_list.append(proc.get_proxy())
pool.append(proc)
c.export_parallel(proxy_list, "SELECT * FROM SOME_SCHEMA.SOME_TABLE", export_params={'with_column_names': True})
stmt = c.last_statement()
r = stmt.fetchall()
At the last statement, I'm getting the following error and unable to fetch any results.
---------------------------------------------------------------------------
ExaRuntimeError Traceback (most recent call last)
<command-911615> in <module>
----> 1 r = stmt.fetchall()
/local_disk0/pythonVirtualEnvDirs/virtualEnv-01515a25-967f-4b98-aa10-6ac03c978ce2/lib/python3.7/site-packages/pyexasol/statement.py in fetchall(self)
85
86 def fetchall(self):
---> 87 return [row for row in self]
88
89 def fetchcol(self):
/local_disk0/pythonVirtualEnvDirs/virtualEnv-01515a25-967f-4b98-aa10-6ac03c978ce2/lib/python3.7/site-packages/pyexasol/statement.py in <listcomp>(.0)
85
86 def fetchall(self):
---> 87 return [row for row in self]
88
89 def fetchcol(self):
/local_disk0/pythonVirtualEnvDirs/virtualEnv-01515a25-967f-4b98-aa10-6ac03c978ce2/lib/python3.7/site-packages/pyexasol/statement.py in __next__(self)
53 if self.pos_total >= self.num_rows_total:
54 if self.result_type != 'resultSet':
---> 55 raise ExaRuntimeError(self.connection, 'Attempt to fetch from statement without result set')
56
57 raise StopIteration
ExaRuntimeError:
(
message => Attempt to fetch from statement without result set
dsn => <IP-ADDRESS>:8563
user => username
schema =>
)
It seems that the type of the returned statement is not 'resultSet' but 'rowCount'. Any help on what I'm doing wrong or why the type of statement is ''rowCount' ?

PyEXASOL creator is here. Please not in case of parallel HTTP transport you have to process data chunks inside child processes. Your data set is available in pd1 DataFrame.
You should not be calling .fetchall() in the main process in case of parallel processing.
I suggest to check the complete examples, especially example 14 (parallel export).
Hope it helps!

Related

Update shared dictionary using mpire package

I am working to update a shared dictionary synchronously using mpire package in Python in a multi-core machine (i.e., parallel processing to update a dict). The environment I am using is a Linux machine with 8 vCPU and 16 GB memory in Amazon Sagemaker. Below is a sample/dummy code snippet I am using for this. But I am unable to make it working. I know I can perhaps use Process or map methods from multiprocessing package to accomplish this task. I am just checking if there is any way I can do it using mpire package. Any help would be greatly appreciated. Thanks much!
def myFunc(shared_objects, id_val):
indata, output = shared_objects
# Temporary store for model output for an input ID
temp: Dict[str, int] = dict()
# Filter data for input ID and store output in temp variable
indata2 = indata.loc[indata['ID']==id_val]
temp = indata2.groupby(['M_CODE'])['VALUE'].sum().to_dict()
# store the result .. I want this to happen synchronously
output[id_val] = temp
#*******************************************************************
if __name__ == '__main__':
from mpire import WorkerPool
from multiprocessing import Manager
# This is just a sample data
inputData = pd.DataFrame(dict({'ID':['A', 'B', 'A', 'C', 'A'],
'M_CODE':['AKQ1', 'ALM3', 'BLC4', 'ALM4', 'BLC4'],
'VALUE':[0.75, 1, 1.75, 0.67, 3], }))
start_time = datetime.now()
print(start_time, '>> Process started.')
# Use a shared dict to store results from various workers
manager = Manager()
# dict on Manager has no lock at all!
# https://stackoverflow.com/questions/2936626/how-to-share-a-dictionary-between-multiple-processes-in-python-without-locking
output: Dict[str, Dict[str, int]] = manager.dict()
shared_objects = inputData, output
with WorkerPool(n_jobs=7, shared_objects=shared_objects) as pool:
results = pool.map_unordered(myFunc, inputData['ID'].unique(), progress_bar=True)
print(datetime.now(), '>> Process completed -> total time taken:', datetime.now()-start_time)
Below is the error I'm stuck with:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-df7d847398a1> in <module>
37
38 with WorkerPool(n_jobs=7, shared_objects=shared_objects) as pool:
---> 39 results = pool.map_unordered(myFunc, inputData['ID'].unique(), progress_bar=True)
40
41 print(datetime.now(), '>> Process completed -> total time taken:', datetime.now()-start_time)
/opt/conda/lib/python3.7/site-packages/mpire/pool.py in map_unordered(self, func, iterable_of_args, iterable_len, max_tasks_active, chunk_size, n_splits, worker_lifespan, progress_bar, progress_bar_position, enable_insights, worker_init, worker_exit, task_timeout, worker_init_timeout, worker_exit_timeout)
418 n_splits, worker_lifespan, progress_bar, progress_bar_position,
419 enable_insights, worker_init, worker_exit, task_timeout, worker_init_timeout,
--> 420 worker_exit_timeout))
421
422 def imap(self, func: Callable, iterable_of_args: Union[Sized, Iterable], iterable_len: Optional[int] = None,
/opt/conda/lib/python3.7/site-packages/mpire/pool.py in imap_unordered(self, func, iterable_of_args, iterable_len, max_tasks_active, chunk_size, n_splits, worker_lifespan, progress_bar, progress_bar_position, enable_insights, worker_init, worker_exit, task_timeout, worker_init_timeout, worker_exit_timeout)
664 # Terminate if exception has been thrown at this point
665 if self._worker_comms.exception_thrown():
--> 666 self._handle_exception(progress_bar_handler)
667
668 # All results are in: it's clean up time
/opt/conda/lib/python3.7/site-packages/mpire/pool.py in _handle_exception(self, progress_bar_handler)
729 # Raise
730 logger.debug("Re-raising obtained exception")
--> 731 raise err(traceback_str)
732
733 def stop_and_join(self, progress_bar_handler: Optional[ProgressBarHandler] = None,
ValueError:
Exception occurred in Worker-0 with the following arguments:
Arg 0: 'A'
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/mpire/worker.py", line 352, in _run_safely
results = func()
File "/opt/conda/lib/python3.7/site-packages/mpire/worker.py", line 288, in _func
_results = func(args)
File "/opt/conda/lib/python3.7/site-packages/mpire/worker.py", line 455, in _helper_func
return self._call_func(func, args)
File "/opt/conda/lib/python3.7/site-packages/mpire/worker.py", line 472, in _call_func
return func(args)
File "<ipython-input-10-df7d847398a1>", line 9, in myFunc
indata2 = indata.loc[indata['ID']==id_val]
File "/opt/conda/lib/python3.7/site-packages/pandas/core/ops/common.py", line 69, in new_method
return method(self, other)
File "/opt/conda/lib/python3.7/site-packages/pandas/core/arraylike.py", line 32, in __eq__
return self._cmp_method(other, operator.eq)
File "/opt/conda/lib/python3.7/site-packages/pandas/core/series.py", line 5502, in _cmp_method
res_values = ops.comparison_op(lvalues, rvalues, op)
File "/opt/conda/lib/python3.7/site-packages/pandas/core/ops/array_ops.py", line 262, in comparison_op
"Lengths must match to compare", lvalues.shape, rvalues.shape
ValueError: ('Lengths must match to compare', (5,), (1,))
[Update]: Here is the code I found to be working fine using only the package multiprocessing.
def myFunc(id_val, output, indata):
# Temporary store for model output for an input ID
temp: Dict[str, int] = dict()
# Filter data for input ID and store output in temp variable
indata2 = indata.loc[indata['ID']==id_val]
temp = indata2.groupby(['M_CODE'])['VALUE'].sum().to_dict()
# store the result .. I want this to happen synchronously
output[id_val] = temp
#*******************************************************************
if __name__ == '__main__':
import pandas as pd
from typing import Dict
from itertools import repeat
from multiprocessing import Manager
from datetime import datetime
# This is just a sample data
inputData = pd.DataFrame(dict({'ID':['A', 'B', 'A', 'C', 'A'],
'M_CODE':['AKQ1', 'ALM3', 'BLC4', 'ALM4', 'BLC4'],
'VALUE':[0.75, 1, 1.75, 0.67, 3], }))
start_time = datetime.now()
print(start_time, '>> Process started.')
# Use a shared dict to store results from various workers
with Manager() as manager:
# dict on Manager has no lock at all!
# https://stackoverflow.com/questions/2936626/how-to-share-a-dictionary-between-multiple-processes-in-python-without-locking
output: Dict[str, Dict[str, int]] = manager.dict()
# Start processes involving n workers
# Set chunksize to effciently handling the tasks across workers so none remains idle as much as possible
with manager.Pool(processes=7, ) as pool:
pool.starmap(myFunc,
zip(inputData['ID'].unique(), repeat(output), repeat(inputData)),
chunksize = max(inputData['ID'].nunique() // (7*4), 1))
output = dict(output)
print(datetime.now(), '>> Process completed -> total time taken:', datetime.now()-start_time)
UPDATE:
Now that I better understand the specific issue, I can say the issue lies with the relationship of mpire.WorkerPool.map_unordered chunking procedure with the expected inputs to the pandas.loc function. Specifically, MyFunc gets id_val as a singular Numpy array such as array(['A'], dtype=object) as detailed in the chunking explanation and the source code. On the other side, indata['ID'] in the loc function a pandas Series. One of these has to be changed for the comparison to work, but based on what your code is trying to do, the id_val can be changed to give just its item, like:
id_val.item()
indata2 = indata.loc[indata['ID']==id_val]
Making the new MyFunc (which on my machine gets your script to run):
def myFunc(shared_objects, id_val):
indata, output = shared_objects
# Keep just the value of the id_val
id_val.item()
# Temporary store for model output for an input ID
temp: Dict[str, int] = dict()
# Filter data for input ID and store output in temp variable
indata2 = indata.loc[indata['ID']==id_val]
temp = indata2.groupby(['M_CODE'])['VALUE'].sum().to_dict()
# store the result .. I want this to happen synchronously
output[id_val] = temp
The reason why this isn't an issue in your multiprocessing-only solution is because zip is chunking inputData['ID'].unique() the way you expect: it only gives the value, not the value wrapped in an array object. Nice job finding an alternative solution, though!
The error is occurring in the function line:
indata2 = indata.loc[indata['ID']==id_val
Per the main error:
File "/opt/conda/lib/python3.7/site-packages/pandas/core/ops/array_ops.py", line 262, in comparison_op "Lengths must match to compare", lvalues.shape, rvalues.shape ValueError: ('Lengths must match to compare', (5,), (1,))
This is an element-wise equality match between Series(['A', 'B', 'A', 'C', 'A']).unique() and Series(['A', 'B', 'A', 'C', 'A']) . Which will never work unless there are no repeated values in 'ID' . I'm not sure what you are trying to do exactly with this statement, but that is certainly the cause of your error.

pexpect sometimes misses a line but logs it

I'm using python3-pexpect 4.8.0-2 from Debian Bullseye to control functions on an embedded device via RS232. This works quite well, however on the two of those functions pexpect sometimes misses an answer. Strangely the logfiles written by pexpect itself (child.logfile) always do contain the missing line. So I suspect an error in my own code.
I use fdspawn() for the serial device:
pexpect.run("stty sane raw cstopb -echo -echoe -echok -echoctl -echoke -iexten 115200 -F /dev/ttyS0")
fd = os.open("/dev/ttyS0", os.O_RDWR|os.O_NONBLOCK|os.O_NOCTTY)
self.child = pexpect.fdpexpect.fdspawn(fd, encoding='utf-8')
self.child.logfile = open("serial.log", "w")
There are 2 methods which fail with a timeout error in 5 to 10% of all runs. I stripped one down to the bare minimum neccessary to understand it.
def command(self, cmd, timeout, what, errmsg):
resp = ["ERROR CRC", "monitor: ERROR", what]
res = ""
try:
self.sendline(cmd)
while True:
idx = self.child.expect_exact(resp, timeout)
if idx == 0:
raise Rs232Error("RS232 error ({})".format(self.child.after))
elif idx != 1:
res = self.child.after # return matched string
break
log.debug("Ignoring ERROR")
pass
except pexpect.exceptions.TIMEOUT as e:
raise TimeoutError(errmsg)
return res
def answer(self, regexp, timeout=2):
l = ["\[[0-9]+\] {}\r\n".format(regexp), "\[[0-9]+\] ERROR.+\r\n"]
try:
idx = self.child.expect(l, timeout)
except pexpect.exceptions.TIMEOUT as e:
raise TimeoutError("No answer (last: {})".format(self.child.before))
s = self.child.after
m = re.match("\[([0-9]+)\] (.+)\r\n", s)
t = m.group(2)
return s
def rs485Test(self, oline, iline):
cmd = "%s %u" %("T 190", oline)
data = ""
self.command("%s %s" %(cmd, data), 5, "Test 190 execution", "Command %s not accepted" %(cmd))
v = self.answer("RS485-%u Tx: .+" %(oline), 5) # Check echo
try:
v = self.answer("RS485-%u Rx: .+" %(iline), 10)
except TimeoutError as e:
return 1, "Receive timed out (%s)" %(str(e)) # testcase failed!
return 0, v
As written above: the missing line is always logged. But pexpect does not return it and the TimeoutError message doesn't contain it either (as in "No answer (last: )"). The echo ("RS485-%u Tx" is sendout within 1 ms and never missed. The received data follows 6-8 ms later and is missing.
How could I debug the issue?
Sorry, I missed an example. The first call (sending via RS485-0 and receiving via RS485-1) was missed by above code. In the second call worked it worked fine. I see no difference.
T 230 0 AB 70 9A C8 3F 44#FC
[8638628] Test 190/230 execution^M
[8638629] RS485-0 Tx: AB 70 9A C8 3F 44 ^M
[8638635] RS485-1 Rx: AB 70 9A C8 3F 44 ^M
T 190 1 60 AC 24 DF 46 AB#09
[8648659] Test 190/230 execution^M
[8648660] RS485-1 Tx: 60 AC 24 DF 46 AB ^M
[8648666] RS485-0 Rx: 60 AC 24 DF 46 AB ^M

Fixing "HEREError: Error occured on __get"

Im getting started using the herepy EVCharginStationsApi and have been running into some issues that I can't troubleshoot on my own. I have obtained an API Key and have tried this so far:
evAPI = EVChargingStationsApi(API_Key)
response = evAPI.get_stations_circular_search(latitude = 37.87166, longitude = -122.2727, radius = 10000)
At this point, I run into the aforementioned HEREError: Error occured on__get (I have included a screenshot of the full error message below)
Please let me know how I can get around this (seemingly trivial) error.
As per the request, here is the error message as text as well:
HEREError Traceback (most recent call last)
<ipython-input-72-e0860df7cdc5> in <module>
----> 1 response = evAPI.get_stations_circular_search(latitude = 37.87166, longitude =-122.2727, radius = 10000)
2 response
/Applications/anaconda3/lib/python3.7/site-packages/herepy/ev_charging_stations_api.py in get_stations_circular_search(self, latitude, longitude, radius, connectortypes)
100 }
101 response = self.__get(
--> 102 self._base_url + "stations.json", data, EVChargingStationsResponse
103 )
104 return response
/Applications/anaconda3/lib/python3.7/site-packages/herepy/ev_charging_stations_api.py in __get(self, base_url, data, response_cls)
41 return response_cls.new_from_jsondict(json_data)
42 else:
---> 43 raise error_from_ev_charging_service_error(json_data)
44
45 def __connector_types_str(self, connector_types: List[EVStationConnectorTypes]):
HEREError: Error occured on __get

Grakn Python client API - How to recieve certain attributes for a thing

I am currently diving into the Grakn phone_calls examples for the Python client and are playing around a bit. What I currently try is to get only certain atrributes of a Grakn thing. The Python API documentation informs me to use thing.attributes(attribute_types) and states that attribute_types is supposed to be a list of AttributeTypes.
I tried the following passing a python list of AttributeTypes:
for attr in answer.attributes([transaction.get_schema_concept('first-name'), transaction.get_schema_concept('phone-number')]):
print(" {}: {}".format(attr.type().label(), attr.value()))
Resulting in the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-24-277eb53a924b> in <module>()
31 print("{}. {}: {} - Type: {} - Thing: {}".format(i, answer.type().label(), answer.id, answer.is_type(), answer.is_thing()))
32
---> 33 for attr in answer.attributes([transaction.get_schema_concept('first-name'), transaction.get_schema_concept('phone-number'), transaction.get_schema_concept('is-customer')]):
34 print(" {}: {}".format(attr.type().label(), attr.value()))
35 #for role in answer.roles():
~\Anaconda3\envs\grakn\lib\site-packages\grakn\service\Session\Concept\Concept.py in attributes(self, *attribute_types)
481 def attributes(self, *attribute_types):
482 """ Retrieve iterator of this Thing's attributes, filtered by optionally provided attribute types """
--> 483 attrs_req = RequestBuilder.ConceptMethod.Thing.attributes(attribute_types)
484 method_response = self._tx_service.run_concept_method(self.id, attrs_req)
485 from grakn.service.Session.util import ResponseReader
~\Anaconda3\envs\grakn\lib\site-packages\grakn\service\Session\util\RequestBuilder.py in attributes(attribute_types)
549 attributes_req = concept_messages.Thing.Attributes.Req()
550 for attribute_type_concept in attribute_types:
--> 551 grpc_attr_type_concept = RequestBuilder.ConceptMethod._concept_to_grpc_concept(attribute_type_concept)
552 attributes_req.attributeTypes.extend([grpc_attr_type_concept])
553 concept_method_req = concept_messages.Method.Req()
~\Anaconda3\envs\grakn\lib\site-packages\grakn\service\Session\util\RequestBuilder.py in _concept_to_grpc_concept(concept)
205 """ Takes a concept from ConceptHierarcy and converts to GRPC message """
206 grpc_concept = concept_messages.Concept()
--> 207 grpc_concept.id = concept.id
208 base_type_name = concept.base_type
209 grpc_base_type = BaseTypeMapping.name_to_grpc_base_type[base_type_name]
AttributeError: 'list' object has no attribute 'id'
The problem was that I missinterpreted list of AttributeTypes as a Python list, but instead could pass one or more AttributeTypes as parameters instead. For example:
for attr in answer.attributes(transaction.get_schema_concept('first-name'), transaction.get_schema_concept('phone-number'), transaction.get_schema_concept('is-customer')):
print(" {}: {}".format(attr.type().label(), attr.value()))
I hope that this will be of some help for other Grakn newbies.

Parsing a large space separated file into sqlite

I am trying to parse a large space separated file (3 GB and higher) into a sqlite database for other processing. The file currently has around 20+ million lines of data. I have tried multithreading this, but for some reason, it stops with around 1500 lines and does not proceed. I don’t know if I am doing anything wrong. Can someone please point me in the right direction?
The insertion is working fine with one process, but is is way too slow (of course!!!). It has been running for over seven hours and it is not even past the first set of strings. The DB file is still 25 MB in size and not even close to the number of records it has to contain.
Please guide me towards speeding this up. I have one more huge file to go (more than 5 GB) and this could take days.
Here’s my code:
1 import time
2 import queue
3 import threading
4 import sys
5 import sqlite3 as sql
6
7 record_count = 0
8 DB_INSERT_LOCK = threading.Lock()
9
10 def process_data(in_queue):
11 global record_count
12 try:
13 mp_db_connection = sql.connect("sequences_test.sqlite")
14 sql_handler = mp_db_connection.cursor()
15 except sql.Error as error:
16 print("Error while creating database connection: ", error.args[0])
17 while True:
18 line = in_queue.get()
19 # print(line)
20 if (line[0] == '#'):
21 pass
22 else:
23 (sequence_id, field1, sequence_type, sequence_count, field2, field3,
24 field4, field5, field6, sequence_info, kmer_length, field7, field8,
25 field9, field10, field11, field12, field13, field14, field15) =
line.expandtabs(1).split(" ")
26
27 info = (field7 + " " + field8 + " " + field9 + " " + field10 + " " +
28 field11 + " " + field12 + " " + field13 + " " + field14 + " "
29 + field15)
30
31 insert_tuple = (None, sequence_id, field1, sequence_type, sequence_count,
32 field2, field3, field4, field5, field6, sequence_info,
33 kmer_length, info)
34 try:
35 with DB_INSERT_LOCK:
36 sql_string = 'insert into sequence_info \
37 values (?,?,?,?,?,?,?,?,?,?,?,?,?)'
38 sql_handler.execute(sql_string, insert_tuple)
39 record_count = record_count + 1
40 mp_db_connection.commit()
41 except sql.Error as error:
42 print("Error while inserting service into database: ", error.args[0])
43 in_queue.task_done()
44
45 if __name__ == "__main__":
46 try:
47 print("Trying to open database connection")
48 mp_db_connection = sql.connect("sequences_test.sqlite")
49 sql_handler = mp_db_connection.cursor()
50 sql_string = '''SELECT name FROM sqlite_master \
51 WHERE type='table' AND name='sequence_info' '''
52 sql_handler.execute(sql_string)
53 result = sql_handler.fetchone()
54 if(not result):
55 print("Creating table")
56 sql_handler.execute('''create table sequence_info
57 (row_id integer primary key, sequence_id real, field1
58 integer, sequence_type text, sequence_count real,
59 field2 integer, field3 text,
60 field4 text, field5 integer, field6 integer,
61 sequence_info text, kmer_length text, info text)''')
62 mp_db_connection.commit()
63 else:
64 pass
65 mp_db_connection.close()
66 except sql.Error as error:
67 print("An error has occured.: ", error.args[0])
68
69 thread_count = 4
70 work = queue.Queue()
71
72 for i in range(thread_count):
73 thread = threading.Thread(target=process_data, args=(work,))
74 thread.daemon = True
75 thread.start()
76
77 with open("out.txt", mode='r') as inFile:
78 for line in inFile:
79 work.put(line)
80
81 work.join()
82
83 print("Final Record Count: ", record_count)
The reason I have a lock is that with sqlite, I don’t currently have a way to batch commit my files into the DB and hence I have to make sure that every time a thread inserts a record, state of the DB is committed.
I know I am losing some processing time with the expandtabs call in the thick of things, but it is a little difficult to post process the file I am receiving to do a simple split on it. I will continue trying to do that so that the workload is reduced, but I need the multithreading at least to work.
EDIT:
I moved the expandtabs and split part outside the processing. So I process the line and insert in into the queue as a tuple so that the threads can pick it up and directly insert it into the DB. I was hoping to save quite a bit of time with this, but now I am running into problems with sqlite. It says could not insert into db because it is locked. I am thinking it is more of a thread sync issue with the locking part since I have an exclusive lock on the critical section below. Could someone please elaborate on how to resolve this?
I wouldn't expect multithreading to be of much use there. You should maybe write a generator function that processes the file into tuples, which you then insert with executemany
In addition to previous responses, try:
Use executemany from Connetion object
Use Pypy
multihtreading will not help you
The first thing you have to do is not commit each record according to
http://sqlite.org/speed.html it's a factor 250 in speed.
To not lose all your work if you interrupt just commit every 10000 or 100000 records

Categories