error with multiprocessing parsed data to sqlite

error with multiprocessing parsed data to sqlite - python

I am trying to parse a bunch of links and append the parsed data to sqlite3. I am getting errors that the sqlite3 database is locked, so maybe it's because I am using too high a pool value? I tried to lower it to 5 but I am still getting errors shown below.
My code is basically looking like this:
from multiprocessing import Pool
with Pool(5) as p:
p.map(parse_link, links)
My real code is looking like this:
with Pool(5) as p:
p.map(Get_FT_OU, file_to_set('links.txt'))
# Where Get_FT_OU(link) appends links to a sqlite3 database.
When the code runs I often get these errors. Can someone help me to fix it?
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/Users/christian/Documents/GitHub/odds/CP_Parser.py", line 166, in Get_FT_OU
cursor.execute(sql_str)
sqlite3.OperationalError: database is locked
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/christian/Documents/GitHub/odds/CP_Parser.py", line 206, in <module>
p.map(Get_FT_OU, file_to_set('links.txt'))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
sqlite3.OperationalError: database is locked
>>>
I can run the code fine without using multiprocessing and actually also while using Pool(2) I get no errors, but if I go higher I get these errors. I'm using the newest MacBook Air.

It somehow worked by adding a timeout=10 to the connection
conn = sqlite3.connect(DB_FILENAME, timeout=10)

Related

With bigtable and python, what are the causes of an exception like google.api_core.exceptions.Aborted: 409 Error while reading table?

I'm getting this exception when using read_rows on a table. The table has rows for features of documents, each document has 300 to 800 features and there are about 2 million documents. The row_key is the feature, the columns are the document ids that have that feature. There are billions of rows.
I'm running this on a 16 CPU VM on GCP and the load averages are between 6 and 10. I'm using the python bigtable SDK and python 3.6.8 and google-cloud-bigtable 2.3.3.
I'm getting this kind of exception when reading the rows using table.read_rows(start_key=foo#xy, end_key=foo#xz). foo#xy and foo#xy are from table.sample_row_keys(). I get 200 prefixes from sample_row_keys and I successfully process the first 5 or so before I get this error. I'm running the table.read_rows() call in a ThreadPool.
If you've encountered an exception like this and investigated it, what was the cause of it and what did you do to prevent it?
Traceback (most recent call last):
File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/api_core/grpc_helpers.py", line 106, in __next__
return next(self._wrapped)
File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/grpc/_channel.py", line 809, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.ABORTED
details = "Error while reading table 'projects/arxiv-production/instances/docsim/tables/docsim' : Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)"
debug_error_string = "{"created":"#1635477504.521060666","description":"Error received from peer ipv4:172.217.0.42:443","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Error while reading table 'projects/arxiv-production/instances/docsim/tables/docsim' : Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)","grpc_status":10}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/bdc34/docsim/docsim/loading/all_common_hashes.py", line 53, in <module>
for hash, n, c, dt in pool.imap_unordered( do_prefix, jobs ):
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 735, in next
raise value
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/bdc34/docsim/docsim/loading/all_common_hashes.py", line 33, in do_prefix
for hash, common, papers in by_prefix(db, start, end):
File "/home/bdc34/docsim/docsim/loading/all_common_hashes.py", line 15, in by_prefix
for row in db.table.read_rows(start_key=start, end_key=end):
File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/cloud/bigtable/row_data.py", line 485, in __iter__
response = self._read_next_response()
File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/cloud/bigtable/row_data.py", line 474, in _read_next_response
return self.retry(self._read_next, on_error=self._on_error)()
File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/api_core/retry.py", line 288, in retry_wrapped_func
on_error=on_error,
File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/api_core/retry.py", line 190, in retry_target
return target()
File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/cloud/bigtable/row_data.py", line 470, in _read_next
return six.next(self.response_iterator)
File "/home/bdc34/docsim/venv/lib64/python3.6/site-packages/google/api_core/grpc_helpers.py", line 109, in __next__
raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.Aborted: 409 Error while reading table 'projects/testproject/instances/testinstance/tables/testtable' :
Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)

There could be different causes of this error. You might need to be sure that you are not facing a hotspotting scenario here.
Also, you may check if you're reading many different rows in your table and that you are creating as few clients as possible. Performance can be hit too if you are reading a large range of row keys that contains only a small number of rows. You'll find more general advice on troubleshooting performance issues here.

I worked around this by calling read_rows with a much smaller range. The prefixes from table.sample_row_keys() were spanning around 1.5B rows. Bisecting each range 5 times to produce smaller ranges worked.
I bisected by padding out the start and end row_keys to the same length, converting those to ints and finding the midpoint.

[Python]:phoenixdb.errors.ProgrammingError: ("Syntax error. Unexpected char: '!'", 601, '42P00', None)

I am using a python module called phoenixdb to access phoenix which is SQL wrapper to query over HBase.
Here is my code snippet:-
import phoenixdb
database_url = 'http://localhost:8765/'
conn = phoenixdb.connect(database_url, autocommit=True)
cursor = conn.cursor()
cursor.execute("!table")
print cursor.fetchall()
cursor.close()
The phoenix query to list all the schemes and tables is !table or !tables.
But when I try passing the same in the execute function as shown above, I get the following error:-
Traceback (most recent call last):
File "phoenix_hbase.py", line 7, in <module>
cursor.execute("!table")
File "build/bdist.linux-x86_64/egg/phoenixdb/cursor.py", line 242, in execute
File "build/bdist.linux-x86_64/egg/phoenixdb/avatica.py", line 345, in prepareAndExecute
File "build/bdist.linux-x86_64/egg/phoenixdb/avatica.py", line 184, in _apply
File "build/bdist.linux-x86_64/egg/phoenixdb/avatica.py", line 90, in parse_error_page
phoenixdb.errors.ProgrammingError: ("Syntax error. Unexpected char: '!'", 601, '42P00', None)
Funny part is when I try to passing a different query, for example a select query, then script gets executed and produces result just fine.
Query:cursor.execute("select * from CARETESTING.EDR_BIN_SOURCE_3600_EDR_FLOW_SUBS_TOTAL limit 1")
Result:
[[2045,1023,4567]]
Is there any other format for passing !table which is equivalent to show tables in phoenixdb library's execute function which I am missing?
I tried looking up on the internet but unfortunately haven't come across anything helpful so far.
Thanks

!tables is sqlline grammar that can not be parsed by JDBC interface.

Convert a multi-threaded Python to a multi-process one using concurrent futures

I have the following working code (Python 3.5) which uses concurrent futures to parse files in a threaded manner, and then do some post-processing on the results when they come back (in any order).
from concurrent import futures
with futures.ThreadPoolExecutor(max_workers=4) as executor:
# A dictionary which will contain a list the future info in the key, and the filename in the value
jobs = {}
# Loop through the files, and run the parse function for each file, sending the file-name to it, along with the kwargs of parser_variables.
# The results of the functions can come back in any order.
for this_file in files_list:
job = executor.submit(parse_log_file.parse, this_file, **parser_variables)
jobs[job] = this_file
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
debug.checkpointer("Multi-threaded Parsing File finishing")
# Send the result of the file the job is based on (jobs[job]) and the job (job.result)
result_content = job.result()
this_file = jobs[job]
I want to convert this to use processes instead of threads because threads don't offer any speedup. In theory I just need to change ThreadPoolExecutor into ProcessPoolExecutor.
The problem is, if I do that I get this exception:
Process Process-2:
Traceback (most recent call last):
File "C:\Python35\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
File "C:\Python35\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Python35\lib\concurrent\futures\process.py", line 169, in _process_worker
call_item = call_queue.get(block=True)
File "C:\Python35\lib\multiprocessing\queues.py", line 113, in get
return ForkingPickler.loads(res)
TypeError: Required argument 'fileno' (pos 1) not found
Traceback (most recent call last):
File "c:/myscript/main.py", line 89, in <module>
main()
File "c:/myscript/main.py", line 59, in main
system_counters = process_system(system, filename)
File "c:\myscript\per_system.py", line 208, in process_system
system_counters = process_filelist(**file_handling_variables)
File "c:\myscript\per_logfile.py", line 31, in process_filelist
results_list = job.result()
File "C:\Python35\lib\concurrent\futures\_base.py", line 398, in result
return self.__get_result()
File "C:\Python35\lib\concurrent\futures\_base.py", line 357, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I think that this might have something to do with pickling, but googling for the error hasn't found anything.
How do I convert the above to use multiple processes?

It turns out this is because one of the things I'm passing inside parser_variables is a class (a reader from a third-party module). If I remove the class, the above works fine.
For whatever reason, pickle doesn't seem to be able to handle this particular object.

python and elastic search update errors

today I'm updated the elastic search from 1.6 to 2.1, because 1.6 is vulnerable version, after this update my website not working, give this error :
Traceback (most recent call last):
File "manage.py", line 8, in <module>
from app import app, db
File "/opt/project/app/__init__.py", line 30, in <module>
es.create_index(app.config['ELASTICSEARCH_INDEX'])
File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 93, in decorate
return func(*args, query_params=query_params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 1033, in create_index
query_params=query_params)
File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 285, in send_request
self._raise_exception(status, error_message)
File "/usr/local/lib/python2.7/dist-packages/pyelasticsearch/client.py", line 299, in _raise_exception
raise error_class(status, error_message)
pyelasticsearch.exceptions.ElasticHttpError: (400, u'index_already_exists_exception')
make: *** [run] Error 1
the code is this :
redis = Redis()
es = ElasticSearch(app.config['ELASTICSEARCH_URI'])
try:
es.create_index(app.config['ELASTICSEARCH_INDEX'])
except IndexAlreadyExistsError, e:
pass
where is wrong ? what is new on this new version ?

You're getting the following error: index_already_exists_exception
This means that you're trying to create an index that already exists. The second time you run your program you either need to delete your index first or create it only if it doesn't exist.

You have handled the exception using IndexAlreadyExistsError. Try using TransportError to handle the exception.
And you can also add a check like :
exist = es.indices.exists(index_name)
if not exist:
es.create_index(app.config['ELASTICSEARCH_INDEX'])

Intermittent KeyError raised and can not reproduce it

I tried to reproduce it with some simpler functions but didn't succeed. So the following code shows the relevant methods for a KeyError which get's thrown by our production servers, a lot.
class PokerGame:
...
def serialsNotFold(self):
return filter(lambda x: not self.serial2player[x].isFold(), self.player_list)
def playersNotFold(self):
return [self.serial2player[serial] for serial in self.serialsNotFold()]
...
And here is the Traceback.
Traceback (most recent call last):
File "/usr/lib/python2.6/dist-packages/pokernetwork/pokertable.py", line 945, in update
try: self.game.historyReduce()
File "/usr/lib/python2.6/dist-packages/pokerengine/pokergame.py", line 3949, in historyReduce
self.turn_history = PokerGame._historyReduce(self.turn_history,self.moneyMap())
File "/usr/lib/python2.6/dist-packages/pokerengine/pokergame.py", line 1323, in moneyMap
money = dict((player.serial,player.money) for player in self.playersNotFold())
File "/usr/lib/python2.6/dist-packages/pokerengine/pokergame.py", line 3753, in playersNotFold
return [self.serial2player[serial] for serial in self.serialsNotFold()]
KeyError: 21485L
self.player_list is a list with serials
self.serial2player is a dict which maps serials to Player objects
Now it shouldn't be possible that the KeyError is raised in playersNotFold because therefore the same error had to be raised in serialsNotFold, which it gets not.
I asked my 2 peers and the guys on #python but no one was able to even gues how this can happen.
If you need the full source: https://github.com/pokermania/poker-network/
EDIT:
The Problem was that we printed traceback.format_exc(limit=4) which limits from the top instead of the bottom. The last 2 calls where hidden, so it looked like playersNotFold raised the exception.
Here is a full trace.
Traceback (most recent call last):
File "/usr/lib/python2.7/pokernetwork/pokertable.py", line 704, in update
try: self.game.historyReduce()
File "/usr/lib/python2.7/pokerengine/pokergame.py", line 3953, in historyReduce
self.turn_history = PokerGame._historyReduce(self.turn_history,self.moneyMap())
File "/usr/lib/python2.7/pokerengine/pokergame.py", line 1327, in moneyMap
money = dict((player.serial,player.money) for player in self.playersNotFold())
File "/usr/lib/python2.7/pokerengine/pokergame.py", line 3757, in playersNotFold
return self.serial2player[serial] for serial in self.serialsNotFold()]
File "/usr/lib/python2.7/pokerengine/pokergame.py", line 3754, in serialsNotFold
return filter(lambda x: not self.serial2player[x].isFold(] self.player_list)
File "/usr/lib/python2.7/pokerengine/pokergame.py", line 3754, in <lambda>
return filter(lambda x: not self.serial2player[x].isFold(] self.player_list)
KeyError: 1521
I'm sorry for wasting your time :/

I guess you use threads and self.serial2player gets modified by a different thread.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

error with multiprocessing parsed data to sqlite - python

It somehow worked by adding a timeout=10 to the connection conn = sqlite3.connect(DB_FILENAME, timeout=10)

Related

With bigtable and python, what are the causes of an exception like google.api_core.exceptions.Aborted: 409 Error while reading table?

[Python]:phoenixdb.errors.ProgrammingError: ("Syntax error. Unexpected char: '!'", 601, '42P00', None)

Convert a multi-threaded Python to a multi-process one using concurrent futures

python and elastic search update errors

Intermittent KeyError raised and can not reproduce it

Categories

Resources