Does appengine-mapreduce have a limit on operations? - python

I am working on a project that requires a big knowledgebase to be constructed based on word co-occurrences in text. As I have researched, a similar approach has not been tried in appengine. I would like to use appengine's flexibility and scalability, to be able to serve the knowledgebase and do reasoning on it to a wide scale of users.
So far I have come up with a mapreduce implementation based on the demo app for the pipeline. The source texts are stored in in the blobstore as zipped files containing one xml document, each containing a variable number of articles (as much as 30000).
The first step was to adapt the current BlobstoreZipLineInputReader, so that it parses the xml file, retrieving the relevant information from it. The XMLParser class uses the lxml iterparse approach to retrieve the xml elements to process from http://www.ibm.com/developerworks/xml/library/x-hiperfparse/, and returns an iterator.
The modified class BlobstoreXMLZipLineInputReader has a slightly different next function:
def next(self):
if not self._filestream:
if not self._zip:
self._zip = zipfile.ZipFile(self._reader(self._blob_key))
self._entries = self._zip.infolist()[self._start_file_index:
self._end_file_index]
self._entries.reverse()
if not self._entries:
raise StopIteration()
entry = self._entries.pop()
parser = XMLParser()
# the result here is an iterator with the individual articles
self._filestream = parser.parseXML(self._zip.open(entry.filename))
try:
article = self._filestream.next()
self._article_index += 1
except StopIteration:
article = None
if not article:
self._filestream.close()
self._filestream = None
self._start_file_index += 1
self._initial_offset = 0
return self.next()
return ((self._blob_key, self._start_file_index, self._article_index),
article)
The map function will then receive each of these articles, split by sentences, and then split by words:
def map_function(data):
"""Word count map function."""
(entry, article) = data
for s in split_into_sentences(article.body):
for w in split_into_words(s.lower()):
if w not in STOPWORDS:
yield (w, article.id)
And the reducer aggregates words, and joins the ids for the articles they appear on:
def reduce_function(key, values):
"""Word count reduce function."""
yield "%s: %s\n" % (key, list(set(values)))
This works beautifully on both the dev server and the live setup up to around 10000 texts (there are not that many words on them). It generally takes no more than 10 seconds. The problem is when it goes a bit over that, and mapreduce seems to hang processing the job continuously. The number of processed items per shard just increments, and my write op limits are soon reached.
Q1. Is there somehow a limit in how many map operations the mapreduce pipeline can do before it starts "behaving badly"?
Q2. Would there be a better approach to my problem?
Q3. I know this has been asked before, but can I circumvent the temporary mapreduce datastore writes? they're killing me...
P.S.: here's my main mapreduce call:
class XMLArticlePipeline(base_handler.PipelineBase):
def run(self, filekey, blobkey):
output = yield mapreduce_pipeline.MapreducePipeline(
"process_xml",
"backend.build_knowledgebase.map_function",
"backend.build_knowledgebase.reduce_function",
"backend.build_knowledgebase.BlobstoreXMLZipLineInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
mapper_params={
"blob_keys": [blobkey],
},
reducer_params={
"mime_type": "text/plain",
},
shards=12)
yield StoreOutput(filekey, output)
EDIT.: I get some weird errors in the dev server when running a neverending job:
[App Instance] [0] [dev_appserver_multiprocess.py:821] INFO Exception in HandleRequestThread
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver_multiprocess.py", line 819, in run
HandleRequestDirectly(request, client_address)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver_multiprocess.py", line 957, in HandleRequestDirectly
HttpServer(), request, client_address)
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/SocketServer.py", line 310, in process_request
self.finish_request(request, client_address)
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/SocketServer.py", line 323, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 2579, in __init__
BaseHTTPServer.BaseHTTPRequestHandler.__init__(self, *args, **kwargs)
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/SocketServer.py", line 641, in __init__
self.finish()
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/SocketServer.py", line 694, in finish
self.wfile.flush()
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/socket.py", line 303, in flush
self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe

Related

Python: yahoo_fin.stock_info.get_quote_table() not returning table

Goal:
The goal is to generate a robot in replit which will iteratively scrape yahoo pages like this amazon page, and track the dynamic 'Volume' datapoint for abnormally large changes. I'm currently trying to be able to reliably pull this exact datapoint down, and I have been using the yahoo_fin API to do so. I have also considered using bs4, but I'm not sure if it is possible to use BS4 to extract dynamic data. (I'd greatly appreciate it if you happen to know the answer to this: can bs4 extract dynamic data?)
Problem:
The script seems to work, but it does not stay online due to what appears to be an error in yahoo_fin. Usually within around 5 minutes of turning the bot on, it throws the following error:
File "/home/runner/goofy/scrape.py", line 13, in fetchCurrentVolume
table = si.get_quote_table(ticker)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/yahoo_fin/stock_info.py", line 293, in get_quote_table
tables = pd.read_html(requests.get(site, headers=headers).text)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/html.py", line 1098, in read_html
return _parse(
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/html.py", line 926, in _parse
raise retained
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/html.py", line 906, in _parse
tables = p.parse_tables()
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/html.py", line 222, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/html.py", line 552, in _parse_tables
raise ValueError("No tables found")
ValueError: No tables found
However, this usually happens after a number of tables have already been found.
Here is the fetchCurrentVolume function:
import yahoo_fin.stock_info as si
def fetchCurrentVolume(ticker):
table = si.get_quote_table(ticker)
currentVolume = table['Volume']
return currentVolume
and the API documentation is found above under Goal. Whenever this error message is displayed, the bot exits a #tasks.loop , and the robot goes offline. If you know of a way to fix the current use of yahoo_fin, OR any other way to obtain the dynamic data found in this xpath: '//div[#id="quote-summary"]/div/table/tbody/tr' , then you will have pulled me out of a 3 weeks long debacle with this issue! Thank you.
If you are able to retrieve some data then it cuts out, it is probably due to a rate limit. Try adding a sleep of a few seconds between each one.
see here for how to use sleep
Maybe the web server bonks out when the tables are being re-written every so often. Or something like that.
If you use a try/except waited a few seconds and then tried again before bailing out to a failure maybe that would work if it is just a hicup once in a while?
import yahoo_fin.stock_info as si
import time
def fetchCurrentVolume(ticker):
try:
table = si.get_quote_table(ticker)
currentVolume = table['Volume']
except:
# hopefully this was just a hicup and it will be back up in 5 seconds
time.sleep(5)
table = si.get_quote_table(ticker)
currentVolume = table['Volume']
return currentVolume

Python multiprocessing apply_async "assert left > 0" AssertionError

I am trying to load numpy files asynchronously in a Pool:
self.pool = Pool(2, maxtasksperchild = 1)
...
nextPackage = self.pool.apply_async(loadPackages, (...))
for fi in np.arange(len(files)):
packages = nextPackage.get(timeout=30)
# preload the next package asynchronously. It will be available
# by the time it is required.
nextPackage = self.pool.apply_async(loadPackages, (...))
The method "loadPackages":
def loadPackages(... (2 strings & 2 ints) ...):
print("This isn't printed!')
packages = {
"TRUE": np.load(gzip.GzipFile(path1, "r")),
"FALSE": np.load(gzip.GzipFile(path2, "r"))
}
return packages
Before even the first "package" is loaded, the following error occurs:
Exception in thread Thread-8: Traceback (most recent call last):
File "C:\Users\roman\Anaconda3\envs\tsc1\lib\threading.py", line 914,
in _bootstrap_inner
self.run() File "C:\Users\roman\Anaconda3\envs\tsc1\lib\threading.py", line 862, in
run
self._target(*self._args, **self._kwargs) File "C:\Users\roman\Anaconda3\envs\tsc1\lib\multiprocessing\pool.py", line
463, in _handle_results
task = get() File "C:\Users\roman\Anaconda3\envs\tsc1\lib\multiprocessing\connection.py",
line 250, in recv
buf = self._recv_bytes() File "C:\Users\roman\Anaconda3\envs\tsc1\lib\multiprocessing\connection.py",
line 318, in _recv_bytes
return self._get_more_data(ov, maxsize) File "C:\Users\roman\Anaconda3\envs\tsc1\lib\multiprocessing\connection.py",
line 337, in _get_more_data
assert left > 0 AssertionError
I monitor the resources closely: Memory is not an issue, I still have plenty left when the error occurs.
The unzipped files are just plain multidimensional numpy arrays.
Individually, using a Pool with a simpler method works, and loading the file like that works. Only in combination it fails.
(All this happens in a custom keras generator. I doubt this helps but who knows.) Python 3.5.
What could the cause of this issue be? How can this error be interpreted?
Thank you for your help!
There is a bug in Python C core code that prevents data responses bigger than 2GB return correctly to the main thread.
you need to either split the data into smaller chunks as suggested in the previous answer or not use multiprocessing for this function
I reported this bug to python bugs list (https://bugs.python.org/issue34563) and created a PR (https://github.com/python/cpython/pull/9027) to fix it, but it probably will take a while to get it released (UPDATE: the fix is present in python 3.8.0+)
if you are interested you can find more details on what causes the bug in the bug description in the link I posted
It think I've found a workaround by retrieving data in small chunks. In my case it was a list of lists.
I had:
for i in range(0, NUMBER_OF_THREADS):
print('MAIN: Getting data from process ' + str(i) + ' proxy...')
X_train.extend(ListasX[i]._getvalue())
Y_train.extend(ListasY[i]._getvalue())
ListasX[i] = None
ListasY[i] = None
gc.collect()
Changed to:
CHUNK_SIZE = 1024
for i in range(0, NUMBER_OF_THREADS):
print('MAIN: Getting data from process ' + str(i) + ' proxy...')
for k in range(0, len(ListasX[i]), CHUNK_SIZE):
X_train.extend(ListasX[i][k:k+CHUNK_SIZE])
Y_train.extend(ListasY[i][k:k+CHUNK_SIZE])
ListasX[i] = None
ListasY[i] = None
gc.collect()
And now it seems to work, possibly by serializing less data at a time.
So maybe if you can segment your data into smaller portions you can overcome the issue. Good luck!

Elasticsearch Python client Reindex Timedout

I'm trying to reindex using the Elasticsearch python client, using https://elasticsearch-py.readthedocs.org/en/master/helpers.html#elasticsearch.helpers.reindex. But I keep getting the following exception: elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeout
The stacktrace of the error is
Traceback (most recent call last):
File "~/es_test.py", line 33, in <module>
main()
File "~/es_test.py", line 30, in main
target_index='users-2')
File "~/ENV/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 306, in reindex
chunk_size=chunk_size, **kwargs)
File "~/ENV/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 182, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
File "~/ENV/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 124, in streaming_bulk
raise e
elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeout(HTTPSConnectionPool(host='myhost', port=9243): Read timed out. (read timeout=10))
Is there anyway to prevent this exception besides increasing the timeout?
EDIT: python code
from elasticsearch import Elasticsearch, RequestsHttpConnection, helpers
es = Elasticsearch(connection_class=RequestsHttpConnection,
host='myhost',
port=9243,
http_auth=HTTPBasicAuth(username, password),
use_ssl=True,
verify_certs=True,
timeout=600)
helpers.reindex(es, source_index=old_index, target_index=new_index)
I have been suffering from this issue for couple of days, I changed the request_timeout parameter to 30 (which is 30 seconds) didn't work.
Finally I have to edit the stream_bulk and reindex APIs inside the elasticsearch.py
Change the chunk_size parameter from the default 500 (which is processing 500 documents) to less number of documents per batch. I changed mine to 50 which worked fine for me. No more read timeout errors.
def streaming_bulk(client, actions, chunk_size=50, raise_on_error=True,
expand_action_callback=expand_action, raise_on_exception=True,
**kwargs):
def reindex(client, source_index, target_index, query=None, target_client=None,
chunk_size=50, scroll='5m', scan_kwargs={}, bulk_kwargs={}):
It may be happening because of the OutOfMemoryError for Java heap space, which means you are not giving elasticsearch enough memory for what you want to do.
Try to look at your /var/log/elasticsearch if there is any exception like that.
https://github.com/elastic/elasticsearch/issues/2636

py2neo not enforcing uniqueness constraints in Neo4j database

I have a neo4j database with nodes that have labels "Program" and "Session". In the Neo4j database I've enforced a uniqueness constraint on the properties: "name" and "href". From the :schema
Constraints
ON (program:Program) ASSERT program.href IS UNIQUE
ON (program:Program) ASSERT program.name IS UNIQUE
ON (session:Session) ASSERT session.name IS UNIQUE
ON (session:Session) ASSERT session.href IS UNIQUE
I want to periodically query another API (thus storing the name and API endpoint href as properties), and only add new nodes when they're not already in the database.
This is how I'm creating the nodes:
newprogram, = graph_db.create(node(name = programname, href = programhref))
newprogram.add_labels('Program')
newsession, = graph_db.create(node(name = sessionname, href = sessionhref))
newsession.add_labels('Session')
I'm running into the following error:
Traceback (most recent call last):
File "/Users/jedc/google-cloud-sdk/platform/google_appengine/lib/webapp2-2.5.2/webapp2.py", line 1535, in __call__
rv = self.handle_exception(request, response, e)
File "/Users/jedc/google-cloud-sdk/platform/google_appengine/lib/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/Users/jedc/google-cloud-sdk/platform/google_appengine/lib/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/Users/jedc/google-cloud-sdk/platform/google_appengine/lib/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/Users/jedc/google-cloud-sdk/platform/google_appengine/lib/webapp2-2.5.2/webapp2.py", line 572, in dispatch
return self.handle_exception(e, self.app.debug)
File "/Users/jedc/google-cloud-sdk/platform/google_appengine/lib/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/Users/jedc/appfolder/applicationapis.py", line 42, in post
newprogram.add_labels('Program')
File "/Users/jedc/appfolder/py2neo/util.py", line 99, in f_
return f(*args, **kwargs)
File "/Users/jedc/appfolder/py2neo/core.py", line 1638, in add_labels
if err.response.status_code == BAD_REQUEST and err.cause.exception == 'ConstraintViolationException':
AttributeError: 'ConstraintViolationException' object has no attribute 'exception'
My thought was that if I try to add the nodes and they're already in the database they just won't be added.
I've done a try/except AttributeError block around the creation/add_labels lines, but when I did that I managed to duplicate everything that was already in the database, even though I had the constraints shown. (?!?) (How can py2neo manage to violate those constraints??)
I'm really confused, and would appreciate any help in figuring out how to add nodes only when they don't already exist.
The problem seems to be that you are first creating nodes without a label and then subsequently adding the label after creation.
That is
graph_db.create(node(name = programname, href = programhref))
and
graph_db.create(node(name = sessionname, href = sessionhref))
This, first creates nodes without any labels which means the nodes satisfy the constraint conditions which only apply to nodes with the labels Program and Session.
Once you call newprogram.add_labels('Program') and newsession.add_labels('Session') Neo4j attempts to add labels to the node and raises an exception since the constraint assertions cannot be met.
Py2neo may be creating duplicate nodes. Although I'm sure if you inspect them, you'll find one set of nodes has the labels and the other set does not.
Can you use py2neo in a way that it adds the label at the same time as creation?
Otherwise you could use a Cypher query
CREATE (program:Program{name: {programname}, href: {programhref}})
CREATE (session:Session{name: {sessionname}, href: {sessionhref}})
Using Py2neo you should be able to do this as suggested in the docs
graph_db = neo4j.GraphDatabaseService()
qs = '''CREATE (program:Program{name: {programname}, href: {programhref}})
CREATE (session:Session{name: {sessionname}, href: {sessionhref}})'''
query = neo4j.CypherQuery(graph_db, qs)
query.execute(programname=programname, programhref=programhref,
sessionname=sessionname, sessionhref=sessionhref)
Firstly, the stack trace that you've shown highlights a a bug that should be fixed in the latest version of py2neo (1.6.4 at the time of writing this). There was an issue whereby error detail dropped an expected "exception" key and this has now been fixed so upgrading should give you a better error message.
However, this only addresses the error reporting bug. In terms of the constraint question itself, it is correct that the node creation and application of labels are necessarily carried out in two steps. This is due to a limitation in the REST API that does not allow a direct method for creating a node with label detail.
The next version of py2neo will make this easier/possible in a single step via batching. But for now, you probably want to look at a Cypher statement to carry out the creation and labelling as mentioned in the other answer here.

Threading in a django app

I'm trying to create what I call a scrobbler. The task is to read a Delicious user from a queue, fetch all of their bookmarks and put them in the bookmarks queue. Then something should go through that queue, do some parsing and then store the data in a database.
This obviously calls for threading because most of the time is spent waiting for Delicious to respond and then for bookmarked websites to respond and be passed through some API's and it would be silly for everything to wait for that.
However I am having trouble with threading and keep getting strange errors like database tables not being defined. Any help is appreciated :)
Here's the relevant code:
# relevant model #
class Bookmark(models.Model):
account = models.ForeignKey( Delicious )
url = models.CharField( max_length=4096 )
tags = models.TextField()
hash = models.CharField( max_length=32 )
meta = models.CharField( max_length=32 )
# bookmark queue reading #
def scrobble_bookmark(account):
try:
bookmark = Bookmark.objects.all()[0]
except Bookmark.DoesNotExist:
return False
bookmark.delete()
tags = bookmark.tags.split(' ')
user = bookmark.account.user
for concept in Concepts.extract( bookmark.url ):
for tag in tags:
Concepts.relate( user, concept['name'], tag )
return True
def scrobble_bookmarks(account):
semaphore = Semaphore(10)
for i in xrange(Bookmark.objects.count()):
thread = Bookmark_scrobble(account, semaphore)
thread.start()
class Bookmark_scrobble(Thread):
def __init__(self, account, semaphore):
Thread.__init__(self)
self.account = account
self.semaphore = semaphore
def run(self):
self.semaphore.acquire()
try:
scrobble_bookmark(self.account)
finally:
self.semaphore.release()
This is the error I get:
Exception in thread Thread-65:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 525, in __bootstrap_inner
self.run()
File "/home/swizec/Documents/trees/bookmarklet_server/../bookmarklet_server/Scrobbler/Scrobbler.py", line 60, in run
scrobble_bookmark(self.account)
File "/home/swizec/Documents/trees/bookmarklet_server/../bookmarklet_server/Scrobbler/Scrobbler.py", line 28, in scrobble_bookmark
bookmark = Bookmark.objects.all()[0]
File "/usr/local/lib/python2.6/dist-packages/django/db/models/query.py", line 152, in __getitem__
return list(qs)[0]
File "/usr/local/lib/python2.6/dist-packages/django/db/models/query.py", line 76, in __len__
self._result_cache.extend(list(self._iter))
File "/usr/local/lib/python2.6/dist-packages/django/db/models/query.py", line 231, in iterator
for row in self.query.results_iter():
File "/usr/local/lib/python2.6/dist-packages/django/db/models/sql/query.py", line 281, in results_iter
for rows in self.execute_sql(MULTI):
File "/usr/local/lib/python2.6/dist-packages/django/db/models/sql/query.py", line 2373, in execute_sql
cursor.execute(sql, params)
File "/usr/local/lib/python2.6/dist-packages/django/db/backends/sqlite3/base.py", line 193, in execute
return Database.Cursor.execute(self, query, params)
OperationalError: no such table: Scrobbler_bookmark
PS: all other tests depending on the same table pass with flying colours.
You can not use threading with in memory databases (sqlite3 in this case) in Django see this bug. This might work with PostgreSQL or MySQL.
I'd recommend something like celeryd instead of threads, message queues are MUCH easier to work with than threading.
This calls for a task queue, though not necessarily threads. You'll have your server process, one or several scrobbler processes, and a queue that lets them communicate. The queue can be in the database, or something separate like a beanstalkd. All this has nothing to do with your error, which sounds like your database is just misconfigured.
1) Does the error persist if you use a real database, not SQLite?
2) If you're using threads, you might need to create separate SQL cursors for use in threads.
i think the table really doesn't exists yew have to create it first by SQL commad
or any other way. As i have a small database for testing different modules i just delete
the database and recreate it using syncdb command

Categories