Elasticsearch for python - Calls not blocking correctly - python

I'm trying to write unittests for my own Elasticsearch client. It uses the client from elasticsearch-py.
Most of my tests are fine, but when running a test on my own search() function (which uses the search() function from Elasticsearch client) I get very random behaviour. This is the way my test is implemented:
def setUp(self) -> None:
self.es = ESClient(host="localhost")
self.es_acc = ESClient()
self.connection_res = (False, {})
self.t = self.es_acc.get_connection_status(self._callback)
self.t.join()
# Create test index and index some documents
self.es.create_index(self.TEST_INDEX)
names = ["Gregor", "Alice", "Per Svensson", "Mats Hermelin", "Mamma Mia"
, "Eva Dahlgren", "Per Morberg", "Maja Larsson", "Ola Salo", "Magrecievic Holagrostokovic"]
self.num_docs = len(names)
self.payload = []
random.seed(123)
for i, name in enumerate(names):
n = name.split(" ")
fname = n[0]
lname = n[1] if len(n) > 1 else n[0]
self.payload.append({"name": {"first": fname, "last": lname}, "age": random.randint(-100, 100),
"timestamp": datetime.utcnow() - timedelta(days=1 * i)})
self.es.upload(self.TEST_INDEX, self.payload, ids=list(range(len(names))))
def test_search(self):
# Test getting docs based on ids
ids = ["1", "4", "9"]
status, hits = self.es.search(self.TEST_INDEX, ids=ids) # Breakpoint
docs = hits["hits"]["hits"]
self.assertTrue(status, "Status not correct for search!")
returned_ids = [d["_id"] for d in docs]
names = [d["_source"]["name"] for d in docs]
self.assertListEqual(sorted(returned_ids), ids, "Returned ids from search not correct!")
self.assertListEqual(names, [self.payload[i]["name"] for i in [1, 4, 9]], "Returned source from search not correct!")
In setUp() I'm just uploading a few documents to test on, so there should always be 10 documents to test on. Below is an excerpt from my search() function.
if ids:
try:
q = Query().ids(ids).compile_and_get()
res = self.es.search(index=index, body=q)
print(res)
return True, res
except exceptions.ElasticsearchException as e:
self._handle_elastic_exceptions("search", e, index=index)
return False, {}
I've implemented Query. Anyway, when I just run the test, I ALMOST always get 0 hits. But if I debug the application, with a breakpoint in test_search() on the row where I make the call to search() and step, everything works fine. If I put it just one line below, I get 0 hits again. What is going on? Why is it not blocking correctly?

It seems like I found my solution!
I did not understand that setUp was called on every test method. This was actually not the problem however.
The problem is that for some tests, uploading documents simply took to much time (which was done in setUp) and so when the test started, the documents did not exist yet! Solution: add sleep(1) to the end of setUp.

Related

Python3.6 function not called

I'm developing a REST API in Python 3.6 using Flask-Rebar and PostgreSQL and am having trouble trying to execute some queries simultaneously using psycopg2.
More specifically, I execute a query and require the id value from this query for use in the next query. The first query successfully returns the expected value, however the function that calls the subsequent query doesn't even execute.
Here is the function responsible for calling the query function:
psql = PostgresHandler()
user_ids = [1, 5, 9]
horse = {"name": "Adam", "age": 400}
def createHorseQuery(user_ids, horse):
time_created = strftime("%Y-%m-%dT%H:%M:%SZ")
fields, values = list(), list()
for key, val in horse.items():
fields.append(key)
values.append(val)
fields.append('time_created')
values.append(time_created)
fields = str(fields).replace('[', '(').replace(']', ')').replace("'", "")
values = str(values).replace('[', '(').replace(']', ')')
create_horse_query = f"INSERT INTO horse {fields} VALUES {values} RETURNING horse_id;"
horse_id = None
for h_id in psql.queryDatabase(create_horse_query, returnInsert=True):
horse_id = h_id
link_user_query = ''
for u_id in user_ids:
link_user_query += f"INSERT INTO user_to_horse (user_id, horse_id) VALUES ({u_id}, {horse_id['horse_id']});"
psql.queryDatabase(link_user_query)
return horse_id, 201
Here is the PostgresHandler() class that contains the function queryDatabase:
class PostgresHandler(object):
def __init__(self):
self.connectToDatabase()
def connectToDatabase(self):
self.connection = psycopg2.connect(
host = '...',
user = '...',
password = '...',
database = '...'
)
def queryDatabase(self, query, returnInsert=False):
cursor = self.connection.cursor(cursor_factory=RealDictCursor)
cursor.execute(query)
if "SELECT" in query.upper():
for result in cursor.fetchall():
yield result
elif "INSERT" in query.upper():
if returnInsert:
for result in cursor.fetchall():
yield result
self.connection.commit()
cursor.close()
I can verify that the psql.queryDatabase(create_horse_query, returnInsert=True) operation is successful by querying the database manually and comparing against the return value, h_id.
I can verify that link_user_query is created and contains the user_ids and horse_id as expected by printing. I know the query that's generated is okay as I have tested this manually in the database.
It appears that the function called on the line psql.queryDatabase(link_user_query) is never actually called as a print statement at the very top of the queryDatabase function does not get executed.
I've tried with delays between the two query function calls, initialising a new connection with each function call and many other things to no avail and I am absolutely stumped. Any insight is greatly appreciated.
EDIT: FYI, The createHorseQuery function returns successfully and displays the two returned values as expected.
queryDatabase in your code is a generator because it contains a yield statement. The generator only actually does things when you iterate over it (i.e. cause __next__() to be called). Consider the following:
def gen():
print("Gen is running!")
yield "Gen yielded: hello"
print("Gen did: commit")
print("***Doing stuff with b***")
b = gen()
for a in b:
print(a)
print("***Doing stuff with c***")
c = gen()
print("***Done***")
Output is:
***Doing stuff with b***
Gen is running!
Gen yielded: hello
Gen did: commit
***Doing stuff with c***
***Done***
When we called gen() to create c we didn't actually run it, we just instantiated it as a generator.
We could force it to run by calling __next__() on it a bunch of times:
c.__next__()
try:
c.__next__()
except StopIteration:
print("Iteration is over!")
outputs:
Gen is running!
Gen did: commit
Iteration is over!
But really, you should probably not use a generator like this where you are never intending to yield from it. You could consider adding a new function which is not a generator called insertSilently (or similar).

Getting wrong result from JSON - Python 3

Im working on a small project of retrieving information about books from the Google Books API using Python 3. For this i make a call to the API, read out the variables and store those in a list. For a search like "linkedin" this works perfectly. However when i enter "Google", it reads the second title from the JSON input. How can this happen?
Please find my code below (Google_Results is the class I use to initialize the variables):
import requests
def Book_Search(search_term):
parms = {"q": search_term, "maxResults": 3}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
print(r.url)
results = r.json()
i = 0
for result in results["items"]:
try:
isbn13 = str(result["volumeInfo"]["industryIdentifiers"][0]["identifier"])
isbn10 = str(result["volumeInfo"]["industryIdentifiers"][1]["identifier"])
title = str(result["volumeInfo"]["title"])
author = str(result["volumeInfo"]["authors"])[2:-2]
publisher = str(result["volumeInfo"]["publisher"])
published_date = str(result["volumeInfo"]["publishedDate"])
description = str(result["volumeInfo"]["description"])
pages = str(result["volumeInfo"]["pageCount"])
genre = str(result["volumeInfo"]["categories"])[2:-2]
language = str(result["volumeInfo"]["language"])
image_link = str(result["volumeInfo"]["imageLinks"]["thumbnail"])
dict = Google_Results(isbn13, isbn10, title, author, publisher, published_date, description, pages, genre,
language, image_link)
gr.append(dict)
print(gr[i].title)
i += 1
except:
pass
return
gr = []
Book_Search("Linkedin")
I am a beginner to Python, so any help would be appreciated!
It does so because there is no publisher entry in volumeInfo of the first entry, thus it raises a KeyError and your except captures it. If you're going to work with fuzzy data you have to account for the fact that it will not always have the expected structure. For simple cases you can rely on dict.get() and its default argument to return a 'valid' default entry if an entry is missing.
Also, there are a few conceptual problems with your function - it relies on a global gr which is bad design, it shadows the built-in dict type and it captures all exceptions guaranteeing that you cannot exit your code even with a SIGINT... I'd suggest you to convert it to something a bit more sane:
def book_search(search_term, max_results=3):
results = [] # a list to store the results
parms = {"q": search_term, "maxResults": max_results}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
try: # just in case the server doesn't return valid JSON
for result in r.json().get("items", []):
if "volumeInfo" not in result: # invalid entry - missing volumeInfo
continue
result_dict = {} # a dictionary to store our discovered fields
result = result["volumeInfo"] # all the data we're interested is in volumeInfo
isbns = result.get("industryIdentifiers", None) # capture ISBNs
if isinstance(isbns, list) and isbns:
for i, t in enumerate(("isbn10", "isbn13")):
if len(isbns) > i and isinstance(isbns[i], dict):
result_dict[t] = isbns[i].get("identifier", None)
result_dict["title"] = result.get("title", None)
authors = result.get("authors", None) # capture authors
if isinstance(authors, list) and len(authors) > 2: # you're slicing from 2
result_dict["author"] = str(authors[2:-2])
result_dict["publisher"] = result.get("publisher", None)
result_dict["published_date"] = result.get("publishedDate", None)
result_dict["description"] = result.get("description", None)
result_dict["pages"] = result.get("pageCount", None)
genres = result.get("authors", None) # capture genres
if isinstance(genres, list) and len(genres) > 2: # since you're slicing from 2
result_dict["genre"] = str(genres[2:-2])
result_dict["language"] = result.get("language", None)
result_dict["image_link"] = result.get("imageLinks", {}).get("thumbnail", None)
# make sure Google_Results accepts keyword arguments like title, author...
# and make them optional as they might not be in the returned result
gr = Google_Results(**result_dict)
results.append(gr) # add it to the results list
except ValueError:
return None # invalid response returned, you may raise an error instead
return results # return the results
Then you can easily retrieve as much info as possible for a term:
gr = book_search("Google")
And it will be far more tolerant of data omissions, provided that your Google_Results type makes most of the entries optional.
Following #Coldspeed's recommendation it became clear that missing information in the JSON file caused the exception to run. Since I only had a "pass" statement there it skipped the entire result. Therefore I will have to adapt the "Try and Except" statements so errors do get handled properly.
Thanks for the help guys!

BigQuery async query job - the fetch_results() method returns wrong number of values

I am writing Python code with the BigQuery Client API, and attempting to use the async query code (written everywhere as a code sample), and it is failing at the fetch_data() method call. Python errors out with the error:
ValueError: too many values to unpack
So, the 3 return values (rows, total_count, page_token) seem to be the incorrect number of return values. But, I cannot find any documentation about what this method is supposed to return -- besides the numerous code examples that only show these 3 return results.
Here is a snippet of code that shows what I'm doing (not including the initialization of the 'client' variable or the imported libraries, which happen earlier in my code).
#---> Set up and start the async query job
job_id = str(uuid.uuid4())
job = client.run_async_query(job_id, query)
job.destination = temp_tbl
job.write_disposition = 'WRITE_TRUNCATE'
job.begin()
print 'job started...'
#---> Monitor the job for completion
retry_count = 360
while retry_count > 0 and job.state != 'DONE':
print 'waiting for job to complete...'
retry_count -= 1
time.sleep(1)
job.reload()
if job.state == 'DONE':
print 'job DONE.'
page_token = None
total_count = None
rownum = 0
job_results = job.results()
while True:
# ---- Next line of code errors out...
rows, total_count, page_token = job_results.fetch_data( max_results=10, page_token=page_token )
for row in rows:
rownum += 1
print "Row number %d" % rownum
if page_token is None:
print 'end of batch.'
break
What are the specific return results I should expect from the job_results.fetch_data(...) method call on an async query job?
Looks like you are right! The code no longer return these 3 parameters.
As you can see in this commit from the public repository, fetch_data now returns an instance of the HTTPIterator class (guess I didn't realize this before as I have a docker image with an older version of the bigquery client installed where it does return the 3 values).
The only way that I found to return the results was doing something like this:
iterator = job_results.fetch_data()
data = []
for page in iterator._page_iter(False):
data.extend([page.next() for i in range(page.num_items)])
Notice that now we don't have to manage pageTokens anymore, it's been automated for the most part.
[EDIT]:
I just realized you can get results by doing:
results = list(job_results.fetch_data())
Got to admit it's way easier now then it was before!

not able to undertand cursors in appengine

I'm trying to fetch results in a python2.7 appengine app using cursors, but each time I use with_cursor() it fetches the same result set.
query = Model.all().filter("profile =", p_key).order('-created')
if r.get('cursor'):
query = query.with_cursor(start_cursor = r.get('cursor'))
cursor = query.cursor()
objs = query.fetch(limit=10)
count = len(objs)
for obj in objs:
...
Each time through I'm getting same 10 results. I'm thinkng it has to do with using end_cursor, but how do I get that value if query.cursor() is returning the start_cursor. I've looked through the docs but this is poorly documented.
Your formatting is a bit screwy by the way. Looking at your code (which is incomplete and therefore potentially leaving something out.) I have to assume you have forgotten to store the cursor after fetching results (or return to the user - I am assuming r is a request ?).
So after you have fetched some data you need to call cursor() on the query. e.g This function counts all entities using a cursor.
def count_entities(kind):
c = None
count = 0
q = kind.all(keys_only=True)
while True:
if c:
q.with_cursor(c)
i = q.fetch(1000)
count = count + len(i)
if not i:
break
c = q.cursor()
return count
See how after fetch() has been called the c=q.cursor() call and it's is used as the cursor next time through the loop.
Here's what finally worked:
query = Model.all().filter("profile =", p_key).order('-created')
if request.get('cursor'):
query = query.with_cursor(request.get('cursor'))
objs = query.fetch(limit=10)
cursor = query.cursor()
for obj in objs:
...

Update DynamoDB Atomic Counter with Python / Boto

I am trying to update an atomic count counter with Python Boto 2.3.0, but can find no documentation for the operation.
It seems there is no direct interface, so I tried to go to "raw" updates using the layer1 interface, but I was unable to complete even a simple update.
I tried the following variations but all with no luck
dynoConn.update_item(INFLUENCER_DATA_TABLE,
{'HashKeyElement': "9f08b4f5-d25a-4950-a948-0381c34aed1c"},
{'new': {'Value': {'N':"1"}, 'Action': "ADD"}})
dynoConn.update_item('influencer_data',
{'HashKeyElement': "9f08b4f5-d25a-4950-a948-0381c34aed1c"},
{'new': {'S' :'hello'}})
dynoConn.update_item("influencer_data",
{"HashKeyElement": "9f08b4f5-d25a-4950-a948-0381c34aed1c"},
{"AttributesToPut" : {"new": {"S" :"hello"}}})
They all produce the same error:
File "/usr/local/lib/python2.6/dist-packages/boto-2.3.0-py2.6.egg/boto/dynamodb/layer1.py", line 164, in _retry_handler
data)
boto.exception.DynamoDBResponseError: DynamoDBResponseError: 400 Bad Request
{u'Message': u'Expected null', u'__type': u'com.amazon.coral.service#SerializationException'}
I also investigated the API docs here but they were pretty spartan.
I have done a lot of searching and fiddling, and the only thing I have left is to use the PHP API and dive into the code to find where it "formats" the JSON body, but that is a bit of a pain. Please save me from that pain!
Sorry, I misunderstood what you were looking for. You can accomplish this via layer2 although there is a small bug that needs to be addressed. Here's some Layer2 code:
>>> import boto
>>> c = boto.connect_dynamodb()
>>> t = c.get_table('counter')
>>> item = t.get_item('counter')
>>> item
{u'id': 'counter', u'n': 1}
>>> item.add_attribute('n', 20)
>>> item.save()
{u'ConsumedCapacityUnits': 1.0}
>>> item # Here's the bug, local Item is not updated
{u'id': 'counter', u'n': 1}
>>> item = t.get_item('counter') # Refetch item just to verify change occurred
>>> item
{u'id': 'counter', u'n': 21}
This results in the same over-the-wire request as you are performing in your Layer1 code, as shown by the following debug output.
2012-04-27 04:17:59,170 foo [DEBUG]:StringToSign:
POST
/
host:dynamodb.us-east-1.amazonaws.com
x-amz-date:Fri, 27 Apr 2012 11:17:59 GMT
x-amz-security- token:<removed> ==
x-amz-target:DynamoDB_20111205.UpdateItem
{"AttributeUpdates": {"n": {"Action": "ADD", "Value": {"N": "20"}}}, "TableName": "counter", "Key": {"HashKeyElement": {"S": "counter"}}}
If you want to avoid the initial GetItem call, you could do this instead:
>>> import boto
>>> c = boto.connect_dynamodb()
>>> t = c.get_table('counter')
>>> item = t.new_item('counter')
>>> item.add_attribute('n', 20)
>>> item.save()
{u'ConsumedCapacityUnits': 1.0}
Which will update the item if it already exists or create it if it doesn't yet exist.
For those looking for the answer I have found it.
First IMPORTANT NOTE, I am currently unaware of what is going on BUT for the moment, to get a layer1 instance I have had to do the following:
import boto
AWS_ACCESS_KEY=XXXXX
AWS_SECRET_KEY=YYYYY
dynoConn = boto.connect_dynamodb(AWS_ACCESS_KEY, AWS_SECRET_KEY)
dynoConnLayer1 = boto.dynamodb.layer1.Layer1(AWS_ACCESS_KEY, AWS_SECRET_KEY)
Essentially instantiating a layer2 FIRST and THEN a layer 1.
Maybe Im doing something stupid but at this point Im just happy to have it working....
I'll sort the details later. THEN...to actually do the atomic update call:
dynoConnLayer1.update_item("influencer_data",
{"HashKeyElement":{"S":"9f08b4f5-d25a-4950-a948-0381c34aed1c"}},
{"direct_influence":
{"Action":"ADD","Value":{"N":"20"}}
}
);
Note in the example above Dynamo will ADD 20 to what ever the current value is and this operation will be atomic meaning other operations happening at the "same time" will be correctly "scheduled" to happen after the new value has been established as +20 OR before this operation is executed. Either way the desired effect will be accomplished.
Be certain to do this on the instance of the layer1 connection as the layer2 will throw errors given it expects a different set of parameter types.
Thats all there is to it!!!! Just so folks know, I figured this out using the PHP SDK. Takes a very short time to install and set up AND THEN when you do a call, the debug data will actually show you the format of the HTTP request body so you will be able to copy/model your layer1 parameters after the example. Here is the code I used to do the atomic update in PHP:
<?php
// Instantiate the class
$dynamodb = new AmazonDynamoDB();
$update_response = $dynamodb->update_item(array(
'TableName' => 'influencer_data',
'Key' => array(
'HashKeyElement' => array(
AmazonDynamoDB::TYPE_STRING=> '9f08b4f5-d25a-4950-a948-0381c34aed1c'
)
),
'AttributeUpdates' => array(
'direct_influence' => array(
'Action' => AmazonDynamoDB::ACTION_ADD,
'Value' => array(
AmazonDynamoDB::TYPE_NUMBER => '20'
)
)
)
));
// status code 200 indicates success
print_r($update_response);
?>
Hopefully this will help other up until the Boto layer2 interface catches up...or someone simply figures out how to do it in level2 :-)
I'm not sure this is truly an atomic counter, since when you increment the value of 1, another call call could increment the number by 1, so that when you "get" the value, it is not the value that you would expect.
For instance, putting the code by garnaat, which is marked as the accepted answer, I see that when you put it in a thread, it does not work:
class ThreadClass(threading.Thread):
def run(self):
conn = boto.dynamodb.connect_to_region(aws_access_key_id=os.environ['AWS_ACCESS_KEY'], aws_secret_access_key=os.environ['AWS_SECRET_KEY'], region_name='us-east-1')
t = conn.get_table('zoo_keeper_ids')
item = t.new_item('counter')
item.add_attribute('n', 1)
r = item.save() #- Item has been atomically updated!
# Uh-Oh! The value may have changed by the time "get_item" is called!
item = t.get_item('counter')
self.counter = item['n']
logging.critical('Thread has counter: ' + str(self.counter))
tcount = 3
threads = []
for i in range(tcount):
threads.append(ThreadClass())
# Start running the threads:
for t in threads:
t.start()
# Wait for all threads to complete:
for t in threads:
t.join()
#- Now verify all threads have unique numbers:
results = set()
for t in threads:
results.add(t.counter)
print len(results)
print tcount
if len(results) != tcount:
print '***Error: All threads do not have unique values!'
else:
print 'Success! All threads have unique values!'
Note: If you want this to truly work, change the code to this:
def run(self):
conn = boto.dynamodb.connect_to_region(aws_access_key_id=os.environ['AWS_ACCESS_KEY'], aws_secret_access_key=os.environ['AWS_SECRET_KEY'], region_name='us-east-1')
t = conn.get_table('zoo_keeper_ids')
item = t.new_item('counter')
item.add_attribute('n', 1)
r = item.save(return_values='ALL_NEW') #- Item has been atomically updated, and you have the correct value without having to do a "get"!
self.counter = str(r['Attributes']['n'])
logging.critical('Thread has counter: ' + str(self.counter))
Hope this helps!
There is no high-level function in DynamoDB for atomic counters. However, you can implement an atomic counter using the conditional write feature. For example, let's say you a table with an string hash key called like this.
>>> import boto
>>> c = boto.connect_dynamodb()
>>> schema = s.create_schema('id', 's')
>>> counter_table = c.create_table('counter', schema, 5, 5)
You now write an item to that table that includes an attribute called 'n' whose value is zero.
>>> n = 0
>>> item = counter_table.new_item('counter', {'n': n})
>>> item.put()
Now, if I want to update the value of my counter, I would perform a conditional write operation that will bump the value of 'n' to 1 iff it's current value agrees with my idea of it's current value.
>>> n += 1
>>> item['n'] = n
>>> item.put(expected_value={'n': n-1})
This will set the value of 'n' in the item to 1 but only if the current value in the DynamoDB is zero. If the value was already incremented by someone else, the write would fail and I would then need to increment by local counter and try again.
This is kind of complicated but all of this could be wrapped up in some code to make it much simpler to use. I did a similar thing for SimpleDB that you can find here:
http://www.elastician.com/2010/02/stupid-boto-tricks-2-reliable-counters.html
I should probably try to update that example to use DynamoDB
You want to increment a value in dynamodb then you can achieve this by using:
import boto3
import json
import decimal
class DecimalEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, decimal.Decimal):
if o % 1 > 0:
return float(o)
else:
return int(o)
return super(DecimalEncoder, self).default(o)
ddb = boto3.resource('dynamodb')
def get_counter():
table = ddb.Table(TableName)
try:
response = table.update_item(
Key={
'haskey' : 'counterName'
},
UpdateExpression="set currentValue = currentValue + :val",
ExpressionAttributeValues={
':val': decimal.Decimal(1)
},
ReturnValues="UPDATED_NEW"
)
print("UpdateItem succeeded:")
except Exception as e:
raise e
print(response["Attributes"]["currentValue" ])
This implementetion needs a extra counter table that will just keep the last used value for you.

Categories