I am doing research on AWS OpenSearch and one of the things I'm trying to measure is run times or execution times for different queries andindex commands. For example how long does it take to perform an action such as a query search, create index, delete index?
Right now I am using the awswrangler Python library for interacting with OpenSearch.API for that library here
Read Index Code I currently have:
awswrangler.opensearch.search(client=self.client, index="index_name", search_body=any_dsl_query,size=100)
awswrangler.opensearch.search_by_sql(client=self.client, sql_query="SELECT * from index_name limit 100")
Delete Index Code:
awswrangler.opensearch.delete_index(client=self.client, index="index_name")
Create Index Code (this one actually returns Elapsed time as desired):
awswrangler.opensearch.index_csv(client=self.client, path=csv_file_path, index="index_name")
Unfortunately none of these except Create Index return the runtime out of the box.
I know that I can create my own timer script to get the runtime but I don't want to do this client side, because then that would include my network, latency in the execution time. Is there anyway to do this with OpenSearch?
I couldn't find a way in the awswrangler python library I was using, or with any other method so far.
I was able to resolve this by using the python requests library and looking at the "took" value in the response, which is the time it took to run the query in ms. Here is the code I used to get this working:
sample_sql_query = "SELECT * FROM <index_name> LIMIT 5"
sql_result = requests.post("<opensearch_domain>/_plugins/_sql?format=json", auth=(username, password), data=json.dumps({"query":sample_sql_query}), headers=headers).json()
print (sql_result)
Related
I am working in collecting a data set that crossreferences a track's audio features and the Billboard's chart data set available on Kaggle. I am trying to get each song's URI in order to then get its audio features, and I defined the following function:
def get_track_uri(track_title, sp):
result = sp.search(track_title, type="track", limit=1)
if result['tracks']['total'] > 0:
track_uri = result['tracks']['items'][0]['uri']
return track_uri
else:
return None
and then it goes through the Billboard's 'song' column in order to create a new column with the URIs.
cleandf['uri'] = cleandf['song'].apply(lambda x: get_track_uri(x, sp))
So, I left it running for about 40 min and I noticed that it got stuck in a sleep method from Spotipy which I gathered was because I was making a lot of requests to the Spotify API. How can I be able to go around this if I'm trying to go through 50,000 rows? I could maybe make it wait between search queries but it will easily take what, 15 hours? Also, there probably is a way to directly get the audio features without me getting the URI's, but it still would need to go through all of the rows.
I am trying to implement Pull Model to query change feed using Azure Cosmos Python SDK. I found that to parallelise the querying process, the official documentation mentions about FeedRange value and create FeedIterator to iterate through each range of partition key values obtained from the FeedRange.
Currently my code snippet to query change feed looks like this and it is pretty straight-forward:
# function to get items from change feed based on a condition
def get_response(container_client, condition):
# Historical data read
if condition:
response = container.query_items_change_feed(
is_start_from_beginning = True,
# partition_key_range_id = 0
)
# reading from a checkpoint
else:
response = container.query_items_change_feed(
is_start_from_beginning = False,
continuation = last_continuation_token
)
return response
The problem with this approach is the efficiency when getting all the items from beginning (Historical Data Read). I tried this method with pretty small dataset of 500 items and the response took around 60 seconds. When dealing with millions or even billions of items the response might take too long to return.
Would querying change feed parallelly for each partition key range save time?
If yes, how to get PartitionKeyRangeId in Python SDK?
Is there any problems I need to consider when implementing this?
I hope I make sense!
I have been working with couchdb module in python to meet some projects needs. I was happily using view method from couchdb to retrieve result sets from my database until recently.
for row in db.view(mapping_function):
print row.key
However lately I have been needing to work with databases a lot bigger in size than before (~ 15-20 Gb). This is when I ran into an unfortunate issue.
db.view() method loads all rows in memory before you can do anything with it. This is not an issue with small databases but a big problem with large databases.
That is when I came across iterview function. This looks promising but I couldn't find a example usage of it. Can someone share or point me to example usage of iteview function in python-couchdb
Thanks - A
Doing this is almost working for me:
import couchdb.client
server = couchdb.client.Server()
db = server['db_name']
for row in db.iterview('my_view', 10, group=True):
print row.key + ': ' + row.value
I say it almost works because it does return all of the data and all the rows are printed. However, at the end of the batch, it throws a KeyError exception inside couchdb/client.py (line 884) in iterview
This worked for me. You need to add include_docs=True to the iterview call, and then you will get a doc attribute on each row which can be passed to the database delete method:
import couchdb
server = couchdb.Server("http://127.0.0.1:5984")
db = server['your_view']
for row in db.iterview('your_view/your_view', 10, include_docs=True):
# print(type(row))
# print(type(row.doc))
# print(dir(row))
# print(row.id)
# print(row.keys())
db.delete(row.doc)
If I do something like this:
from py2neo import Graph
graph = Graph()
stuff = graph.cypher.execute("""
match (a:Article)-[p]-n return a, n, p.weight
""")
on a database with lots of articles and links, the query takes a long time and uses all my system's memory, presumably because it's copying the entire result set into memory in one go. Is there some kind of cursor-based version where I could iterate through the results one at a time without having to have them all in memory at once?
EDIT
I found the stream function:
stuff = graph.cypher.stream("""
match (a:Article)-[p]-n return a, n, p.weight
""")
which seems to be what I want according to the documentation but now I get a timeout error (py2neo.packages.httpstream.http.SocketError: timed out), followed by the server becoming unresponsive until I kill it using kill -9.
Have you tried implementing a paging mechanism? Perhaps with the skip keyword: http://neo4j.com/docs/stable/query-skip.html
Similar to using limit / offset in a postgres / mysql query.
EDIT: I previously said that the entire result set was stored in memory, but it appears this is not the case when using api streaming - per Nigel's (Neo engineer) comment below.
For the purpose of making a sentiment summariser i require to read large number of tweets.I use the following code to fetch tweets from twitter.The number of tweets returned are just 10 to 20.What changes can be made in this code to increase the number of tweets to 100 or more
t.statuses.home_timeline()
raw_input(query)
data = t.search.tweets(q=query)
for i in range (len(data['statuses'])):
test = data['statuses'][i]['text']
print test
By default, it returns only 20 tweets. Use Count Parameter in your query. Here's statuses/home_timeline doc page.
So, below is the code to get 100 tweets. Also, it must be less than or equal to 200.
t.statuses.home_timeline(count=100)
Updated at 4.48 after getting output
I tried and got huge tweets in 50 & 100. Here's the code:
Save the below code as test.py. Create a new directory - Paste test.py & this latest Twitter 1.14.1 library in it - Click Terminal & go the path where you created your new directoy using cd path command - now run python test.py.
from twitter import *
t = Twitter(
auth=OAuth('OAUTH_TOKEN','OAUTH_SECRET',
'CONSUMER_KEY', 'CONSUMER_SECRET')
)
query = int(raw_input("Type how many tweets do you need:\n"))
x = t.statuses.home_timeline(count=query)
for i in range(query):
print x[i]['text']
There is a limit to the number of tweets an application can fetch in a single request. You need to iterate through the results to get more than what you are returned in a single request. Take a look at this article on the twitter developer site that explains how to work with iterating through the results.
Note that the number of results also depends on the query you are searching for.