I am working in collecting a data set that crossreferences a track's audio features and the Billboard's chart data set available on Kaggle. I am trying to get each song's URI in order to then get its audio features, and I defined the following function:
def get_track_uri(track_title, sp):
result = sp.search(track_title, type="track", limit=1)
if result['tracks']['total'] > 0:
track_uri = result['tracks']['items'][0]['uri']
return track_uri
else:
return None
and then it goes through the Billboard's 'song' column in order to create a new column with the URIs.
cleandf['uri'] = cleandf['song'].apply(lambda x: get_track_uri(x, sp))
So, I left it running for about 40 min and I noticed that it got stuck in a sleep method from Spotipy which I gathered was because I was making a lot of requests to the Spotify API. How can I be able to go around this if I'm trying to go through 50,000 rows? I could maybe make it wait between search queries but it will easily take what, 15 hours? Also, there probably is a way to directly get the audio features without me getting the URI's, but it still would need to go through all of the rows.
Related
I am doing research on AWS OpenSearch and one of the things I'm trying to measure is run times or execution times for different queries andindex commands. For example how long does it take to perform an action such as a query search, create index, delete index?
Right now I am using the awswrangler Python library for interacting with OpenSearch.API for that library here
Read Index Code I currently have:
awswrangler.opensearch.search(client=self.client, index="index_name", search_body=any_dsl_query,size=100)
awswrangler.opensearch.search_by_sql(client=self.client, sql_query="SELECT * from index_name limit 100")
Delete Index Code:
awswrangler.opensearch.delete_index(client=self.client, index="index_name")
Create Index Code (this one actually returns Elapsed time as desired):
awswrangler.opensearch.index_csv(client=self.client, path=csv_file_path, index="index_name")
Unfortunately none of these except Create Index return the runtime out of the box.
I know that I can create my own timer script to get the runtime but I don't want to do this client side, because then that would include my network, latency in the execution time. Is there anyway to do this with OpenSearch?
I couldn't find a way in the awswrangler python library I was using, or with any other method so far.
I was able to resolve this by using the python requests library and looking at the "took" value in the response, which is the time it took to run the query in ms. Here is the code I used to get this working:
sample_sql_query = "SELECT * FROM <index_name> LIMIT 5"
sql_result = requests.post("<opensearch_domain>/_plugins/_sql?format=json", auth=(username, password), data=json.dumps({"query":sample_sql_query}), headers=headers).json()
print (sql_result)
I am trying to implement Pull Model to query change feed using Azure Cosmos Python SDK. I found that to parallelise the querying process, the official documentation mentions about FeedRange value and create FeedIterator to iterate through each range of partition key values obtained from the FeedRange.
Currently my code snippet to query change feed looks like this and it is pretty straight-forward:
# function to get items from change feed based on a condition
def get_response(container_client, condition):
# Historical data read
if condition:
response = container.query_items_change_feed(
is_start_from_beginning = True,
# partition_key_range_id = 0
)
# reading from a checkpoint
else:
response = container.query_items_change_feed(
is_start_from_beginning = False,
continuation = last_continuation_token
)
return response
The problem with this approach is the efficiency when getting all the items from beginning (Historical Data Read). I tried this method with pretty small dataset of 500 items and the response took around 60 seconds. When dealing with millions or even billions of items the response might take too long to return.
Would querying change feed parallelly for each partition key range save time?
If yes, how to get PartitionKeyRangeId in Python SDK?
Is there any problems I need to consider when implementing this?
I hope I make sense!
I would like to pick up a specific keyword from a discord message as quickly as possible in Python, I'm currently using a request but the issue is that it takes too much time to actualize and grab the new message (it takes 200-500ms to receive messages), I am sure there is a better way of doing it.
def retrieve_messages(channelid):
while True:
##DISCORD
headers = {'authorization':""}
r= requests.get('https://discord.com/api/v9/channels/xxxxxxxxxxxxxxx/messages',headers=headers)
jsonn = json.loads(r.text)
for value in jsonn:
s1=str(value['content'])
s2=(re.findall('code:(0x......................)', s1))
if s2 !=[]:
print(s2)
retrieve_messages('')
According to the reference, the default number of messages returned from the endpoint is 50 (Link: https://discord.com/developers/docs/resources/channel#get-channel-messages). Using the limit-parameter it should be possible to maybe only get 1 or 5, which should limit the amount of time it takes to retrieve the messages and the time it takes to loop through them.
For the purpose of making a sentiment summariser i require to read large number of tweets.I use the following code to fetch tweets from twitter.The number of tweets returned are just 10 to 20.What changes can be made in this code to increase the number of tweets to 100 or more
t.statuses.home_timeline()
raw_input(query)
data = t.search.tweets(q=query)
for i in range (len(data['statuses'])):
test = data['statuses'][i]['text']
print test
By default, it returns only 20 tweets. Use Count Parameter in your query. Here's statuses/home_timeline doc page.
So, below is the code to get 100 tweets. Also, it must be less than or equal to 200.
t.statuses.home_timeline(count=100)
Updated at 4.48 after getting output
I tried and got huge tweets in 50 & 100. Here's the code:
Save the below code as test.py. Create a new directory - Paste test.py & this latest Twitter 1.14.1 library in it - Click Terminal & go the path where you created your new directoy using cd path command - now run python test.py.
from twitter import *
t = Twitter(
auth=OAuth('OAUTH_TOKEN','OAUTH_SECRET',
'CONSUMER_KEY', 'CONSUMER_SECRET')
)
query = int(raw_input("Type how many tweets do you need:\n"))
x = t.statuses.home_timeline(count=query)
for i in range(query):
print x[i]['text']
There is a limit to the number of tweets an application can fetch in a single request. You need to iterate through the results to get more than what you are returned in a single request. Take a look at this article on the twitter developer site that explains how to work with iterating through the results.
Note that the number of results also depends on the query you are searching for.
I think we all know this page, but the benchmarks provided dated from more than two years ago. So, I would like to know if you could point out the best xml parser around. As I need just a xml parser, the more important thing to me is speed over everything else.
My objective is to process some xml feeds (about 25k) that are 4kb in size (this will be a daily task). As you probably know, I'm restricted by the 30 seconds request timeout. So, what's the best parser (Python only) that I can use?
Thanks for your anwsers.
Edit 01:
#Peter Recore
I'll. I'm writing some code now and plan to run some profiling in the near future. Regarding your question, the answer is no. Processing takes just a little time when compared with downloading the actual xml feed. But, I can't increase Google's Bandwidth, so I can only focus on this right now.
My only problem is that i need to do this as fastest as possible because my objective is to get a snapshot of a website status. And, as internet is live and people keep adding and changing it's data, i need the fastest method because any data insertion during the "downloading and processing" time span will actually mess with my statistical analisys.
I used to do it from my own computer and the process took 24 minutes back then, but now the website has 12 times more information.
I know that this don't awnser my question directly, but id does what i just needed.
I remenbered that xml is not the only file type I could use, so instead of using a xml parser I choose to use json. About 2.5 times smaller in size. What means a decrease in download time. I used simplejson as my json libray.
I used from google.appengine.api import urlfetch to get the json feeds in parallel:
class GetEntityJSON(webapp.RequestHandler):
def post(self):
url = 'http://url.that.generates.the.feeds/'
if self.request.get('idList'):
idList = self.request.get('idList').split(',')
try:
asyncRequests = self._asyncFetch([url + id + '.json' for id in idList])
except urlfetch.DownloadError:
# Dealed with time out errors (#5) as these were very frequent
for result in asyncRequests:
if result.status_code == 200:
entityJSON = simplejson.loads(result.content)
# Filled a database entity with some json info. It goes like this:
# entity= Entity(
# name = entityJSON['name'],
# dateOfBirth = entityJSON['date_of_birth']
# ).put()
self.redirect('/')
def _asyncFetch(self, urlList):
rpcs = []
for url in urlList:
rpc = urlfetch.create_rpc(deadline = 10)
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
return [rpc.get_result() for rpc in rpcs]
I tried getting 10 feeds at a time, but most of the times an individual feed raised the DownloadError #5 (Time out). Then, I increased the deadline to 10 seconds and started getting 5 feeds at a time.
But still, 25k feeds getting 5 at a time results in 5k calls. In a queue that can spawn 5 tasks a second, the total task time should be 17min in the end.