I am currently trying to get live stock prices using the beautifulsoup and requests library from the yahoo finance website. I'm currently finding that the bottleneck for speed is that the request for the webpage takes around 0.5 seconds on average. Below is my code and output
from bs4 import BeautifulSoup, SoupStrainer
import requests
import time
# transform percent text into numerical (float) value
def process_percent_change(percent_text):
processed_text = []
for c in percent_text:
if c == '.' or c == '+' or c == '-' or c.isdigit():
processed_text.append(c)
if processed_text[0] == '-':
return -float("".join(processed_text[1:]))
else:
return float("".join(processed_text[1:]))
# parse most current ticker data from beautiful soup object
def get_current_data():
# define the page to be parsed for data
spy_link = "https://finance.yahoo.com/quote/SPY?p=SPY&.tsrc=fin-srch" # yahoo finance link to SPY etf info
webpage_start = time.time()
page = requests.get(spy_link).text
webpage_end = time.time()
webpage_time = webpage_end-webpage_start
print("webpage get time: {}".format(webpage_time))
parse_start = time.time()
strainer = SoupStrainer('div', attrs = {'data-reactid': '30'}) # only get relevant div info
data_element = BeautifulSoup(page, 'lxml', parse_only=strainer)
current_data = {}
# store current data inside a dictionary
current_data["current_price"] = float(data_element.find('span', attrs={"data-reactid": "32"}).text)
change = data_element.find('span', attrs={"data-reactid": "33"}).text.split()
current_data["value_change"] = float(change[0])
current_data["percent_change"] = process_percent_change(change[1])
parse_end = time.time()
parse_time = parse_end-parse_start
print("parse time: {}".format(parse_time))
# return data to be used
return current_data
if __name__ == "__main__":
current_data = get_current_data()
webpage get time: 0.42667412757873535
parse time: 0.05974388122558594
What I want to do is either speed up the request or find a more efficient way of monitoring the website for changes. I've noticed that after loading the website in my browser that the price will change without having to refresh the page. Here is a gif example for what I mean:
https://media.giphy.com/media/nZmbqDeu7rWT2vmrks/giphy.gif
Here's a link to the specific yahoo finance page: https://finance.yahoo.com/quote/SPY?p=SPY&.tsrc=fin-srch
Is there a way to monitor the stock price such as in the gif/web browser example? If there isn't, is there a way to speed up requests (without buying a faster connection)?
Any help is appreciated.
This seems a reasonable time for what you are doing (downloading and parsing a webpage).
In real I think that 0.4s should be enough for monitoring, but if you really need to have an higher update frequency (ie. you have got some new crazy trading algorithm) you can:
Try to parse a more lightweight page. (But you need to find one!)
Create a pool of thread performing many requests at the same time. You should note that this behaviour could lead to a violation of yahoo ToS and restrictions on your account/ip could be applied.
Use any sort of market API (free or payed) to avoid downloading an entire webpage and all its dependencies when you can just have something like a 20byte json (or similar) message. Also in this case you should read about the max pooling frequency you can use.
Use Selenium and perform just 1 connection to the webpage. When you detect any changes in the div you are interested in you get the updated value.
Related
On Auction websites, there is a clock counting down the time remaining. I am trying to extract that piece of information (among others) to print to a csv file.
For example, I am trying to take the value after 'Time Left:' on this site: https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx
I have tried 3 different options, without any success
1)
time = ''
try:
time = soup.find(id='tzcd').text.replace('Time Left:','')
#print("Time: ",time)
except Exception as e:
print(e)
time = ''
try:
time = soup.find(id='tzcd').text
#print("Time: ",time)
except:
pass
3
time = ''
try:
time = soup.find('div', id="BiddingTimeSection").find_next_sibling("div").text
#print("Time: ",time)
except:
pass
I am a new user of Python and don't know if it's because of the date/time structure of the pull or because of something else inherently flawed with my code.
Any help would be greatly appreciated!
That information is being pulled into page via a Javascript XHR call. You can see that by inspecting Network tab in browser's Dev tools. The following code will get you the time left in seconds:
import requests
s = requests.Session()
header = {'X-AjaxPro-Method': 'GetTimerText'}
payload = '{"inventoryId":271177}'
r = s.get('https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx')
s.headers.update(header)
r = s.post('https://auctionofchampions.com/ajaxpro/LotDetail,App_Web_lotdetail.aspx.cdcab7d2.1voto_yr.ashx', data=payload)
print(r.json()['value']['timeLeft'])
Response:
792309
792309 seconds are a bit over 9 days. There are easy ways to return them in days/hours/minutes, if you want.
I have developed a webscraper with beautiful soup that scrapes news from a website and then sends them to a telegram bot. Every time the program runs it picks up all the news currently on the news web page, and I want it to just pick the new entries on the news and send only those.
How can I do this? Should I use a sorting algorithm of some sort?
Here is the code:
#Lib requests
import requests
import bs4
fonte = requests.get('https://www.noticiasaominuto.com/')
soup = bs4.BeautifulSoup(fonte.text, 'lxml')
body = soup.body
for paragrafo in body.find_all('p', class_='article-thumb-text'):
print(paragrafo.text)
conteudo = paragrafo.text
id = requests.get('https://api.telegram.org/bot<TOKEN>/getUpdates')
chat_id = id.json()['result'][0]['message']['from']['id']
print(chat_id)
msg = requests.post('https://api.telegram.org/bot<TOKEN>/sendMessage', data = {'chat_id': chat_id ,'text' : conteudo})
You need to keep track of articles that you have seen before, either by using a full database solution or by simply saving the information in a file. The file needs to be read before starting. The website is then scraped and compared against the existing list. Any articles not in the list are added to the list. At the end, the updated list is saved back to the file.
Rather that storing the whole text in the file, a hash of the text can be saved instead. i.e. convert the text into a unique number, in this case a hex digest is used to make it easier to save to a text file. As each hash will be unique, they can be stored in a Python set to speed up the checking:
import hashlib
import requests
import bs4
import os
# Read in hashes of past articles
db = 'past.txt'
if os.path.exists(db):
with open(db) as f_past:
past_articles = set(f_past.read().splitlines())
else:
past_articles = set()
fonte = requests.get('https://www.noticiasaominuto.com/')
soup = bs4.BeautifulSoup(fonte.text, 'lxml')
for paragrafo in soup.body.find_all('p', class_='article-thumb-text'):
m = hashlib.md5(paragrafo.text.encode('utf-8'))
if m.hexdigest() not in past_articles:
print('New {} - {}'.format(m.hexdigest(), paragrafo.text))
past_articles.add(m.hexdigest())
# ...Update telegram here...
# Write updated hashes back to the file
with open(db, 'w') as f_past:
f_past.write('\n'.join(past_articles))
The first time this is run, all articles will be displayed. The next time, no articles will be displayed until the website is updated.
I am looking to find various statistics about players in games such as CS:GO from the Steam Web API, but cannot work out how to search through the JSON returned from the query (e.g. here) in Python.
I just need to be able to get a specific part of the list that is provided, e.g. finding total_kills from the link above. If I had a way that could sort through all of the information provided and filters it down to just that specific thing (in this case total_kills) then that would help a load!
The code I have at the moment to turn it into something Python can read is:
url = "http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/?key=FE3C600EB76959F47F80C707467108F2&steamid=76561198185148697&include_appinfo=1"
data = requests.get(url).text
data = json.loads(data)
If you are looking for a way to search through the stats list then try this:
import requests
import json
def findstat(data, stat_name):
for stat in data['playerstats']['stats']:
if stat['name'] == stat_name:
return stat['value']
url = "http://api.steampowered.com/ISteamUserStats/GetUserStatsForGame/v0002/?appid=730&key=FE3C600EB76959F47F80C707467108F2&steamid=76561198185148697"
data = requests.get(url).text
data = json.loads(data)
total_kills = findstat(data, 'total_kills') # change 'total_kills' to your desired stat name
print(total_kills)
I'm trying to build an API tool for creating 100+ campaigns at a time, but so far I keep running into timeout errors. I have a feeling it's because I'm not doing this as a batch/async request, but I can't seem to find straightforward instructions specifically for batch creating campaigns in Python. Any help would be GREATLY appreciated!
I have all the campaign details prepped and ready to go in a Google sheet, which my script then reads (using pygsheets) and attempts to create the campaigns. Here's what it looks like so far:
from facebookads.adobjects.campaign import Campaign
from facebookads.adobjects.adaccount import AdAccount
from facebookads.api import FacebookAdsApi
from facebookads.exceptions import FacebookRequestError
import time
import pygsheets
FacebookAdsApi.init(access_token=xxx)
gc = pygsheets.authorize(service_file='xxx/client_secret.json')
sheet = gc.open('Campaign Prep')
tab1 = sheet.worksheet_by_title('Input')
tab2 = sheet.worksheet_by_title('Output')
# gets range size, offsetting it by 1 to account for the range starting on row 2
row_range = len(tab1.get_values('A1', 'A', returnas='matrix', majdim='ROWS', include_empty=False))+1
# finds first empty row in the output sheet
start_row = len(tab2.get_values('A1', 'A', returnas='matrix', majdim='ROWS', include_empty=False))
def create_campaigns(row):
campaign = Campaign(parent_id=row[6])
campaign.update({
Campaign.Field.name: row[7],
Campaign.Field.objective: row[9],
Campaign.Field.buying_type: row[10],
})
c = campaign.remote_create(params={'status': Campaign.Status.active})
camp_name = c['name']
camp_id = 'cg:'+c['id']
return camp_name, camp_id
r = start_row
# there's a header so I have the range starting at 2
for x in range(2, int(row_range)):
r += 1
row = tab1.get_row(x)
camp_name, camp_id = create_campaigns(row)
# pastes the generated campaign ID, campaign name and account id back into the sheet
tab2.update_cells('A'+str(r)+':C'+str(r).format(r),[[camp_id, camp_name, row[6].rsplit('_',1)[1]]])
I've tried putting this in a try loop and if it runs into a FacebookRequestError have it do time.sleep(5) then keep trying, but I'm still running into timeout errors every 5 - 10 rows it loops through. When it doesn't timeout it does work, I guess I just need to figure out a way to make this handle big batches of campaigns more efficiently.
Any thoughts? I'm new to the Facebook API and I'm still a relative newb at Python, but I find this stuff so much fun! If anyone has any advice for how this script could be better (as well as general Python advice), I'd love to hear it! :)
Can you post the actual error message?
It sounds like what you are describing is that you hit the rate limits after making a certain amount of calls. If that is so, time.sleep(5) won't be enough. The rate score decays over time and will be reset after 5 minutes https://developers.facebook.com/docs/marketing-api/api-rate-limiting. In that case I would suggest making a sleep between each call instead. However a better option would be to upgrade your API status. If you hit the rate limits this fast I assume you are on Developer level. Try upgrading first to Basic and then Standard and you should not have these problems. https://developers.facebook.com/docs/marketing-api/access
Also, as you mention, utilizing Facebook's batch request API could be a good idea. https://developers.facebook.com/docs/marketing-api/asyncrequests/v2.11
Here is a thread with examples of the Batch API working with the Python SDK: https://github.com/facebook/facebook-python-ads-sdk/issues/116
I paste the code snippet (copied from the last link that #reaktard pasted), credit to github user #williardx
it helped me a lot in my development.
# ----------------------------------------------------------------------------
# Helper functions
def generate_batches(iterable, batch_size_limit):
# This function can be found in examples/batch_utils.py
batch = []
for item in iterable:
if len(batch) == batch_size_limit:
yield batch
batch = []
batch.append(item)
if len(batch):
yield batch
def success_callback(response):
batch_body_responses.append(response.body())
def error_callback(response):
# Error handling here
pass
# ----------------------------------------------------------------------------
batches = []
batch_body_responses = []
api = FacebookAdsApi.init(your_app_id, your_app_secret, your_access_token)
for ad_set_list in generate_batches(ad_sets, batch_limit):
next_batch = api.new_batch()
requests = [ad_set.get_insights(pending=True) for ad_set in ad_set_list]
for req in requests:
next_batch.add_request(req, success_callback, error_callback)
batches.append(next_batch)
for batch_request in batches:
batch_request.execute()
time.sleep(5)
print batch_body_responses
Have been trying to use this xgoogle to search for pdfs on the internet.. the problem am having is that if i search for "Medicine:pdf" the first page returns to me is not the first page google returns,i.e if i actually use google.... dont know whats wrong here is ma code
try:
page = 0
gs = GoogleSearch(searchfor)
gs.results_per_page = 100
results = []
while page < 2:
gs.page=page
results += gs.get_results()
page += 1
except SearchError, e:
print "Search failed: %s" % e
for res in results:
print res.desc
if i actually use google website to search for the query the first page google display for me is :
Title : Medicine - British Council
Desc :United Kingdom medical training has a long history of excellence and of ... Leaders in medicine throughout the world have received their medical education.
Url : http://www.britishcouncil.org/learning-infosheets-medicine.pdf
But if I used my python Xgoogle Search I get :
Python OutPut
Descrip:UCM175757.pdf
Title:Medicines in My Home: presentation for students - Food and Drug ...
Url:http://www.fda.gov/downloads/Drugs/ResourcesForYou/Consumers/BuyingUsingMedicineSafely/UnderstandingOver-the-CounterMedicines/UCM175757.pdf
I noticed it is difference between using xgoogle and using google in browser. I have no idea why, but you could try the google custom search api. The google custom search api may give you more close result and no risk of banned from google(if you use xgoogle to many times in one short period, you have an error return instead of search result).
first you have to register and enable your custom search in google to get key and cx
https://www.google.com/cse/all
the api format is:
'https://www.googleapis.com/customsearch/v1?key=yourkey&cx=yourcx&alt=json&q=yourquery'
customsearch is the google function you want to use, in your case I think it is customsearch
v1 is the version of you app
yourkey and yourcx are provided from google you could find it on you dashboard
yourquery is the term you want to search, in your case is "Medicine:pdf"
json is the return format
example return the first 3 pages of google custom search results:
import urllib2
import urllib
import simplejson
def googleAPICall():
userInput = urllib.quote("global warming")
KEY = "##################" # get yours
CX = "###################" # get yours
for i in range(0,3):
index = i*10+1
url = ('https://scholar.googleapis.com/customsearch/v1?'
'key=%s'
'&cx=%s'
'&alt=json'
'&q=%s'
'&num=10'
'&start=%d')%(KEY,CX,userInput,index)
request = urllib2.Request(url)
response = urllib2.urlopen(request)
results = simplejson.load(response)