fetch text from a web site and displaying it back - python

Currently, there's a game that has different groups, and you can play for a prize 'gold' every hour. Sometimes there is gold, sometimes there isn't. It is posted on facebook every hour ''gold in group2" or "gold in group6'', and other times there isn't a post due to no gold being a prize for that hour. I want to write a small script that will check the site hourly and grab the result (if there is gold or not, and what group) and display it back to me. I was wanting to write it in python as I'm learning it. Would this be the best language to use? And how would I go about doing this? All I can really find is information on extracting links. I don't want to extract links, just the text. Thanks for any and all help. I appreciate it.

Check out urllib2 for getting html from a url and BeautifulSoup/HTMLParser/etc to parse the html. Then, you could use something like this as a starting point for the script:
import time
import urllib2
import BeautifulSoup
import HTMLParser
def getSource(url, postdata):
source = ""
req = urllib2.Request(url, postdata)
try:
sock = urllib2.urlopen(req)
except urllib2.URLError, exc:
# handle the error..
pass
else:
source = sock.read()
finally:
try:
sock.close()
except:
pass
return source
def parseSource(source):
pass
# parse source with BeautifulSoup/HTMLParser, or here...
def main():
last_run = 0
while True:
t1 = time.time()
# check if 1 hour has passed since last_run
if t1 - last_run >= 3600:
source = getSource("someurl.com", "user=me&blah=foo")
last_run = time.time()
parseSource(source)
else:
# sleep for 60 seconds and check time again.
time.sleep(60)
return 0
if __name__ == "__main__":
sys.exit(main())
Here is a good article about parsing-html-with-python

I have something similiar to what you have, but you left out what my main question revolves around. I looked at htmlparser and bs, but I am unsure how to do something like if($posttext == gold) echo "gold in so and so".. seems like bs deals a lot with tags..i suppose since facebook posts can use a variety of tags, how would i go about doing just a search on the text and to return the 'post' ??

Related

Catching website changes with python using urlopen read function

Hi I am a high school student who has not used python to code programs much, and I was having trouble with creating code to check when a website was updated. I have looked at different resources and I have used them to create what I have but when I run the code it doesn't seem to work and do what I expect it to do. When I run the code I expect it to tell me if a site has been updated or stayed the same from when I last checked it. I put some print statements in the code to try to catch the issue, but it has only showed me that the website has changed even though it doesn't look like it has changed.
import time
import hashlib
from urllib.request import urlopen, Request
url = Request('https://www.canada.ca/en/immigration-refugees-citizenship/services/immigrate-canada/express-entry/submit-profile/rounds-invitations.html')
res = urlopen(url).read()
current = hashlib.sha224(res).hexdigest()
print("running")
time.sleep(10)
while True:
try:
res = urlopen(url).read()
current = hashlib.sha224(res).hexdigest()
print(current)
print(res)
time.sleep(30)
res = urlopen(url).read()
newHash = hashlib.sha224(res).hexdigest()
print (newHash)
print(res)
if newHash == current:
print ("nothing changed")
continue
else:
print("there was a change")
except AttributeError as e:
print ("error")

Having trouble using Beautiful Soup's 'Next Sibling' to extract some information

On Auction websites, there is a clock counting down the time remaining. I am trying to extract that piece of information (among others) to print to a csv file.
For example, I am trying to take the value after 'Time Left:' on this site: https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx
I have tried 3 different options, without any success
1)
time = ''
try:
time = soup.find(id='tzcd').text.replace('Time Left:','')
#print("Time: ",time)
except Exception as e:
print(e)
time = ''
try:
time = soup.find(id='tzcd').text
#print("Time: ",time)
except:
pass
3
time = ''
try:
time = soup.find('div', id="BiddingTimeSection").find_next_sibling("div").text
#print("Time: ",time)
except:
pass
I am a new user of Python and don't know if it's because of the date/time structure of the pull or because of something else inherently flawed with my code.
Any help would be greatly appreciated!
That information is being pulled into page via a Javascript XHR call. You can see that by inspecting Network tab in browser's Dev tools. The following code will get you the time left in seconds:
import requests
s = requests.Session()
header = {'X-AjaxPro-Method': 'GetTimerText'}
payload = '{"inventoryId":271177}'
r = s.get('https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx')
s.headers.update(header)
r = s.post('https://auctionofchampions.com/ajaxpro/LotDetail,App_Web_lotdetail.aspx.cdcab7d2.1voto_yr.ashx', data=payload)
print(r.json()['value']['timeLeft'])
Response:
792309
792309 seconds are a bit over 9 days. There are easy ways to return them in days/hours/minutes, if you want.

Python: Download only the HEAD tag of web page

I need to scrape multiple webpages (1000s per hour) as fast as possible, I only need to get the metadata from the head section. I don't think using range headers is going to be reliable as the head sections can vary in size greatly.
I came across another java implementation that Opens a URLConnection and read from the input stream, stopping once you find the closing </head> tag
See: Is it possible to download only the HEAD tag of a page?
is this possible in python?
I've been testing with pycurl and the WRITEFUNCTION callback
from time import time
import pycurl
import sys
c = pycurl.Curl()
class Body:
body = ""
def read(self, buf):
self.body += str(buf)
b = Body()
c.setopt(c.URL, "https://www.oracle.com/")
c.setopt(c.WRITEFUNCTION, b.read)
t1 = time()
try:
c.perform()
except pycurl.error:
pass
print(time() - t1)
print(b.body)
but using Wireshark and setting the debugger to stop in the read function. I'm still seeing all the transactions happen on the first pass.

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Fairly new to python, I learn by doing, so I thought I'd give this project a shot. Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.
Here are the requirements:
Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)
Use selenium to open the two urls, use browsermobproxy/phantomJS to
get all HAR
Store the HAR as a list
From the list of all HAR files, find the google analytics request, including the payload
If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.
Issue: Sometimes for a website that I know has google analytics, i.e. nytimes.com - the HAR that I get is incomplete, i.e. my prog. will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there. This issue in intermittent and does not happen all the time. Any ideas?
I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured. I tried the "wait for traffic to stop" but maybe I didn't do something right.
Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow. As I mentioned, I'm new to python so go easy :)
This is what I've got thus far.
import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime
def cleanup():
s.stop()
driver.quit()
proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any'] # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)
urlLists = []
collectTags = []
gaCollect = 0
varList = []
for x in range(0,2): # I want to ask the user for 2 inputs
url = raw_input("Enter a website to find GA on: ")
time.sleep(2.0)
urlLists.append(url)
if not url:
print "You need to type something in...here"
sys.exit()
#gets the two user url and stores in list
for urlList in urlLists:
print urlList, 'start 2nd loop' #printing for debug purpose, no need for this
if not urlList:
print 'Your Url list is empty'
sys.exit()
proxy.new_har()
driver.get(urlList)
#proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything
for ent in proxy.har['log']['entries']:
gaCollect = (ent['request']['url'])
print gaCollect
if re.search(r'google-analytics.com/r\b', gaCollect):
print 'Found GA'
collectTags.append(gaCollect)
time.sleep(2.0)
break
else:
print 'No GA Found - Ending Prog.'
cleanup()
sys.exit()
cleanup()
This might be a stale question, but I found an answer that worked for me.
You need to change two things:
1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found
2 - call new_har with the captureContent option enabled to get the payload of requests:
proxy.new_har(options={'captureHeaders':True, 'captureContent': True})
See if that helps.

Dictionary / JSON issue using Python 2.7

I'm looking at scraping some data from Facebook using Python 2.7. My code basically augments by 1 changing the Facebook profile ID to then capture details returned by the page.
An example of the page I'm looking to capture the data from is graph.facebook.com/4.
Here's my code below:
import scraperwiki
import urlparse
import simplejson
source_url = "http://graph.facebook.com/"
profile_id = 1
while True:
try:
profile_id +=1
profile_url = urlparse.urljoin(source_url, str(profile_id))
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json['results']:
print result
data = {}
data['id'] = result['id']
data['name'] = result['name']
data['first_name'] = result['first_name']
data['last_name'] = result['last_name']
data['link'] = result['link']
data['username'] = result['username']
data['gender'] = result['gender']
data['locale'] = result['locale']
print data['id'], data['name']
scraperwiki.sqlite.save(unique_keys=['id'], data=data)
#time.sleep(3)
except:
continue
profile_id +=1
I am using the scraperwiki site to carry out this check but no data is printed back to console despite the line 'print data['id'], data['name'] used just to check the code is working
Any suggestions on what is wrong with this code? As said, for each returned profile, the unique data should be captured and printed to screen as well as populated into the sqlite database.
Thanks
Any suggestions on what is wrong with this code?
Yes. You are swallowing all of your errors. There could be a huge number of things going wrong in the block under try. If anything goes wrong in that block, you move on without printing anything.
You should only ever use a try / except block when you are looking to handle a specific error.
modify your code so that it looks like this:
while True:
profile_id +=1
profile_url = urlparse.urljoin(source_url, str(profile_id))
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json['results']:
print result
data = {}
# ... more ...
and then you will get detailed error messages when specific things go wrong.
As for your concern in the comments:
The reason I have the error handling is because, if you look for
example at graph.facebook.com/3, this page contains no user data and
so I don't want to collate this info and skip to the next user, ie. no
4 etc
If you want to handle the case where there is no data, then find a way to handle that case specifically. It is bad practice to swallow all errors.

Categories