Python script on refreshing a web page, count value varies - python

I'm new to developing python scripts, but i'm trying to develop a script that will inform me when web page has been updated. For each check I use a counter to see how many times the program has run until the site has updated in someway.
My doubt is that, when I feed the url "stackoverflow.com", my program can run upto 6 times, however when I feed the url "stackoverflow.com/questions" the program runs at most once. Both sites on refreshing seems to be updating their questions often. But could someone explain to me why is there a big difference on the number of times the program runs?
import urllib2
import time
refreshcnt=0
url=raw_input("Enter the site to check");
x=raw_input("Enter the time duration to refresh");
x=int(x);
url="http://"+url
response = urllib2.urlopen(url)
html = response.read()
htmlnew=html
while html==htmlnew:
time.sleep(x)
try:
htmlnew=urllib2.urlopen(url).read()
except IOError:
print "Can't open site"
break
refreshcnt+=1
print "Refresh Count",refreshcnt
print("The site has updated!");

Just add this little loop to the end of your code and see what's changing:
for i in xrange(min(len(htmlnew),len(html))):
if htmlnew[i] != html[i]:
print(htmlnew[i-20:i+20])
print(html[i-20:i+20])
break
I tried it quick and it appears that there is a ServerTime key that is updated every second. For one reason or another, it would appear that this key is updated every second on the "/questions" page, but is only updated every half a minute or so on the homepage.
However, doing a couple other quick checks, this is certainly not the only part of the HTML being updated on the "stackoverflow.com/questions" page. Just comparing the entire HTML against the old one probably won't work in many situations. You'll likely want to search for a specific part of the HTML and then see if that piece has changed. For example, look for the HTML signifying the title newest question on SO and see if that title is different than before.

Related

looping in beautiful soup / no errors

I have been writing a program that would hypothetically find items on a website as soon as they were loaded onto the website. As of now the script takes as input, two different values (keywords) used to describe an item and a color used to pick the color of the item. The parsing is spot on with items that are already on the website but lets say that I run my program before the website loads the items, instead of having to re run the entire script i'd like for it to just refresh the page and re-parse the data until it found it. I also included no errors in my question because from my example run of the script I entered Keywords and Color not pertaining to item on the website and instead of getting an error, I just got " Process finished with exit code 0". Thank you in advance to any who take the time to help !
Here is my code:
As another user suggested, you're probably better off using Selenium for the entire process rather than using it for only parts of your code and swapping between BSoup and Selenium.
As for reloading the page if certain items are not present, if you already know which items are supposed to be on the page then you can simply search for each item by id with selenium and if you can't find one or more then refresh the page with the following line of code:
driver.refresh()

Webpage change monitoring

I want to monitor a announcement webpage, so that when a new announcement comes in, I can execute tasks as quickly as possible. Currently I'm using python with requests package:
allText = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).text
and then find the first occurrence of the text with a particular header corresponding to an article item:
ind = allText.find(''<li class="article-list-item">''); allText = allText[ind:]; ind = allText.find(''</a>''); allText = allText[0:ind]
I'm repeating the command (i.e. refreshing the page) every ~1.5 seconds.
The problems are:
it's not fast enough. It typically takes more than 3 seconds for my programme to detect it after a new webpage appears. I guess the text finding is taking up too much time. Is there a faster way?
on some website, the articles are concealed and the requests command does not return anything even though the browser can still see it. An example source code of the webpage is:
<div data-app="recent-activity" data-url="/hc/api/internal/recent_activities">/div>
How should I scrape this kind of page please?

Possible bottle-neck issue in web-scraping with Python

First of all I apologize for the vague title, but the problem is that I'm not sure what is causing the error.
I'm using Python to extrapolate some data from a website.
The code I created works perfectly when passing one link at the time, but somehow breaks when trying to collect the data from the 8000 pages I have (it actually breaks way before). The process I need to do is this:
Collect all the links from one single page (8000 links)
From each link extrapolate another link contained in an iframe
Scrape the date from the link in 2.
Point 1 is easy and works fine.
Point 2 and 3 works for a while and then I get some errors. Every time at a different point and it's never the same. After some tests, I decided to try a different approach and run my code until point 2 on all the links in 1, trying to collect all the links first. And at this point I found out that, probably, I get the error during this stage.
The code works like this: in a for cycle I pass each item of a list of urls to the function below. It's supposed to search for a link to the Disqus website. There should be only one link and there is always one link. Because with a library as lxml, it's not possible to scan inside the iframe, I use selenium and the ChromeDriver.
def get_url(webpage_url):
chrome_driver_path= '/Applications/chromedriver'
driver = webdriver.Chrome(chrome_driver_path)
driver.get(webpage_url)
iframes=driver.find_elements_by_tag_name("iframe")
list_urls=[]
urls=[]
# collects all the urls of all the iframe tags
for iframe in iframes:
driver.switch_to_frame(iframe)
time.sleep(3)
list_urls.append(driver.current_url)
driver.switch_to_default_content()
driver.quit()
for item in list_urls:
if item.startswith('http://disqus'):
urls.append(item)
if len(urls)>1:
print "too many urls collected in iframes"
else:
url=urls[0]
return url
At the beginning there was no time.sleep and it worked for roughly 30 links. Then I put a time.sleep(2) and it arrived to about 60. Now with time.sleep(3) it works for around 130 links. Of course, this cannot be a solution. The error I get now, it's always the same (index out of range in url=urls[0]), but each time with a different link. If I check my code with the single link where it breaks, the code works, so it can actually find urls there. And of course, sometimes passes a link where it stopped before and it works with no issue.
I suspect I get this because maybe of a time-out, but of course I'm not sure.
So, how can I understand what's the issue, here?
If the problem is that it makes too many requests (even though the sleep), how can I deal with this?
Thank you.
From your description of the problem, it might be that the host throttles your client when you issue too many requests in a given time. This is a common protection againts DoS attacks and ill-behaved robots - like yours.
The clean solution here is to checkout if the site has a robots.txt file and if so parse it and respect the rules - else, set a large enough wait time between two requests so you dont get kicked.
Also you can get quite a few other issues - 404, lost network connection etc - and even load time issues with selenium.webdriver as documented here:
Dependent on several factors, including the OS/Browser combination,
WebDriver may or may not wait for the page to load. In some
circumstances, WebDriver may return control before the page has
finished, or even started, loading. To ensure robustness, you need to
wait for the element(s) to exist in the page using Explicit and
Implicit Waits.
wrt/ your IndexError, you blindly assume that you'll get at least one url (which means at least one iframe), which might not be the case for any of the reasons above (and a few others too). First you want to make sure you properly handle all corner cases, then fix your code so you don't assume that you do have at least one url:
url = None
if len(urls) > 1:
print "too many urls collected in iframes"
elif len(urls) == 0:
url = urls[0]
else:
print "no url found"
Also if all you want is the first http://disqus url you can find, no need to collect them all, then filter them out, then return the first:
def get_url(webpage_url):
chrome_driver_path= '/Applications/chromedriver'
driver = webdriver.Chrome(chrome_driver_path)
driver.get(webpage_url)
iframes=driver.find_elements_by_tag_name("iframe")
# collects all the urls of all the iframe tags
for iframe in iframes:
driver.switch_to_frame(iframe)
time.sleep(3)
if driver.current_url.startswith('http;//disqus'):
return driver.current_url
driver.switch_to_default_content()
driver.quit()
return None # nothing found

itertools.count returning more than one item per count

Here is a pastebin of the full script: http://pastebin.com/TfAc8sYM
Unfortunately there's no way for others to test this in my specific use-case as the API is not public.
There are a series of URLs that are http://example.com/api/users/get.ext?user=X
I want to go through these one at a time counting up and executing code for each URL.
If I load the API in the browser the XML looks just fine. Likewise if I print the requests.text data in the terminal everything is working correctly.
However when I run the script I get multiple outputs for the same API URL. I can see this in my database and printed via the command line. The amount of repeated lines per user entry seems consistent each time the script runs but it is inconsistent from user to user. The data within the repeated entries is identical.
Am I approaching the counting wrong?
Here's a sample of the XML. It's from the Vanilla Forums API:http://pastebin.com/aR51ShTM

Screen scraping with Python/Scrapy/Urllib2 seems to be blocked

To help me learn Python, I decided to screen scrape the football commentaries from the ESPNFC website from the 'live' page (such as here).
It was working up until a day ago but having finally sorted some things out, I went to test it and the only piece of commentary I got back was [u'Commentary Not Available'].
Does anyone have any idea how they are doing this, and any easy and quick ways around? I am using Scrapy/Xpath and Urllib2.
Edit//
for game_id in processQueue:
data_text = getInformation(game_id)
clean_events_dict = getEvents(data_text)
break
Doesn't work the same as
i = getInformation(369186)
j = getEvents(i)
In the first sample, processQueue is a list with game_ids in. The first one of these is given to the script to start scraping. This is broken out of before it has a chance to move on to another game_id
In the second sample I use a single game id.
The first one fails and the second one works and I have absolutely no idea why. Any ideas?
There's a few things you can try, assuming you can still access the data from your browser. Bear in mind, however, that web site operators generally are within their rights to block you; this is why projects that rely on the scraping of a single site are a risky proposition. Here they are:
Delay a few seconds between each scrape
Delay a random number of seconds between each scrape
Accept cookies during your scraping session
Run JavaScript during your session (not possible with Scrapy as far as I know)
Share the scraping load between several IP ranges
There are other strategies which, I generally argue, are less ethical:
Modify your User Agent string to make your scraper look like a browser
I suggest in this answer here that scrapers should be set up to obey robots.txt. However, if you program your scraper to be well-behaved, site operators will have fewer reasons to go to the trouble of blocking you. The most frequent errors I see in this Stack Overflow tag are simply that scrapers are being run far too fast, and they are accidentally causing a (minor) denial of service. So, try slowing down your scrapes first, and see if that helps.

Categories