Get last changes on site - python

I need to create software in Python which monitoring sites when changes have happened. At the moment I have periodic task and check content of site with previous version. Is there any easier way to check if content of site has been changed, maybe time of last changes, so to avoid downloading content everu time ?

You could use the HEAD HTTP method and look at the Date-Modified and ETag headers, etc. before actually downloading the full content again.
However nothing guarantees that the server will actually update these headers when the entity's (URL's) content changes, or indeed even respond properly to the HEAD method.

Altough it doesn't answer your question I think its worth to mention that you don't have to store the previous version of website to look for changes. You can just count md5 sum of it and store this sum, then count it for the new version and check if they are equal.
And about the question itself, AKX gave a great answer - just look for Date-Modified header, but remember it is not guaranteed to work.

Related

ElasticSearch doesn't return hits when making consecutive calls - Python

For example when I have ids lookup and want to search one by by one to see if the document already exists or not. One of two things happen:
First -> first search request returns the correct doc and all the calls after that returns the same doc as first one even though I was searching for different Ids
Second -> first search request returns the correct doc and all the calls after that returns empty hits array even though I was searching for different Ids. The search metadata does tell me that "total" was one for this request but no actual hits are returned.
I have been facing this weird behaviour with ElasticSearch.py and using raw http requests as well.
Could it be firewall that is causing some sort of weird caching behaviour?
Is there anyway to force the results?
Any ideas are welcome at this point.
Thanks in advance!
It was the firewall caching that was causing the havoc! Once the caching was disabled for certain endpoints, the issue resolved itself. Painful!

Requests body empty

Why does the following print None?
import requests
r = requests.request('http://cnn.com', data={"foo":"bar"})
print r.request.body
# None
If you change cnn.com to www.cnn.com, it prints the proper body. I noticed a redirect (there is a 301 in r.history). What's going on?
Your code as it stands doesn't actually work—it'll raise a TypeError right off the bat. But I think I can guess at what you're trying to do.
If you change that request to a post, it will indeed successfully return None.
Why? Because you're asking for the body of the redirect, not the body of the original request. For that, you want r.history[0].request.body.
Read Redirection and History for more info. Note that auto-redirecting isn't actually documented to work for POST requests, even though it often does anyway. Also note that in earlier versions of requests, history entries didn't have complete Request objects. (You'll have to look at the version history if you need to know when that changed. But it seems to be there in 1.2.0, and not in 0.14.2—and a lot of things that were added or changes in 1.0.0 aren't really documented, because it was a major rewrite.)
As a side note… why do you need this? If you really need to know what body you sent, why not do the two-step process of creating a request and sending it, so you can see the body beforehand? (Or, for that matter, just encode the data explicitly?)

github api - fetch all commits of all repos of an organisation based on date

Assembla provides a simple way to fetch all commits of an organisation using api.assembla.com/v1/activity.json and it takes to and from parameters allowing to get commits of selected date(from all the spaces(repos) the user is participating.
Is there any similar way in Github ?
I found these for Github:
/repos/:owner/:repo/commits
Accepts since and until parameters for getting commits of selected date. But, since I want commits from all repos, I have to loop over all those repos and fetch commits for each repo.
/users/:user/events
This shows the commits of a user. I dont have any problem looping over all the users in the org, but how can I get for a particular date ?
/orgs/:org/events
This shows commits of all users of all repos but dont know how to fetch for a particular date ?
The problem with using the /users/:user/events endpoint is that you just don't get the PushEvents and you would have to skip over non-commit events and perform more calls to the API. Assuming you're authenticated, you should be safe so long as your users aren't hyper active.
For /orgs/:org/events I don't think they accept parameters for anything, but I can check with the API designers.
And just in case you aren't familiar, these are all paginated results. So you can go back until the beginning with the Link headers. My library (github3.py) provides iterators to do this for you automatically. You can also tell it how many events you'd like. (Same with commits, etc). But yeah, I'll come back an edit after talking to the API guys at GitHub.
Edit: Conversation
You might want to check out the GitHub Archive project -- http://www.githubarchive.org/, and the ability to query the archive using Google's BigQuery. Sounds like it would be a perfect tool for the job -- I'm pretty sure you could get exactly what you want with a single query.
The other option is to call the GitHub API -- iterate over all events for the organization and filter out the ones that don't satisfy your date rage criteria and event type criteria (commits). But since you can't specify date ranges in the API call, you will probably do a lot of calls to get the the events that interest you. Notice that you don't have to iterate over every page starting from 0 to find the page that contains the first result in the date range -- just do a (variation of) binary search over page numbers to find any page that contains a commit in the date range, a then iterate in both directions until you break out of the date range. That should reduce the number of API calls you make.

Caching online prices fetched via API unless they change

I'm currently working on a site that makes several calls to big name online sellers like eBay and Amazon to fetch prices for certain items. The issue is, currently it takes a few seconds (as far as I can tell, this time is from making the calls) to load the results, which I'd like to be more instant (~10 seconds is too much in my opinion).
I've already cached other information that I need to fetch, but that information is static. Is there a way that I can cache the prices but update them only when needed? The code is in Python and I store info in a mySQL database.
I was thinking of somehow using chron or something along that lines to update it every so often, but it would be nice if there was a simpler and less intense approach to this problem.
Thanks!
You can use memcache to do the caching. The first request will be slow, but the remaining requests should be instant. You'll want a cron job to keep it up to date though. More caching info here: Good examples of python-memcache (memcached) being used in Python?
Have you thought about displaying the cached data, then updating the prices via an ajax callback? You could notify the user if the price changed with a SO type notification bar or similar.
This way the users get results immediately, and updated prices when they are available.
Edit
You can use jquery:
Assume you have a script names getPrices.php that returns a json array of the id of the item, and it's price.
No error handling etc here, just to give you an idea
My necklace: <div id='1'> $123.50 </div><br>
My bracelet: <div id='1'> $13.50 </div><br>
...
<script>
$(document).ready(function() {
$.ajax({ url: "getPrices.php", context: document.body, success: function(data){
for (var price in data)
{
$(price.id).html(price.price);
}
}}));
</script>
You need to handle the following in your application:
get the price
determine if the price has changed
cache the price information
For step 1, you need to consider how often the item prices will change. I would go with your instinct to set a Cron job for a process which will check for new prices on the items (or on sets of items) at set intervals. This is trivial at small scale, but when you have tens of thousands of items the architecture of this system will become very important.
To avoid delays in page load, try to get the information ahead of time as much as possible. I don't know your use case, but it'd be good to prefetch the information as much as possible and avoid having the user wait for an asynchronous JavaScript call to complete.
For step 2, if it's a new item or if the price has changed, update the information in your caching system.
For step 3, you need to determine the best caching system for your needs. As others have suggested, memcached is a possible solution. There are a variety of "NoSQL" databases you could check out, or even cache the results in MySQL.
How are you getting the price? If you are scrapping the data from the normal HTML page using a tool such as BeautifulSoup, that may be slowing down the round-trip time. In this case, it might help to compute a fast checksum (such as MD5) from the page to see if it has changed, before parsing it. If you are using a API which gives a short XML version of the price, this is probably not an issue.

Creating database schema for parsed feed

Additional questions regarding SilentGhost's initial answer to a problem I'm having parsing Twitter RSS feeds. See also partial code below.
First, could I insert tags[0], tags[1], etc., into the database, or is there a different/better way to do it?
Second, almost all of the entries have a url, but a few don't; likewise, many entries don't have the hashtags. So, would the thing to do be to create default values for url and tags? And if so, do you have any hints on how to do that? :)
Third, when you say the single-table db design is not optimal, do you mean I should create a separate table for tags? Right now, I have one table for the RSS feed urls and another table with all the rss entry data (summar.y, date, etc.).
I've pasted in a modified version of the code you posted. I had some success in getting a "tinyurl" variable to get into the sqlite database, but now it isn't working. Not sure why.
Lastly, assuming I can get the whole thing up and running (smile), is there a central site where people might appreciate seeing my solution? Or should I just post something on my own blog?
Best,
Greg
I would suggest reading up on database normalisation, especially on 1st and 2nd normal forms. Once you're done with it, I hope there won't be need for default values, and your db schema evolves into something more appropriate.
There are plenty of options for sharing your source code on the web, depending on what versioning system you're most comfortable with you might have a look at such well know sites as google code, bitbucket, github and many other.

Categories