I am using pywikibot in python to get all revisions of a Wikipedia page.
import pywikibot as pw
wikiPage='Narthaki'
page = pw.Page(pw.Site('en'), wikiPage)
revs = page.revisions(content=True)
How do I know which of the revisions were reverts? I see from https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Narthaki that the page has one revert edit. Not sure how to get more information about this from the revision object.
Request your help. Many thanks!
"Revert" is not a well-defined concept so it depends on how you define it. (See https://phabricator.wikimedia.org/T152434 for some relevant discussion.) The most capable revert detection tool today is probably mwrevert.
You can compare text of revision directly, or look for the revisions that have the same sha1 hash:
>>> rev = next(revs)
>>> rev.sha1
'1b02fc4cbcfd1298770b16f85afe0224fad4b3ca'
If two revision have the same text/hash it means that the newer one is a revert to the older one. Of-course there are some special cases like sha1hidden, or how to handle multiple reverts to the same revision that one needs to consider.
Related
I want to get the text of the edit made on a Wikipedia page before and after the edit. I have this url:
https://en.wikipedia.org/w/index.php?diff=328391582&oldid=328391343
But, I want the text in the json format so that I can directly use it in my program. Is there any API provided by MediaWiki that gives me the old and new text after an edit or do I have to parse the HTML page using a parser?
Try this: https://www.mediawiki.org/wiki/API:Revisions
There are a few options which may be of use, such as:
rvparse: Parse revision content. For performance reasons if this option is used, rvlimit is enforced to 1.
rvdifftotext: Text to diff each revision to.
If those fail there's still
rvprop / ids: Get the revid and, from 1.16 onward, the parentid
Then once you get the parent ID, you can compare the text of the two.
Leaving a note in JavaScript, how to query the Wikipedia API to get all the recent edits.
In some cases the article get locked, the recent edits can't be seen.
🔐 This article is semi-protected due to vandalism
Querying the API as follow allow to read all edits.
fetch("https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=revisions&format=json&titles=Timeline_of_the_2020_United_States_presidential_election&rvslots=*&rvprop=timestamp|user|comment|content")
.then(v => v.json()).then((function(v){
main.innerHTML = JSON.stringify(v, null, 2)
})
)
<pre id="main" style="white-space: pre-wrap"></pre>
See also How to get Wikipedia content as text by API?
You can try WikiWho. It tracks every single token written in Wikipedia (with 95% accuracy). In a nutshell, it assigns IDs to every token, and it tracks them based on the context. You just need to check for the existence (or not) of the ID between two revisions (it works even if the revisions are not consecutive).
There is a wrapper and a tutorial. There is a bug in the tutorial because the name of the article change (instead of "bioglass", you should look for "Bioglass_45S5")
You can (sometimes) access the tutorial online:
I'm using PRAW to create a Reddit bot that submits something once a day. After submitting I want to save the url of the submission and write it to a text file.
url = r.submit(subreddit, submission_title, text=submission_text)
The above returns a Submission object, but I want the actual url. Is there a way to get the url from a Submission object, or do I need to do something else to get the url?
submission.shortlink (previously .short_link) is what you're looking for, if submission.permalink wasn't good enough.
reddit = praw.Reddit("Amos")
submission = reddit.get_submission(submission_id="XYZ")
print submission.permalink
>>> www.reddit.com/r/subreddit/comments/XYZ
I see that #TankorSmash has answered your question already, though I thought I might add some fundamental knowledge for future references:
If you use "dir(object)," you'll be able to see both attributes and methods that pertain to the Reddit API (which you may use to test and see all properties that effect the given object being tested). You can ignore everything that starts with an underscore (most likely).
An example would be:
submissionURL = submission.url
Or you can go straight to source where PRAW is getting its data. The variable names are not set by PRAW, they come from this JSON (linked above).
I need to create software in Python which monitoring sites when changes have happened. At the moment I have periodic task and check content of site with previous version. Is there any easier way to check if content of site has been changed, maybe time of last changes, so to avoid downloading content everu time ?
You could use the HEAD HTTP method and look at the Date-Modified and ETag headers, etc. before actually downloading the full content again.
However nothing guarantees that the server will actually update these headers when the entity's (URL's) content changes, or indeed even respond properly to the HEAD method.
Altough it doesn't answer your question I think its worth to mention that you don't have to store the previous version of website to look for changes. You can just count md5 sum of it and store this sum, then count it for the new version and check if they are equal.
And about the question itself, AKX gave a great answer - just look for Date-Modified header, but remember it is not guaranteed to work.
One feature I would like to add to my django app is the ability for users to create some content (without signing up / creating an account), and then generating a content-specific link that the users can share with others. Clicking on the link would take the user back to the content they created.
Basically, I'd like the behavior to be similar to sites like pastebin - where users get a pastebin link they can share with other people (example: http://pastebin.com/XjEJvSJp)
I'm not sure what the best way is to generate these types of links - does anyone have any ideas?
Thanks!
You can create these links in any way you want, as long as each link is unique. For example, take the MD5 of the content and use the first 8 characters of the hex digest.
A simple model for that could be:
class Permalink(models.Model):
key = models.CharField(primary_key = True, max_length = 8)
refersTo = models.ForeignKey(MyContentModel, unique = True)
You could also make refersTo a property that automatically assigns a unique key (as described above).
And you need a matching URL:
url("^permalink/(?P<key>[a-f0-9]{8})$",
"view.that.redirects.to.permalink.refersTo"),
You get the idea...
Usually all that is made up of is a (possibly random, possibly sequential) token, plus the content, stored in a DB and then served up on demand.
If you don't mind that your URLs will get a bit longer you can have a look at the uuid module. This should guarantee unique IDs.
Basically you just need a view that stores data and a view that shows it.
e.g. Store with:
server.com/objects/save
And then, after storing the new model, it could be reached with
server.com/objects/[id]
Where [id] is the id of the model you created when you saved.
This doesn't require users to sign in - it could work for anonymous users as well.
as a personal project I am trying to write a wiki with the help of django. I'm a beginner when it comes to web development. I am at the (early) point where I need to decide how to store the wiki sites. I have three approaches in mind and would like to know your suggestion.
Flat files
I considered a flat file approach with a version control system like git or mercurial. Firstly, I would have some example wikis to look at like http://hatta.sheep.art.pl/. Secondly, the vcs would probably deal with editing conflicts and keeping the edit history, so I would not have to reinvent the wheel. And thirdly, I could probably easily clone the wiki repository, so I (or for that matter others) can have an offline copy of the wiki.
On the other hand, as far as I know, I can not use django models with flat files. Then, if I wanted to add fields to a wiki site, like a category, I would need to somehow keep a reference to that flat file in order to associate the fields in the database with the flat file. Besides, I don't know if it is a good idea to have all the wiki sites in one repository. I imagine it is more natural to have kind of like a repository per wiki site resp. file. Last but not least, I'm not sure, but I think using flat files would limit my deploying capabilities because web hosts maybe don't allow creating files (I'm thinking, for example, of Google App Engine)
Storing in a database
By storing the wiki sites in the database I can utilize django models and associate arbitrary fields with the wiki site. I probably would also have an easier life deploying the wiki. But I would not get vcs features like history and conflict resolving per se. I searched for django-extensions to help me and I found django-reversion. However, I do not fully understand if it fit my needs. Does it track model changes like for example if I change the django model file, or does it track the content of the models (which would fit my need). Plus, I do not see if django reversion would help me with edit conflicts.
Storing a vcs repository in a database field
This would be my ideal solution. It would combine the advantages of both previous approaches without the disadvantages. That is; I would have vcs features but I would save the wiki sites in a database. The problem is: I have no idea how feasible that is. I just imagine saving a wiki site/source together with a git/mercurial repository in a database field. Yet, I somehow doubt database fields work like that.
So, I'm open for any other approaches but this is what I came up with. Also, if you're interested, you can find the crappy early test I'm working on here http://github.com/eugenkiss/instantwiki-test
In none of your choices have you considered whether you wish to be able to search your wiki. If this is a consideration, having the 'live' copy of each page in a database with full text search would be hugely beneficial. For this reason, I would personally go with storing the pages in a database every time - otherwise you'll have to create your own index somewhere.
As far as version logging goes, you only need store the live copy in an indexable format. You could automatically create a history item within your 'page' model when an changed page is written back to the database. You can cut down on the storage overhead of earlier page revisions by compressing the data, should this become necessary.
If you're expecting a massive amount of change logging, you might want to read this answer here:
How does one store history of edits effectively?
Creating a wiki is fun and rewarding, but there are a lot of prebuilt wiki software packages already. I suggest Wikipedia's List of wiki software. In particular, MoinMoin and Trac are good. Finally, John Sutherland has made a wiki using Django.