I want to get the text of the edit made on a Wikipedia page before and after the edit. I have this url:
https://en.wikipedia.org/w/index.php?diff=328391582&oldid=328391343
But, I want the text in the json format so that I can directly use it in my program. Is there any API provided by MediaWiki that gives me the old and new text after an edit or do I have to parse the HTML page using a parser?
Try this: https://www.mediawiki.org/wiki/API:Revisions
There are a few options which may be of use, such as:
rvparse: Parse revision content. For performance reasons if this option is used, rvlimit is enforced to 1.
rvdifftotext: Text to diff each revision to.
If those fail there's still
rvprop / ids: Get the revid and, from 1.16 onward, the parentid
Then once you get the parent ID, you can compare the text of the two.
Leaving a note in JavaScript, how to query the Wikipedia API to get all the recent edits.
In some cases the article get locked, the recent edits can't be seen.
🔐 This article is semi-protected due to vandalism
Querying the API as follow allow to read all edits.
fetch("https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=revisions&format=json&titles=Timeline_of_the_2020_United_States_presidential_election&rvslots=*&rvprop=timestamp|user|comment|content")
.then(v => v.json()).then((function(v){
main.innerHTML = JSON.stringify(v, null, 2)
})
)
<pre id="main" style="white-space: pre-wrap"></pre>
See also How to get Wikipedia content as text by API?
You can try WikiWho. It tracks every single token written in Wikipedia (with 95% accuracy). In a nutshell, it assigns IDs to every token, and it tracks them based on the context. You just need to check for the existence (or not) of the ID between two revisions (it works even if the revisions are not consecutive).
There is a wrapper and a tutorial. There is a bug in the tutorial because the name of the article change (instead of "bioglass", you should look for "Bioglass_45S5")
You can (sometimes) access the tutorial online:
Related
I am working on an application that sends logs to GCP StackDriver. I want to put custom "tags" (or summary fields) natively on my log entry. I am looking for a solution that doesn't rely on defining custom summary fields in the console, as those are not permanent, and not project-wide.
I realized that some logger have tags displayed. For example, GCF logs will show its execution_id. Using the following snippet, I can verify that the tags displayed depend on the name of the logger:
from google.cloud import logging
client = logging.Client()
client.logger(name="custom").log_text("foobar", labels={"execution_id": "foo"})
client.logger(name="cloudfunctions.googleapis.com%2Fcloud-functions").log_text("foobar", labels={"execution_id": "foo"})
if you filter your logs on "foobar", you will see that only the second entry has "foo" as a tag.
That tag matches the label.execution_id specified in the code. The problem is, I cannot add custom labels, if I add another label that is not execution_id, it is not displayed as a tag (but still found in the log body).
It looks like each monitored resources has its own set of tag, ie: BigQuery resources use protoPayload.authenticationInfo.principalEmail as tag. But I cannot find a way to specify my own resources.
Does anybody has some experience with that kind of issue?
Thanks in advance
The closest solution I found was in an expanded log entry, click on a field within the JSON representation. In the resulting panel, select Add field to summary line,
to get more information about this topic, please refer to this link
Additionally I found a feature request opened for the product team, where the user, on that case, wants to filter out in Stackdriver by Dataflow jobs custom labels, the reference might be useful on your use case, no ETA was shared, neither guarantee of its implementation
I've filed a Feature Request on your behalf to the product team, they'll evaluate the possibility to implement the functionality that fits your use case, you can follow up on this PIT [1], where you will be able to receive further updates from the team as well.
Keep in mind that there is no ETA, nor guarantee that this will be implemented. However, please feel free to ask for updates directly on the PIT, I would appreciate if you give my answer as accepted, if it was helpful for you.
[1]https://issuetracker.google.com/172667238
I am using pywikibot in python to get all revisions of a Wikipedia page.
import pywikibot as pw
wikiPage='Narthaki'
page = pw.Page(pw.Site('en'), wikiPage)
revs = page.revisions(content=True)
How do I know which of the revisions were reverts? I see from https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/Narthaki that the page has one revert edit. Not sure how to get more information about this from the revision object.
Request your help. Many thanks!
"Revert" is not a well-defined concept so it depends on how you define it. (See https://phabricator.wikimedia.org/T152434 for some relevant discussion.) The most capable revert detection tool today is probably mwrevert.
You can compare text of revision directly, or look for the revisions that have the same sha1 hash:
>>> rev = next(revs)
>>> rev.sha1
'1b02fc4cbcfd1298770b16f85afe0224fad4b3ca'
If two revision have the same text/hash it means that the newer one is a revert to the older one. Of-course there are some special cases like sha1hidden, or how to handle multiple reverts to the same revision that one needs to consider.
I'm using PRAW to create a Reddit bot that submits something once a day. After submitting I want to save the url of the submission and write it to a text file.
url = r.submit(subreddit, submission_title, text=submission_text)
The above returns a Submission object, but I want the actual url. Is there a way to get the url from a Submission object, or do I need to do something else to get the url?
submission.shortlink (previously .short_link) is what you're looking for, if submission.permalink wasn't good enough.
reddit = praw.Reddit("Amos")
submission = reddit.get_submission(submission_id="XYZ")
print submission.permalink
>>> www.reddit.com/r/subreddit/comments/XYZ
I see that #TankorSmash has answered your question already, though I thought I might add some fundamental knowledge for future references:
If you use "dir(object)," you'll be able to see both attributes and methods that pertain to the Reddit API (which you may use to test and see all properties that effect the given object being tested). You can ignore everything that starts with an underscore (most likely).
An example would be:
submissionURL = submission.url
Or you can go straight to source where PRAW is getting its data. The variable names are not set by PRAW, they come from this JSON (linked above).
Assembla provides a simple way to fetch all commits of an organisation using api.assembla.com/v1/activity.json and it takes to and from parameters allowing to get commits of selected date(from all the spaces(repos) the user is participating.
Is there any similar way in Github ?
I found these for Github:
/repos/:owner/:repo/commits
Accepts since and until parameters for getting commits of selected date. But, since I want commits from all repos, I have to loop over all those repos and fetch commits for each repo.
/users/:user/events
This shows the commits of a user. I dont have any problem looping over all the users in the org, but how can I get for a particular date ?
/orgs/:org/events
This shows commits of all users of all repos but dont know how to fetch for a particular date ?
The problem with using the /users/:user/events endpoint is that you just don't get the PushEvents and you would have to skip over non-commit events and perform more calls to the API. Assuming you're authenticated, you should be safe so long as your users aren't hyper active.
For /orgs/:org/events I don't think they accept parameters for anything, but I can check with the API designers.
And just in case you aren't familiar, these are all paginated results. So you can go back until the beginning with the Link headers. My library (github3.py) provides iterators to do this for you automatically. You can also tell it how many events you'd like. (Same with commits, etc). But yeah, I'll come back an edit after talking to the API guys at GitHub.
Edit: Conversation
You might want to check out the GitHub Archive project -- http://www.githubarchive.org/, and the ability to query the archive using Google's BigQuery. Sounds like it would be a perfect tool for the job -- I'm pretty sure you could get exactly what you want with a single query.
The other option is to call the GitHub API -- iterate over all events for the organization and filter out the ones that don't satisfy your date rage criteria and event type criteria (commits). But since you can't specify date ranges in the API call, you will probably do a lot of calls to get the the events that interest you. Notice that you don't have to iterate over every page starting from 0 to find the page that contains the first result in the date range -- just do a (variation of) binary search over page numbers to find any page that contains a commit in the date range, a then iterate in both directions until you break out of the date range. That should reduce the number of API calls you make.
One feature I would like to add to my django app is the ability for users to create some content (without signing up / creating an account), and then generating a content-specific link that the users can share with others. Clicking on the link would take the user back to the content they created.
Basically, I'd like the behavior to be similar to sites like pastebin - where users get a pastebin link they can share with other people (example: http://pastebin.com/XjEJvSJp)
I'm not sure what the best way is to generate these types of links - does anyone have any ideas?
Thanks!
You can create these links in any way you want, as long as each link is unique. For example, take the MD5 of the content and use the first 8 characters of the hex digest.
A simple model for that could be:
class Permalink(models.Model):
key = models.CharField(primary_key = True, max_length = 8)
refersTo = models.ForeignKey(MyContentModel, unique = True)
You could also make refersTo a property that automatically assigns a unique key (as described above).
And you need a matching URL:
url("^permalink/(?P<key>[a-f0-9]{8})$",
"view.that.redirects.to.permalink.refersTo"),
You get the idea...
Usually all that is made up of is a (possibly random, possibly sequential) token, plus the content, stored in a DB and then served up on demand.
If you don't mind that your URLs will get a bit longer you can have a look at the uuid module. This should guarantee unique IDs.
Basically you just need a view that stores data and a view that shows it.
e.g. Store with:
server.com/objects/save
And then, after storing the new model, it could be reached with
server.com/objects/[id]
Where [id] is the id of the model you created when you saved.
This doesn't require users to sign in - it could work for anonymous users as well.