Download entire history of a Wikipedia page - python

I'd like to download the entire revision history of a single article on Wikipedia, but am running into a roadblock.
It is very easy to download an entire Wikipedia article, or to grab pieces of its history using the Special:Export URL parameters:
curl -d "" 'https://en.wikipedia.org/w/index.php?title=Special:Export&pages=Stack_Overflow&limit=1000&offset=1' -o "StackOverflow.xml"
And of course I can download the entire site including all versions of every article from here, but that's many terabytes and way more data than I need.
Is there a pre-built method for doing this? (Seems like there must be.)

The example above only gets information about the revisions, not the actual contents themselves. Here's a short python script that downloads the full content and metadata history data of a page into individual json files:
import mwclient
import json
import time
site = mwclient.Site('en.wikipedia.org')
page = site.pages['Wikipedia']
for i, (info, content) in enumerate(zip(page.revisions(), page.revisions(prop='content'))):
info['timestamp'] = time.strftime("%Y-%m-%dT%H:%M:%S", info['timestamp'])
print(i, info['timestamp'])
open("%s.json" % info['timestamp'], "w").write(json.dumps(
{ 'info': info,
'content': content}, indent=4))

Wandering around aimlessly looking for clues to another question I have myself — my way of saying I know nothing substantial about this topic! — I just came upon this a moment after reading your question: http://mwclient.readthedocs.io/en/latest/reference/page.html. Have a look for the revisions method.
EDIT: I also see http://mwclient.readthedocs.io/en/latest/user/page-ops.html#listing-page-revisions.
Sample code using the mwclient module:
#!/usr/bin/env python3
import logging, mwclient, pickle, os
from mwclient import Site
from mwclient.page import Page
logging.root.setLevel(logging.DEBUG)
logging.debug('getting page...')
env_page = os.getenv("MEDIAWIKI_PAGE")
page_name = env_page is not None and env_page or 'Stack Overflow'
page_name = Page.normalize_title(env_page)
site = Site('en.wikipedia.org') # https by default. change w/`scheme=`
page = site.pages[page_name]
logging.debug('extracting revisions (may take a really long time, depending on the page)...')
revisions = []
for i, revision in enumerate(page.revisions()):
revisions.append(revision)
logging.debug('saving to file...')
with open('{}Revisions.mediawiki.pkl'.format(page_name), 'wb+') as f:
pickle.dump(revisions, f, protocol=0) # protocol allows backwards compatibility between machines

Related

Extract URLs recursively from website archives in scrapy

Hi I want to crawl the data from http://economictimes.indiatimes.com/archive.cms, all the urls are archived based on date, month and year, first to get the urls list I am using the code from https://github.com/FraPochetti/StocksProject/blob/master/financeCrawler/financeCrawler/spiders/urlGenerator.py, modified the code for my website as,
import scrapy
import urllib
def etUrl():
totalWeeks = []
totalPosts = []
url = 'http://economictimes.indiatimes.com/archive.cms'
data = urllib.urlopen(url).read()
hxs = scrapy.Selector(text=data)
months = hxs.xpath('//ul/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news.cms')
admittMonths = 12*(2013-2007) + 8
months = months[:admittMonths]
for month in months:
data = urllib.urlopen(month).read()
hxs = scrapy.Selector(text=data)
weeks = hxs.xpath('//ul[#class="weeks"]/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news/day\\d+\.cms')
totalWeeks += weeks
for week in totalWeeks:
data = urllib.urlopen(week).read()
hxs = scrapy.Selector(text=data)
posts = hxs.xpath('//ul[#class="archive"]/li/h1/a/#href').extract()
totalPosts += posts
with open("eturls.txt", "a") as myfile:
for post in totalPosts:
post = post + '\n'
myfile.write(post)
etUrl()
saved file as urlGenerator.py and ran with the command $ python urlGenerator.py
I am getting no result, could someone assist me how to adopt this code for my website use case or any other solution?
Try stepping through your code one line at a time using pdb. Run python -m pdb urlGenerator.py and follow the instructions for using pdb in the linked page.
If you step through your code line by line you can immediately see that the line
data = urllib.urlopen(url).read()
is failing to return something useful:
(pdb) print(data)
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://economictimes.indiatimes.com/archive.cms" on this server.<P>
Reference #18.6057c817.1508411706.1c3ffe4
</BODY>
</HTML>
It seems that they are not allowing access by Python's urllib. As pointed out in the comments you really shouldn't be using urllib anyways--Scrapy is already adept at dealing with this.
A lot of the rest of your code is clearly broken as well. For example this line:
months = hxs.xpath('//ul/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news.cms')
returns an empty list even given the real HTML from this site. If you look at the HTML it's clearly in a table, not unsorted lists (<ul>). You also have the URL format wrong. Instead something like this would work:
months = response.xpath('//table//tr//a/#href').re(r'/archive/year-\d+,month-\d+.cms')
If you want to build a web scraper, rather than starting from some code you found (that isn't even correct) and trying to blindly modify it, try following the official tutorial for Scrapy and start with some very simple examples, then build up from there. For example:
class EtSpider(scrapy.Spider):
name = 'et'
start_urls = ["https://economictimes.indiatimes.com/archive.cms"]
def parse(self, response):
months = response.xpath('//table//tr//a/#href').re(r'/archive/year-\d+,month-\d+.cms')
for month in months:
self.logger.info(month)
process = scrapy.crawler.CrawlerProcess()
process.crawl(EtSpider)
process.start()
This runs correctly, and you can clearly see it finding the correct URLs for the individual months, as printed to the log. Now you can go from there and use callbacks, as explained in the documentation, to make further additional requests.
In the end you'll save yourself a lot of time and hassle by reading the docs and getting some understanding of what you're doing rather than taking some dubious code off the internet and trying to shoehorn it into your problem.

How do I use python requests to download a processed files?

I'm using Django 1.8.1 with Python 3.4 and i'm trying to use requests to download a processed file. The following code works perfect for a normal request.get command to download the exact file at the server location, or unprocessed file.
The file needs to get processed based on the passed data (shown below as "data"). This data will need to get passed into the Django backend, and based off the text pass variables to run an internal program from the server and output .gcode instead .stl filetype.
python file.
import requests, os, json
SERVER='http://localhost:8000'
authuser = 'admin#google.com'
authpass = 'passwords'
#data not implimented
##############################################
data = {FirstName:Steve,Lastname:Escovar}
############################################
category = requests.get(SERVER + '/media/uploads/9128342/141303729.stl', auth=(authuser, authpass))
#download to path file
path = "/home/bradman/Downloads/requestdata/newfile.stl"
if category.status_code == 200:
with open(path, 'wb') as f:
for chunk in category:
f.write(chunk)
I'm very confused about this, but I think the best course of action is to pass the data along with request.get, and somehow make some function to grab them inside my views.py for Django. Anyone have any ideas?
To use data in request you can do
get( ... , params=data)
(and you get data as parameters in url)
or
post( ... , data=data).
(and you send data in body - like HTML form)
BTW. some APIs need params= and data= in one request of GET or POST to send all needed information.
Read requests documentation

Download a Google Sites page Content Feed using gdata-python-client

My final goal is import some data from Google Site pages.
I'm trying to use gdata-python-client (v2.0.17) to download a specific Content Feed:
self.client = gdata.sites.client.SitesClient(source=SOURCE_APP_NAME)
self.client.client_login(USERNAME, PASSWORD, source=SOURCE_APP_NAME, service=self.client.auth_service)
self.client.site = SITE
self.client.domain = DOMAIN
uri = '%s?path=%s' % (self.client.MakeContentFeedUri(), '[PAGE PATH]')
feed = self.client.GetContentFeed(uri=uri)
entry = feed.entry[0]
...
Resulted entry.content has a page content in xhtml format. But this tree doesn't content any plan text data from a page. Only html page struct and links.
For example my test page has
<div>Some text</div>
ContentFeed entry has only div node with text=None.
I have debugged gdata-python-client request/response and checked resolved data from server in raw buffer - any plan text data in content. Hence it is a Google API bug.
May be there is some workaround? May be i can use some common request parameter? What's going wrong here?
This code works for me against a Google Apps domain and gdata 2.0.17:
import atom.data
import gdata.sites.client
import gdata.sites.data
client = gdata.sites.client.SitesClient(source='yourCo-yourAppName-v1', site='examplesite', domain='example.com')
client.ClientLogin('admin#example.com', 'examplepassword', client.source);
uri = '%s?path=%s' % (client.MakeContentFeedUri(), '/home')
feed = client.GetContentFeed(uri=uri)
entry = feed.entry[0]
print entry
Given, it's pretty much identical to yours, but it might help you prove or disprove something. Good luck!

Python script for "Google search by image"

I have checked Google Search API's and it seems that they have not released any API for searching "Images". So, I was wondering if there exists a python script/library through which I can automate the "search by image feature".
This was annoying enough to figure out that I thought I'd throw a comment on the first python-related stackoverflow result for "script google image search". The most annoying part of all this is setting up your proper application and custom search engine (CSE) in Google's web UI, but once you have your api key and CSE, define them in your environment and do something like:
#!/usr/bin/env python
# save top 10 google image search results to current directory
# https://developers.google.com/custom-search/json-api/v1/using_rest
import requests
import os
import sys
import re
import shutil
url = 'https://www.googleapis.com/customsearch/v1?key={}&cx={}&searchType=image&q={}'
apiKey = os.environ['GOOGLE_IMAGE_APIKEY']
cx = os.environ['GOOGLE_CSE_ID']
q = sys.argv[1]
i = 1
for result in requests.get(url.format(apiKey, cx, q)).json()['items']:
link = result['link']
image = requests.get(link, stream=True)
if image.status_code == 200:
m = re.search(r'[^\.]+$', link)
filename = './{}-{}.{}'.format(q, i, m.group())
with open(filename, 'wb') as f:
image.raw.decode_content = True
shutil.copyfileobj(image.raw, f)
i += 1
There is no API available but you are can parse the page and imitate the browser, but I don't know how much data you need to parse because google may limit or block access.
You can imitate the browser by simply using urllib and setting correct headers, but if you think parsing complex web-pages may be difficult from python, you can directly use a headless browser like phontomjs, inside a browser it is trivial to get correct elements using javascript/DOM
Note before trying all this check google's TOS
You can try this:
https://developers.google.com/image-search/v1/jsondevguide#json_snippets_python
It's deprecated, but seems to work.

The JSON syntax vs html/xml tags

The JSON syntax definition say that
html/xml tags (like the <script>...</script> part) are not part of
valid json, see the description at http://json.org.
A number of browsers and tools ignore these things silently, but python does
not.
I'd like to insert the javascript code (google analytics) to get info about the users using this service (place, browsers, OS ...).
What do you suggest to do?
I should solve the problem on [browser output][^1] or [python script][^2]?
thanks,
Antonio
[^1]: Browser output
<script>...</script>
[{"key": "value"}]
[^2]: python script
#!/usr/bin/env python
import urllib2, urllib, json
url="http://.........."
params = {}
url = url + '?' + urllib.urlencode(params, doseq=True)
req = urllib2.Request(url)
headers = {'Accept':'application/json;text/json'}
for key, val in headers.items():
req.add_header(key, val)
data = urllib2.urlopen(req)
print json.load(data)
These sound like two different kinds of services--one is a user-oriented web view of some data, with visualizations, formatting, etc., and one is a machine-oriented data service. I would keep these separate, and maybe build the user view as an extension to the data service.

Categories