How to export all pages from mediawiki into individual page files? - python

Related to
How to export text from all pages of a MediaWiki?, but I want the output be individual text files named using the page title.
SELECT page_title, page_touched, old_text
FROM revision,page,text
WHERE revision.rev_id=page.page_latest
AND text.old_id=revision.rev_text_id;
works to dump it into stdout and all pages in one go.
How to split them and dump into individual files?
SOLVED
First dump into one single file:
SELECT page_title, page_touched, old_text
FROM revision,page,text
WHERE revision.rev_id=page.page_latest AND text.old_id=revision.rev_text_id AND page_namespace!='6' AND page_namespace!='8' AND page_namespace!='12'
INTO OUTFILE '/tmp/wikipages.csv'
FIELDS TERMINATED BY '\n'
ESCAPED BY ''
LINES TERMINATED BY '\n######################################\n';
Then split it into individual file, use python:
with open('wikipages.csv', 'rb') as f:
alltxt = f.read().split('\n######################################\n')
for row in alltxt:
one = row.split('\n')
name = one[0].replace('/','-')
try:
del one[0]
del one[0]
except:
continue
txt = '\n'.join(one)
of = open('/tmp/wikipages/' + name + '.txt', 'w')
of.write(txt)
of.close()

In case you have some python knowledge you can utilize mwclient library to achieve this:
install Python 2.7 sudo apt-get install python2.7 ( see https://askubuntu.com/questions/101591/how-do-i-install-python-2-7-2-on-ubuntu in case of troubles )
install mwclient via pip install mwclient
run python script below
import mwclient
wiki = mwclient.Site(('http', 'you-wiki-domain.com'), '/')
for page in wiki.Pages:
file = open(page.page_title, 'w')
file.write(page.text())
file.close()
see mwclient page https://github.com/mwclient/mwclient for reference

From Mediawiki version 1.35, multi-content revision model has been implemented, so the original dump code won't work correctly. Instead, you can use following code:
SELECT page_title, page_touched, old_text
FROM revision,page,text,content,slots
WHERE page.page_latest=revision.rev_id AND revision.rev_id=slots.slot_revision_id AND slots.slot_content_id=convert(substring(content.content_address,4),int) AND convert(substring(content.content_address,4),int)=text.old_id AND page_namespace!='6' AND page_namespace!='8' AND page_namespace!='12'
INTO OUTFILE '/var/tmp/wikipages.csv'
FIELDS TERMINATED BY '\n'
ESCAPED BY ''
LINES TERMINATED BY '\n######################################\n';

Related

Setting HTML source for QtWebKit to a string value vs. file.read(), encoding issue?

I have a script that reads a bunch of JavaScript files into a variable, and then places the contents of those files into placeholders in a Python template. This results in the value of the variable src (described below) being a valid HTML document including scripts.
# Open the source HTML file to get the paths to the JavaScript files
f = open(srcfile.html, 'rU')
src = f.read()
f.close()
js_scripts = re.findall('script\ssrc="(.*)"', src)
# Put all of the scripts in a variable
js = ''
for script in js_scripts:
f = open(script, 'rU')
js = js + f.read() + '\n'
f.close()
# Open/read the template
template = open('template.html)
templateSrc = Template(template.read())
# Substitute the scripts for the placeholder variable
src = str(templateSrc.safe_substitute(javascript_content=js))
# Write a Python file containing the string
with open('htmlSource.py', 'w') as f:
f.write('#-*- coding: utf-8 -*-\n\nhtmlSrc = """' + src + '"""')
If I try to open it up via PyQt5/QtWebKit in Python...
from htmlSource import htmlSrc
webWidget.setHtml(htmlSrc)
...it doesn't load the JS files in the web widget. I just end up with a blank page.
But if I get rid of everything else, and just write to file '"""src"""', when I open the file up in Chrome, it loads everything as expected. Likewise, it'll also load correctly in the web widget if I read from the file itself:
f = open('htmlSource.py', 'r')
htmlSrc = f.read()
webWidget.setHtml(htmlSrc)
In other words, when I run this script, it produces the Python output file with the variable; then I try to import that variable and pass it to webWidget.setHtml(); but the page doesn't render. But if I use open() and read it as a file, it does.
I suspect there's an encoding issue going on here. But I've tried several variations of encode and decode without any luck. The scripts are all UTF-8.
Any suggestions? Many thanks!

Python urlretrieve downloading corrupted images

I am downloading a list of images (all .jpg) from the web using this python script:
__author__ = 'alessio'
import urllib.request
fname = "inputs/skyscraper_light.txt"
with open(fname) as f:
content = f.readlines()
for link in content:
try:
link_fname = link.split('/')[-1]
urllib.request.urlretrieve(link, "outputs_new/" + link_fname)
print("saved without errors " + link_fname)
except:
pass
In OSX preview I see the images just fine, but I can't open them with any image editor (for example Photoshop says "Could not complete your request because Photoshop does not recognize this type of file."), and when i try to attach them to a word document, the files are not even showed as picture files in the dialog for browsing for image.
What am i doing wrong?
As J.F. Sebastian suggested me in the comments, the issue was related to the newline in the filename.
To make my script work, you need to replace
link_fname = link.split('/')[-1]
with
link_fname = link.strip().split('/')[-1]

Read file using urllib and write adding extra characters

I have a script that regularly reads a text file on a server and over writes a copy of the text to a local copy of the text file. I have an issue of the process adding extra carriage returns and an extra invisible character after the last character. How do I make an identical copy of the server file?
I use the following to read the file
for link in links:
try:
f = urllib.urlopen(link)
myfile = f.read()
except IOError:
pass
and to write it to the local file
f = open("C:\\localfile.txt", "w")
try:
f.write(myfile)
except NameError:
pass
finally:
f.close()
This is how the file looks on the server
!http://i.imgur.com/rAnUqmJ.jpg
and this is how the file looks locally. Besides, an additional invisible character after the last 75
!http://i.imgur.com/xfs3E8D.jpg
I have seen quite a few similar questions, but not sure how to handle the urllib to read in binary
Any solution please?
If you want to copy a remote file denoted by a URL to a local file i would use urllib.urlretrieve:
import urllib
urllib.urlretrieve("http://anysite.co/foo.gz", "foo.gz")
I think urllib is reading binary.
Try changing
f = open("C:\\localfile.txt", "w")
to
f = open("C:\\localfile.txt", "wb")

Dynamically Export URL Document To Server Using Python

I'm writing a script in python and I'm trying to wrap my head around a problem. I've a URL that when opened, downloads a document. I'm trying to write a python script that opens the https URL that downloads this document, and automatically send that document to a server I have opened using python's pysftp module.
I can't wrap my head around how to do this... Do you think I'd be able to just do:
server.put(urllib.open('https://......./document'))
EDIT:
This is the code I've tried before the above doesn't work...
download_file = urllib2.urlopen('https://somewebsite.com/file.csv')
file_contents = download_file.read().replace('"', '')
columns = [x.strip() for x in file_contents.split(',')]
# Write Downloaded File Contents To New CSV File
with open('file.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerow(columns)
# Upload New File To Server
srv.put('./file.csv', './SERVERFOLDER/file.csv')
ALSO:
How would I go about getting a FILE that is ONE DAY old from the server? (Examining age of each file)... using paramiko

Saving a downloaded ZIP file w/Python

I'm working on a script that will automatically update an installed version of Calibre. Currently I have it downloading the latest portable version. I seem to be having trouble saving the zipfile. Currently my code is:
import urllib2
import re
import zipfile
#tell the user what is happening
print("Calibre is Updating")
#download the page
url = urllib2.urlopen ( "http://sourceforge.net/projects/calibre/files" ).read()
#determin current version
result = re.search('title="/[0-9.]*/([a-zA-Z\-]*-[0-9\.]*)', url).groups()[0][:-1]
#download file
download = "http://status.calibre-ebook.com/dist/portable/" + result
urllib2.urlopen( download )
#save
output = open('install.zip', 'w')
output.write(zipfile.ZipFile("install.zip", ""))
output.close()
You don't need to use zipfile.ZipFile for this (and the way you're using it, as well as urllib2.urlopen, has problems as well). Instead, you need to save the urlopen result in a variable, then read it and write that output to a .zip file. Try this code:
#download file
download = "http://status.calibre-ebook.com/dist/portable/" + result
request = urllib2.urlopen( download )
#save
output = open("install.zip", "w")
output.write(request.read())
output.close()
There also can be a one-liner:
open('install.zip', 'wb').write(urllib.urlopen('http://status.calibre-ebook.com/dist/portable/' + result).read())
which doesn't have a good memory-efficiency, but still works.
If you just want to download a file from the net, you can use urllib.urlretrieve:
Copy a network object denoted by a URL to a local file ...
Example using requests instead of urllib2:
import requests, re, urllib
print("Calibre is updating...")
content = requests.get("http://sourceforge.net/projects/calibre/files").content
# determine current version
v = re.search('title="/[0-9.]*/([a-zA-Z\-]*-[0-9\.]*)', content).groups()[0][:-1]
download_url = "http://status.calibre-ebook.com/dist/portable/{0}".format(v)
print("Downloading {0}".format(download_url))
urllib.urlretrieve(download_url, 'install.zip')
# file should be downloaded at this point
have you tryed
output = open('install.zip', 'wb') // note the "b" flag which means "binary file"

Categories