I'm working on a Python script that transforms this:
foo
bar
Into this:
[[Component foo]]
[[bar]]
The script checks (per input line) if the page "Component foo" exists. If it exists then a link to that page is created, if it doesn't exist then a direct link is created.
The problem is that I need a quick & cheap way to check if a lot of wiki pages exist.I don't want to (try to) download all the 'Component' pages.
I already figured out a fast way to do this by hand: Edit a new wiki page. paste all the 'component' links into the page, press preview, and then save the resulting preview HTML page. The resulting HTML file contains a different link for existing pages than for non-existing pages.
So to rephrase my question: How can I save a mediawiki preview page in Python?
(I don't have local access to the database.)
You can definitely use the API to check if a page exists:
# assuming words is a list of words you wish to query for
import urllib
# replace en.wikipedia.org with the address of the wiki you want to access
query = "http://en.wikipedia.org/w/api.php?action=query&titles=%s&format=xml" % "|".join(words)
pages = urllib.urlopen(query)
Now pages you will contain xml like this:
<?xml version="1.0"?><api><query><pages>
<page ns="0" title="DOESNOTEXIST" missing="" />
<page pageid="600799" ns="0" title="FOO" />
<page pageid="11178" ns="0" title="Foobar" />
</pages></query></api>
Pages which don't exist will appear here but they have the missing="" attribute set, as can be seen above. You can also check for the invalid attribute to be on the save side.
Now you can use your favorite xml parser to check for these attributes and react accordingly.
See also: http://www.mediawiki.org/wiki/API:Query
Use Pywikibot to interact with the MediaWiki software. It's probably the most powerful bot framework available.
The Python Wikipediabot Framework (pywikipedia or PyWikipediaBot) is a
collection of tools that automate work on MediaWiki sites. Originally
designed for Wikipedia, it is now used throughout the Wikimedia
Foundation's projects and on many other MediaWiki wikis. It's written
in Python, which is a free, cross-platform programming language. This
page provides links to general information for people who want to use
the bot software.
If you have local access to the wiki database, it might be easiest to do a query against the database to see whether each page exists.
If you only have HTTP access, you might try the mechanize library which lets you programmatically automate tasks that would otherwise require a browser.
You should be able to use the MediaWiki API.
http://www.mediawiki.org/wiki/API (maybe under Queries or Creating/Editing)
I'm not too familiar with it, but for example, you could compare the output of an existing page with a nonexistent page.
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Bill_Gates&rvprop=timestamp
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=NONEXISTENT_PAGE&rvprop=timestamp
Related
I would like to create a tool with Python and the Twitter API to be able to create lists of tweets that match certain criteria like "contains the word Python" or "has at least 2 likes". Or simple stats like top posters, most liked, etc.
All of my search pointed me to the Tweepy project. But for that I need 0Auth tokens. So I applied for a developer account and was denied with the comment "we are unable to serve your use case".
Do I have any alternatives?
Well, as a general answer for these situations, you can always use a web-based automation tool, which is basically a library that interacts with the remote feature of the browsers and replicates what would be you "opening the website, logging in, etc" and can subsequently parse all data from the rendered elements.
Try looking at selenium, i've used that library in the past to raw scrap facebook and it worked flawless.
Edit: Note that this isn't a twitter specific library, you will have to find the html tags in the login website and use them to log in, same for parsing data, etc.
I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)
i need to create a web scraper for this website
However I need to get the links for the counties, stored in the interactive map
Unfortunately, for some reason, their search engine doesn't provide all the results as the interactive map does.
My question:
Could anyone tell me how to get all the links for all the counties, without manually accessing them?
Thanks
Technically you can use a decompiler to do this job.
There are free (e.g.: ActionScript Extractor) and paid (e.g.: Sothink
SWF Decompiler) tools out there.
you can reference this answer
Edit :
Most swf content gets external records from either a .xml or .json file.
Without decompiling and just using the browser's Developer Tools we can see that an xml file is indeed accessed (maybe it contains what you want) :
http://www.allpetservices.co.uk/uk_ir_locator.xml.
Put view-source: in front of the link to read it (if there's an error message).
In that xml you want to extract the contents (the xyz) of each & every <link> xyz </link> tag. This will give you the links of every entry on the map.
The short answer to your question: There's no way to get the links from the site.
The solution: The structure of the links you are trying to retrieve are very predictable. They follow the same structure:
http://www.allpetservices.co.uk/search_map.asp?ccounty={COUNTY_NAME}
So, if you can use another site or data source to get the names of each of the counties, you can formulate each of the links that you need.
Say I look at the following Tumblr post: http://ronbarak.tumblr.com/post/40692813…
It (currently) has 292 notes.
I'd like to get all the above notes using a Python script (e.g., via urllib2, BeautifulSoup, simplejson, or tumblr Api).
Some extensive Googling did not produce any items relating to notes' extraction in Tumblr.
Can anyone point me in the right direction on which tool will enable me to do that?
Unfortunately looks like the Tumblr API has some limitations (lacks of meta information about Reblogs, notes limited by 50), so you can't get all the notes.
It is also forbidden to do page scraping according to the Terms of Service.
"You may not do any of the following while accessing or using the Services: (...) scrape the Services, and particularly scrape Content (as defined below) from the Services, without Tumblr's express prior written consent;"
Source:
https://groups.google.com/forum/?fromgroups=#!topic/tumblr-api/ktfMIdJCOmc
Without JS you get separate pages that only contain the notes. For the mentioned blog post the first page would be:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Following pages are linked at the bottom, e.g.:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358403506
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358383221
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358377013
…
(See my answer on how to find the next URL in a’s onclick attribute.)
Now you could use various tools to download/parse the data.
The following wget command should download all notes pages for that post:
wget --recursive --domains=ronbarak.tumblr.com --include-directories=notes http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Like Fabio implies, it is better to use the API.
If for whatever reasons you cannot, then the tools you will use will depend on what you want to do with the data in the posts.
for a data dump: urllib will return a string of the page you want
looking for a specific section in the html: lxml is pretty good
looking for something in unruly html: definitely beautifulsoup
looking for a specific item in a section: beautifulsoup, lxml, text parsing is what you need.
need to put the data in a database/file: use scrapy
Tumblr url scheme is simple: url/scheme/1, url/scheme/2, url/scheme/3, etc... until you get to the end of the posts and the servers just does not return any data anymore.
So if you are going to brute force your way to scraping, you can easily tell your script to dump all the data on your hard drive until, say the contents tag, is empty.
One last word of advice, please remember to put a small sleep(1000) in your script, because you could put some stress on Tumblr servers.
how to load all notes on tumblr? also covers the topic, but unor's response (above) does it very well.
Hi guys : Is there a way to improve trac wiki quality using a plugin that deals with artifacts like for obsolete pages, or pages that refer to code which doesn't exist anymore, pages that are unlinked, or pages which have a low update-rate ? I think there might be several heuristics which could be used to prevent wiki-rot :
Number of recent edits
Number of recent views
Wether or not a page links to a source file
Wether or not a wiki page's last update is < or > the source files it links to
Wether entire directories in the wiki have been used/edited/ignored over the last "n" days
etc. etc. etc.
If nothing else, just these metrics alone would be useful for each page and each directory from an administrative standpoint.
I don't know of an existing plugin that does this, but everything you mentioned certainly sounds do-able in one way or another.
You can use the trac-admin CLI command to get a list of wiki pages and to dump the contents of a particular wiki page (as plain text) to a file or stdout. Using this, you can write a script that reads in all of the wiki pages, parses the content for links, and generates a graph of which pages link to what. This should pinpoint "orphans" (pages that aren't linked to), pages that link to source files, and pages that link to external resources. Running external links through something like wget can help you identify broken links.
To access last-edited dates, you'll want to query Trac's database. The query you'll need will depend on the particular database type that you're using. For playing with the database in a (relatively) safe and easy manner, I find the WikiTableMacro and TracSql plugins quite useful.
The hardest feature in your question to implement would be the one regarding page views. I don't think that Trac keeps track of page views, you'll probably have to parse your web server's log for that sort of information.
How about these:
BadLinksPlugin: This plugin logs bad local links found in wiki content.
It's a quite new one, just deals with dangling links, but any bad links as I see from source code. This is at least one building block to your solution request.
VisitCounterMacro: Macro displays how many times was wiki page displayed.
This is a rather old one. You'll get just the statistic per page while an administrative view is missing, but this could be built rather easily, i.e. like a custom PageIndex.