Automatically preventing wiki-rot in Trac?

Automatically preventing wiki-rot in Trac? - python

Hi guys : Is there a way to improve trac wiki quality using a plugin that deals with artifacts like for obsolete pages, or pages that refer to code which doesn't exist anymore, pages that are unlinked, or pages which have a low update-rate ? I think there might be several heuristics which could be used to prevent wiki-rot :
Number of recent edits
Number of recent views
Wether or not a page links to a source file
Wether or not a wiki page's last update is < or > the source files it links to
Wether entire directories in the wiki have been used/edited/ignored over the last "n" days
etc. etc. etc.
If nothing else, just these metrics alone would be useful for each page and each directory from an administrative standpoint.

I don't know of an existing plugin that does this, but everything you mentioned certainly sounds do-able in one way or another.
You can use the trac-admin CLI command to get a list of wiki pages and to dump the contents of a particular wiki page (as plain text) to a file or stdout. Using this, you can write a script that reads in all of the wiki pages, parses the content for links, and generates a graph of which pages link to what. This should pinpoint "orphans" (pages that aren't linked to), pages that link to source files, and pages that link to external resources. Running external links through something like wget can help you identify broken links.
To access last-edited dates, you'll want to query Trac's database. The query you'll need will depend on the particular database type that you're using. For playing with the database in a (relatively) safe and easy manner, I find the WikiTableMacro and TracSql plugins quite useful.
The hardest feature in your question to implement would be the one regarding page views. I don't think that Trac keeps track of page views, you'll probably have to parse your web server's log for that sort of information.

How about these:
BadLinksPlugin: This plugin logs bad local links found in wiki content.
It's a quite new one, just deals with dangling links, but any bad links as I see from source code. This is at least one building block to your solution request.
VisitCounterMacro: Macro displays how many times was wiki page displayed.
This is a rather old one. You'll get just the statistic per page while an administrative view is missing, but this could be built rather easily, i.e. like a custom PageIndex.

Related

How to scrape a website and all its directories from the one link?

Sorry if this is not a valid question, i personally feel it kind of boarders on the edge.
Assuming the website involved has given full permission
How could I download the ENTIRE contents (html) of that website using a python data scraper. By entire contents I refer to not only the current page you are on, but any other directory that branches off of that main website. Eg.
Using the link:
https://www.dogs.com
could I pull info from:
https://www.dogs.com/about-us
and any other directory attached to the "https://www.dogs.com/"
(I have no idea is dogs.com is a real website or not, just an example)
I have already made a scraper that will pull info from a certain link (nothing further than that), but I want to further improve it so I dont have to have heaps of links. I understand I can use an API but if this is possible I would rather this. Cheers!

while there is scrapy to do it professionally, you can use requests to get the url data, and bs4 to parse the html and look into it. it's also easier to do for a beginner i guess.
anyhow you go, you need to have a starting point, then you just follow the link's in the page, and then link's within those pages.
you might need to check if the url is linking to another website or is still in the targeted website. find the pages one by one and scrape them.

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !

You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer

A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.

TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Automatically resolving disambiguation pages

The problem
I'm using the Wikipedia API to get page HTML which I parse. I use queries like this one to get the HTML for the first section of a page.
The MediaWiki API provides a handy parameter, redirects, which will cause the API to automatically follow pages that redirect other pages. For example, if I search for 'Cats' with https://en.wikipedia.org/w/api.php?page=Cats&redirects, I will be shown the results for Cat because Cats redirects to Cat.
I'd like a similar function for disambiguation pages such as this, by which if I arrive at a disambiguation page, I am automatically redirected to the first link. For example, if I make a request to a page like Mercury, I'd automatically be redirected to Mercury (element), as it is the first link listed in the page.
The Python HTML parser BeautifulSoup is fairly slow on large documents. By only requesting the first section of articles (that's all I need for my use), using section=0, I can parse it quickly. This is perfect for most articles. But for disambiguation pages, the first section does not include any of the links to specific pages, making it a poor solution. But if I request more than the first section, the HTML loading slows down, which is unnecessary for most articles. See this query for an example of a disambiguation page in which links are not included in the first section.
What I have so far
As of right now, I've gotten as far as detecting when a disambiguation page is reached. I use code like
bs4.BeautifulSoup(page_html).find("p", recursive=false).get_text().endswith(("refer to:", "refers to:"))
I also spent a while trying to write code that automatically followed a link, before I realized that the links were not included in
My constraints
I'd prefer to keep the number of requests made to a minimum. I also need to be parsing as little HTML as possible, because speed is essential for my application.
Possible solutions (which I need help executing)
I could envision several solutions to this problem:
A way within the MediaWiki API to automatically follow the first link from disambiguation pages
A method within the Mediawiki API that allows it to return different amounts of HTML content based on a condition (like presence of a disambiguation template)
A way to dramatically improve the speed of bs4 so that it doesn't matter if I end up having to parse the entire page HTML

As Tgr and everybody said, no, such a feature doesn't exist because it doesn't make sense. The first link in a disambiguation page doesn't have any special status or meaning.
As for the existing API, see https://www.mediawiki.org/wiki/Extension:Disambiguator#API_usage
By the way, the "bot policy" you linked does not really apply to crawlers/scraper; the only relevant policy/guideline is the User-Agent policy.

Trying to automate downloading campaign disclosure reports from an ASP server but it uses an encrypted __VIEWSTATE to handle important data

In my line of work, I often need to look at campaign disclosure reports for my state from ethics.ga.gov. However, the state system is one of the shittiest webapps I've ever dealt with.
It only provides contribution data per report. There are six reports per election cycle. And to add insult to injury, the system is slow. Not only are you having to download a shit ton of files, you have to wait a good minute for the damn thing to generate.
This is like an obvious opportunity to automate the process. What I had planned on doing is writing a program where I can input a URL of the page that links to all disclosure reports, and it will download all the contribution reports.
For a given candidate, I would input a link to this page - http://media.ethics.ga.gov/Search/Campaign/Campaign_Name.aspx?NameID=5753&FilerID=C2009000086&Type=candidate (the view report links are in the dropdown list titled "campaign contribution reports"). I then plan on following each of those links to the report page, following that link to the contributions page, and downloading the csv file. Once I have the csv file, (I think) the project comes under the scope of my coding ability.
The problem I am stuck on right now is that I can't figure out how to follow the view report links. The system is written in ASP. The links call a javascript postback function with a call of the sort "View Report". ctl02 is the identifier of the control. It appears that the information to map that control identifier to the url I need (in this case http://media.ethics.ga.gov/search/Campaign/Campaign_ReportOptions.aspx?NameID=5753&FilerID=C2009000086&CDRID=85776) is embedded in an encrypted __VIEWSTATE field.
I installed the Firebug debugger to try and get data that way. While I am very new to Firebug, all I could find is that in the net tab it shows a GET request to the URL that I need.
Obviously, somehow my browser is getting the next page, which means it should be automatable, but I am now at a loss. I've been working this up in python because I'm really starting to like it, but everything's negotiable. I am doing this on a mac (with full gnu environment), and would prefer to keep working in the environment I am familiar with, but I do have a windows xp vm with visual c++ '10 if I have to go that route.
What do y'all think?

Turns out the data wasn't in the encrypted __VIEWSTATE at all. There was a POST operation that Firebug was clearing on a redirect (despite having it set not to clear things.) I ran it with the Chrome dev console, and I was able to capture the POST data and replicate the POST operation in my application. That got me the URL I was looking for.
Thanks to everyone that looked at this!

Web scraping - how to identify main content on a webpage

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.
What's a generic way of doing this that will work on most major news sites?
What are some good tools or libraries for data mining? (preferably python based)

There are a number of ways to do it, but, none will always work. Here are the two easiest:
if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.

A while ago I wrote a simple Python script for just this task. It uses a heuristic to group text blocks together based on their depth in the DOM. The group with the most text is then assumed to be the main content. It's not perfect, but works generally well for news sites, where the article is generally the biggest grouping of text, even if broken up into multiple div/p tags.
You'd use the script like: python webarticle2text.py <url>

There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.

Diffbot offers a free(10.000 urls) API to do that, don't know if that approach is what you are looking for, but it might help someone http://www.diffbot.com/

Check the following script. It is really amazing:
from newspaper import Article
URL = "https://www.ksat.com/money/philippines-stops-sending-workers-to-qatar"
article = Article(URL)
article.download()
print(article.html)
article.parse()
print(article.authors)
print(article.publish_date)
#print(article.text)
print(article.top_image)
print(article.movies)
article.nlp()
print(article.keywords)
print(article.summary)
More documentation can be found at http://newspaper.readthedocs.io/en/latest/ and https://github.com/codelucas/newspaper you should install it using:
pip3 install newspaper3k

For a solution in Java have a look at https://github.com/kohlschutter/boilerpipe :
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
But there is also a python wrapper around this available here:
https://github.com/misja/python-boilerpipe

It might be more useful to extract the RSS feeds (<link type="application/rss+xml" href="..."/>) on that page and parse the data in the feed to get the main content.

Another possibility of separating "real" content from noise is by measuring HTML density of the parts of a HTML page.
You will need a bit of experimentation with the thresholds to extract the "real" content, and I guess you could improve the algorithm by applying heuristics to specify the exact bounds of the HTML segment after having identified the interesting content.
Update: Just found out the URL above does not work right now; here is an alternative link to a cached version of archive.org.

There is a recent (early 2020) comparison of various methods of extracting article body, without and ads, menus, sidebars, user comments, etc. - see https://github.com/scrapinghub/article-extraction-benchmark. A report, data and evaluation scripts are available. It compares many options mentioned in the answers here, as well as some options which were not mentioned:
python-readability
boilerpipe
newspaper3k
dragnet
html-text
Diffbot
Scrapinghub AutoExtract
In short, "smart" open source libraries are adequate if you need to remove e.g. sidebar and menu, but they don't handle removal of unnecessary content inside articles, and are quite noisy overall; sometimes they remove an article itself and return nothing. Commercial services use Computer Vision and Machine Learning, which allows them to provide a much more precise output.
For some use cases simpler libraries like html-text are preferrable, both to commercial services and to "smart" open source libraries - they are fast, and ensure information is not missing (i.e. recall is high).
I would not recommend copy-pasting code snippets, as there are many edge cases even for a seemingly simple task of extracting text from HTML, and there are libraries available (like html-text or html2text) which should be handling these edge cases.
To use a commercial tool, in general one needs to get an API key, and then use a client library. For example, for AutoExtract by Scrapinghub (disclaimer: I work there) you would need to install pip install scrapinghub-autoextract. There is a Python API available - see https://github.com/scrapinghub/scrapinghub-autoextract README for details, but an easy way to get extractions is to create a .txt file with URLs to extract, and then run
python -m autoextract urls.txt --page-type article --api-key <API_KEY> --output res.jl

I wouldn't try to scrape it from the web page - too many things could mess it up - but instead see which web sites publish RSS feeds. For example, the Guardian's RSS feed has most of the text from their leading articles:
http://feeds.guardian.co.uk/theguardian/rss
I don't know if The Times (The London Times, not NY) has one because it's behind a paywall. Good luck with that...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.