How to speed up Pywikibot? - python

I've built some report tools using Pywikibot. As things are growing it now takes up to 2 hours to finish the reports so I'm looking to speed things up. Main ideas:
Disable throttling, the script is read-only, so page.get(throttle=False) handles this
Cache
Direct database access
Unfortunately I can't find much documentation about caching and db access. Only way seems to dive into the code, and well, there's limited information about database access in user-config.py. If there is any, where can I find good documentation about pywikibot caching and direct db access?
And, are there other ways to speed things up?

Use PreloadingGenerator so that pages are loaded in batches. Or MySQLPageGenerator if you use direct DB access.
See examples here.

I'm using "-pt:1" option in the command to make one edit per second.
I'm currently running the command
python pwb.py category add -pt:1 -file:WX350.txt -to:"Taken with Sony DSC-WX350"
https://www.mediawiki.org/wiki/Manual:Pywikibot/Global_Options

Looks like pagegenerators is indeed a good way to speed up things. The best documentation for that is directly in the source.
Even in there it's not directly clear where to put the MySQL connection details. (Will update this hopefully.)

Disable throttling, the script is read-only, so page.get(throttle=False) handles this
"throttle" parameter of Page.get() is not supported since Pywikibot 2.0 (formerly known as rewrite) and was removed in 5.0.0. Pywikibot 2.0+ has not activated a get throttle by default. Decreasing putthrottle is only for putting a page to the wiki and may be restricted by local policies. Never touch maxlag parameter which is server related.
If you are using multiple sites the first run needs a lot of time until all site objects are cached. PreloadingGenerator can be be used for bulk load of page contents but decreases speed if meta data are required only. In summary speeding up your script depends on you implementation and your need.

Using PreloadingGenerator from pagegenerators is the simplest way to speed some programs that need to read a lot from online wikis, as other answers have already pointed.
Alternative ways are:
Download a dump of the wiki and read it locally. Wikimedia projects offer dumps updated about once a week.
Create an account on Wikimedia Labs and work from there enjoying from faster connection with Wikipedias and updated dumps.
Modifying throttle might put you in danger of getting blocked if the target wiki has a policy against it - and I'm afraid Wikipedia has such a policy.

You can download all the data in advance in a dump file in this site
http://dumps.wikimedia.org
You can then use a two passes - first pass reads the data from the local dump,
then the second pass reads only the remote pages for which you found issues in the local dump.
Example:
dump_file = hewiktionary-latest-pages-articles.xml.bz2
all_wiktionary = XmlDump(dump_file).parse()
gen = (pywikibot.Page(site, p.title) for p in all_wiktionary if report_problem(p))
gen = pagegenerators.PreloadingGenerator(gen)
for page in gen:
report_problem(page)

Related

Backtrader, using mongodb as datafeed instead CSV

I'm quite new to backtrader and since I've started I couldn't stop wondering why there's no database support for the datafeed. I've found a page on the official website where's described how to implement a custom datafeed. The implementation should be pretty easy, but on github (or more in general on the web) I couldn't fine a single one implementation of a feed with MongoDb. I understand that CSV are easier to manage and so on, but in some cases could require a lot of RAM for storing data all at once in memory. On the other hand, having a db can be "RAM friendly" but will take longer during the backtesting process even if the DB is a documental one. Does anyone have any experience with both of these two approaches? And if yes, there's some code I can take a look at?
Thanks!

Architecture question for app which serves data from SQLite in memory

I am building a dynamic <div> into my (very low-traffic) website, which I want to display only some data from a memory-based SQLite database running within a Python program on the server. Being a novice in web tech, I can't decide which technologies and principles should go into this project.
Right now, the only decided-upon technologies are Python and Apache. Python, at the very least, needs to be constantly running to fetch data from the external source, format it, and enter it into the database. Problem #1 is where this database should reside. Ideally, I would like it in RAM, since the database will update both often and around the clock. Then the question becomes, "How does one retrieve the data?". Note: the query will never change; I want the web page to receive the same JSON structure only with up-to-date values. From here, I see two options with the first, again, being ideal:
1) Perform some simple "hey someone wants the stuff" interaction with the Python program (remember that this program will be running) whenever someone loads the page, which is responded to with the JSON data. This should be fairly easy with WebSockets, but I understand they have fallen out of favor.
2) Have the Python program periodically create/update an HTML file, which the page loads with jQuery. I could do this with my current knowledge, but I find it inelegant and it would be accepting several compromises such as increased disk read/writes and possibly out-of-date data unless read/writes are increased even further, essentially rendering the benefits from a memory database useless.
So, is my ideal case feasible? Can I implement an API into my Python program to listen for requests? Would the request be made with jQuery? Node.js? PHP? Maybe even with Apache? Do I bypass Python by manipulating the VFS? The techs available feel overwhelming and most online resources only detail generating HTML with Python (Django, Flask, etc.).
Thank you!
WSGI is the tech I was looking for!

How can I retrieve Proxmox node notes?

I am using proxmoxer to manipulate machines on ProxMox (create, delete etc).
Every time I am creating a machine, I provide a description which is being written in ProxMox UI in section "Notes".
I am wondering how can I retrieve that information?
Best would be if it can be done with ProxMox, but if there is not a way to do it with that Python module, I will also be satisfied to do it with plain ProxMox API call.
The description parameter is only a message to show in proxmox UI, and it's not related to any function
You could use https://github.com/baseblack/Proxmoxia to get started, I asked this very same question on the forum as I need to generate some reports from a legacy system with dozens of VMs (and descriptions).
Let me know if you still need this, perhaps we can collaborate on it.

Best Python Web Framework for my API Server Needs

I am working on developing two systems:
A system that will constantly retrieve economic data from a 3rd party data feed and push it into a MySQL DB (using sqlalchemy)
A server that will allow anyone to query the data in the db over a JSON AJAX API (similar to Yelp or Yahoo API for example)
I have two main questions:
Which Python framework should I use in 2)? Pyramid is my first choice, but if you strongly suggest against it or in favor of something else like Django or Pylons I am definitely wiling to consider it.
Should I develop the two system separately? Or should 1) be a part of 2), running within the framework (using crontab or celery for example)?
Depends on what stage you are at, I would suggest to develop 2 systems because the load to pull data from 3rd party and the load to handle the API would be different. You can scale them into a different types of nodes if you want.
Django-Tastypie (https://github.com/toastdriven/django-tastypie) is not bad, it supports all JSON, XML and YAML. Also you can add OAuth easily. Though, Django itself maybe a bit heavy for your needs at this time.
You might want to check out web2py's new functionality for easily generating RESTful API's, particularly its parse_as_rest and smart_query functions. You might also consider using web2py's database abstraction layer to handle #1.
If you need any help, ask on the mailing list.
I agree with Anthony, you should look at Web2Py. It is very easy to get started, very low learning cure and easy to deploy on many systems including Linux, Windows and Amazon.
So far I have found nothing that Web2Py can not do. But more importantly it does things how you would think they should be done, so if you are not sure, very often a guess is good enough and it just works. If you do get stuck, it has by far the best and most up to date documentation for any Python Web Framework.
Even with all it's great features, easy use and up to date documentation, you will also find that the web2py user group on Google, is like having a paid for help desk staffed 24 hours a day. Most questions are answered with a couple minutes and Massimo (The original creator of Web2Py) goes out of his way not only to help, but to implement new ideas, suggestions and bug fixes within days of them being raised in the group.

Consuming RSS in Django ( / Python)

For a site I'm working on I would like to import a lot of RSS feeds using Django. Since I need the content of them fast I will need to cache them locally (either in the database or in some other way)
Is there a standard app to do RSS consumption in Django, or is there a standard way to do this in Python?
Of course I could implement it myself, but I'd rather reuse a good piece of code (since there's a lot of stuff to consider, like what to do when an item updates, how long to wait before checking for updates, etc, and I'd rather reuse someone elses thinking about this).
(I did google django and rss, but everything that seems to popup is feed generation; surely there must be other sites out there using Django and consuming RSS?)
Check outhttp://feedparser.org/docs/ http://code.google.com/p/feedparser/
One of the best Python libraries for parsing RSS and Atom Feeds; although it seems like you want to do a bit more (caching, auto-refresh etc.)

Categories