I developed a statistics system for online web service user behavior research in python, which mostly relies on reading and analyzing logs from production server. Currently I shared log folders internally under SMB protocol for the routine analytics program to read, but for the data accessing method I have 2 questions,
Are there any other way accessing logs other than via SMB? or other strategy?
I guess a lot read may block HD of the production and affect normal log writing, any solution to solve this?
I hoped I could come up with some real number but currently don't have. Any guy can give me some guide on doing this more gracefully?
If you are open to using a third party log aggregation tool, you have a couple of options:
http://graylog2.org/
http://www.logstash.net/
http://www.octopussy.pm/
https://github.com/facebook/scribe
In addition, if you are logging to syslog - many of the commonly used syslog daemons ( eg syslog-ng ) can be configured to forward logs from various applications to one or more of these aggregators. It is trivial to log to syslog from a python application - there is a syslog module in the standard library
Well, if you have a HTTP server in between (IHS, OHS, I guess Apache too...) then you can expose your physical repositories via a URL: each of your files will benefit from a URL too, and via this kind of code you can download them quite easily:
import os
import urllib2
# Open our local file for writing
f = urllib2.urlopen(url)
with open(os.path.basename(url), 'wb') as local_file:
local_file.write(f.read())
Related
I decided to try using fileconveyor in order to write a simple app that will be able to sync a directory (with very small word files) across all my computers.
In order to do that I also installed pydtpdlib so as to write a simple ftp server that fileconveyor will link to.
pydtpdlib comes with a number of examples so I used one of them to run a server on 0.0.0.0:2121 and configured file conveyor to connect to it which it did, reporting back that it is
- Fully up and running now.
The ftp server also logged the connection as such
USER 'user' logged in.
FTP session closed (disconnect).
But I am not quite sure on what to do now.
1.How can I make the ftp server save uploaded files to a directory of my choosing?
2.Will fileconveyor be able to sync the files both ways?
3.If yes how is that possible, as it would have to track changes on the files in the remote machine?
4.Is what I am trying to do a good idea or should I be using file conveyor differently, possibly not with pyftpdlib but some other service?
Answer to 1: You can configure a directory per user and per anonymus user, see example (add_user/add_anonymous)
Answers to pp 2 and 3: I don't think it's possible to make a reliable implementation of a two-way sync application using only a standard FTP server on the one of the sides. Such apps need more information than FTP protocol provides.
Answer to 4: Why do you need pyftpdlib? I believe it's good for building a customized embedded FTP server. You can use any popular FTP servers like ProFTPD or Filezilla. They are well documented and you can find a lot of HOW-TOs.
BTW why don't you want to use Dropbox?
I have a existing Website deployed in Google App Engine for Python. Now I have setup the local development server in my System. But I don't know how to get the updated DataBase from live server. There is no Export option in Google's developer console.
And, I don't want to read the data for each request from Production Datastore, I want to set it up locally for once. The google manual says that it stores the local datastore in sqlite file.
Any hint would be appreciated.
First, make sure your app.yaml enables the "remote" built-in, with a stanza such as:
builtins:
- remote_api: on
This app.yaml of course must be the one deployed to your appspot.com (or whatever) "production" GAE app.
Then, it's a job for /usr/local/google_appengine/bulkloader.py or wherever you may have installed the bulkloader component. Run it with -h to get a list of the many, many options you can pass.
You may need to generate an application-specific password for this use on your google accounts page. Then, the general use will be something like:
/usr/local/google_appengine/bulkloader.py --dump --url=http://your_app.appspot.com/_ah/remote_api --filename=allkinds.sq3
You may not (yet) be able to use this "all kinds" query -- the server only generates the needed statistics for the all-kinds query "periodically", so you may get an error message including info such as:
[ERROR ] Unable to download kind stats for all-kinds download.
[ERROR ] Kind stats are generated periodically by the appserver
[ERROR ] Kind stats are not available on dev_appserver.
If that's the case, then you can still get things "one kind at a time" by adding the option --kind=EntityKind and running the bulkloader repeatedly (with separate sqlite3 result files) for each kind of entity.
Once you've dumped (kind by kind if you have to, all at once if you can) the production datastore, you can use the bulkloader again, this time with --restore and addressing your localhost dev_appserver instance, to rebuild the latter's datastore.
It should be possible to explicitly list kinds in the --kind flag (by separating them with commas and putting them all in parentheses) but unfortunately I think I've found a bug stopping that from working -- I'll try to get it fixed but don't hold your breath. In any case, this feature is not documented (I just found it by studying the open-source release of bulkloader.py) so it may be best not to rely on it!-)
More info about the then-new bulkloader can be found in a blog post by Nick Johnson at http://blog.notdot.net/2010/04/Using-the-new-bulkloader (though it doesn't cover newer functionalities such as the sqlite3 format of results in the "zero configuration" approach I outlined above). There's also a demo, with plenty of links, at http://bulkloadersample.appspot.com/ (also a bit outdated, alas).
Check out the remote API. This will tunnel your database calls over HTTP to the production database.
Later note: the issues in the original posting below have been largely resolved.
Here's the background: For an introductory comp sci course, students develop html and server-side Python 2.7 scripts using a server provided by the instructors. That server is based on CGIHTTPRequestHandler, like the one at pointlessprogramming. When the students' html and scripts seem correct, they port those files to a remote, slow Apache server. Why support two servers? Well, the initial development using a local server has the benefit of reducing network issues and dependency on the remote, weak machine that is running Apache. Eventually porting to the Apache-running machine has the benefit of publishing their results for others to see.
For the local development to be most useful, the local server should closely resemble the Apache server. Currently there is an important difference: Apache requires that a script start its response with headers that include a content-type; if the script fails to provide such a header, Apache sends the client a 500 error ("Internal Server Error"), which too generic to help the students, who cannot use the server logs. CGIHTTPRequestHandler imposes no similar requirement. So it is common for a student to write header-free scripts that work with the local server, but get the baffling 500 error after copying files to the Apache server. It would be helpful to have a version of the local server that checks for a content-type header and gives a good error if there is none.
I seek advice about creating such a server. I am new to Python and to writing servers. Here are the issues that occur to me, but any helpful advice would be appreciated.
Is a content-type header required by the CGI standard? If so, other people might benefit from an answer to the main question here. Also, if so, I despair of finding a way to disable Apache's requirement. Maybe the relevant part of the CGI RFC is section 6.3.1 (CGI Response, Content-Type): "If an entity body is returned, the script MUST supply a Content-Type field in the response."
To make a local server that checks for the content-type header, perhaps I should sub-class CGIHTTPServer.CGIHTTPRequestHandler, to override run_cgi() with a version that issues an error for a missing header. I am looking at CGIHTTPServer.py __version__ = "0.4", which was installed with Python 2.7.3. But run_cgi() does a lot of processing, so it is a little unappealing to copy all its code, just to add a couple calls to a header-checking routine. Is there a better way?
If the answer to (2) is something like "No, overriding run_cgi() is recommended," I anticipate writing a version that invokes the desired script, then checks the script's output for headers before that output is sent to the client. There are apparently two places in the existing run_cgi() where the script is invoked:
3a. When run_cgi() is executed on a non-Unix system, the script is executed using Python's subprocess module. As a result, the standard output from the script will be available as an in-memory string, which I can presumably check for headers before the call to self.wfile.write. Does this sound right?
3b. But when run_cgi() is executed on a *nix system, the script is executed by a forked process. I think the child's stdout will write directly to self.wfile (I'm a little hazy on this), so I see no opportunity for the code in run_cgi() to check the output. Ugh. Any suggestions?
If analyzing the script's output is recommended, is email.parser the standard way to recognize whether there is a content-type header? Is another standard module recommended instead?
Is there a more appropriate forum for asking the main question ("How can a CGI server based on CGIHTTPRequestHandler require...")? It seems odd to ask if there is a better forum for asking programming questions than Stack Overflow, but I guess anything is possible.
Thanks for any help.
I'm searching for a good way to stress test a web application. Basically I'm searching für something like ab with a scriptable interface. Ideally I want to define some tasks, that simulate different action on the webapp (register a account, login, search, etc.) and the tool runs a hole bunch of processes that executes these tasks*. As result I would like something like "average request time", "slowest request (per uri)", etc.
*: To be independed from the client bandwith I will run theses test from some EC2 instances so in a perfect world the tool will already support this - otherwise I will script is using boto.
If you're familiar with the python requests package, locust is very easy to write load tests in.
http://locust.io/
I've used it to write all of our perf tests in it.
You can maybe look onto these tools:
palb (Python Apache-Like Benchmark Tool) - HTTP benchmark tool with command line interface resembles ab.
It lacks the advanced features of ab, but it supports multiple URLs (from arguments, files, stdin, and Python code).
Multi-Mechanize - Performance Test Framework in Python
Multi-Mechanize is an open source framework for performance and load testing.
Runs concurrent Python scripts to generate load (synthetic transactions) against a remote site or service.
Can be used to generate workload against any remote API accessible from Python.
Test output reports are saved as HTML or JMeter-compatible XML.
Pylot (Python Load Tester) - Web Performance Tool
Pylot is a free open source tool for testing performance and scalability of web services.
It runs HTTP load tests, which are useful for capacity planning, benchmarking, analysis, and system tuning.
Pylot generates concurrent load (HTTP Requests), verifies server responses, and produces reports with metrics.
Tests suites are executed and monitored from a GUI or shell/console.
( Pylot on GoogleCode )
The Grinder
Default script language is Jython.
Pretty compact how-to guide.
Tsung
Maybe a bit unusual for the first use but really good for stress-testing.
Step-by-step guide.
+1 for locust.io in answer above.
I would recommend JMeter.
See: http://jmeter.apache.org/
You can setup JMeter as proxy of your browser to record actions like login and then stress test your web application. You can also write scripts to it.
Don't forget FunkLoad, it's very easy to use
I'm maintaining an open-source document asset management application called NotreDAM, which is written in Django running on Apache an instance of TwistedWeb.
Whenever any user downloads a file, the application hangs for all users for the entire duration of the download. I've tracked down the download command to this point in the code, but I'm not enough versed with Python/Django to know why this may be happening.
response = HttpResponse(open(fullpath, 'rb').read(), mimetype=mimetype)
response["Last-Modified"] = http_date(statobj.st_mtime)
response["Content-Length"] = statobj.st_size
if encoding:
response["Content-Encoding"] = encoding
return response
Do you know how I could fix the application hanging while a file downloads?
The web server reads the whole file in the memory instead of streaming it. It is not well written code, but not a bug per se.
This blocks the Apache client (pre-forked) for the duration of whole file read. If IO is slow and the file is large it may take some time.
Usually you have several pre-forked Apache clients configured to satisfy this kind of requests, but on a badly configured web server you may exhibit this kind of problems and this is not a Django issue. Your web server is probably running only one pre-forked process, potentially in a debug mode.
notreDAM serves the asset files using the django.views.static.serve() command, which according to the Django docs "Using this method is inefficient and insecure. Do not use this in a production setting. Use this only for development." So there we go. I have to use another command.