Python Software design - python

I am starting to use python,more. Is there a good way to keep python disk access to a minimum.
Seems to me that everytime a *.py file runs, it hits a hard disk. Is there way to avoid hitting the harddisk, and keep *.py file in memory and access it there.
Would creating a small gui using Wxframe, keep code in memory, and reuse work or is it more pain vs benefit.

If you run a .py file from the harddisk, the harddisk will be accessed.
In your GUI, just import your code and it will be loaded once and you can access it later.

Modern operating systems cache file access pretty efficiently, as long as there is enough spare RAM available. You most likely won't notice any difference, fi you're not loading thousand of python files at once.
And as always, before trying to optimize one aspect, make sure that this is really the bottleneck. Chances are, your percieved slowness is not due to loading of the .py files.

I think if you took the time to measure how much time it takes to load your python code from disk you would end up with a very, very tiny number unless you are doing something very wrong. And if you are doing something really wrong, solving that problem will be a better use of your time.
Using wxpython to create a guy to work around what you perceive to be a problem wouldn't likely make any difference.

Related

Working on many files with Python

The task:
I am working with 4 TB of data/files, stored on an external usb disk: images, html, videos, executables and so on.
I want to index all those files in a sqlite3 database with the following schema:
path TEXT, mimetype TEXT, filetype TEXT, size INT
So far:
I os.walk recursively through the mounted directory, execute the linux file command with python's subprocess and get the size with os.path.getsize(). Finally the results are written into the database, stored on my computer - the usb is mounted with -o ro, of course. No threading, by the way
You can see the full code here http://hub.darcs.net/ampoffcom/smtid/browse/smtid.py
The problem:
The code is really slow. I realized that the deeper the direcory structure, the slower the code. I suppose, os.walk might be a problem.
The questions:
Is there a faster alternative to os.walk?
Would threading fasten things up?
Is there a faster alternative to os.walk?
Yes. In fact, multiple.
scandir (which will be in the stdlib in 3.5) is significantly faster than walk.
The C function fts is significantly faster than scandir. I'm pretty sure there are wrappers on PyPI, although I don't know one off-hand to recommend, and it's not that hard to use via ctypes or cffi if you know any C.
The find tool uses fts, and you can always subprocess to it if you can't use fts directly.
Would threading fasten things up?
That depends on details your system that we don't have, but… You're spending all of your time waiting on the filesystem. Unless you have multiple independent drives that are only bound together at user-level (that is, not LVM or something below it like RAID) or not at all (e.g., one is just mounted under the other's filesystem), issuing multiple requests in parallel will probably not speed things up.
Still, this is pretty easy to test; why not try it and see?
One more idea: you may be spending a lot of time spawning and communicating with those file processes. There are multiple Python libraries that use the same libmagic that it does. I don't want to recommend one in particular over the others, so here's search results.
As monkut suggests, make sure you're doing bulk commits, not autocommitting each insert with sqlite. As the FAQ explains, sqlite can do ~50000 inserts per second, but only a few dozen transactions per second.
While we're at it, if you can put the sqlite file on a different filesystem than the one you're scanning (or keep it in memory until you're done, then write it to disk all at once), that might be worth trying.
Finally, but most importantly:
Profile your code to see where the hotspots are, instead of guessing.
Create small data sets and benchmark different alternatives to see how much benefit you get.

In a Django web application, would large files or many unnecessary import statements slow down my server?

In my Django web app, I have pretty much one large file that contains all my views. This has a ton of imported python libraries that are only used for certain views.
Does this slow my code? Like in python does importing things like python natural language toolkit (nlkt) and threading libraries slow down the code when its not needed?
I know its not great for a maintainability/style standpoint to have one big file like this, but I am asking purely from a performance standpoint.
No, code speed is not affected by the size of your modules.
Additional imports only affect the memory footprint (a little more memory is needed to hold the extra code objects) and startup speed (more files are loaded from disk when your Django server starts).
However, this doesn't really affect code running speeds; Python does not have to do extra work to run your code.
Views load only one time, at the moment of start your code

Persistant database state strategies

Due to several edits, this question might have become a bit incoherent. I apologize.
I'm currently writing a Python server. It will never see more than 4 active users, but I'm a computer science student, so I'm planning for it anyway.
Currently, I'm about to implement a function to save a backup of the current state of all relevant variables into CSV files. Of those I currently have 10, and they will never be really big, but... well, computer science student and so on.
So, I am currently thinking about two things:
When to run a backup?
What kind of backup?
When to run:
I can either run a backup every time a variable changes, which has the advantage of always having the current state in the backup, or something like once every minute, which has the advantage of not rewriting the file hundreds of times per minute if the server gets busy, but will create a lot of useless rewrites of the same data if I don't implement a detection which variables have changed since the last backup.
Directly related to that is the question what kind of backup I should do.
I can either do a full backup of all variables (Which is pointless if I'm running a backup every time a variable changes, but might be good if I'm running a backup every X minutes), or a full backup of a single variable (Which would be better if I'm backing up each time the variables change, but would involve either multiple backup functions or a smart detection of the variable that is currently backed up), or I can try some sort of delta-backup on the files (Which would probably involve reading the current file and rewriting it with the changes, so it's probably pretty stupid, unless there is a trick for this in Python I don't know about).
I cannot use shelves because I want the data to be portable between different programming languages (java, for example, probably cannot open python shelves), and I cannot use MySQL for different reasons, mainly that the machine that will run the Server has no MySQL support and I don't want to use an external MySQL-Server since I want the server to keep running when the internet connection drops.
I am also aware of the fact that there are several ways to do this with preimplemented functions of python and / or other software (sqlite, for example). I am just a big fan of building this stuff myself, not because I like to reinvent the wheel, but because I like to know how the things I use work. I'm building this server partly just for learning python, and although knowing how to use SQLite is something useful, I also enjoy doing the "dirty work" myself.
In my usage scenario of possibly a few requests per day I am tending towards the "backup on change" idea, but that would quickly fall apart if, for some reason, the server gets really, really busy.
So, my question basically boils down to this: Which backup method would be the most useful in this scenario, and have I possibly missed another backup strategy? How do you decide on which strategy to use in your applications?
Please note that I raise this question mostly out of a general curiosity for backup strategies and the thoughts behind them, and not because of problems in this special case.
Use sqlite. You're asking about building persistent storage using csv files, and about how to update the files as things change. What you're asking for is a lightweight, portable relational (as in, table based) database. Sqlite is perfect for this situation.
Python has had sqlite support in the standard library since version 2.5 with the sqlite3 module. Since a sqlite database is implemented as a single file, it's simple to move them across machines, and Java has a number of different ways to interact with sqlite.
I'm all for doing things for the sake of learning, but if you really want to learn about data persistence, I wouldn't marry yourself to the idea of a "csv database". I would start by looking at the wikipedia page for Persistence. What you're thinking about is basically a "System Image" for your data. The Wikipedia article describes some of the same shortcomings of this approach that you've mentioned:
State changes made to a system after its last image was saved are lost
in the case of a system failure or shutdown. Saving an image for every
single change would be too time-consuming for most systems
Rather than trying to update your state wholesale at every change, I think you'd be better off looking at some other form of persistence. For example, some sort of journal could work well. This makes it simple to just append any change to the end of a log-file, or some similar construct.
However, if you end up with many concurrent users, with processes running on multiple threads, you'll run in to concerns of whether or not your changes are atomic, or if they conflict with one another. While operating systems generally have some ways of dealing with locking files for edits, you're opening up a can of worms trying to learn about how that works and interacts with your system. At this point you're back to needing a database.
So sure, play around with a couple different approaches. But as soon as you're looking to just get it working in a clear and consistent manner, go with sqlite.
If your data is in CSV files, why not use a revision control system on those files? E.g. git would be pretty fast and give excellent history. The repository would be wholly contained in the directory where the files reside, so it's pretty easy to handle. You could also replicate that repository to other machines or directories easily.

python or database?

i am reading a csv file into a list of a list in python. it is around 100mb right now. in a couple of years that file will go to 2-5gigs. i am doing lots of log calculations on the data. the 100mb file is taking the script around 1 minute to do. after the script does a lot of fiddling with the data, it creates URL's that point to google charts and then downloads the charts locally.
can i continue to use python on a 2gig file or should i move the data into a database?
I don't know exactly what you are doing. But a database will just change how the data is stored. and in fact it might take longer since most reasonable databases may have constraints put on columns and additional processing for the checks. In many cases having the whole file local, going through and doing calculations is going to be more efficient than querying and writing it back to the database (subject to disk speeds, network and database contention, etc...). But in some cases the database may speed things up, especially because if you do indexing it is easy to get subsets of the data.
Anyway you mentioned logs, so before you go database crazy I have the following ideas for you to check out. Anyway I'm not sure if you have to keep going through every log since the beginning of time to download charts and you expect it to grow to 2 GB or if eventually you are expecting 2 GB of traffic per day/week.
ARCHIVING -- you can archive old logs, say every few months. Copy the production logs to an archive location and clear the live logs out. This will keep the file size reasonable. If you are wasting time accessing the file to find the small piece you need then this will solve your issue.
You might want to consider converting to Java or C. Especially on loops and calculations you might see a factor of 30 or more speedup. This will probably reduce the time immediately. But over time as data creeps up, some day this will slow down as well. if you have no bound on the amount of data, eventually even hand optimized Assembly by the world's greatest programmer will be too slow. But it might give you 10x the time...
You also may want to think about figuring out the bottleneck (is it disk access, is it cpu time) and based on that figuring out a scheme to do this task in parallel. If it is processing, look into multi-threading (and eventually multiple computers), if it is disk access consider splitting the file among multiple machines...It really depends on your situation. But I suspect archiving might eliminate the need here.
As was suggested, if you are doing the same calculations over and over again, then just store them. Whether you use a database or a file this will give you a huge speedup.
If you are downloading stuff and that is a bottleneck, look into conditional gets using the if modified request. Then only download changed items. If you are just processing new charts then ignore this suggestion.
Oh and if you are sequentially reading a giant log file, looking for a specific place in the log line by line, just make another file storing the last file location you worked with and then do a seek each run.
Before an entire database, you may want to think of SQLite.
Finally a "couple of years" seems like a long time in programmer time. Even if it is just 2, a lot can change. Maybe your department/division will be laid off. Maybe you will have moved on and your boss. Maybe the system will be replaced by something else. Maybe there will no longer be a need for what you are doing. If it was 6 months I'd say fix it. but for a couple of years, in most cases, I'd say just use the solution you have now and once it gets too slow then look to do something else. You could make a comment in the code with your thoughts on the issue and even an e-mail to your boss so he knows it as well. But as long as it works and will continue doing so for a reasonable amount of time, I would consider it "done" for now. No matter what solution you pick, if data grows unbounded you will need to reconsider it. Adding more machines, more disk space, new algorithms/systems/developments. Solving it for a "couple of years" is probably pretty good.
If you need to go through all lines each time you perform the "fiddling" it wouldn't really make much difference, assuming the actual "fiddling" is whats eating your cycles.
Perhaps you could store the results of your calculations somehow, then a database would probably be nice. Also, databases have methods for ensuring data integrity and stuff like that, so a database is often a great place for storing large sets of data (duh! ;)).
I'd only put it into a relational database if:
The data is actually relational and expressing it that way helps shrink the size of the data set by normalizing it.
You can take advantage of triggers and stored procedures to offload some of the calculations that your Python code is performing now.
You can take advantage of queries to only perform calculations on data that's changed, cutting down on the amount of work done by Python.
If neither of those things is true, I don't see much difference between a database and a file. Both ultimately have to be stored on the file system.
If Python has to process all of it, and getting it into memory means loading an entire data set, then there's no difference between a database and a flat file.
2GB of data in memory could mean page swapping and thrashing by your application. I would be careful and get some data before I blamed the problem on the file. Just because you access the data from a database won't solve a paging problem.
If your data's flat, I see less advantage in a database, unless "flat" == "highly denormalized".
I'd recommend some profiling to see what's consuming CPU and memory before I made a change. You're guessing about the root cause right now. Better to get some data so you know where the time is being spent.
I always reach for a database for larger datasets.
A database gives me some stuff for "free"; that is, I don't have to code it.
searching
sorting
indexing
language-independent connections
Something like SQLite might be the answer for you.
Also, you should investigate the "nosql" databases; it sounds like your problem might fit well into one of them.
At 2 gigs, you may start running up against speed issues. I work with model simulations for which it calls hundreds of csv files and it takes about an hour to go through 3 iterations, or about 20 minutes per loop.
This is a matter of personal preference, but I would go with something like PostGreSql because it integrates the speed of python with the capacity of a sql-driven relational database. I encountered the same issue a couple of years ago when my Access db was corrupting itself and crashing on a daily basis. It was either MySQL or PostGres and I chose Postgres because of its python friendliness. Not to say MySQL would not work with Python, because it does, which is why I say its personal preference.
Hope that helps with your decision-making!

Scratch disks in Python?

I understood that in certain Windows XP programs, like Photoshop, there is something called "scratch disks". What I understood that this means, and please correct me if I'm wrong, is that Photoshop manages its own virtual memory on the hard-drive, instead of letting Windows manage it. I understood that the reason for this is some limitation by Windows XP on how much total memory a process can take, regardless of HD space. I think it's around 3 GB. Did I get it right so far?
I am making an application in Python for running simulations. It will take a lot of memory, and will run on Windows XP. Is it possible for it to use scratch disks? How?
Until you ACTUALLY run out of memory, thinking about this is a waste of time.
When you finally do run out of memory, you'll need to use a temporary file to store objects that your process needs, but can't fit into memory.
Use pickle or shelve (see Data Persistence) your objects in a file. If that file happens to be on a disk named "scratch", well that's nice.
Sometimes you want your temporary files to be on a separate disk from your other working files for performance reasons. In some environments (SAN, NAS, storage arrays) your disks are virtual and looking for a "scratch" disk doesn't have any performance benefit. In other environments (i.e., you own all the hardware) you can put temporary files on some other drive, making that drive a "scratch" disk.
I understood that the reason for this is some limitation by Windows XP on how much total memory a process can take, regardless of HD space. I think it's around 3 GB.
Just an FYI, this is more a limitation of a 32-bit OS rather than being a Windows XP problem. You'll have the same problem in 32-bit Vista, linux, bsd... you get the idea. If you go the 64-bit route, you don't have these problems.
For example, Windows XP x64 allows up to 8 terabytes of memory per process.
Scratch disks will benefit your application in the case that it works with very big files,
Is that the case?
If not, then i don't think you may find something that will benefit your application in scratch disks.
Memory mapped files might be what you are looking for. Python's implementation lets you use a file like a mutable string in memory.
The Win32 API provides this: link text.
You may be able to use these functions through PyWin32.
You could combine S.Lott's answer about using pickle (you should use cPickle though for better performance) with SqlLite.
sqlite is built into python 2.5 and up, so all you'll need to do is import :), then just store the pickled objects as strings in there and you'll have a nice fast method of accessing the data (compared to building your own method) that will help keep you organized as well.
note: cPickle is almost identical to pickle in use. Only difference is that it is written in C
Useful Python Docs:
sqlite3 module
pickle module
edit: It may be a good idea to have a user controlled memory usage limit. It would be a shame to be storing a bunch a data on disk and waiting on slow-ass disk I/O when the user has 8GB of RAM ;)
You are probably looking for something like ZODB. However, though ZODB tries hard to be transparent, no solution is going to be 100% free of artifacts. You have to write your code with an awareness that your objects primarily live in a database, but that there are multiple representations of your objects, there are caching/syncing issues, etc. Nothing is going to make this very difficult problem completely trivial for you.

Categories