Scratch disks in Python?

Scratch disks in Python? - python

I understood that in certain Windows XP programs, like Photoshop, there is something called "scratch disks". What I understood that this means, and please correct me if I'm wrong, is that Photoshop manages its own virtual memory on the hard-drive, instead of letting Windows manage it. I understood that the reason for this is some limitation by Windows XP on how much total memory a process can take, regardless of HD space. I think it's around 3 GB. Did I get it right so far?
I am making an application in Python for running simulations. It will take a lot of memory, and will run on Windows XP. Is it possible for it to use scratch disks? How?

Until you ACTUALLY run out of memory, thinking about this is a waste of time.
When you finally do run out of memory, you'll need to use a temporary file to store objects that your process needs, but can't fit into memory.
Use pickle or shelve (see Data Persistence) your objects in a file. If that file happens to be on a disk named "scratch", well that's nice.
Sometimes you want your temporary files to be on a separate disk from your other working files for performance reasons. In some environments (SAN, NAS, storage arrays) your disks are virtual and looking for a "scratch" disk doesn't have any performance benefit. In other environments (i.e., you own all the hardware) you can put temporary files on some other drive, making that drive a "scratch" disk.

I understood that the reason for this is some limitation by Windows XP on how much total memory a process can take, regardless of HD space. I think it's around 3 GB.
Just an FYI, this is more a limitation of a 32-bit OS rather than being a Windows XP problem. You'll have the same problem in 32-bit Vista, linux, bsd... you get the idea. If you go the 64-bit route, you don't have these problems.
For example, Windows XP x64 allows up to 8 terabytes of memory per process.

Scratch disks will benefit your application in the case that it works with very big files,
Is that the case?
If not, then i don't think you may find something that will benefit your application in scratch disks.

Memory mapped files might be what you are looking for. Python's implementation lets you use a file like a mutable string in memory.

The Win32 API provides this: link text.
You may be able to use these functions through PyWin32.

You could combine S.Lott's answer about using pickle (you should use cPickle though for better performance) with SqlLite.
sqlite is built into python 2.5 and up, so all you'll need to do is import :), then just store the pickled objects as strings in there and you'll have a nice fast method of accessing the data (compared to building your own method) that will help keep you organized as well.
note: cPickle is almost identical to pickle in use. Only difference is that it is written in C
Useful Python Docs:
sqlite3 module
pickle module
edit: It may be a good idea to have a user controlled memory usage limit. It would be a shame to be storing a bunch a data on disk and waiting on slow-ass disk I/O when the user has 8GB of RAM ;)

You are probably looking for something like ZODB. However, though ZODB tries hard to be transparent, no solution is going to be 100% free of artifacts. You have to write your code with an awareness that your objects primarily live in a database, but that there are multiple representations of your objects, there are caching/syncing issues, etc. Nothing is going to make this very difficult problem completely trivial for you.

Related

Working on many files with Python

The task:
I am working with 4 TB of data/files, stored on an external usb disk: images, html, videos, executables and so on.
I want to index all those files in a sqlite3 database with the following schema:
path TEXT, mimetype TEXT, filetype TEXT, size INT
So far:
I os.walk recursively through the mounted directory, execute the linux file command with python's subprocess and get the size with os.path.getsize(). Finally the results are written into the database, stored on my computer - the usb is mounted with -o ro, of course. No threading, by the way
You can see the full code here http://hub.darcs.net/ampoffcom/smtid/browse/smtid.py
The problem:
The code is really slow. I realized that the deeper the direcory structure, the slower the code. I suppose, os.walk might be a problem.
The questions:
Is there a faster alternative to os.walk?
Would threading fasten things up?

Is there a faster alternative to os.walk?
Yes. In fact, multiple.
scandir (which will be in the stdlib in 3.5) is significantly faster than walk.
The C function fts is significantly faster than scandir. I'm pretty sure there are wrappers on PyPI, although I don't know one off-hand to recommend, and it's not that hard to use via ctypes or cffi if you know any C.
The find tool uses fts, and you can always subprocess to it if you can't use fts directly.
Would threading fasten things up?
That depends on details your system that we don't have, but… You're spending all of your time waiting on the filesystem. Unless you have multiple independent drives that are only bound together at user-level (that is, not LVM or something below it like RAID) or not at all (e.g., one is just mounted under the other's filesystem), issuing multiple requests in parallel will probably not speed things up.
Still, this is pretty easy to test; why not try it and see?
One more idea: you may be spending a lot of time spawning and communicating with those file processes. There are multiple Python libraries that use the same libmagic that it does. I don't want to recommend one in particular over the others, so here's search results.
As monkut suggests, make sure you're doing bulk commits, not autocommitting each insert with sqlite. As the FAQ explains, sqlite can do ~50000 inserts per second, but only a few dozen transactions per second.
While we're at it, if you can put the sqlite file on a different filesystem than the one you're scanning (or keep it in memory until you're done, then write it to disk all at once), that might be worth trying.
Finally, but most importantly:
Profile your code to see where the hotspots are, instead of guessing.
Create small data sets and benchmark different alternatives to see how much benefit you get.

Persistant database state strategies

Due to several edits, this question might have become a bit incoherent. I apologize.
I'm currently writing a Python server. It will never see more than 4 active users, but I'm a computer science student, so I'm planning for it anyway.
Currently, I'm about to implement a function to save a backup of the current state of all relevant variables into CSV files. Of those I currently have 10, and they will never be really big, but... well, computer science student and so on.
So, I am currently thinking about two things:
When to run a backup?
What kind of backup?
When to run:
I can either run a backup every time a variable changes, which has the advantage of always having the current state in the backup, or something like once every minute, which has the advantage of not rewriting the file hundreds of times per minute if the server gets busy, but will create a lot of useless rewrites of the same data if I don't implement a detection which variables have changed since the last backup.
Directly related to that is the question what kind of backup I should do.
I can either do a full backup of all variables (Which is pointless if I'm running a backup every time a variable changes, but might be good if I'm running a backup every X minutes), or a full backup of a single variable (Which would be better if I'm backing up each time the variables change, but would involve either multiple backup functions or a smart detection of the variable that is currently backed up), or I can try some sort of delta-backup on the files (Which would probably involve reading the current file and rewriting it with the changes, so it's probably pretty stupid, unless there is a trick for this in Python I don't know about).
I cannot use shelves because I want the data to be portable between different programming languages (java, for example, probably cannot open python shelves), and I cannot use MySQL for different reasons, mainly that the machine that will run the Server has no MySQL support and I don't want to use an external MySQL-Server since I want the server to keep running when the internet connection drops.
I am also aware of the fact that there are several ways to do this with preimplemented functions of python and / or other software (sqlite, for example). I am just a big fan of building this stuff myself, not because I like to reinvent the wheel, but because I like to know how the things I use work. I'm building this server partly just for learning python, and although knowing how to use SQLite is something useful, I also enjoy doing the "dirty work" myself.
In my usage scenario of possibly a few requests per day I am tending towards the "backup on change" idea, but that would quickly fall apart if, for some reason, the server gets really, really busy.
So, my question basically boils down to this: Which backup method would be the most useful in this scenario, and have I possibly missed another backup strategy? How do you decide on which strategy to use in your applications?
Please note that I raise this question mostly out of a general curiosity for backup strategies and the thoughts behind them, and not because of problems in this special case.

Use sqlite. You're asking about building persistent storage using csv files, and about how to update the files as things change. What you're asking for is a lightweight, portable relational (as in, table based) database. Sqlite is perfect for this situation.
Python has had sqlite support in the standard library since version 2.5 with the sqlite3 module. Since a sqlite database is implemented as a single file, it's simple to move them across machines, and Java has a number of different ways to interact with sqlite.
I'm all for doing things for the sake of learning, but if you really want to learn about data persistence, I wouldn't marry yourself to the idea of a "csv database". I would start by looking at the wikipedia page for Persistence. What you're thinking about is basically a "System Image" for your data. The Wikipedia article describes some of the same shortcomings of this approach that you've mentioned:
State changes made to a system after its last image was saved are lost
in the case of a system failure or shutdown. Saving an image for every
single change would be too time-consuming for most systems
Rather than trying to update your state wholesale at every change, I think you'd be better off looking at some other form of persistence. For example, some sort of journal could work well. This makes it simple to just append any change to the end of a log-file, or some similar construct.
However, if you end up with many concurrent users, with processes running on multiple threads, you'll run in to concerns of whether or not your changes are atomic, or if they conflict with one another. While operating systems generally have some ways of dealing with locking files for edits, you're opening up a can of worms trying to learn about how that works and interacts with your system. At this point you're back to needing a database.
So sure, play around with a couple different approaches. But as soon as you're looking to just get it working in a clear and consistent manner, go with sqlite.

If your data is in CSV files, why not use a revision control system on those files? E.g. git would be pretty fast and give excellent history. The repository would be wholly contained in the directory where the files reside, so it's pretty easy to handle. You could also replicate that repository to other machines or directories easily.

Is there a way to access hardware directly in Python?

I want to learn about graphical libraries by myself and toy with them a bit. I built a small program that defines lines and shapes as lists of pixels, but I cannot find a way to access the screen directly so that I can display the points on the screen without any intermediate.
What I mean is that I do not want to use any prebuilt graphical library such as gnome, cocoa, etc. I usually use Python to code and my program uses Python, but I can also code and integrate C modules with it.
I am aware that accessing the screen hardware directly takes away the multiplatform side of Python, but I disregard it for the sake of learning. So: is there any way to access the hardware directly in Python, and if so, what is it?

No, Python isn't the best choice for this type of raw hardware access to the video card. I would recommend writing C in DOS. Well, actually, I don't recommend it. It's a horrible thing to do. But, it's how I learned to do it, and it's probably about as friendly as you are going to get for accessing hardware directly without any intermediate.
I say DOS rather than Linux or NT because neither of those will give you direct access to the video hardware without writing a driver. That means having to learn the whole driver API, and you need to invoke a lot of "magic," that won't be very obvious because writing a video driver in Windows NT is fairly complicated.
I say C rather than Python because it gives you real pointers, and the ability to do stupid things with them. Under DOS, you can write to arbitrary physical memory addresses in C, which seems to be what you want. Trying to get Python working at all under an OS terrible enough to allow you direct hardware access would be a frustration in itself, even if you only wanted to do simple stuff that Python is good at.
And, as others have said, don't expect to use anything that you learn with this project in the real world. It may be interesting, but if you tried to write a real application in this way, you'd promptly be shot by whoever would have to maintain your code.

This seems a great self-learning path and I would add my two-cents worth and suggest you consider looking at the GObject, Cairo and Pygame modules some time.
The Python GObject module may be at a higher level than your current interest, but it enables pixel level drawing with Cairo (see the home page) as well as providing a general base for portable GUI apps using Python
Pygame also has pixel level methods as well as access methods to the graphics drivers (at the higher level) - here is a quick code example

This is rather an old thread now, but I stumbled upon it while musing the same question.
I used to program in assembly language. In my day, drawing on screen was simply(?) a matter of poking a value into a memory location. The value turned a pixel on or off and defined its colour.
The term 'poke' comes from Basic by the way, not assembler. In assembler, you had to write a value into a data register then tell the processor where to put the data using another command and specifying an address register, usually in hexadecimal form! And each different processor had its own assembly language. But hec was the code fast!
As hardware progressed, I found that graphics hardware programming became more and more complex. There's much more to it than now simply defining a pixel. The graphics subsystem has its own processor -- or processors -- and it's that that you've got to learn to talk to. The processor doesn't just plonk stuff in memory locations. (I believe that what used to be the fastest supercomputer in the world for a while ran on graphics chips!) 'Plonk' is not a Basic command by the way.
Sorry; I digress. In answer to the original poster's query, I believe that the goal of understanding the graphics-drawing process could have been best achieved by experimenting with a Raspberry Pi. It's Python compatible and hence perfect for the job. Its hardware is well documented and it's cheap and easy to use.
Hope this helps someone, Cheers, M

Python Software design

I am starting to use python,more. Is there a good way to keep python disk access to a minimum.
Seems to me that everytime a *.py file runs, it hits a hard disk. Is there way to avoid hitting the harddisk, and keep *.py file in memory and access it there.
Would creating a small gui using Wxframe, keep code in memory, and reuse work or is it more pain vs benefit.

If you run a .py file from the harddisk, the harddisk will be accessed.
In your GUI, just import your code and it will be loaded once and you can access it later.

Modern operating systems cache file access pretty efficiently, as long as there is enough spare RAM available. You most likely won't notice any difference, fi you're not loading thousand of python files at once.
And as always, before trying to optimize one aspect, make sure that this is really the bottleneck. Chances are, your percieved slowness is not due to loading of the .py files.

I think if you took the time to measure how much time it takes to load your python code from disk you would end up with a very, very tiny number unless you are doing something very wrong. And if you are doing something really wrong, solving that problem will be a better use of your time.
Using wxpython to create a guy to work around what you perceive to be a problem wouldn't likely make any difference.

Best DataMining Database

I am an occasional Python programer who only have worked so far with MYSQL or SQLITE databases. I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.
Sales department makes a CSV dump every week and I need to make a small scripting application that allow people form other departments mixing the information, mostly linking the records. I have all this solved, my problem is the speed, I am using just plain text files for all this and unsurprisingly it is very slow.
I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.
Update: I think I was not being very detailed about my database usage thus explaining my problem badly. I am working reading all the data ~900 Megas or more from a csv into a Python dictionary then working with it. My problem is storing and mostly reading the data quickly.
Many thanks!

Quick Summary
You need enough memory(RAM) to solve your problem efficiently. I think you should upgrade memory?? When reading the excellent High Scalability Blog you will notice that for big sites to solve there problem efficiently they store the complete problem set in memory.
You do need a central database solution. I don't think hand doing this with python dictionary's only will get the job done.
How to solve "your problem" depends on your "query's". What I would try to do first is put your data in elastic-search(see below) and query the database(see how it performs). I think this is the easiest way to tackle your problem. But as you can read below there are a lot of ways to tackle your problem.
We know:
You used python as your program language.
Your database is ~900MB (I think that's pretty large, but absolute manageable).
You have loaded all the data in a python dictionary. Here I am assume the problem lays. Python tries to store the dictionary(also python dictionary's aren't the most memory friendly) in your memory, but you don't have enough memory(How much memory do you have????). When that happens you are going to have a lot of Virtual Memory. When you attempt to read the dictionary you are constantly swapping data from you disc into memory. This swapping causes "Trashing". I am assuming that your computer does not have enough Ram. If true then I would first upgrade your memory with at least 2 Gigabytes extra RAM. When your problem set is able to fit in memory solving the problem is going to be a lot faster. I opened my computer architecture book where it(The memory hierarchy) says that main memory access time is about 40-80ns while disc memory access time is 5 ms. That is a BIG difference.
Missing information
Do you have a central server. You should use/have a server.
What kind of architecture does your server have? Linux/Unix/Windows/Mac OSX? In my opinion your server should have linux/Unix/Mac OSX architecture.
How much memory does your server have?
Could you specify your data set(CSV) a little better.
What kind of data mining are you doing? Do you need full-text-search capabilities? I am not assuming you are doing any complicated (SQL) query's. Performing that task with only python dictionary's will be a complicated problem. Could you formalize the query's that you would like to perform? For example:
"get all users who work for departement x"
"get all sales from user x"
Database needed
I am the computer person for
everything in a small company and I
have been started a new project where
I think it is about time to try new
databases.
You are sure right that you need a database to solve your problem. Doing that yourself only using python dictionary's is difficult. Especially when your problem set can't fit in memory.
MySQL
I thought about using mysql, but then
I need installing mysql in every
desktop, sqlite is easier, but it is
very slow. I do not need a full
relational database, just some way of
play with big amounts of data in a
decent time.
A centralized(Client-server architecture) database is exactly what you need to solve your problem. Let all the users access the database from 1 PC which you manage. You can use MySQL to solve your problem.
Tokyo Tyrant
You could also use Tokyo Tyrant to store all your data. Tokyo Tyrant is pretty fast and it does not have to be stored in RAM. It handles getting data a more efficient(instead of using python dictionary's). However if your problem can completely fit in Memory I think you should have look at Redis(below).
Redis:
You could for example use Redis(quick start in 5 minutes)(Redis is extremely fast) to store all sales in memory. Redis is extremely powerful and can do this kind of queries insanely fast. The only problem with Redis is that it has to fit completely in RAM, but I believe he is working on that(nightly build already supports it). Also like I already said previously solving your problem set completely from memory is how big sites solve there problem in a timely manner.
Document stores
This article tries to evaluate kv-stores with document stores like couchdb/riak/mongodb. These stores are better capable of searching(a little slower then KV stores), but aren't good at full-text-search.
Full-text-search
If you want to do full-text-search queries you could like at:
elasticsearch(videos): When I saw the video demonstration of elasticsearch it looked pretty cool. You could try put(post simple json) your data in elasticsearch and see how fast it is. I am following elastissearch on github and the author is commiting a lot of new code to it.
solr(tutorial): A lot of big companies are using solr(github, digg) to power there search. They got a big boost going from MySQL full-text search to solr.

You probably do need a full relational DBMS, if not right now, very soon. If you start now while your problems and data are simple and straightforward then when they become complex and difficult you will have plenty of experience with at least one DBMS to help you. You probably don't need MySQL on all desktops, you might install it on a server for example and feed data out over your network, but you perhaps need to provide more information about your requirements, toolset and equipment to get better suggestions.
And, while the other DBMSes have their strengths and weaknesses too, there's nothing wrong with MySQL for large and complex databases. I don't know enough about SQLite to comment knowledgeably about it.
EDIT: #Eric from your comments to my answer and the other answers I form even more strongly the view that it is time you moved to a database. I'm not surprised that trying to do database operations on a 900MB Python dictionary is slow. I think you have to first convince yourself, then your management, that you have reached the limits of what your current toolset can cope with, and that future developments are threatened unless you rethink matters.
If your network really can't support a server-based database than (a) you really need to make your network robust, reliable and performant enough for such a purpose, but (b) if that is not an option, or not an early option, you should be thinking along the lines of a central database server passing out digests/extracts/reports to other users, rather than simultaneous, full RDBMS working in a client-server configuration.
The problems you are currently experiencing are problems of not having the right tools for the job. They are only going to get worse. I wish I could suggest a magic way in which this is not the case, but I can't and I don't think anyone else will.

Have you done any bench marking to confirm that it is the text files that are slowing you down? If you haven't, there's a good chance that tweaking some other part of the code will speed things up so that it's fast enough.

It sounds like each department has their own feudal database, and this implies a lot of unnecessary redundancy and inefficiency.
Instead of transferring hundreds of megabytes to everyone across your network, why not keep your data in MySQL and have the departments upload their data to the database, where it can be normalized and accessible by everyone?
As your organization grows, having completely different departmental databases that are unaware of each other, and contain potentially redundant or conflicting data, is going to become very painful.

Does the machine this process runs on have sufficient memory and bandwidth to handle this efficiently? Putting MySQL on a slow machine and recoding the tool to use MySQL rather than text files could potentially be far more costly than simply adding memory or upgrading the machine.

Here is a performance benchmark of different database suits ->
Database Speed Comparison
I'm not sure how objective the above comparison is though, seeing as it's hosted on sqlite.org. Sqlite only seems to be a bit slower when dropping tables, otherwise you shouldn't have any problems using it. Both sqlite and mysql seem to have their own strengths and weaknesses, in some tests the one is faster then the other, in other tests, the reverse is true.
If you've been experiencing lower then expected performance, perhaps it is not sqlite that is the causing this, have you done any profiling or otherwise to make sure nothing else is causing your program to misbehave?
EDIT: Updated with a link to a slightly more recent speed comparison.

It has been a couple of months since I posted this question and I wanted to let you all know how I solved this problem. I am using Berkeley DB with the module bsddb instead loading all the data in a Python dictionary. I am not fully happy, but my users are.
My next step is trying to get a shared server with redis, but unless users starts complaining about speed, I doubt I will get it.
Many thanks everybody who helped here, and I hope this question and answers are useful to somebody else.

If you have that problem with a CSV file, maybe you can just pickle the dictionary and generate a pickle "binary" file with pickle.HIGHEST_PROTOCOL option. It can be faster to read and you get a smaller file. You can load the CSV file once and then generate the pickled file, allowing faster load in next accesses.
Anyway, with 900 Mb of information, you're going to deal with some time loading it in memory. Another approach is not loading it on one step on memory, but load only the information when needed, maybe making different files by date, or any other category (company, type, etc..)

Take a look at mongodb.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.