First, sorry for my bad English.
I'm writing a python script, which compares the files in two different directories. But for performance, I want to know that: "Are the directories on the same physical disk or not?", so I can read them simultaneously for performance gain.
My current idea is getting "mount" commands output, and getting the /dev/sd* directories path and use them for identify the disks. But sometimes you can mount an already mounted directory on somewhere else(or something like that, i'm not so sure), so things get complicated.
Is there a better way to do that, like a library?
(If there is a cross-platform way, I will be more appreciated, but it seems it's hard to find a cross-platform library like this.)
You are looking for the stat function from linux, which is also provided to you by python (see http://docs.python.org/library/os.html#os.stat).
You will have to compare the st_dev from the resulting structure and both files will be on the same filesystem if they match.
Using this function is as portable as you can get (better than mount or df).
Bonus: you do not have to run expensive exec calls and do error prone text parsing.
An easier alternative to using mount might be to invoke df <directory>.
This prints out the filesystem. Also, on my Ubuntu box, passing -P to df makes the output a little bit easier to parse.
Related
EDIT: I rewrote and shortened the question because it seemed unclear!
I am currently writing a python program, and some of the functionalities are improved if it has access to some specific, non-python files. (I am talking about datasets here)
The problem is that I don't have the right to share those files myself, and also that they are very big, which is not practical for sharing them along with my library.
The user has two possibilities:
Using the program stand-alone, with only basic features
Download the files by requesting them to the owner(s), then use advance features
In that second case, I would like to know what is the best practice for asking the user "If you have these files on your machine, tell me where so I can use advanced functionalities from now on".
I can see several ways to achieve this but none of them seemed very clean or pythonic (e.g. Ask the user to copy files at a specific place within the library, or ask the user to write the paths in a config file, etc...).
What is the good way of doing this? Or at least, is there a standard way?
Use pathlib to handle different conventions and make path os independent. Usually, I tend to keep my paths in separate file, like config/paths.py. Always use relative paths.
from pathlib import Path
class Paths:
A: Path = Path('data/foo.csv')
B: Path = Path('data/foo2.csv')
The task:
I am working with 4 TB of data/files, stored on an external usb disk: images, html, videos, executables and so on.
I want to index all those files in a sqlite3 database with the following schema:
path TEXT, mimetype TEXT, filetype TEXT, size INT
So far:
I os.walk recursively through the mounted directory, execute the linux file command with python's subprocess and get the size with os.path.getsize(). Finally the results are written into the database, stored on my computer - the usb is mounted with -o ro, of course. No threading, by the way
You can see the full code here http://hub.darcs.net/ampoffcom/smtid/browse/smtid.py
The problem:
The code is really slow. I realized that the deeper the direcory structure, the slower the code. I suppose, os.walk might be a problem.
The questions:
Is there a faster alternative to os.walk?
Would threading fasten things up?
Is there a faster alternative to os.walk?
Yes. In fact, multiple.
scandir (which will be in the stdlib in 3.5) is significantly faster than walk.
The C function fts is significantly faster than scandir. I'm pretty sure there are wrappers on PyPI, although I don't know one off-hand to recommend, and it's not that hard to use via ctypes or cffi if you know any C.
The find tool uses fts, and you can always subprocess to it if you can't use fts directly.
Would threading fasten things up?
That depends on details your system that we don't have, but… You're spending all of your time waiting on the filesystem. Unless you have multiple independent drives that are only bound together at user-level (that is, not LVM or something below it like RAID) or not at all (e.g., one is just mounted under the other's filesystem), issuing multiple requests in parallel will probably not speed things up.
Still, this is pretty easy to test; why not try it and see?
One more idea: you may be spending a lot of time spawning and communicating with those file processes. There are multiple Python libraries that use the same libmagic that it does. I don't want to recommend one in particular over the others, so here's search results.
As monkut suggests, make sure you're doing bulk commits, not autocommitting each insert with sqlite. As the FAQ explains, sqlite can do ~50000 inserts per second, but only a few dozen transactions per second.
While we're at it, if you can put the sqlite file on a different filesystem than the one you're scanning (or keep it in memory until you're done, then write it to disk all at once), that might be worth trying.
Finally, but most importantly:
Profile your code to see where the hotspots are, instead of guessing.
Create small data sets and benchmark different alternatives to see how much benefit you get.
I just started using git to get my the code I write for my Master-thesis more organized. I have divided the tasks into 4 sub-folders, each one containing data and programs that work with that data. The 4 sub-projects do not necessarily need to be connected, none off the programs contained use functions from the other sub-projects. However the output-files produced by the programs in a certain sub-folder are used by programs of another sub-folder.
In addition some programs are written in Bash and some in Python.
I use git in combination with bitbucket. I am really new to the whole concept, so I wonder if I should create one "Master-thesis" repository or rather one repository for each of the (until now) 4 sub-projects. Thank you for your help!
Well, as devnull says, answers would be highly opinion based, but given that I disagree that that's a bad thing, I'll go ahead and answer if I can type before someone closes the question. :)
I'm always inclined to treat git repositories as separate units of work or projects. If I'm likely to work on various parts of something as a single project or toward a common goal (e.g., Master's thesis), my tendency would be to treat it as a single repository.
And by the way, since the .git repository will be in the root of that single repository, if you need to spin off a piece of your work later and track it separately, you can always create a new repository if needed at that point. Meantime it seems "keep it simple" would mean one repo.
I recommend a single master repository for this problem. You mentioned that the output files of certain programs are used as input to the others. These programs may not have run-time dependencies on each other, but they do have dependencies. It sounds like they will not work without each other being present to create the data. Especially if file location (e.g. relative path) is important, then a single repository will help you keep them better organized.
Is it a good idea to use:
import os.path
os.path.exists (file_path)
to "protect" a program against copies?
For example, in our main application, we use:
import os.path
os.path.exists ("c:\windows\mifile.dll")
where mifile.dll is anything, of course with another name like windriv.dll and just a simple text saved with Notepad.
If the file exists the program works, if not then it displays a warning message that it's a illegal copy or something.
When installing the program, I do the normal installation of the package or the portable folder and manually copy the file mifile.dll in c:\windows.
This isn't the best idea.
A lot of people (such as myself and possibly virus programs) watch the windows directory and would delete something like this.
This kind of thing might be better to be encrypted
Catching an import error isn't the easiest thing
If you are worried about illegal copies it wouldn't be long until somebody figured this out and you have a file that can be copied easily and distributed easily.
Using an import and erroring out would be a huge red flag to a reverse engineer
With UAC this file might not be accessible without running the program as an administrator
No.
Whichever solution you end up with, the general idea of a "secret handshake install technique" is basically sabotage. You are effectively preventing your customers from:
Upgrading the OS of the machine
Restoring their system from a backup
Moving your service to a new machine because of hardware failure
The customer will need to do either of these within the next few years. When they do, your program will break, and they will not know why or how to fix it. Given that you are even available to them at this time, think of how this makes you look when they contact you to fix the issue.
If I found out that a subcontractor had secretly introduced themselves as a single point of failure like this, I would be bloody furious.
Either trust your customers, get new customers that you can trust, or go for a fully professional non-secret DRM solution.
I have an application written in python. I created a plugin system for the application that uses egg files. Egg files contain compiled python files and can be easily decompiled and used to hack the application. Is there a way to secure this system? I'd like to use digital signature for this - sign these egg files and check the signature before loading such egg file. Is there a way to do this programmatically from python? Maybe using winapi?
Is there a way to secure this system?
The answer is "that depends".
The two questions you should ask is "what are people supposed to be able to do" and "what are people able to do (for a given implementation)". If there exists an implementation where the latter is a subset of the former, the system can be secured.
One of my friend is working on a programming competition judge: a program which runs a user-submitted program on some test data and compares its output to a reference output. That's damn hard to secure: you want to run other peoples' code, but you don't want to let them run arbitrary code. Is your scenario somewhat similar to this? Then the answer is "it's difficult".
Do you want users to download untrustworthy code from the web and run it with some assurance that it won't hose their machine? Then look at various web languages. One solution is not offering access to system calls (JavaScript) or offering limited access to certain potentially dangerous calls (Java's SecurityManager). None of them can be done in python as far as I'm aware, but you can always hack the interpreter and disallow the loading of external modules not on some whitelist. This is probably error-prone.
Do you want users to write plugins, and not be able to tinker with what the main body of code in your application does? Consider that users can decompile .pyc files and modify them. Assume that those running your code can always modify it, and consider the gold-farming bots for WoW.
One Linux-only solution, similar to the sandboxed web-ish model, is to use AppArmor, which limits which files your app can access and which system calls it can make. This might be a feasible solution, but I don't know much about it so I can't give you advice other than "investigate".
If all you worry about is evil people modifying code while it's in transit in the intertubes, standard cryptographic solutions exist (SSL). If you want to only load signed plugins (because you want to control what the users do?), signing code sounds like the right solution (but beware of crafty users or evil people who edit the .pyc files and disables the is-it-signed check).
Maybe some crypto library like this http://chandlerproject.org/Projects/MeTooCrypto helps to build an ad-hoc solution. Example usage: http://tdilshod.livejournal.com/38040.html