Automatically detect non-deterministic behaviour in Python

Automatically detect non-deterministic behaviour in Python - python

This may be impossible, but I am just wondering if there are any tools to help detect non-deterministic behaviour when I run a Python script. Some fancy options in a debugger perhaps? I guess I am imagining that theoretically it might be possible to compare the stack instruction-by-instruction or something between two subsequent runs of the same code, and thus pick out where any divergence begins.
I realise that a lot is going on under the hood though, so that this might be far too difficult to ask of a debugger or any tool...
Essentially my problem is that I have a test failing occasionally, almost certainly because somewhere the code relies accidentally on the ordering of output from iterating over a dictionary, or some such thing where the ordering isn't actually guaranteed. I'd just like a tool to help me locate the source of this sort of problem :).
The following question is similar, but there was not much suggestion of how to really deal with this in an automated or general way: Testing for non-deterministic behavior of python function

I'm not aware of a way to do this automatically, but what I would recommend doing is starting a debugger when a test fails, then running it automatically (overnight?) until you get a failure. You can then examine the variables and see if anything stands out. If you're using pytest, running with the --pdb flag will start a debugger on failure.
You might also consider using Hypothesis to run generative test cases.
You might also consider running the tests over and over, collecting the output of each run (success or failure). When you have a representative sample, compare the two, particularly the ordering of tests that were run!

I'm fairly certain this is fundamentally impossible with our current understanding of computation and automata theory. I could be wrong though.
Anyone more directly knowledgeable (rigorous background) feel free to pipe in, most of what follows is self-taught and heavily based on professional observations over the last decade doing Systems Engineering/SRE/automation.
Modern day computers are an implementation of Automata Theory and Computation Theory. At the signal level, they require certain properties to do work.
Signal Step, Determinism and Time Invariance are a few of those required properties.
Deterministic behavior, and deterministic properties rely on there being a unique solution, or a 1:1 mapping of directed nodes (instructions & data context) on the state graph from the current state to the next state. There is only one unique path. Most of this is implemented and hidden away at low level abstractions (i.e. the Signal, Firmware, and kernel/shell level).
Non-deterministic behavior is the absence of the determinism signal property. This can be introduced into any interconnected system due to a large range of issues (i.e. hardware failures, strong EM fields, Cosmic Rays, and even poor programming between interfaces).
Anytime determinism breaks, computers are unable to do useful work, the scope may be limited depending on where it happens. Usually it will either be caught as an error and the program or shell will halt, or it may continue running indefinitely or provide bogus data, both because of the class of problem it turns into, and the fundamental limitations on the types of problems turing machines can solve (i.e. computers).
Please bare in mind, I am not a Computer Science major, nor do I hold a degree in Computer Engineering or related IT field. I'm self taught, no degree.
Most of this explanation has been driven by doing years of automation, segmenting problem domains, design and seeking a more generalized solution to many of the issues I ran into, mostly to come to a better usage of my time (hence this non-rigorous explanation).
The class of non-deterministic behavior is the most costly type of errors I've run into because this behavior is the absence of the expected. There isn't a test for non-determinism as a set or group. You can infer it by the absence of properties which you can test (at least interactively)
Normal computer behavior is emergent from the previous mentioned required signals and systems properties, and we see problems when they don't always hold true and we can't quickly validate the system for non-determinism due to its nature.
Interestingly, testing for the presence of those properties interactively, is a useful shortcut, as if the properties are not present it will fall into this class of troubles which we as human beings can solve, but computers cannot, but it can only effectively be done by humans as you can run into issues with the halting problem, and other more theoretical aspects which I didn't bother understanding during my independent studies.
Unfortunately, know how to test for these properties does often require knowledgeable view of the systems and architecture being tested spanning most abstraction layers (depending on where the problem originates).
More formal or rigorous material may use NFAs v. DFAs with more complex vocabularies, non-finite versus discrete-finite automata iirc.
The differences being basically the presence of that 1:1 state map/path or its absence that define determinism.
Where most people trip up with this property, with regards to programming is between interfaces where the interface fails to preserve data and this property by extension, such as accidentally using the empty or NULL state of an output field to mean more than one thing that gets passed to another program.
A theoretical view of a shell program running a series of piped commands might look like this:
DFA->OutInterface ->DFA->OutInterface->NFA->crash/meaningless/unexpected data/infinite loop, etc depending on the code that comes after the NFA, the behavior varies unpredictably in indeterminable ways. (OutInterface being pipe at the shell '|' )
For an actual example in the wild, ldd on recent versions of linux had two such errors that injected non-determinism into the pipe. Trying to identify linked dependencies for a arbitrary binary, for use with a build system was not possible using ldd because of this issue.
More specifically, the in-memory structures, and then also the flattening of the output fields in a non-deterministic way that varies across different binaries.
Most of the material mentioned above is normally covered in a BS Compiler design course at the undergraduate level, one can also find it in the dragon compiler book which is what I did instead, it does require a decent background in math fundamentals (i.e. Abstract Algebra/Linear Algebra) to grok the basis and examples, and the properties are best described in Oppenheim's Signals and Systems.
Without knowing how to test that certain system properties hold true, you can easily waste months of labor trying to document and/or trying to narrow the issue down. All you really have in those non-deterministic cases is a guess and check model/strategy which becomes very expensive especially if you don't realize its an underlying systems property issue.

Related

Is there any way to encrypt PYTHON output file like exe in C? [duplicate]

I've been contemplating how to protect my C/C++ code from disassembly and reverse engineering. Normally I would never condone this behavior myself in my code; however the current protocol I've been working on must not ever be inspected or understandable, for the security of various people.
Now this is a new subject to me, and the internet is not really resourceful for prevention against reverse engineering but rather depicts tons of information on how to reverse engineer
Some of the things I've thought of so far are:
Code injection (calling dummy functions before and after actual function calls)
Code obfustication (mangles the disassembly of the binary)
Write my own startup routines (harder for debuggers to bind to)
void startup();
int _start()
{
startup( );
exit (0)
}
void startup()
{
/* code here */
}
Runtime check for debuggers (and force exit if detected)
Function trampolines
void trampoline(void (*fnptr)(), bool ping = false)
{
if(ping)
fnptr();
else
trampoline(fnptr, true);
}
Pointless allocations and deallocations (stack changes a lot)
Pointless dummy calls and trampolines (tons of jumping in disassembly output)
Tons of casting (for obfuscated disassembly)
I mean these are some of the things I've thought of but they can all be worked around and or figured out by code analysts given the right time frame. Is there anything else alternative I have?

but they can all be worked around and or figured out by code analysists given the right time frame.
If you give people a program that they are able to run, then they will also be able to reverse-engineer it given enough time. That is the nature of programs. As soon as the binary is available to someone who wants to decipher it, you cannot prevent eventual reverse-engineering. After all, the computer has to be able to decipher it in order to run it, and a human is simply a slower computer.

What Amber said is exactly right. You can make reverse engineering harder, but you can never prevent it. You should never trust "security" that relies on the prevention of reverse engineering.
That said, the best anti-reverse-engineering techniques that I've seen focused not on obfuscating the code, but instead on breaking the tools that people usually use to understand how code works. Finding creative ways to break disassemblers, debuggers, etc is both likely to be more effective and also more intellectually satisfying than just generating reams of horrible spaghetti code. This does nothing to block a determined attacker, but it does increase the likelihood that J Random Cracker will wander off and work on something easier instead.

Safe Net Sentinel (formerly Aladdin). Caveats though - their API sucks, documentation sucks, and both of those are great in comparison to their SDK tools.
I've used their hardware protection method (Sentinel HASP HL) for many years. It requires a proprietary USB key fob which acts as the 'license' for the software. Their SDK encrypts and obfuscates your executable & libraries, and allows you to tie different features in your application to features burned into the key. Without a USB key provided and activated by the licensor, the software can not decrypt and hence will not run. The Key even uses a customized USB communication protocol (outside my realm of knowledge, I'm not a device driver guy) to make it difficult to build a virtual key, or tamper with the communication between the runtime wrapper and key. Their SDK is not very developer friendly, and is quite painful to integrate adding protection with an automated build process (but possible).
Before we implemented the HASP HL protection, there were 7 known pirates who had stripped the dotfuscator 'protections' from the product. We added the HASP protection at the same time as a major update to the software, which performs some heavy calculation on video in real time. As best I can tell from profiling and benchmarking, the HASP HL protection only slowed the intensive calculations by about 3%. Since that software was released about 5 years ago, not one new pirate of the product has been found. The software which it protects is in high demand in it's market segment, and the client is aware of several competitors actively trying to reverse engineer (without success so far). We know they have tried to solicit help from a few groups in Russia which advertise a service to break software protection, as numerous posts on various newsgroups and forums have included the newer versions of the protected product.
Recently we tried their software license solution (HASP SL) on a smaller project, which was straightforward enough to get working if you're already familiar with the HL product. It appears to work; there have been no reported piracy incidents, but this product is a lot lower in demand..
Of course, no protection can be perfect. If someone is sufficiently motivated and has serious cash to burn, I'm sure the protections afforded by HASP could be circumvented.

Making code difficult to reverse-engineer is called code obfuscation.
Most of the techniques you mention are fairly easy to work around. They center on adding some useless code. But useless code is easy to detect and remove, leaving you with a clean program.
For effective obfuscation, you need to make the behavior of your program dependent on the useless bits being executed. For example, rather than doing this:
a = useless_computation();
a = 42;
do this:
a = complicated_computation_that_uses_many_inputs_but_always_returns_42();
Or instead of doing this:
if (running_under_a_debugger()) abort();
a = 42;
Do this (where running_under_a_debugger should not be easily identifiable as a function that tests whether the code is running under a debugger — it should mix useful computations with debugger detection):
a = 42 - running_under_a_debugger();
Effective obfuscation isn't something you can do purely at the compilation stage. Whatever the compiler can do, a decompiler can do. Sure, you can increase the burden on the decompilers, but it's not going to go far. Effective obfuscation techniques, inasmuch as they exist, involve writing obfuscated source from day 1. Make your code self-modifying. Litter your code with computed jumps, derived from a large number of inputs. For example, instead of a simple call
some_function();
do this, where you happen to know the exact expected layout of the bits in some_data_structure:
goto (md5sum(&some_data_structure, 42) & 0xffffffff) + MAGIC_CONSTANT;
If you're serious about obfuscation, add several months to your planning; obfuscation doesn't come cheap. And do consider that by far the best way to avoid people reverse-engineering your code is to make it useless so that they don't bother. It's a simple economic consideration: they will reverse-engineer if the value to them is greater than the cost; but raising their cost also raises your cost a lot, so try lowering the value to them.
Now that I've told you that obfuscation is hard and expensive, I'm going to tell you it's not for you anyway. You write
current protocol I've been working on must not ever be inspected or understandable, for the security of various people
That raises a red flag. It's security by obscurity, which has a very poor record. If the security of the protocol depends on people not knowing the protocol, you've lost already.
Recommended reading:
The security bible: Security Engineering by Ross Anderson
The obfuscation bible: Surreptitious software by Christian Collberg and Jasvir Nagra

Take, for example, the AES algorithm. It's a very, very public algorithm, and it is VERY secure. Why? Two reasons: It's been reviewed by lots of smart people, and the "secret" part is not the algorithm itself - the secret part is the key which is one of the inputs to the algorithm. It's a much better approach to design your protocol with a generated "secret" that is outside your code, rather than to make the code itself secret. The code can always be interpreted no matter what you do, and (ideally) the generated secret can only be jeopardized by a massive brute force approach or through theft.
I think an interesting question is "Why do you want to obfuscate your code?" You want to make it hard for attackers to crack your algorithms? To make it harder for them to find exploitable bugs in your code? You wouldn't need to obfuscate code if the code were uncrackable in the first place. The root of the problem is crackable software. Fix the root of your problem, don't just obfuscate it.
Also, the more confusing you make your code, the harder it will be for YOU to find security bugs. Yes, it will be hard for hackers, but you need to find bugs too. Code should be easy to maintain years from now, and even well-written clear code can be difficult to maintain. Don't make it worse.

The best anti disassembler tricks, in particular on variable word length instruction sets are in assembler/machine code, not C. For example
CLC
BCC over
.byte 0x09
over:
The disassembler has to resolve the problem that a branch destination is the second byte in a multi byte instruction. An instruction set simulator will have no problem though. Branching to computed addresses, which you can cause from C, also make the disassembly difficult to impossible. Instruction set simulator will have no problem with it. Using a simulator to sort out branch destinations for you can aid the disassembly process. Compiled code is relatively clean and easy for a disassembler. So I think some assembly is required.
I think it was near the beginning of Michael Abrash's Zen of Assembly Language where he showed a simple anti disassembler and anti-debugger trick. The 8088/6 had a prefetch queue what you did was have an instruction that modified the next instruction or a couple ahead. If single stepping then you executed the modified instruction, if your instruction set simulator did not simulate the hardware completely, you executed the modified instruction. On real hardware running normally the real instruction would already be in the queue and the modified memory location wouldnt cause any damage so long as you didnt execute that string of instructions again. You could probably still use a trick like this today as pipelined processors fetch the next instruction. Or if you know that the hardware has a separate instruction and data cache you can modify a number of bytes ahead if you align this code in the cache line properly, the modified byte will not be written through the instruction cache but the data cache, and an instruction set simulator that did not have proper cache simulators would fail to execute properly. I think software only solutions are not going to get you very far.
The above are old and well known, I dont know enough about the current tools to know if they already work around such things. The self modifying code can/will trip up the debugger, but the human can/will narrow in on the problem and then see the self modifying code and work around it.
It used to be that the hackers would take about 18 months to work something out, dvds for example. Now they are averaging around 2 days to 2 weeks (if motivated) (blue ray, iphones, etc). That means to me if I spend more than a few days on security, I am likely wasting my time. The only real security you will get is through hardware (for example your instructions are encrypted and only the processor core well inside the chip decrypts just before execution, in a way that it cannot expose the decrypted instructions). That might buy you months instead of days.
Also, read Kevin Mitnick's book The Art of Deception. A person like that could pick up a phone and have you or a coworker hand out the secrets to the system thinking it is a manager or another coworker or hardware engineer in another part of the company. And your security is blown. Security is not all about managing the technology, gotta manage the humans too.

Many a times, fear of your product getting reverse engineered is misplaced. Yes, it can get reverse engineered; but will it become so famous over a short period of time, that hackers will find it worth to reverse engg. it ? (this job is not a small time activity, for substantial lines of code).
If it really becomes a money earner, then you should have gathered enough money to protect it using the legal ways like, patent and/or copyrights.
IMHO, take the basic precautions you are going to take and release it. If it becomes a point of reverse engineering that means you have done a really good job, you yourself will find better ways to overcome it. Good luck.

Take a read of http://en.wikipedia.org/wiki/Security_by_obscurity#Arguments_against. I'm sure others could probably also give a better sources of why security by obscurity is a bad thing.
It should be entirely possible, using modern cryptographic techniques, to have your system be open (I'm not saying it should be open, just that it could be), and still have total security, so long as the cryptographic algorithm doesn't have a hole in it (not likely if you choose a good one), your private keys/passwords remain private, and you don't have security holes in your code (this is what you should be worrying about).

Since July 2013, there is renewed interest in cryptographically robust obfuscation (in the form of Indistinguishability Obfuscation) which seems to have spurred from original research from Amit Sahai.
Sahai, Garg, Gentry, Halevi, Raykova, Waters, Candidate Indistinguishability Obfuscation
and Functional Encryption for all circuits (July 21, 2013).
Sahai, Waters, How to Use Indistinguishability Obfuscation:
Deniable Encryption, and More.
Sahai, Barak, Garg, Kalai, Paneth, Protecting Obfuscation Against Algebraic Attacks (February 4, 2014).
You can find some distilled information in this Quanta Magazine article and in that IEEE Spectrum article.
Currently the amount of resources required to make use of this technique make it impractical, but AFAICT the consensus is rather optimistic about the future.
I say this very casually, but to everyone who's used to instinctively dismiss obfuscation technology -- this is different. If it's proven to be truly working and made practical, this is major indeed, and not just for obfuscation.

To inform yourself, read the academic literature on code obfuscation. Christian Collberg of the University of Arizona is a reputable scholar in this field; Salil Vadhan of Harvard University has also done some good work.
I'm behind on this literature, but the essential idea I'm aware of is that you can't prevent an attacker from seeing the code that you will execute, but you can surround it with code that is not executed, and it costs an attacker exponential time (using best known techniques) to discover which fragments of your code are executed and which are not.

If someone wants to spend the time to reverse your binary then there is absolutely nothing you can do to stop them. You can make if moderately more difficult but that's about it. If you really want to learn about this then get a copy of http://www.hex-rays.com/idapro/ and disassemble a few binaries.
The fact that the CPU needs to execute the code is your undoing. The CPU only executes machine code... and programmers can read machine code.
That being said... you likely have a different issue which can be solved another way. What are you trying to protect? Depending on your issue you can likely use encryption to protect your product.

To be able to select the right option, You should think of the following aspects:
Is it likely that "new users" do not want to pay but use Your software?
Is it likely that existing customers need more licences than they have?
How much are potential users willing to pay?
Do You want to give licences per user / concurrent users / workstation / company?
Does Your software need training / customization to be useful?
If the answer to question 5 is "yes", then do not worry about illegal copies. They wouldn't be useful anyway.
If the answer to question 1 is "yes", then first think about pricing (see question 3).
If You answer questions 2 "yes", then a "pay per use" model might
be appropriate for You.
From my experience, pay per use + customization and training is the best protection
for Your software, because:
New users are attracted by the pricing model (little use -> little pay)
There are almost no "anonymous users", because they need training and customization.
No software restrictions scares potential customers away.
There is a continuous stream of money from existing customers.
You get valuable feedback for development from Your customers, because of a long-term business relationship.
Before You think of introducing DRM or obfuscation, You might think of these points and if they are applicable to Your software.

There is a recent paper called "Program obfuscation and one-time programs". If you are really serious about protecting your application. The paper in general goes around the theoretical impossibility results by the use of simple and universal hardware.
If you cant afford requiring extra hardware, then there is also another paper that gives the theoretically best-possible obfuscation "On best-possible obfuscation", amongst all programs with the same functionality and same size. However the paper shows that information-theoretic best-possible implies a collapse of the polynomial hierarchy.
Those papers should at least give you sufficient bibliographical leads to walk in the related literature if these results does not work for your needs.
Update: A new notion of obfuscation, called indistinguishable obfuscation, can mitigate the impossibility result (paper)

Protected code in a virtual machine seemed impossible to reverse engineer at first. Themida Packer
But it's not that secure anymore.. And no matter how you pack your code you can always do a memory dump of any loaded executable and disassemble it with any disassembler like IDA Pro.
IDA Pro also comes with a nifty assembly code to C source code transformer although the generated code will look more like a pointer/address mathematical mess.. if you compare it with original you can fix all errors and rip anything out.

No dice, you cannot protect your code from disassemble. What you can do is to set up the server for the business logic and use webservice to provide it for your app. Of course, this scenario is not always possible.

To avoid reverse engineering, you must not give the code to users. That said, I recommend using an online application...however (since you gave no context) that could be pointless on yours.

Possibly your best alternative is still using virtualization, which introduces another level of indirection/obfuscation needed to by bypassed, but as SSpoke said in his answer, this technique is also not 100% secure.
The point is you won't get ultimate protection, because there is no such thing, and if ever will be, it won't last long, which mean it wasn't ultimate protection in the first place.
Whatever man assemble, can be disassembled.
It's usually true that (proper) disassembling is often (a bit or more) harder task, so your opponent must be more skilled, but you can assume that there is always someone of such quality, and it's a safe bet.
If you want to protect something against REs, you must know at least common techniques used by REs.
Thus words
internet is not really resourceful for prevention against reverse engineering but rather depicts tons of information on how to reverse engineer
show bad attitude of yours. I'm not saying that to use or embed protection you must know how to break it, but to use it wisely you should know its weaknesses and pitfalls. You should understand it.
(There are examples of software using protection in a wrong way, making such protection practically nonexistent. To avoid speaking vaguely I'll give you an example briefly described in internet: Oxford English Dictionary Second Edition on CD-ROM v4. You can read about its failed use of SecuROM in following page: Oxford English Dictionary (OED) on CD-ROM in a 16-, 32-, or 64-bit Windows environment: Hard-disk installation, bugs, word processing macros, networking, fonts, and so forth)
Everything takes time.
If you're new to the subject and don't have months or rather years to get properly into RE stuff, then go with available solutions made by others. The problem here is obvious, they are already there, so you already know they're not 100% secure, but making your own new protection would give you only a false sense of being protected, unless you know really well state of the art in reverse engineering and protection (but you don't, at least at this moment).
The point of software protection is to scare newbies, stall common REs, and put a smile on the face of seasoned RE after her/his (hopefully interesting) journey to the center of your application.
In business talk you may say it's all about delaying competition, as much as it is possible.
(Have a look at nice presentation Silver Needle in the Skype by Philippe Biondi and Fabrice Desclaux shown on Black Hat 2006).
You're aware that there is a lot of stuff about RE out there, so start reading it. :)
I said about virtualization, so I'll give you a link to one exemplary thread from EXETOOLS FORUM: Best software protector: Themida or Enigma Protector?. It may help you a bit in further searches.

Contrary to what most people say, based on their intuition and personal experience, I don't think cryptographically-safe program obfuscation is proven to be impossible in general.
This is one example of a perfectly obfuscated program statement to demonstrate my point:
printf("1677741794\n");
One can never guess that what it really does is
printf("%d\n", 0xBAADF00D ^ 0xDEADBEEF);
There is an interesting paper on this subject, which proves some impossibility results. It is called "On the (Im)possibility of Obfuscating Programs".
Although the paper does prove that the obfuscation making the program non-distinguishable from the function it implements is impossible, obfuscation defined in some weaker way may still be possible!

I do not think that any code is unhackable but the rewards need to be great for someone to want to attempt it.
Having said that there are things you should do such as:
Use the highest optimization level possible (reverse engineering is not only about getting the assembly sequence, it is also about understanding the code and porting it into a higher-level language such as C). Highly optimized code can be a b---h to follow.
Make structures dense by not having larger data types than necessary. Rearrange structure members between official code releases. Rearranged bit fields in structures are also something you can use.
You can check for the presence of certain values which shouldn't be changed (a copyright message is an example). If a byte vector contains "vwxyz" you can have another byte vector containing "abcde" and compare the differences. The function doing it should not be passed pointers to the vectors but use external pointers defined in other modules as (pseudo-C code) "char *p1=&string1[539];" and "char p2=&string2[-11731];". That way there won't be any pointers pointing exactly at the two strings. In the comparison code you then compare for "(p1-539+i)-*(p2+11731+i)==some value". The cracker will think it is safe to change string1 because no one appears to reference it. Bury the test in some unexpected place.
Try to hack the assembly code yourself to see what is easy and what is difficult to do. Ideas should pop up that you can experiment with to make the code more difficult to reverse engineer and to make debugging it more difficult.

As many already said: On a regular CPU you cant stop them from doing, you can just delay them. As my old crypto teacher told me: You dont need perfect encryption, breaking the code must be just more expensive than the gain. Same holds for your obfuscation.
But 3 additional notes:
It is possible to make reverse engineering impossible, BUT (and this is a very very big but), you cant do it on a conventional cpu. I did also much hardware development, and often FPGA are used. E.g. the Virtex 5 FX have a PowerPC CPU on them, and you can use the APU to implement own CPU opcodes in your hardware. You could use this facility to really decrypt incstuctions for the PowerPC, that is not accessible by the outside or other software, or even execute the command in the hardware. As the FPGA has builtin AES encryption for its configuration bitstream, you could not reverse engineer it (except someone manages to break AES, but then I guess we have other problems...). This ways vendors of hardware IP also protect their work.
You speak from protocol. You dont say what kind of protocol it is, but when it is a network protocol you should at least protect it against network sniffing. This can you indeed do by encryption. But if you want to protect the en/decryption from an owner of the software, you are back to the obfuscation.
Do make your programm undebuggable/unrunnable. Try to use some kind of detection of debugging and apply it e.g. in some formula oder adding a debug register content to a magic constant. It is much harder if your program looks in debug mode is if it where running normal, but makes a complete wrong computation, operation, or some other. E.g. I know some eco games, that had a really nasty copy-protection (I know you dont want copyprotection, but it is similar): The stolen version altered the mined resources after 30 mins of game play, and suddenly you got just a single resource. The pirate just cracked it (i.e. reverse engineered it) - checked if it run, and volia released it. Such slight behaviour changings are very hard to detect, esp. if they do not appear instantly to detection, but only delayed.
So finally I would suggest:
Estimate what is the gain of the people reverse engineering your software, translate this into some time (e.g. by using the cheapest indian salary) and make the reverse engineering so time costing that it is bigger.

Traditional reverse engineering techniques depend on the ability of a smart agent using a disassembler to answer questions about the code. If you want strong safety, you have do to things that provably prevent the agent from getting such answers.
You can do that by relying on the Halting Program ("does program X halt?") which in general cannot be solved. Adding programs that are difficult to reason about to your program, makes your program difficult to reason about. It is easier to construct such programs than to tear them apart. You can also add code to program that has varying degrees of difficulty for reasoning; a great candidate is the program of reasoning about aliases ("pointers").
Collberg et al have a paper ("Manufacturing Cheap, Resilient and Stealthy Opaque Constructs") that discusses these topics and defines a variety of "opaque" predicates that can make it very difficult to reason about code:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.1946&rep=rep1&type=pdf
I have not seen Collberg's specific methods applied to production code, especially not C or C++ source code.
The DashO Java obfuscator seems to use similar ideas.
http://www.cs.arizona.edu/~collberg/Teaching/620/2008/Assignments/tools/DashO/

Security through obscurity doesn't work as has been demonstrated by people much cleverer than
the both of us. If you must protect the communication protocol of your customers then you are
morally obliged to use the best code that is in the open and fully scrutinized by experts.
This is for the situation where people can inspect the code. If your application is to run on an embedded microprocessor, you can choose one that has a sealing facility, which makes it impossible to inspect the code or observe more than trivial parameters like current usage while it runs. (It is, except by hardware invading techniques, where you carefully dismantle the chip and use advanced equipment to inspect currents on individual transistors.)
I'm the author of a reverse engineering assembler for the x86. If you're ready for a cold
surprise, send me the result of your best efforts. (Contact me through my websites.)
Few I have seen in the answers would present a substantial hurdle to me. If you want to see
how sophisticated reverse engineering code works, you should really study websites with
reverse engineering challenges.
Your question could use some clarification. How do you expect to keep a protocol secret if the
computer code is amenable to reverse engineering? If my protocol would be to send an RSA encrypted message (even public key) what do you gain by keeping the protocol secret?
For all practical purposes an inspector would be confronted with a sequence of random bits.
Groetjes Albert

FIRST THING TO REMEMBER ABOUT HIDING YOUR CODE: Not all of your code needs to be hidden.
THE END GOAL: My end goal for most software programs is the ability to sell different licenses that will turn on and off specific features within my programs.
BEST TECHNIQUE: I find that building in a system of hooks and filters like WordPress offers, is the absolute best method when trying to confuse your opponents. This allows you to encrypt certain trigger associations without actually encrypting the code.
The reason that you do this, is because you'll want to encrypt the least amount of code possible.
KNOW YOUR CRACKERS: Know this: The main reason for cracking code is not because of malicious distribution of licensing, it's actually because NEED to change your code and they don't really NEED to distribute free copies.
GETTING STARTED: Set aside the small amount of code that you're going to encrypt, the rest of the code should try and be crammed into ONE file to increase complexity and understanding.
PREPARING TO ENCRYPT: You're going to be encrypting in layers with my system, it's also going to be a very complex procedure so build another program that will be responsible for the encryption process.
STEP ONE: Obfuscate using base64 names for everything. Once done, base64 the obfuscated code and save it into a temporary file that will later be used to decrypt and run this code. Make sense?
I'll repeat since you'll be doing this again and again. You're going to create a base64 string and save it into another file as a variable that will be decrypted and rendered.
STEP TWO: You're going to read in this temporary file as a string and obfuscate it, then base64 it and save it into a second temp file that will be used to decrypt and render it for the end user.
STEP THREE: Repeat step two as many times as you would like. Once you have this working properly without decrypt errors, then you're going to want to start building in land mines for your opponents.
LAND MINE ONE: You're going to want to keep the fact that you're being notified an absolute secret. So build in a cracker attempt security warning mail system for layer 2. This will be fired letting you know the specifics about your opponent if anything is to go wrong.
LAND MINE TWO: Dependencies. You don't want your opponent to be able to run layer one, without layer 3 or 4 or 5, or even the actual program it was designed for. So make sure that within layer one you include some sort of kill script that will activate if the program isn't present, or the other layers.
I'm sure you can come up with you're own landmines, have fun with it.
THING TO REMEMBER: You can actually encrypt your code instead of base64'ing it. That way a simple base64 willnt decrypt the program.
REWARD: Keep in mind that this can actually be a symbiotic relationship between you and you'r opponent. I always place a comment inside of layer one, the comment congratulates the cracker and gives them a promo code to use in order to receive a cash reward from you.
Make the cash reward significant with no prejudice involved. I normally say something like $500. If your guy is the first to crack the code, then pay him his money and become his friend. If he's a friend of yours he's not going to distribute your software. Ask him how he did it and how you can improve!
GOOD LUCK!

Have anyone tried CodeMorth: http://www.sourceformat.com/code-obfuscator.htm ?
Or Themida: http://www.oreans.com/themida_features.php ?
Later one looks more promissing.

One thing that has not been mentioned so far:
You could run parts of the code at your side (server side, e.g. called by a REST API). This way, the code is completely inaccessible to the reverse engineer.
Of course, this only applies, if
latency
traffic volume
compute and I/O power
privacy issues
are not preventing server-side execution of (parts of) your code.

Statistical profiler for PyPy

I would like to use statprof.py for profiling code in PyPy. Unfortunately, it does not seem to work, the line numbers it points to are off. Does anyone know how to make it work or know of an alternative?

It's likely that "the line numbers are off" because PyPy, in JITted code, will inline many functions and will only deliver signals (here from the timer) at the end of the loops. Compare this with CPython, which delivers the signals between two random bytecodes -- occasionally at the end of the loops too, but generally anywhere. So what you get on PyPy is the same as what you'd get on CPython if you constrained the signal handlers to run only at the "end of loop" bytecode.
This is why this kind of profiling will seem to always miss a lot of functions, like most functions with no loop in them.
You can try to use the built-in cProfile module. It comes of course with a bigger performance hit than statistical profiling, but try it anyway --- it doesn't prevent JITting, for example, so the performance hit should still be reasonable.
More generally, I don't see an easy way to implement the equivalent of statistical profiling in PyPy. It's quite hard to give it sense in the presence of functions that are inlined into each other and then optimized globally... I'd be interested if you can find that a tool actually exists, for some other high-level language, doing statistical profiling, on a VM with a tracing JIT.
We could record enough information to track each small group of assembler instructions back to the real Python function it comes from, and then use hacks to inspect the current Instruction Pointer (IP) at the machine level. Not impossible, but serious work :-)

Is there an open source tool that automatically generates test cases for legacy code?

I recently stumbled over this (aged) article:
http://imranontech.com/2007/01/04/unit-testing-the-final-frontier-legacy-code/
where the author allegedly wrote a perl script to automatically generate test cases.
His strategy went like this (cited):
Read in the header files I gave it.
Extracted the function prototypes.
Gave me the list of functions it found and let me pick
which ones I wanted to create unit tests for.
It then created a dbx
(Solaris debugger) script which would break-point every time the
selected function was called, save the variables that were passed to
it and then continue until the function returned at which point it
would save the return value.
Run the executable under the dbx
script, and which point I proceeded to use the application as
normal, and just ran through lots of use cases which I thought would
go through the code in question and especially cases where I thought
it would hit edge cases in the functions I want to create unit tests
for.
The perl script then took all of the example runs, stripped out
duplicates, and then autogenerated a C file containing unit tests
for each of the examples (i.e pass in the input data and verify the
return value is the same as in the example run) Compiled/Linked/Ran
the unit tests and threw away ones which failed (i.e. get rid of
inputs which cause the function to behave non-deterministically)
I have a lot of legacy code of all kinds in the languages Python and Fortran. The article is from 2007. Is there anything like this implemented in current Unit testing frameworks?
How would i go about writing such a script?

Very C-like. Also, OS dependent, I think (Solaris debugger)? I'd say you should look at "record/capture and playback" tools, though somehow I think the "generate" part never really took off.
Python's testing tools taxonomy would be a great place to start. I'd say you either record your way through application using Selenium or Dogtail. The link takes you right to that section, Web testing tools, but check others as well: fuzzy testing is a technique similar to Golden Master, which sometimes may help with legacy apps, and is a "record / playback" technique. Feathers calls such tests "characterization" test, for they characterize legacy system's behaviours.
Very good point in article you cite:
Have a look at your own source code repository and see which
functions/classes have had the most bugfix checkins applied, 80% of
bugfixes tend to be made to about 20% of the code. There’s sound logic
behind this – often that 20% of the code is poorly written with dozens
or hundreds of “special case” hacks.
This is where I'd actually start. Have you got these parts identified? Simple Git/SVB log usage scripts and coverage tools section from the taxonomy would come in handy with this.
Unfortunately more than that I can't help you - my Python experience is limited and Fortran - non-existing.

TDD with large data in Python

I wonder if TDD could help my programming. However, I cannot use it simply as most of my functions take large network objects (many nodes and links) and do operations on them. Or I even read SQL tables.
Most of them time it's not really the logic that breaks (i.e. not semantic bugs), but rather some functions calls after refactoring :)
Do you think I can use TDD with such kind of data? What do you suggest for that? (mock frameworks etc?)
Would I somehow take real data, process it with a function, validate the output, save input/output states to some kind of mock object, and then write a test on it? I mean just in case I cannot provide hand made input data.
I haven't started TDD yet, so references are welcome :)

You've pretty much got it. Database testing is done by starting with a clean, up-to-date schema and adding a small amount of known, fixed data into the database. You can then do operations on this controlled environment, knowing what results you expect to see.
Working with network objects is a bit more complex, but it normally involves stubbing them (i.e. removing the inner functionality entirely) or mocking them so that a fixed set of known data is returned.
There is always a way to test your code. If it's proving difficult, it's normally the code design that needs some rethinking.
I don't know any Python specific TDD resources, but a great resource on TDD in general is "Test Driven Development: A Practical Guide" by Coad. It uses Java as the language, but the principles are the same.

most of my functions take large network objects
Without knowing anything about your code, it is hard to assess this claim, but you might want to redesign your code so it is easier to unit test, by decomposing it into smaller methods. Although some high-level methods might deal with those troublesome large objects, perhaps low-level methods do not. You can then unit test those low-level methods, relying on integration tests to test the high-level methods.
Edited:
Before getting to grips with TDD you might want to try just adding some unit tests.
Unit testing is about testing at the fine-grained level: see my answer to the question How do you unit test the real world.
You might have to introduce some indirection into your program, to isolate parts that are impossible to unit test.
You might find it useful to decompose your data into an assembly of smaller classes, which can be tested individually.

Differences between Smalltalk and python?

I'm studying Smalltalk right now. It looks very similar to python (actually, the opposite, python is very similar to Smalltalk), so I was wondering, as a python enthusiast, if it's really worth for me to study it.
Apart from message passing, what are other notable conceptual differences between Smalltalk and python which could allow me to see new programming horizons ?

In Python, the "basic" constructs such as if/else, short-circuiting boolean operators, and loops are part of the language itself. In Smalltalk, they are all just messages. In that sense, while both Python and Smalltalk agree that "everything is an object", Smalltalk goes further in that it also asserts that "everything is a message".
[EDIT] Some examples.
Conditional statement in Smalltalk:
((x > y) and: [x > z])
ifTrue: [ ... ]
ifFalse: [ ... ]
Note how and: is just a message on Boolean (itself produced as a result of passing message > to x), and the second argument of and: is not a plain expression, but a block, enabling lazy (i.e. short-circuiting) evaluation. This produces another Boolean object, which also supports the message ifTrue:ifFalse:, taking two more blocks (i.e. lambdas) as arguments, and running one or the other depending on the value of the Boolean.

As someone new to smalltalk, the two things that really strike me are the image-based system, and that reflection is everywhere. These two simple facts appear to give rise to everything else cool in the system:
The image means that you do everything by manipulating objects, including writing and compiling code
Reflection allows you to inspect the state of any object. Since classes are objects and their sources are objects, you can inspect and manipulate code
You have access to the current execution context, so you can have a look at the stack, and from there, compiled code and the source of that code and so on
The stack is an object, so you can save it away and then resume later. Bingo, continuations!
All of the above starts to come together in cool ways:
The browser lets you explore the source of literally everything, including the VM in Squeak
You can make changes that affect your live program, so there's no need to restart and navigate your way through to whatever you're working on
Even better, when your program throws an exception you can debug the live code. You fix the bug, update the state if it's become inconsistent and then have your program continue.
The browser will tell you if it thinks you've made a typo
It's absurdly easy to browse up and down the class hierarchy, or find out what messages a object responds to, or which code sends a given message, or which objects can receive a given message
You can inspect and manipulate the state of any object in the system
You can make any two objects literally switch places with become:, which lets you do crazy stuff like stub out any object and then lazily pull it in from elsewhere if it's sent a message.
The image system and reflection has made all of these perfectly natural and normal things for a smalltalker for about thirty years.

Smalltalk historically has had an amazing IDE built in. I have missed this IDE on many languages.
Smalltalk also has the lovely property that it is typically in a living system. You start up clean and start modifying things. This is basically an object persistent storage system. That being said, this is both good and bad. What you run is part of your system and part of what you ship. The system can be setup quite nicely before being distributed. The down side, is that the system has everything you run as part of what you ship. You need to be very careful packaging for redistribution.
Now, that being said, it has been a while since I have worked with Smalltalk (about 20 years). Yes, I know, fun times for those who do the math. Smalltalk is a nice language, fun to program in, fun to learn, but I have found it a little hard to ship things in.
Enjoy playing with it if you do. I have been playing with Python and loving it.
Jacob

The Smalltalk language itself is very important. It comprises a small set of powerful, orthogonal features that makes the language highly extensible. As Alan Lovejoy says:
"Smalltalk is also fun because defining and using domain specific languages isn’t an afterthought, it’s the only way Smalltalk works at all." The language notation is critically important because: "Differences in the expressive power of the programming notation used do matter." For more, read the full article here.

The language aspect often isn't that important, and many languages are quite samey,
From what I see, Python and Smalltalk share OOP ideals ... but are very different in their implementation and the power in the presented language interface.
the real value comes in what the subtle differences in the syntax allows in terms of implementation. Take a look at Self and other meta-heavy languages.
Look past the syntax and immediate semantics to what the subtle differences allow the implementation to do.
For example:
Everything in Smalltalk-80 is available for modification from within a running program
What differences between Python and Smalltalk allow deeper maniplation if any? How does the language enable the implementation of the compiler/runtime?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.