Problems with Collatz C++ code - python

I am solving a problem, although I already solved (after a long while) I wanted to find out what was wrong with my implementation.
I programmed my solution in both C++ and Python in Windows. I was trying working with codeskulptor for my Python and it gave me some a TIMELIMITERROR. I switched to C++ language and it gave me some weird errors. I booted up my virtual machine so that I tried to find out why my C++ code failed (I used BCC32 from Borland). I could detect long int number generated by the Collatz sequence that could make my program crash. Under Linux, I got almost the same error, although I could see under Linux, the program runs and could manipulated very well long numbers (using g++ compiler).
Working under Linux, I could use the same Python program I developed for windows and it worked straightforward. I want to know why C++ fails both on Windows and Linux.
in Python:
def Collatz(num):
temp = []
temp.append(num)
while num> 1:
num = num%2==0 and num/2 or num*3+1
temp.append(num)
return temp
in C++:
vector<unsigned long> collatz(int num)
{
vector<unsigned long> intList;
intList.push_back(num);
while(num>1)
{
if (num%2==0) num /=2;
else num=num*3+1;
intList.push_back(num);
}
return intList;
}
These two piece of codes are the functions only:
the strange thing is that both codes works well calculating the sequence for 13 or 999999. But for example C++ fails to calculate the sequence for 837799... maybe it has something to do with the vector container size??

Because your num is an int, and you get an overflow after the element 991661525 in the Collatz series for 837799 (all operations are done with the int, so you overflow when multiplying 991661525*3+1 in num=num*3+1;). Change num to unsigned long in the function definition
vector<unsigned long> collatz(unsigned long num)
and it will work!

Related

why is this memoized Euler14 implementation so much slower in Raku than Python?

I was recently playing with problem 14 of the Euler project: which number in the range 1..1_000_000 produces the longest Collatz sequence?
I'm aware of the issue of having to memoize to get reasonable times, and the following piece of Python code returns an answer relatively quickly using that technique (memoize to a dict):
#!/usr/bin/env python
L = 1_000_000
cllens={1:1}
cltz = lambda n: 3*n + 1 if n%2 else n//2
def cllen(n):
if n not in cllens: cllens[n] = cllen(cltz(n)) + 1
return cllens[n]
maxn=1
for i in range(1,L+1):
ln=cllen(i)
if (ln > cllens[maxn]): maxn=i
print(maxn)
(adapted from here; I prefer this version that doesn't use max, because I might want to fiddle with it to return the longest 10 sequences, etc.).
I have tried to translate it to Raku staying as semantically close as I could:
#!/usr/bin/env perl6
use v6;
my $L=1_000_000;
my %cllens = (1 => 1);
sub cltz($n) { ($n %% 2) ?? ($n div 2) !! (3*$n+1) }
sub cllen($n) {
(! %cllens{$n}) && (%cllens{$n} = 1+cllen($n.&cltz));
%cllens{$n};
}
my $maxn=1;
for (1..$L) {
my $ln = cllen($_);
($ln > %cllens{$maxn}) && ($maxn = $_)
}
say $maxn
Here are the sorts of times I am consistently getting running these:
$ time <python script>
837799
real 0m1.244s
user 0m1.179s
sys 0m0.064s
On the other hand, in Raku:
$ time <raku script>
837799
real 0m21.828s
user 0m21.677s
sys 0m0.228s
Question(s)
Am I mistranslating between the two, or is the difference an irreconcilable matter of starting up a VM, etc.?
Are there clever tweaks / idioms I can apply to the Raku code to speed it up considerably past this?
Aside
Naturally, it's not so much about this specific Euler project problem; I'm more broadly interested in whether there are any magical speedup arcana appropriate to Raku I am not aware of.
I think the majority of the extra time is because Raku has type checks, and they aren't getting removed by the runtime type specializer. Or if they are getting removed it is after a significant amount of time.
Generally the way to optimize Raku code is first to run it with the profiler:
$ raku --profile test.raku
Of course that fails with a Segfault with this code, so we can't use it.
My guess would be that much of the time is related to using the Hash.
If it was implemented, using native ints for the key and value might have helped:
my int %cllens{int} = (1 => 1);
Then declaring the functions as using native-sized ints could be a bigger win.
(Currently this is a minor improvement at best.)
sub cltz ( int $n --> int ) {…}
sub cllen( int $n --> int ) {…}
for (1..$L) -> int $_ {…}
Of course like I said native hashes aren't implemented, so that is pure speculation.
You could try to use the multi-process abilities of Raku, but there may be issues with the shared %cllens variable.
The problem could also be because of recursion. (Combined with the aforementioned type checks.)
If you rewrote cllen so that it used a loop instead of recursion that might help.
Note:
The closest to n not in cllens is probably %cllens{$n}:!exists.
Though that might be slower than just checking that the value is not zero.
Also cellen is kinda terrible. I would have written it more like this:
sub cllen($n) {
%cllens{$n} //= cllen(cltz($n)) + 1
}

Get Python to read a return code from C# in a reliable fashion

I've written a large-ish program in Python, which I need to talk to a smaller C# script. (I realise that getting Python and C# to talk to each other is not an ideal state of affairs, but I'm forced to do this by a piece of hardware, which demands a C# script.) What I want to achieve in particular - the motivation behind this question - is that I want to know when a specific caught exception occurs in the C# script.
I've been trying to achieve the above by getting my Python program to look at the C# script's return code. The problem I've been having is that, if I tell C# to give a return code x, my OS will receive a return code y and Python will receive a return code z. While a given x always seems to correspond to a specific y and a specific z, I'm having difficulty deciphering the relationship between the three; they should be the same.
Here are the specifics of my setup:
My version of Python is Python 3.
My OS is Ubuntu 20.04.
I'm using Mono to compile and run my C# script.
And here's a minimal working example of the sort of thing I'm talking about:
This is a tiny C# script:
namespace ConsoleApplication1
{
class Script
{
const int ramanujansNumber = 1729;
bool Run()
{
return false;
}
static int Main(string[] args)
{
Script program = new Script();
if(program.Run()) return 0;
else return ramanujansNumber;
}
}
}
If I compile this using mcs Script.cs, run it using mono Script.exe and then run echo $?, it prints 193. If, on the other hand, I run this Python script:
import os
result = os.system("mono Script.exe")
print(result)
it prints 49408. What is the relationship between these three numbers: 1729, 193, 49408? Can I predict the return code that Python will receive if I know what the C# script will return?
Note: I've tried using Environment.Exit(code) in the C# script instead of having Main return an integer. I ran into exactly the same problem.
With os.system the documentation explicitly states that the result is in the same format as for the os.wait, i.e.:
a 16-bit number, whose low byte is the signal number that killed the process, and whose high byte is the exit status (if the signal number is zero); the high bit of the low byte is set if a core file was produced.
So in your case it looks like:
>>> 193<<8
49408
You might want to change that part to using subprocess, e.g. as in the answer to this question
UPD: As for the mono return code, it looks like only the lower byte of it is used (i.e. it is expected to be between 0 and 255). At least:
>>> 1729 & 255
193

Python C API: WindowsError after creating some number of PyObjects

I've been having an issue getting the Python C API to not give me errors.
Background: I've been using ctypes to run native code (C++) for a while, but until now I had never actually done anything specific with the Python C API. I had mostly just been passing in structs from Python and filling them from C++. The way I was using structs was becoming cumbersome, so I decided I would try creating Python objects directly in C++ and just pass them back to my Python script.
Code:
I have a DLL (Foo.dll) with only one function:
#define N 223
__declspec(dllexport) void Bar(void)
{
std::cout << "Bar" << std::endl;
for (int i = 0; i < N; ++i)
{
auto list = PyList_New(0);
std::cout << "Created: " << i << std::endl;
//Py_DECREF(list);
}
}
And then I have the Python script I'm running:
import ctypes as C
dll = r"C:\path\to\dll\Foo.dll"
Foo = C.CDLL(dll)
# lists = [[] for _ in range(0)]
Foo.Bar()
print "Done."
What happens: If I define N in the above DLL to be 222 or below, the code works fine (except for the memory leak, but that isn't the problem).
If I uncomment the //Py_DECREF(list) line, the code works fine.
However, with the above code, I get this:
Bar
Created: 0
Created: 1
Created: 2
...(omitted for your sake)
Created: 219
Created: 220
Created: 221
Traceback (most recent call last):
File "C:\path_to_script\script.py", line 9, in <module>
Foo.Bar()
WindowsError: exception: access violation reading 0x00000028
In fact, I get this same result with dictionaries, lists, tuples and so on. I get the same result if I create a list and then append empty sublists to that list.
What's weirder, every list that I make from within the actual Python script will decrease the number of lists the DLL can make before getting this windows error.
Weirder still, if I make more than 222 lists in my python script, then the DLL won't encounter this error until it's created something like 720 more lists.
**Other details: **
Windows 10
Using the Anaconda2 32-bit Python 2.7 distribution
(Using Python.h and python27.lib from that distribution
python.exe --version: 2.7.13 :: Anaconda custom (32-bit)
Creating DLL with Visual Studio 2017
As long as I don't create many PyObjects from my C++ code, everything seems to work fine. I can pass PyObjects to and from the Python code and it works fine.. until I've created "too many" of the objects from within my C++ code.
What is going on?
From the documentation for CDLL:
The Python global interpreter lock is released before calling any function exported by these libraries, and reacquired afterwards.
This makes it unsafe to use Python C API code. Exactly how it fails is unpredictable, as you are finding. I'd guess it has to do with if the allocation triggers a run of the garbage collector, but I don't think it's worth spending too much time trying to work out the exact cause.
There's (at least) two solutions to chose from:
Use ctypes.PyDLL (which the documentation notes is like CDLL except that it does not release the GIL)
Reacquire the GIL within your C++ code - an easy way to do this is:
auto state = PyGILState_Ensure();
// C++ code requiring GIL - probably your entire loop
PyGILState_Release(state);

Cython for loop conversion

Using cython -a, I found that a for i in range(0, a, b) statement was run as a python loop (very yellow line in cython -a html output). i, a and b were cdef-ed as int64_t.
Then I tried the 'old' syntax for i from 0 <= i < b by a. From the output of cython -a it seemed to compile quite optimal as expected.
Is it expected behaviour that range(0, a, b) is not optimized here or is this rather bound to the implementation?
Automatic range conversion is only applied when cython can determine the sign of the step at compile time. As the step in this case is a signed type it cannot and so falls back to the python loop.
Note that currently even when the type is unsigned cython still falls back onto the python loop, this is a (rather old) outstanding further optimisation that the compiler could do but doesn't. Have a look at this ticket for more information:
http://trac.cython.org/ticket/546

DLR & Performance

I'm intending to create a web service which performs a large number of manually-specified calculations as fast as possible, and have been exploring the use of DLR.
Sorry if this is long but feel free to skim over and get the general gist.
I've been using the IronPython library as it makes the calculations very easy to specify. My works laptop gives a performance of about 400,000 calculations per second doing the following:
ScriptEngine py = Python.CreateEngine();
ScriptScope pys = py.CreateScope();
ScriptSource src = py.CreateScriptSourceFromString(#"
def result():
res = [None]*1000000
for i in range(0, 1000000):
res[i] = b.GetValue() + 1
return res
result()
");
CompiledCode compiled = src.Compile();
pys.SetVariable("b", new DynamicValue());
long start = DateTime.Now.Ticks;
var res = compiled.Execute(pys);
long end = DateTime.Now.Ticks;
Console.WriteLine("...Finished. Sample data:");
for (int i = 0; i < 10; i++)
{
Console.WriteLine(res[i]);
}
Console.WriteLine("Took " + (end - start) / 10000 + "ms to run 1000000 times.");
Where DynamicValue is a class that returns random numbers from a pre-built array (seeded and built at run time).
When I create a DLR class to do the same thing, I get much higher performance (~10,000,000 calculations per second). The class is as follows:
class DynamicCalc : IDynamicMetaObjectProvider
{
DynamicMetaObject IDynamicMetaObjectProvider.GetMetaObject(Expression parameter)
{
return new DynamicCalcMetaObject(parameter, this);
}
private class DynamicCalcMetaObject : DynamicMetaObject
{
internal DynamicCalcMetaObject(Expression parameter, DynamicCalc value) : base(parameter, BindingRestrictions.Empty, value) { }
public override DynamicMetaObject BindInvokeMember(InvokeMemberBinder binder, DynamicMetaObject[] args)
{
Expression Add = Expression.Convert(Expression.Add(args[0].Expression, args[1].Expression), typeof(System.Object));
DynamicMetaObject methodInfo = new DynamicMetaObject(Expression.Block(Add), BindingRestrictions.GetTypeRestriction(Expression, LimitType));
return methodInfo;
}
}
}
and is called/tested in the same way by doing the following:
dynamic obj = new DynamicCalc();
long t1 = DateTime.Now.Ticks;
for (int i = 0; i < 10000000; i++)
{
results[i] = obj.Add(ar1[i], ar2[i]);
}
long t2 = DateTime.Now.Ticks;
Where ar1 and ar2 are pre-built, runtime seeded arrays of random numbers.
The speed is great this way, but it's not easy to specify the calculation. I'd basically be looking at creating my own lexer & parser, whereas IronPython has everything I need already there.
I'd have thought I could get much better performance from IronPython since it is implemented on top of the DLR, and I could do with better than what I'm getting.
Is my example making best use of the IronPython engine? Is it possible to get significantly better performance out of it?
(Edit) Same as first example but with the loop in C#, setting variables and calling the python function:
ScriptSource src = py.CreateScriptSourceFromString(#"b + 1");
CompiledCode compiled = src.Compile();
double[] res = new double[1000000];
for(int i=0; i<1000000; i++)
{
pys.SetVariable("b", args1[i]);
res[i] = compiled.Execute(pys);
}
where pys is a ScriptScope from py, and args1 is a pre-built array of random doubles. This example executes slower than running the loop in the Python code and passing in the entire arrays.
delnan's comment leads you to some of the problems here. But I'll just get specific about what the differences are here. In the C# version you've cut out a significant amount of the dynamic calls that you have in the Python version. For starters your loop is typed to int and it sounds like ar1 and ar2 are strongly typed arrays. So in the C# version the only dynamic operations you have are the call to obj.Add (which is 1 operation in C#) and potentially the assignment to results if it's not typed to object which seems unlikely. Also note all of this code is lock free.
In the Python version you first have the allocation of the list - this also appears to be during your timer where as in C# it doesn't look like it is. Then you have the dynamic call to range, luckily that only happens once. But that again creates a gigantic list in memory - delnan's suggestion of xrange is an improvement here. Then you have the loop counter i which is getting boxed to an object for every iteration through the loop. Then you have the call to b.GetValue() which is actually 2 dynamic invocatiosn - first a get member to get the "GetValue" method and then an invoke on that bound method object. This is again creating one new object for every iteration of the loop. Then you have the result of b.GetValue() which may be yet another value that's boxed on every iteration. Then you add 1 to that result and you have another boxing operation on every iteration. Finally you store this into your list which is yet another dynamic operation - I think this final operation needs to lock to ensure the list remains consistent (again, delnan's suggestion of using a list comprehension improves this).
So in summary during the loop we have:
C# IronPython
Dynamic Operations 1 4
Allocations 1 4
Locks Acquired 0 1
So basically Python's dynamic behavior does come at a cost vs C#. If you want the best of both worlds you can try and balance what you do in C# vs what you do in Python. For example you could write the loop in C# and have it call a delegate which is a Python function (you can do scope.GetVariable> to get a function out of the scope as a delegate). You could also consider allocating a .NET array for the results if you really need to get every last bit of performance as it may reduce working set and GC copying by not keeping around a bunch of boxed values.
To do the delegate you could have the user write:
def computeValue(value):
return value + 1
Then in the C# code you'd do:
CompiledCode compiled = src.Compile();
compiled.Execute(pys);
var computer = pys.GetVariable<Func<object,object>>("computeValue");
Now you can do:
for (int i = 0; i < 10000000; i++)
{
results[i] = computer(i);
}
If you concerned about computation speed, is it better to look at lowlevel computation specification? Python and C# are high-level languages, and its implementation runtime can spend a lot of time for undercover work.
Look on this LLVM wrapper library: http://www.llvmpy.org
Install it using: pip install llvmpy ply
or on Debian Linux: apt install python-llvmpy python-ply
You still need to write some tiny compiler (you can use PLY library), and bind it with LLVM JIT calls (see LLVM Execution Engine), but this approach can be more effective (generated code much closer to real CPU code), and multiplatform comparing to .NET jail.
LLVM has ready to use optimizing compiler infrastructure, including a lot of optimizer stage modules, and big user and developer community.
Also look here: http://gmarkall.github.io/tutorials/llvm-cauldron-2016
PS: If you interested in it, I can help you with a compiler, contributing to my project's manual in parallel. But it will not be jumpstart, this theme is new to me too.

Categories