I am implementing a protocol in an STM32F412 board. It's almost done, I just need to do a CRC check for the received data.
I tried using the internal CRC module for calculating the CRC but I could not match the result to any online CRC algorithm online, so I decided to do a simple implementation of the Ethernet CRC.
static const uint32_t crc32_tab[] =
{
0x00000000L, 0x77073096L, 0xee0e612cL, 0x990951baL, 0x076dc419L,
0x706af48fL, 0xe963a535L, 0x9e6495a3L, 0x0edb8832L, 0x79dcb8a4L,
0xe0d5e91eL, 0x97d2d988L, 0x09b64c2bL, 0x7eb17cbdL, 0xe7b82d07L,
0x90bf1d91L, 0x1db71064L, 0x6ab020f2L, 0xf3b97148L, 0x84be41deL,
0x1adad47dL, 0x6ddde4ebL, 0xf4d4b551L, 0x83d385c7L, 0x136c9856L,
0x646ba8c0L, 0xfd62f97aL, 0x8a65c9ecL, 0x14015c4fL, 0x63066cd9L,
0xfa0f3d63L, 0x8d080df5L, 0x3b6e20c8L, 0x4c69105eL, 0xd56041e4L,
0xa2677172L, 0x3c03e4d1L, 0x4b04d447L, 0xd20d85fdL, 0xa50ab56bL,
0x35b5a8faL, 0x42b2986cL, 0xdbbbc9d6L, 0xacbcf940L, 0x32d86ce3L,
0x45df5c75L, 0xdcd60dcfL, 0xabd13d59L, 0x26d930acL, 0x51de003aL,
0xc8d75180L, 0xbfd06116L, 0x21b4f4b5L, 0x56b3c423L, 0xcfba9599L,
0xb8bda50fL, 0x2802b89eL, 0x5f058808L, 0xc60cd9b2L, 0xb10be924L,
0x2f6f7c87L, 0x58684c11L, 0xc1611dabL, 0xb6662d3dL, 0x76dc4190L,
0x01db7106L, 0x98d220bcL, 0xefd5102aL, 0x71b18589L, 0x06b6b51fL,
0x9fbfe4a5L, 0xe8b8d433L, 0x7807c9a2L, 0x0f00f934L, 0x9609a88eL,
0xe10e9818L, 0x7f6a0dbbL, 0x086d3d2dL, 0x91646c97L, 0xe6635c01L,
0x6b6b51f4L, 0x1c6c6162L, 0x856530d8L, 0xf262004eL, 0x6c0695edL,
0x1b01a57bL, 0x8208f4c1L, 0xf50fc457L, 0x65b0d9c6L, 0x12b7e950L,
0x8bbeb8eaL, 0xfcb9887cL, 0x62dd1ddfL, 0x15da2d49L, 0x8cd37cf3L,
0xfbd44c65L, 0x4db26158L, 0x3ab551ceL, 0xa3bc0074L, 0xd4bb30e2L,
0x4adfa541L, 0x3dd895d7L, 0xa4d1c46dL, 0xd3d6f4fbL, 0x4369e96aL,
0x346ed9fcL, 0xad678846L, 0xda60b8d0L, 0x44042d73L, 0x33031de5L,
0xaa0a4c5fL, 0xdd0d7cc9L, 0x5005713cL, 0x270241aaL, 0xbe0b1010L,
0xc90c2086L, 0x5768b525L, 0x206f85b3L, 0xb966d409L, 0xce61e49fL,
0x5edef90eL, 0x29d9c998L, 0xb0d09822L, 0xc7d7a8b4L, 0x59b33d17L,
0x2eb40d81L, 0xb7bd5c3bL, 0xc0ba6cadL, 0xedb88320L, 0x9abfb3b6L,
0x03b6e20cL, 0x74b1d29aL, 0xead54739L, 0x9dd277afL, 0x04db2615L,
0x73dc1683L, 0xe3630b12L, 0x94643b84L, 0x0d6d6a3eL, 0x7a6a5aa8L,
0xe40ecf0bL, 0x9309ff9dL, 0x0a00ae27L, 0x7d079eb1L, 0xf00f9344L,
0x8708a3d2L, 0x1e01f268L, 0x6906c2feL, 0xf762575dL, 0x806567cbL,
0x196c3671L, 0x6e6b06e7L, 0xfed41b76L, 0x89d32be0L, 0x10da7a5aL,
0x67dd4accL, 0xf9b9df6fL, 0x8ebeeff9L, 0x17b7be43L, 0x60b08ed5L,
0xd6d6a3e8L, 0xa1d1937eL, 0x38d8c2c4L, 0x4fdff252L, 0xd1bb67f1L,
0xa6bc5767L, 0x3fb506ddL, 0x48b2364bL, 0xd80d2bdaL, 0xaf0a1b4cL,
0x36034af6L, 0x41047a60L, 0xdf60efc3L, 0xa867df55L, 0x316e8eefL,
0x4669be79L, 0xcb61b38cL, 0xbc66831aL, 0x256fd2a0L, 0x5268e236L,
0xcc0c7795L, 0xbb0b4703L, 0x220216b9L, 0x5505262fL, 0xc5ba3bbeL,
0xb2bd0b28L, 0x2bb45a92L, 0x5cb36a04L, 0xc2d7ffa7L, 0xb5d0cf31L,
0x2cd99e8bL, 0x5bdeae1dL, 0x9b64c2b0L, 0xec63f226L, 0x756aa39cL,
0x026d930aL, 0x9c0906a9L, 0xeb0e363fL, 0x72076785L, 0x05005713L,
0x95bf4a82L, 0xe2b87a14L, 0x7bb12baeL, 0x0cb61b38L, 0x92d28e9bL,
0xe5d5be0dL, 0x7cdcefb7L, 0x0bdbdf21L, 0x86d3d2d4L, 0xf1d4e242L,
0x68ddb3f8L, 0x1fda836eL, 0x81be16cdL, 0xf6b9265bL, 0x6fb077e1L,
0x18b74777L, 0x88085ae6L, 0xff0f6a70L, 0x66063bcaL, 0x11010b5cL,
0x8f659effL, 0xf862ae69L, 0x616bffd3L, 0x166ccf45L, 0xa00ae278L,
0xd70dd2eeL, 0x4e048354L, 0x3903b3c2L, 0xa7672661L, 0xd06016f7L,
0x4969474dL, 0x3e6e77dbL, 0xaed16a4aL, 0xd9d65adcL, 0x40df0b66L,
0x37d83bf0L, 0xa9bcae53L, 0xdebb9ec5L, 0x47b2cf7fL, 0x30b5ffe9L,
0xbdbdf21cL, 0xcabac28aL, 0x53b39330L, 0x24b4a3a6L, 0xbad03605L,
0xcdd70693L, 0x54de5729L, 0x23d967bfL, 0xb3667a2eL, 0xc4614ab8L,
0x5d681b02L, 0x2a6f2b94L, 0xb40bbe37L, 0xc30c8ea1L, 0x5a05df1bL,
0x2d02ef8dL
};
uint32_t calc_crc_calculate(uint8_t *pData, uint32_t uLen)
{
uint32_t val = 0xFFFFFFFFU;
int i;
for(i = 0; i < uLen; i++) {
val = crc32_tab[(val ^ pData[i]) & 0xFF] ^ ((val >> 8) & 0x00FFFFFF);
}
return val^0xFFFFFFFF;
}
I calculated the crc of 0x6F and compared the result to the online calculators and it apparently matches.
When I try to test the protocol with my python code I'm just unable to match the CRCs. On python I'm using the following code:
d = 0x6f
crc = zlib.crc32(bytes(d))&0xFFFFFFFF
I'm now unable to tell which is right. Apparently my algorithm is OK because it matches the online calculator. BUT those online calculators do not seem to be reliable sometimes and I doubt that python's zlib implementation is wrong .. I may be using it wrong at worst.
Actually you can compute the Ethernet CRC32 with the builtin module of the STM32. It took me quite a while to make it match up as well.
This code should match up for sizes divisible by 4 (I also used python zlib on the other end):
#include "stm32l4xx_hal.h"
uint32_t CRC32_Compute(const uint32_t *data, size_t sizeIn32BitWords)
{
CRC_HandleTypeDef hcrc = {
.Instance = CRC,
.Init.DefaultPolynomialUse = DEFAULT_POLYNOMIAL_ENABLE,
.Init.DefaultInitValueUse = DEFAULT_INIT_VALUE_ENABLE,
.Init.InputDataInversionMode = CRC_INPUTDATA_INVERSION_WORD,
.Init.OutputDataInversionMode = CRC_OUTPUTDATA_INVERSION_ENABLE,
.InputDataFormat = CRC_INPUTDATA_FORMAT_WORDS,
};
HAL_StatusTypeDef status = HAL_CRC_Init(&hcrc);
assert (status == HAL_OK)
uint32_t checksum = HAL_CRC_Calculate(&hcrc, data, sizeIn32BitWords);
uint32_t checksumInverted = ~checksum;
return checksumInverted;
}
The challenge with sizes not divisible by 4 is to get the "inversion/reversal" (changing the bit order) right. There is an example how the hardware handles this in the "RM0394 Reference manual STM32L43xxx STM32L44xxx STM32L45xxx STM32L46xxx advanced ARM®-based 32-bit MCUs Rev 3" on page 333.
The essence is that reversal reverses the bit order. For CRC32 this reversal must happen on the word level, i.e. over 32 bits.
Ok. It certainly was a bug on my part. But it was happening in my python code.
I suddenly realized that I was practically doing bytes(0x6F) which just creates an array with 111 positions.
What I actually needed to do was
import struct
d = pack('B', 0x6F)
crc = zlib.crc32(bytes(d))&0xFFFFFFFF
This question could have been avoided had I just done a little bit of rubber duck debugging. Hopefuly this will help someone else.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I’m reading a file in C++ and Python as a binary file. I need to divide the binary into blocks, each 6 bytes. For example, if my file is 600 bytes, the result should be 100 blocks, each 6 bytes.
I have tried struct (in C++ and Python) and array (Python). None of them divide the binary into blocks of 6 bytes. They can only divide the binary into blocks each power of two (1, 2, 4, 8, 16, etc.).
The array algorithm was very fast, reading 1 GB of binary data in less than a second as blocks of 4 bytes. In contrast, I used some other methods, but all of them are extremely slow, taking tens of minutes to do it for a few megabytes.
How can I read the binary as blocks of 6 bytes as fast as possible? Any help in either C++ or Python will be great. Thank you.
EDIT - The Code:
struct Block
{
char data[6];
};
class BinaryData
{
private:
char data[6];
public:
BinaryData() {};
~BinaryData() {};
void readBinaryFile(string strFile)
{
Block block;
ifstream binaryFile;
int size = 0;
binaryFile.open(strFile, ios::out | ios::binary);
binaryFile.seekg(0, ios::end);
size = (int)binaryFile.tellg();
binaryFile.seekg(0, ios::beg);
cout << size << endl;
while ( (int)binaryFile.tellg() < size )
{
cout << binaryFile.tellg() << " , " << size << " , " <<
size - (int)binaryFile.tellg() << endl;
binaryFile.read((char*)block.data,sizeof(block.data));
cout << block.data << endl;
//cin >> block.data;
if (size - (int)binaryFile.tellg() > size)
{
break;
}
}
binaryFile.close();
}
};
Notes :
in the file the numbers are in big endian ( remark )
the goal is to as fast as possible read them then sort them in ascending order ( remark )
Let's start simple, then optimize.
Simple Loop
uint8_t array1[6];
while (my_file.read((char *) &array1[0], 6))
{
Process_Block(&array1[0]);
}
The above code reads in a file, 6 bytes at a time and sends the block to a function.
Meets the requirements, not very optimal.
Reading Larger Blocks
Files are streaming devices. They have an overhead to start streaming, but are very efficient to keep streaming. In other words, we want to read as much data per transaction to reduce the overhead.
static const unsigned int CAPACITY = 6 * 1024;
uint8_t block1[CAPACITY];
while (my_file.read((char *) &block1[0], CAPACITY))
{
const size_t bytes_read = my_file.gcount();
const size_t blocks_read = bytes_read / 6;
uint8_t const * block_pointer = &block1[0];
while (blocks_read > 0)
{
Process_Block(block_pointer);
block_pointer += 6;
--blocks_read;
}
}
The above code reads up to 1024 blocks in one transaction. After reading, each block is sent to a function for processing.
This version is more efficient than the Simple Loop, as it reads more data per transaction. Adjust the CAPACITY to find the optimal size on your platform.
Loop Unrolling
The previous code reduces the first bottleneck of input transfer speed (although there is still room for optimization). Another technique is to reduce the overhead of the processing loop by performing more data processing inside the loop. This is called loop unrolling.
const size_t bytes_read = my_file.gcount();
const size_t blocks_read = bytes_read / 6;
uint8_t const * block_pointer = &block1[0];
while ((blocks_read / 4) != 0)
{
Process_Block(block_pointer);
block_pointer += 6;
Process_Block(block_pointer);
block_pointer += 6;
Process_Block(block_pointer);
block_pointer += 6;
Process_Block(block_pointer);
block_pointer += 6;
blocks_read -= 4;
}
while (blocks_read > 0)
{
Process_Block(block_pointer);
block_pointer += 6;
--blocks_read;
}
You can adjust the quantity of operations in the loop, to see how it affects your program's speed.
Multi-Threading & Multiple Buffers
Another two techniques for speeding up the reading of the data, are to use multiple threads and multiple buffers.
One thread, an input thread, reads the file into a buffer. After reading into the first buffer, the thread sets a semaphore indicating there is data to process. The input thread reads into the next buffer. This repeats until the data is all read. (For a challenge, figure out how to reuse the buffers and notify the other thread of which buffers are available).
The second thread is the processing thread. This processing thread is started first and waits for the first buffer to be completely read. After the buffer has the data, the processing thread starts processing the data. After the first buffer has been processed, the processing thread starts on the next buffer. This repeats until all the buffers have been processed.
The goal here is to use as many buffers as necessary to keep the processing thread running and not waiting.
Edit 1: Other techniques
Memory Mapped Files
Some operating systems support memory mapped files. The OS reads a portion of the file into memory. When a location outside the memory is accessed, the OS loads another portion into memory. Whether this technique improves performance needs to be measured (profiled).
Parallel Processing & Threading
Adding multiple threads may show negligible performance gain. Computers have a data bus (data highway) connecting many hardware devices, including memory, file I/O and the processor. Devices will be paused to let other devices use the data highway. With multiple cores or processors, one processor may have to wait while the other processor is using the data highway. This waiting may cause negligible performance gain when using multiple threads or parallel processing. Also, the operating system has overhead when constructing and maintaining threads.
Try that, the input file is received in argument of the program, as you said I suppose the the 6 bytes values in the file are written in the big endian order, but I do not make assumption for the program reading the file then sorting and it can be executed on both little and big endian (I check the case at the execution)
#include <iostream>
#include <fstream>
#include <vector>
#include <cstdint>
#include <algorithm>
#include <limits.h> // CHAR_BIT
using namespace std;
#if CHAR_BIT != 8
# error that code supposes a char has 8 bits
#endif
int main(int argc, char ** argv)
{
if (argc != 2)
cerr << "Usage: " << argv[1] << " <file>" << endl;
else {
ifstream in(argv[1], ios::binary);
if (!in.is_open())
cerr << "Cannot open " << argv[1] << endl;
else {
in.seekg(0, ios::end);
size_t n = (size_t) in.tellg() / 6;
vector<uint64_t> values(n);
uint64_t * p = values.data(); // for performance
uint64_t * psup = p + n;
in.seekg(0, ios::beg);
int i = 1;
if (*((char *) &i)) {
// little endian
unsigned char s[6];
uint64_t v = 0;
while (p != psup) {
if (!in.read((char *) s, 6))
return -1;
((char *) &v)[0] = s[5];
((char *) &v)[1] = s[4];
((char *) &v)[2] = s[3];
((char *) &v)[3] = s[2];
((char *) &v)[4] = s[1];
((char *) &v)[5] = s[0];
*p++ = v;
}
}
else {
// big endian
uint64_t v = 0;
while (p != psup) {
if (!in.read(((char *) &v) + 2, 6))
return -1;
*p++ = v;
}
}
cout << "file successfully read" << endl;
sort(values.begin(), values.end());
cout << "values sort" << endl;
// DEBUG, DO ON A SMALL FILE ;-)
for (auto v : values)
cout << v << endl;
}
}
}
I am using a large CUDA-matrix library developed within our organization. I need to save the state of a CUDA RNG to take a snapshop of a long-running simulation, and be able to restore it later. This is simple with, e.g., python+numpy:
state = numpy.random.get_state()
# state is a tuple with 5 fields which can be pickled, etc.
...
numpy.random.set_state(state)
I cannot seem to find equivalent functionality in the CUDA host api. You can set the seed and offset, but there is no way to retrieve it to save. The device API seems to offer something like this, but this library uses the host api, and it would be monsterous to change.
The hack-ey solution I am thinking about is to keep track of the number of calls to the RNG (reset when a seed is set), and simply call a RNG function repeatedly. However, I am not sure if the function parameters must be identical, e.g. matrix shapes, etc., to get it to the same state. Similarly, if the number of calls was equivalent to the offset parameter for initializing the RNG, this would work as well, i.e., if I call the RNG 200 times, I could set the offset to 200. However, in python, the offset in the state can increase by more than 1 with each call, so this is also potentially wrong.
Any insights into how to tackle this are appreciated!
For the CURAND Host API, I believe curandSetGeneratorOffset() can probably work for this.
Here's a modified example from the curand host API documentation:
$ cat t721.cu
/*
* This program uses the host CURAND API to generate 10
* pseudorandom floats. And then regenerate those same floats.
*/
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#define CUDA_CALL(x) do { if((x)!=cudaSuccess) { \
printf("Error at %s:%d\n",__FILE__,__LINE__);\
return EXIT_FAILURE;}} while(0)
#define CURAND_CALL(x) do { if((x)!=CURAND_STATUS_SUCCESS) { \
printf("Error at %s:%d\n",__FILE__,__LINE__);\
return EXIT_FAILURE;}} while(0)
int main(int argc, char *argv[])
{
size_t n = 10;
size_t i;
curandGenerator_t gen;
float *devData, *hostData;
/* Allocate n floats on host */
hostData = (float *)calloc(n, sizeof(float));
/* Allocate n floats on device */
CUDA_CALL(cudaMalloc((void **)&devData, n*sizeof(float)));
/* Create pseudo-random number generator */
CURAND_CALL(curandCreateGenerator(&gen,
CURAND_RNG_PSEUDO_DEFAULT));
/* Set seed */
CURAND_CALL(curandSetPseudoRandomGeneratorSeed(gen,
1234ULL));
// generator offset = 0
/* Generate n floats on device */
CURAND_CALL(curandGenerateUniform(gen, devData, n));
// generator offset = n
/* Generate n floats on device */
CURAND_CALL(curandGenerateUniform(gen, devData, n));
// generator offset = 2n
/* Copy device memory to host */
CUDA_CALL(cudaMemcpy(hostData, devData, n * sizeof(float),
cudaMemcpyDeviceToHost));
/* Show result */
for(i = 0; i < n; i++) {
printf("%1.4f ", hostData[i]);
}
printf("\n\n");
CURAND_CALL(curandSetGeneratorOffset(gen, n));
// generator offset = n
CURAND_CALL(curandGenerateUniform(gen, devData, n));
// generator offset = 2n
/* Copy device memory to host */
CUDA_CALL(cudaMemcpy(hostData, devData, n * sizeof(float),
cudaMemcpyDeviceToHost));
/* Show result */
for(i = 0; i < n; i++) {
printf("%1.4f ", hostData[i]);
}
printf("\n");
/* Cleanup */
CURAND_CALL(curandDestroyGenerator(gen));
CUDA_CALL(cudaFree(devData));
free(hostData);
return EXIT_SUCCESS;
}
$ nvcc -o t721 t721.cu -lcurand
$ ./t721
0.7816 0.2338 0.6791 0.2824 0.6299 0.1212 0.4333 0.3831 0.5136 0.2987
0.7816 0.2338 0.6791 0.2824 0.6299 0.1212 0.4333 0.3831 0.5136 0.2987
$
So you'll need to keep track of the quantity of random numbers generated (not the number of RNG function calls) up to the point when you do your checkpoint, and save that.
When you restart, initialize the generator in the same way:
/* Create pseudo-random number generator */
CURAND_CALL(curandCreateGenerator(&gen,
CURAND_RNG_PSEUDO_DEFAULT));
/* Set seed */
CURAND_CALL(curandSetPseudoRandomGeneratorSeed(gen,
1234ULL));
but then advance by the number of previously generated values (n):
CURAND_CALL(curandSetGeneratorOffset(gen, n));
So, it is possible to store and restore the state by tracking the number of 32-bit values generated using curandSetGeneratorOffset. The algorithm looks something like:
template<typename T> RNG(T* X, size_T N /*number of values*/){
...
if (sizeof(T) == 1)
offset += (N+4-1)/4;
else if (sizeof(T) == 2)
offset += (N+2-1)/4;
else if (sizeof(T) == 4 || USING_GENERATE_UNIFORM_DOUBLE)
offset += N;
else if (sizeof(T) == 8)
offset += 2*N;
}
For 8-bit values, advance the offset by the N * next highest multiple of 4, for N values generated. For 16, advance by N * the next multiple of 2. For 32 advance by the N, and for 64 advance by 2*N.
HOWEVER, if you use GenerateUniformDouble, you only need to advance by N, not 2*N. I'm not sure why.
Thanks for the help!
I wrote a simple Python extension module to simulate a 3-bit analog-to-digital converter. It is supposed to accept a floating-point array as its input to return the same size array of output. The output actually consists of quantized input numbers. Here is my (simplified) module:
static PyObject *adc3(PyObject *self, PyObject *args) {
PyArrayObject *inArray = NULL, *outArray = NULL;
double *pinp = NULL, *pout = NULL;
npy_intp nelem;
int dims[1], i, j;
/* Get arguments: */
if (!PyArg_ParseTuple(args, "O:adc3", &inArray))
return NULL;
nelem = PyArray_DIM(inArray,0); /* size of the input array */
pout = (double *) malloc(nelem*sizeof(double));
pinp = (double *) PyArray_DATA(inArray);
/* ADC action */
for (i = 0; i < nelem; i++) {
if (pinp[i] >= -0.5) {
if (pinp[i] < 0.5) pout[i] = 0;
else if (pinp[i] < 1.5) pout[i] = 1;
else if (pinp[i] < 2.5) pout[i] = 2;
else if (pinp[i] < 3.5) pout[i] = 3;
else pout[i] = 4;
}
else {
if (pinp[i] >= -1.5) pout[i] = -1;
else if (pinp[i] >= -2.5) pout[i] = -2;
else if (pinp[i] >= -3.5) pout[i] = -3;
else pout[i] = -4;
}
}
dims[0] = nelem;
outArray = (PyArrayObject *)
PyArray_SimpleNewFromData(1, dims, NPY_DOUBLE, pout);
//Py_INCREF(outArray);
return PyArray_Return(outArray);
}
/* ==== methods table ====================== */
static PyMethodDef mwa_methods[] = {
{"adc", adc, METH_VARARGS, "n-bit Analog-to-Digital Converter (ADC)"},
{NULL, NULL, 0, NULL}
};
/* ==== Initialize ====================== */
PyMODINIT_FUNC initmwa() {
Py_InitModule("mwa", mwa_methods);
import_array(); // for NumPy
}
I expected that if reference counts were processed correctly, the Python garbage collection would (frequently enough) release the memory used by the output array if it has the same name and is used repeatedly. So I tested it on some dummy (but voluminous) data with this code:
for i in xrange(200):
a = rand(1000000)
b = mwa.adc3(a)
print i
Here the array named "b" is reused many times and its memory, borrowed by adc3() from the heap, is expected to be returned to the system. I used the gnome-system-monitor to check. Contrary to my expectations, the memory owned by python grew rapidly and could only be released by quitting the program (I use IPython).
For comparison, I tried the same procedure with the standard NumPy functions, zeros() and copy():
for i in xrange(1000):
a = np.zeros(10000000)
b = np.copy(a)
print i
As you can see, the latter code does not make any memory build-up.
I read many texts in the standard documentation and on the web, tried to use Py_INCREF(outArray) and not to use it. All in vain: the problem persisted.
However, I found the solution in http://wiki.scipy.org/Cookbook/C_Extensions/NumPy_arrays.
The author provides an extension program matsq() that creates an array and returns it. When I tried to use the calls suggested by the author:
outArray = (PyArrayObject *) PyArray_FromDims(nd,dims,NPY_DOUBLE);
pout = (double *) outArray->data;
instead of my
pout = (double *) malloc(nelem*sizeof(double));
outArray = (PyArrayObject *)
PyArray_SimpleNewFromData(1, dims, NPY_DOUBLE, pout);
/* no matter with or without Py_INCREF(outArray)) */
the memory leak gone! The program works properly now.
A question: can anybody explain why PyArray_SimpleNewFromData() does not provide the correct reference counting, while PyArray_FromDims() does?
Thank you very much.
ADDITION. I probably exceeded the room/time in comments, so I add to my comment to Alex here.
I tried to set the OWNDATA flag this way:
outArray->flags |= OWNDATA;
but I got "error: ‘OWNDATA’ undeclared".
The rest is in the comment. Thank you in advance.
SOLVED: The correct setting of the flag is
outArray->flags |= NPY_ARRAY_OWNDATA;
Now it works.
Alex, sorry.
The problem is not with PyArray_SimpleNewFromData which produces a properly refcounted PyObject*. Rather, it's with your malloc, assigned to pout then never freed.
As the docs at http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html clearly state, documenting PyArray_SimpleNewFromData:
the ndarray will not own its data. When this ndarray is
deallocated, the pointer will not be freed.
...
If you want the
memory to be freed as soon as the ndarray is deallocated then simply
set the OWNDATA flag on the returned ndarray.
(my emphasis on the not). IOW, you're observing exactly the "will not be freed" behavior so clearly documented, and are not taking the step specifically recommended should you want to avoid said behavior.