How to do random.choice in C - python

As a Python intermediate learner, I made an 8 ball in Python.
Now that I am starting on learning C, is there a way to simulate to the way random.choice can select a string from a list of strings , but in C ?

The closest thing to a "list of strings" in C is an array of string pointers; and the only standard library function that produces random numbers is rand(), defined in <stdlib.h>.
A simple example:
#include <stdio.h>
#include <stdlib.h>
#include <time.h> // needed for the usual srand() initialization
int main(void)
{
const char *string_table[] = { // array of pointers to constant strings
"alpha",
"beta",
"gamma",
"delta",
"epsilon"
};
int table_size = 5; // This must match the number of entries above
srand(time(NULL)); // randomize the start value
for (int i = 1; i <= 10; ++i)
{
const char *rand_string = string_table[rand() % table_size];
printf("%2d. %s\n", i, rand_string);
}
return 0;
}
That will generate and print ten random choices from an array of five strings.
The string_table variable is an array of const char * pointers. You should always use a constant pointer to refer to a literal character string like "alpha". It keeps you from using that pointer in a context where the string contents might be changed.
The random numbers are what are called "pseudorandom"; statistically uncorrelated, but completely determined by a starting "seed" value. Using the statement srand(time(NULL)) takes the current time/date value (seconds since some starting date) and uses that as a seed that won't be repeated in any computer's lifetime. But you will get exactly the same "random" numbers if you manage to run the program twice in the same second. This is easy to do in a shell script, for example. A higher-resolution timestamp would be nice, but there isn't anything useful in the C standard library.
The rand() function returns a non-negative int value from 0 to some implementation-dependent maximum value. The symbolic constant RAND_MAX has that value. The expression rand() % N will return the remainder from dividing that value by N, which is a number from 0 to N-1.
Aconcagua has pointed out that this isn't ideal. If N doesn't evenly divide RAND_MAX, then there will be a bias toward smaller numbers. It's okay for now, though, but plan to learn some other methods later if you do serious simulation or statistical work; and if you do get to that point, you probably won't use the built-in rand() function anyway.

You can write a function if you know the size of your array and use rand() % size to get a random index from your array. Then return the value of arr[randidx]

Related

Skip duplicates for EMR

I currently have enormous amount of medical records which consists medical terms that need to be translated. For cost consideration, we don't want to translate every term for each record. For example, if we found the terms in a record are already frequently appeared in previous records which means these terms might be translated already in previous record, then we don't want to translate them again. I was asked to design a program to accomplish this goal. Hints I got is that I may need to break the records to alphabet level, and matrix may needed to solve this problem. I am literally a beginner in programming. Therefore, I'm looking for help here. Brutal thoughts/suggestions are enough for now. Thanks.
[Edit by Spektre] moved from comments
My problem boils down to this:
Say there are two sentences A and B. A has m tokens (a1, a2, ……, am) and B has n tokens (b1, b2, ……, bn). While A and B might have common tokens. So I need a function to estimate the likelihood of tokens in B that not covered by A.
The tokens are already stored in dictionary.
How to implement this?
So if I see it right you want to know if bi is not in A.
I do not code in python but I see it like this (in C++ like languages)
bool untranslated(int j,int m,int n,string *a,string *b)
{
// the dictionaries are: a[m],b[n]
for (int i=0;j<m;i++) // inspect all tokens of A
if (b[j]==a[i]) // if b[j] present in A
return false;
return true;
}
Now if the dictionaries are rather large then you need to change this linear search to binary search. Also to speed up (if the words are big) you need to use hashes (hash map) for matching. Of coarse depending on your language you can not compare words naively with == rather implement some function that will convert the word into its simplex grammatical form and store to dictionary just that. That can be pretty complicated to implement.
Now the probability of whole sentence would be:
// your dictionaries:
const int m=?,n=?;
string A[m],string B[n];
// code:
int j; float p;
for (p=0.0,j=0;j<n;j++) // test all words of B
if (untranslated(j,m,n,A,B)) p++; // and count how many are untranslated
p/=float(n); // normalize p to <0,1> its your probability that sentence B is not in A
the resulting probability p is in range <0,1> so if you want percentage instead just multiply it by 100.
[Edit1] occurrence of bi
that is entirely different problem but also solvable relatively easy. Its the same as computing histogram so:
add counter for each word in A dictionary
so each record of A will be like this:
struct A_record
{
string word;
int cnt;
};
int m=0;
A_record a[];
process B sentences
on each word bi look into dictionary A. If not present there add it to dictionary and set its counter to 1. If present then just increment its counter by one instead.
const int n=?; // input sentence word count
string b[n]={...}; // input sentence words
int i,j;
for (i=0;i<n;i++) // process B
for (j=0;j<m;j++) // search in A (should be binary search or has-map search)
if (b[i]==a[j].word)
{ a[j].cnt++; j=-1; break; } // here a[j].cnt is the bi occurrence you wanted if divided by m then its probability <0,1>
if (j<0)
{ a[m].word=b[i]; a[m].cnt=1; m++; } // here no previous occurrence of bi
Now if you want just previous occurrence of bithen look at the matched a[j].cnt during the search. If you want the occurrence of any b[i] word in whole text look at the same counter after whole text is processed.

What numbers that I can put in numpy.random.seed()?

I have noticed that you can put various numbers inside of numpy.random.seed(), for example numpy.random.seed(1), numpy.random.seed(101). What do the different numbers mean? How do you choose the numbers?
Consider a very basic random number generator:
Z[i] = (a*Z[i-1] + c) % m
Here, Z[i] is the ith random number, a is the multiplier and c is the increment - for different a, c and m combinations you have different generators. This is known as the linear congruential generator introduced by Lehmer. The remainder of that division, or modulus (%), will generate a number between zero and m-1 and by setting U[i] = Z[i] / m you get random numbers between zero and one.
As you may have noticed, in order to start this generative process - in order to have a Z[1] you need to have a Z[0] - an initial value. This initial value that starts the process is called the seed. Take a look at this example:
The initial value, the seed is determined as 7 to start the process. However, that value is not used to generate a random number. Instead, it is used to generate the first Z.
The most important feature of a pseudo-random number generator would be its unpredictability. Generally, as long as you don't share your seed, you are fine with all seeds as the generators today are much more complex than this. However, as a further step you can generate the seed randomly as well. You can skip the first n numbers as another alternative.
Main source: Law, A. M. (2007). Simulation modeling and analysis. Tata McGraw-Hill.
The short answer:
There are three ways to seed() a random number generator in numpy.random:
use no argument or use None -- the RNG initializes itself from the OS's random number generator (which generally is cryptographically random)
use some 32-bit integer N -- the RNG will use this to initialize its state based on a deterministic function (same seed → same state)
use an array-like sequence of 32-bit integers n0, n1, n2, etc. -- again, the RNG will use this to initialize its state based on a deterministic function (same values for seed → same state). This is intended to be done with a hash function of sorts, although there are magic numbers in the source code and it's not clear why they are doing what they're doing.
If you want to do something repeatable and simple, use a single integer.
If you want to do something repeatable but unlikely for a third party to guess, use a tuple or a list or a numpy array containing some sequence of 32-bit integers. You could, for example, use numpy.random with a seed of None to generate a bunch of 32-bit integers (say, 32 of them, which would generate a total of 1024 bits) from the OS's RNG, store in some seed S which you save in some secret place, then use that seed to generate whatever sequence R of pseudorandom numbers you wish. Then you can later recreate that sequence by re-seeding with S again, and as long as you keep the value of S secret (as well as the generated numbers R), no one would be able to reproduce that sequence R. If you just use a single integer, there's only 4 billion possibilities and someone could potentially try them all. That may be a bit on the paranoid side, but you could do it.
Longer answer
The numpy.random module uses the Mersenne Twister algorithm, which you can confirm yourself in one of two ways:
Either by looking at the documentation for numpy.random.RandomState, of which numpy.random uses an instance for the numpy.random.* functions (but you can also use an isolated independent instance of)
Looking at the source code in mtrand.pyx which uses something called Pyrex to wrap a fast C implementation, and randomkit.c and initarray.c.
In any case here's what the numpy.random.RandomState documentation says about seed():
Compatibility Guarantee A fixed seed and a fixed series of calls to RandomState methods using the same parameters will always produce the same results up to roundoff error except when the values were incorrect. Incorrect values will be fixed and the NumPy version in which the fix was made will be noted in the relevant docstring. Extension of existing parameter ranges and the addition of new parameters is allowed as long the previous behavior remains unchanged.
Parameters:
seed : {None, int, array_like}, optional
Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default). If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise.
It doesn't say how the seed is used, but if you dig into the source code it refers to the init_by_array function: (docstring elided)
def seed(self, seed=None):
cdef rk_error errcode
cdef ndarray obj "arrayObject_obj"
try:
if seed is None:
with self.lock:
errcode = rk_randomseed(self.internal_state)
else:
idx = operator.index(seed)
if idx > int(2**32 - 1) or idx < 0:
raise ValueError("Seed must be between 0 and 2**32 - 1")
with self.lock:
rk_seed(idx, self.internal_state)
except TypeError:
obj = np.asarray(seed).astype(np.int64, casting='safe')
if ((obj > int(2**32 - 1)) | (obj < 0)).any():
raise ValueError("Seed must be between 0 and 2**32 - 1")
obj = obj.astype('L', casting='unsafe')
with self.lock:
init_by_array(self.internal_state, <unsigned long *>PyArray_DATA(obj),
PyArray_DIM(obj, 0))
And here's what the init_by_array function looks like:
extern void
init_by_array(rk_state *self, unsigned long init_key[], npy_intp key_length)
{
/* was signed in the original code. RDH 12/16/2002 */
npy_intp i = 1;
npy_intp j = 0;
unsigned long *mt = self->key;
npy_intp k;
init_genrand(self, 19650218UL);
k = (RK_STATE_LEN > key_length ? RK_STATE_LEN : key_length);
for (; k; k--) {
/* non linear */
mt[i] = (mt[i] ^ ((mt[i - 1] ^ (mt[i - 1] >> 30)) * 1664525UL))
+ init_key[j] + j;
/* for > 32 bit machines */
mt[i] &= 0xffffffffUL;
i++;
j++;
if (i >= RK_STATE_LEN) {
mt[0] = mt[RK_STATE_LEN - 1];
i = 1;
}
if (j >= key_length) {
j = 0;
}
}
for (k = RK_STATE_LEN - 1; k; k--) {
mt[i] = (mt[i] ^ ((mt[i-1] ^ (mt[i-1] >> 30)) * 1566083941UL))
- i; /* non linear */
mt[i] &= 0xffffffffUL; /* for WORDSIZE > 32 machines */
i++;
if (i >= RK_STATE_LEN) {
mt[0] = mt[RK_STATE_LEN - 1];
i = 1;
}
}
mt[0] = 0x80000000UL; /* MSB is 1; assuring non-zero initial array */
self->gauss = 0;
self->has_gauss = 0;
self->has_binomial = 0;
}
This essentially "munges" the random number state in a nonlinear, hash-like method using each value within the provided sequence of seed values.
What is normally called a random number sequence in reality is a "pseudo-random" number sequence because the values are computed using a deterministic algorithm and probability plays no real role.
The "seed" is a starting point for the sequence and the guarantee is that if you start from the same seed you will get the same sequence of numbers. This is very useful for example for debugging (when you are looking for an error in a program you need to be able to reproduce the problem and study it, a non-deterministic program would be much harder to debug because every run would be different).
Basically the number guarantees the same 'randomness' every time.
More properly, the number is a seed, which can be an integer, an array (or other sequence) of integers of any length, or the default (none). If seed is none, then random will try to read data from /dev/urandom if available or make a seed from the clock otherwise.
Edit: In most honesty, as long as your program isn't something that needs to be super secure, it shouldn't matter what you pick. If this is the case, don't use these methods - use os.urandom() or SystemRandom if you require a cryptographically secure pseudo-random number generator.
The most important concept to understand here is that of pseudo-randomness. Once you understand this idea, you can determine if your program really needs a seed etc. I'd recommend reading here.
To understand the meaning of random seeds, you need to first understand the "pseudo-random" number sequence because the values are computed using a deterministic algorithm.
So you can think of this number as a starting value to calulate the next number you get from the random generator. Putting the same value here will make your program getting the same "random" value everytime, so your program becomes deterministic.
As said in this post
they (numpy.random and random.random) both use the Mersenne twister sequence to generate their random numbers, and they're both completely deterministic - that is, if you know a few key bits of information, it's possible to predict with absolute certainty what number will come next.
If you really care about randomness, ask the user to generate some noise (some arbitary words) or just put the system time as seed.
If your codes run on Intel CPU (or AMD with newest chips) I also suggest you to check the RdRand package which uses the cpu instruction rdrand to collect "true" (hardware) randomness.
Refs:
Random seed
What is a seed in terms of generating a random number
One very specific answer: np.random.seed can take values from 0 and 2**32 - 1, which interestingly differs from random.seed which can take any hashable object.
A side comment: better set your seed to a rather large number but still within the generator limit. Doing so can let the seed number have a good balance of 0 and 1 bits. Avoid having many 0 bits in the seed.
Reference: pyTorch documentation

Long numbers in C++

I have this program in Python:
# ...
print 2 ** (int(input())-1) % 1000000007
The problem is that this program works a long time on big numbers. I rewrote my code using C++, but sometimes I have a wrong answer. For example, in Python code for number 12345678 I've got 749037894 and its correct, but in C++ I've got -291172004.
This is the C++ code:
#include <iostream>
#include <cmath>
using namespace std;
const int MOD = 1e9 + 7;
int main() {
// ...
long long x;
cin >> x;
long long a =pow(2, (x-1));
cout << a % MOD;
}
As already mentioned, your problem is that for large exponent you have integer overflow.
To overcome this, remember that modular multiplication has such property that:
(A * B) mod C = (A mod C * B mod C) mod C
And then you can implement 'e to the power p modulo m' function using fast exponentiation scheme.
Assuming no negative powers:
long long powmod(long long e, long long p, long long m){
if (p == 0){
return 1;
}
long long a = 1;
while (p > 1){
if (p % 2 == 0){
e = (e * e) % m;
p /= 2;
} else{
a = (a * e) % m;
e = (e * e) % m;
p = (p - 1) / 2;
}
}
return (a * e) % m;
}
Note that remainder is taken after every multiplication, so no overflow can occur, if single multiplication doesn't overflow (and that's true for 1000000007 as m and long long).
You seem to be dealing with positive numbers and those are overflowing the number of bits you've allocated for their storage. Also keep in mind that there is a difference between Python and C/C++ in the way a modulo on a negative value is computed. To get a similar computation, you will need to add the Modulo to the value so it's positive before you take the modulo which is the way it works in Python:
cout << (a+MOD) % MOD;
You may have to add MOD n times till the temporary value is positive before taking its modulo.
Like has been mentioned by many of the other answers, your problem lies in integer overflow.
You can do like suggested by deniss and implement your own modmul() and modpow() functions.
If, however, this is part of a project that will need to do plenty of calculations with very large numbers, I would suggest using a "big number library" like GNU GMP or mbedTLS Bignum library.
In C++ the various fundamental types have fixed sizes. For example a long long is typically 64 bits wide. But width varies with system type and other factors. As suggested above you can check climits.h for your particular environment's limits.
Raising 2 to the power 12345677 would involve shifting the binary number 10 left by 12345676 bits which wouldn't fit in a 64 bit long long (and I suspect is unlikely to fit in most long long implementations).
Another factor to consider is that pow returns a double (or long double) depending on the overload used. You don't say what compiler you are using but most likely you got a warning about possible truncation or data loss when the result of calling pow is assigned to the long long variable a.
Finally, even if pow is returning a long double I suspect the exponent 12345677 is too large to be stored in a long double so pow is probably returning positive infinity which then gets truncated to some bit pattern that will fit in a long long. You could certainly check that by introducing an intermediate long double variable to receive the value of pow which you could then examine in a debugger.

Converting Algorithm from Python to C: Suggestions for Using bin() in C?

So essentially, I have a homework problem to write in c, and instead of taking the easy route, I thought that I would implement a little algorithm and some coding practice to impress my Professor. The question is to help us to pick up C (or review it, the former is for me), and the question tells us to return all of the integers that divide a given integer (such that there is no remainder).
What I did in python was to create a is_prime() method, a pool_of_primes() method, and a combinations() method. So far, I have everything done in C, up to the combinations() method. The problem that I am running into now is some syntax errors (i.e. not being able to alter a string by declaration) and mainly the binary that I was using for the purpose of what would be included in my list of combinations. But without being able to alter my string by declaration, the Python code is kind of broken...
Here is the python code:
def combinations(aList):
'''
The idea is to provide a list of ints and combinations will provide
all of the combinations of that list using binary.
To track the combinations, we use a string representation of binary
and count down from there. Each spot in the binary represents an
on/off (included/excluded) indicator for the numbers.
'''
length = len(aList) #Have this figured out
s = ""
canidates = 0
nList = []
if (length >=21):
print("\nTo many possible canidates for integers that divide our number.\n")
return False
for i in range(0,length):
s += "1"
canidates += pow(2,i)
#We now have a string for on/off switch of the elements in our
#new list. Canidates is the size of the new list.
nList.append(1)
while (canidates != 0):
x = 1
for i in range(0,length):
if(int(s[i]) == 1):
x = x*aList[i]
nList.append(x)
canidates -= 1
s = ''
temp = bin(canidates)
for i in range(2,len(temp)):
s = s+temp[i]
if (len(s) != length):
#This part is needed in cases of [1...000-1 = 0...111]
while( len(s) != length):
s = '0'+s
return nList
Sorry if the entire code is to lengthy or not optimized to a specific suiting. But it works, and it works well :)
Again, I currently have everything that aList would have, stored as a singly-linked list in c (which I am able to print/use). I also have a little macro I included in C to convert binary to an integer:
#define B(x) S_to_binary_(#x)
static inline unsigned long long S_to_binary_(const char *s)
{
unsigned long long i = 0;
while (*s) {
i <<= 1;
i += *s++ - '0';
}
return i;
}
This may be Coder's Block setting in, but I am not seeing how I can change the binary in the same way that I did in Python... Any help would be greatly appreciated! Also, as a note, what is typically the best way to return a finalized code in C?
EDIT:
Accidentally took credit for the macro above.
UPDATE
I just finished the code, and I uploaded it onto Github. I would like to thank #nneonneo for providing the step that I needed to finish it with exemplary code.If anyone has any further suggestions about the code, I would be happy to see there ideas on [Github]!
Why use a string at all? Keep it simple: use an integer, and use bitwise math to work with the number. Then you don't have to do any conversions back and forth. It will also be loads faster.
You can use a uint32_t to store the "bits", which is enough to hold 32 bits (since you max out at 21, this should work great).
For example, you can loop over the bits that are set by using a loop like this:
uint32_t my_number = ...;
for(int i=0; i<32; i++) {
if(my_number & (1<<i)) {
/* bit i is set */
}
}

Testing wchar_t* for convertable characters

I'm working on talking to a library that handles strings as wchar_t arrays. I need to convert these to char arrays so that I can hand them over to Python (using SWIG and Python's PyString_FromString function). Obviously not all wide characters can be converted to chars. According to the documentation for wcstombs, I ought to be able to do something like
wcstombs(NULL, wideString, wcslen(wideString))
to test the string for unconvertable characters -- it's supposed to return -1 if there are any. However, in my test case it's always returning -1. Here's my test function:
void getString(wchar_t* target, int size) {
int i;
for(i = 0; i < size; ++i) {
target[i] = L'a' + i;
}
printf("Generated %d characters, nominal length %d, compare %d\n", size,
wcslen(target), wcstombs(NULL, target, size));
}
This is generating output like this:
Generated 32 characters, nominal length 39, compare -1
Generated 16 characters, nominal length 20, compare -1
Generated 4 characters, nominal length 6, compare -1
Any idea what I'm doing wrong?
On a related note, if you know of a way to convert directly from wchar_t*s to Python unicode strings, that'd be welcome. :) Thanks!
Clearly, as you found, it's essential to zero-terminate your input data.
Regarding the final paragraph, I would convert from wide to UTF8 and call PyUnicode_FromString.
Note that I am assuming you are using Python 2.x, it's presumably all different in Python 3.x.

Categories