I'm new to this topic, this question may be dumb. I did some experiments, results and their occurrence are list below. I need to convert these discrete numbers into probability distribution and cumulative distribution (x axis is results and y-axis is probability).
import pandas as pd
data = {'Result': [1, 2, 4, 6],
'Occurrence': [2,3,4,1],
'Probability':[0.2,0.3,0.4,0.1]}
df= pd.DataFrame(data)
Then find the x corresponding to different probability level in cumulative distribution. Say 50%, 60%, 80% etc.
I did some research, but cannot find the right python package or function to achieve this. Package or function name should be good and an examples will be great. Thanks.
Working with conditional distributions or probability distributions
and cumulative distributions, you are going to have to use three
different programming styles... lambda, functional and object
orient to represent conditional , probability and
cumulative statistics.
Much like statistical maths, computational maths vary significantly in the scope of knowledge.
for cumulative distributions it would seem necessary that you use lambda functions to represent the data.
Lambda functions allow you to create expressions for relationships.
Functions are just the sum of the steps that make two objects related, or perhaps the sum of conditions that make up a codomain....
You will need to use lambda functions to create anonymous functions to represent possible relationships while you deal with the x and y, let Lambda represent the function definition in our case.
y = codomain
x = domain
f(x) = lamda functions, we dont need to set the process just yet.
import lambda
codomain = ''
anon_steps = 'unknown'
def myFunction(domain=x):
if anon_steps is domain:
codomain = 1
else:
pass
return codomain
function_object = myFunction(domain)
# this is functional relationship
variable = lambda parameters_list : expression
# this is lambda expressed functions
in the above example lambda is the definition of the function and it expresses the code in the function.
Related
I'm trying to find an existing algorithm for the following problem:
i.e, let's say we have 3 variables, x, y, z (all must be integers).
I want to find values for all variables that MUST match some constraints, such as x+y<4, x<=50, z>x, etc.
In addition, there are extra POSSIBLE constraints, like y>=20, etc. (same as before).
The objective function (which i'm intrested in maximizing its value) is the number of EXTRA constraints that are met in the optimal solution (the "must" constraints + the fact that all values are integers, is a demand. without it, there's no valid solution).
If using OR-Tools, as the model is integral, I would recommend using CP-SAT, as it offers indicator constraints with a nice API.
The API would be:
b = model.NewBoolVar('indicator variable')
model.Add(x + 2 * y >= 5).OnlyEnforceIf(b)
...
model.Maximize(sum(indicator_variables))
To get maximal performance, I would recommend using parallelism.
solver = cp_model.CpSolver()
solver.parameters.log_search_progress = True
solver.parameters.num_search_workers = 8 # or more on a bigger computer
status = solver.Solve(model)
I want to replace an existing random number based data generator (in Python) with a hash based one so that it no longer needs to generate everything in sequence, as inspired by this article.
I can create a float from 0 to 1 by taking the integer version of the hash and dividing it by the maximum value of a hash.
I can create a flat integer range by taking the float and multiplying by the flat range. I could probably use modulo and live with the bias, as the hash range is large and my flat ranges are small.
How could I use the hash to create a gaussian or normal distributed floating point value?
For all of these cases, would I be better off just using my hash as a seed for a new random.Random object and using the functions in that class to generate my numbers and rely on them to get the distribution characteristics right?
At the moment, my code is structured like this:
num_people = randint(1,100)
people = [dict() for x in range(num_people)]
for person in people:
person['surname'] = choice(surname_list)
person['forename'] = choice(forename_list)
The problem is that for a given seed to be consistent, I have to generate all the people in the same order, and I have to generate the surname then the forename. If I add a middle name in between the two then the generated forenames will change, as will all the names of all the subsequent people.
I want to structure the code like this:
h1_groupseed=1
h2_peoplecount=1
h2_people=2
h4_surname=1
h4_forename=2
num_people = pghash([h1_groupseed,h2_peoplecount]).hashint(1,100)
people = [dict() for x in range(num_people)]
for h3_index, person in enumerate(people,1):
person['surname'] = surname_list[pghash([h1_groupseed,h2_people,h3_index,h4_surname]).hashint(0, num_of_surnames - 1)]
person['forename'] = forename_list[pghash([h1_groupseed,h2_people,h3_index,h4_forename]).hashint(0, num_of_forenames - 1)]
This would use the values passed to pghash to generate a hash, and use that hash to somehow create the pseudorandom result.
First, a big caveat: DO NOT ROLL YOUR OWN CRYPTO.
If you're trying to do this for security purposes, DON'T.
Next, check out this question which lists several ways to do what you want, i.e. transform a random uniform variable into a normal one:
Converting a Uniform Distribution to a Normal Distribution
Unless you're doing this for your own amusement or as a learning exercise, my very strong advice is don't do this.
PRNGs have the same general structure, even if the details are wildly different. They map a seed value s into an initial state S via some function f: S←f(s); they then iterate states via some transformation h: Si+1←h(Si); and finally they map the state to an output U via some function g: Ui←g(Si). (For simple PRNGs, f() or g() are often identity functions. For more sophisticated generators such as Mersenne Twister, more is involved.)
The state transition function h() is designed to distribute new states uniformly across the state space. In other words, it's already a hash function, but with the added benefit that for any widely accepted generator it has been heavily vetted by experts to have good statistical behavior.
Mersenne Twister, Python's default PRNG, has been mathematically proven to have k-tuples be jointly uniformly distributed for all k ≤ 623. I'm guessing that whatever hash function you choose can't make such claims. Additionally, the collapsing function g() should preserve uniformity in the outcomes. You've proposed that you "can use the integer version of the hash to create a flat number range, just by taking the modulus." In general this will introduce modulo bias, so you won't end up with a uniformly distributed result.
If you stick with the built-in PRNG, there's no reason not to use the built-in Gaussian generator. If you want to do it for your own amusement there are lots of resources that will tell you how to map uniforms to Gaussians. Well-known methods include the Box-Muller method, Marsaglia's polar method, and the ziggurat method.
UPDATE
Given the additional information you've provided in your question, I think the answer you want is contained in this section of Python's documentation for random:
The functions supplied by this module are actually bound methods of a
hidden instance of the random.Random class. You can instantiate your
own instances of Random to get generators that don’t share state. This
is especially useful for multi-threaded programs, creating a different
instance of Random for each thread, and using the jumpahead() method
to make it likely that the generated sequences seen by each thread
don’t overlap.
Sounds like you want separate instances of Random for each person, seeded independently of each other or with synchronized but widely separated states as described in the random.jumpahead() documentation. This is one of the approaches that simulation modelers have used since the early 1950's so they can maintain repeatability between configurations to make direct comparisons of two or more systems in a fair fashion. Check out the discussion of "synchronization" on the second page of this article, or starting on page 8 of this book chapter, or pick up any of the dozens of simulation textbooks available in most university libraries and read the sections on "common random numbers." (I'm not pointing you towards Wikipedia because it provides almost no details on this topic.)
Here's an explicit example showing creating multiple instances of Random:
import random as rnd
print("two PRNG instances with identical seeding produce identical results:")
r1 = rnd.Random(12345)
r2 = rnd.Random(12345)
for _ in range(5):
print([r1.normalvariate(0, 1), r2.normalvariate(0, 1)])
print("\ndifferent seeding yields distinct but reproducible results:")
r1 = rnd.Random(12345)
r2 = rnd.Random(67890)
for _ in range(3):
print([r1.normalvariate(0, 1), r2.normalvariate(0, 1)])
print("\nresetting, different order of operations")
r1 = rnd.Random(12345)
r2 = rnd.Random(67890)
print("r1: ", [r1.normalvariate(0, 1) for _ in range(3)])
print("r2: ", [r2.normalvariate(0, 1) for _ in range(3)])
I have gone ahead and created a simple hash-based replacement for some of the functions in the random.Random class:
from __future__ import division
import xxhash
from numpy import sqrt, log, sin, cos, pi
def gaussian(u1, u2):
z1 = sqrt(-2*log(u1))*cos(2*pi*u2)
z2 = sqrt(-2*log(u1))*sin(2*pi*u2)
return z1,z2
class pghash:
def __init__(self, tuple, seed=0, sep=','):
self.hex = xxhash.xxh64(sep.join(tuple), seed=seed).hexdigest()
def pgvalue(self):
return int(self.hex, 16)
def pghalves(self):
return self.hex[:8], self.hex[8:]
def pgvalues(self):
return int(self.hex[:8], 16), int(self.hex[8:], 16)
def random(self):
return self.value() / 2**64
def randint(self, min, max):
return int(self.random() * max + min)
def gauss(self, mu, sigma):
xx = self.pgvalues()
uu = [xx[0]/2**32, xx[1]/2**32]
return gaussian(uu[0],uu[1])[0]
Next step is to go through my code and replace all the calls to random.Random methods with pghash objects.
I have made this into a module, which I hope to upload to pypi at some point:
https://github.com/UKHomeOffice/python-pghash
I am solving a dispersion equation of multilayer plate. According to the number of layers,I have to generate,let´s say,prescription of a matrix in which will be two variables - frequency and velocity. So, in the first step, I will generate a matrix and in the second step,the matrix will be used for further calculations in loops, which will be related to these variables.
I will roughly demonstrate the problem on a function:
def function(a,b):
y=a*f+b*c
return y
(a and b will be defined in the input,but f and c will be still variables)
function(a,b) will return me the prescription of a function with two variables - f and c.
Then, I will use the prescription of a function to calculate its value for different f and c values.
In my case I have to use this approach, becaulse the shape of the matrix will be related to the number of present layers. Ithought,that I can use symbolic toolbox, but I think that this won´t be the right way how to solve the problem.
As far as I can tell, you need a factory - a function that returns another function:
def prescription_factory(a,b):
return lambda f,c: a*f+b*c
# Create a function:
prescription = prescription_factory(10,20)
# Use the new function
prescription(1,2)
# 50
I have a problem that is equal parts trig and Python. I am plotting a cosine over time interval [0,t] whose frequency changes (slightly) according to another cosine function. So what I'd expect to see is a repeating pattern of higher-to-lower frequency that repeats over the duration of the window [0,t].
Instead what I'm seeing is that over time a low-freq motif emerges in the cosine plot and repeats over time, each time becoming lower and lower in freq until eventually the cosine doesn't even oscillate properly it just "wobbles", for lack of a better term.
I don't understand how this is emerging over the course of the window [0,t] because cosine is (obviously) periodic and the function modulating it is as well. So how can "new" behavior emerge?? The behavior should be identical across all periods of the modulatory cosine that tunes the freq of the base cosine, right?
As a note, I'm technically using a modified cosine, instead of cos(wt) I'm using e^(cos(wt)) [called von mises eq or something similar].
Minimum needed Code:
cos_plot = []
for wind,pos_theta in zip(window,pos_theta_vec): #window is vec of time increments
# for ref: DBFT(pos_theta) = (1/(2*np.pi))*np.cos(np.radians(pos_theta - base_pos))
f = float(baserate+DBFT(pos_theta)) # DBFT() returns a val [-0.15,0.15] periodically depending on val of pos_theta
cos_plot.append(np.exp(np.cos(f*2*np.pi*wind)))
plt.plot(cos_plot)
plt.show()
What you are observing could depend on "aliasing", i.e. the emergence of low-frequency figures because of sampling of an high frequency function with a step that is too big.
(picture taken from the linked Wikipedia page)
If the issue is NOT aliasing consider that any function shape between -1 and 1 can be obtained with cos(f(x)*x) by simply choosing f(x).
For, consider any function -1 <= g(x) <= 1 and set f(x) = arccos(g(x))/x.
To look for the problem try plotting your "frequency" and see if anything really strange is present in it. May be you've a bug in DBFT.
In the interest of posterity, in case anyone ever needs an answer to this question:
I wanted a cosine whose frequency was a time-varying function freq(t). My mistake was simply evaluating this function at each time t like this: Acos(2pifreq(t)t). Instead you need to integrate freq(t) from 0 to t at each time point: y = cos(2%piintegral(f(t)) + 2%pi*f0*t + phase). The term for this procedure is a frequency sweep or chirp (not identical terms, but similar if you need to google/SO answers).
Thanks to those who responded with help :)
SB
I am trying to port from labview to python.
In labview there is a function "Integral x(t) VI" that takes a set of samples as input, performs a discrete integration of the samples and returns a list of values (the areas under the curve) according to Simpsons rule.
I tried to find an equivalent function in scipy, e.g. scipy.integrate.simps, but those functions return the summed integral across the set of samples, as a float.
How do I get the list of integrated values as opposed to the sum of the integrated values?
Am I just looking at the problem the wrong way around?
I think you may be using scipy.integrate.simps slightly incorrectly. The area returned by scipy.integrate.simps is the total area under y (the first parameter passed). The second parameter is optional, and are sample values for the x-axis (the actual x values for each of the y values). ie:
>>> import numpy as np
>>> import scipy
>>> a=np.array([1,1,1,1,1])
>>> scipy.integrate.simps(a)
4.0
>>> scipy.integrate.simps(a,np.array([0,10,20,30,40]))
40.0
I think you want to return the areas under the same curve between different limits? To do that you pass the part of the curve you want, like this:
>>> a=np.array([0,1,1,1,1,10,10,10,10,0])
>>> scipy.integrate.simps(a)
44.916666666666671
>>> scipy.integrate.simps(a[:5])
3.6666666666666665
>>> scipy.integrate.simps(a[5:])
36.666666666666664
There is only one method in SciPy that does cumulative integration which is scipy.integrate.cumtrapz() which does what you want as long as you don't specifically need to use the Simpson rule or another method. For that, you can as suggested always write the loop on your own.