generate random integer with numpy WITH include constraints

generate random integer with numpy WITH include constraints - python

I am working with numpy and simpy on a simulation. I simulate over a 12 months periods.
env.run(until=12.0)
I need to generate random demand values that are between 2 and 50, occuring at random moments within the 12 periods length of the env.
d = np.random.randint(2,50) #generate random demand values
now the values are passed at random intervals into the 12 months simpy environement
0.2 40
0.65 21
0.67 03
1.01 4
1.1 19
...
11.4 49
11.9. 21
what I trying to achieve is to constraint the numpy generation to make the sure that the sum of the values generated in each period (0,1,2...) does not exceed 100
to put it in different words, i am trying to generate random quantities, at random intervals along a 12 periods axis and I am trying to make sure that the sum of these quantities for one period does not exceed a given value
I cannot find anything about it online to twick numpy randint function to do that, would someone have a hint?

I do not understand your question. If you are looking for your simulation to give you an average of 100 per month, then the values on demand should not be in between [2, 50] as the maximum possible average will be 50. I think you might be looking for this: https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html
I won't go to the math but drawing random numbers from a normal distribution, and finding the mean, will give the mean of the normal distribution, which is a parameter you can use.

Related

How do you pick a certain min number in Python?

I'm running a program which corrects responses to tests. There are 23 questions and each correct answer is given a + 1. My code sums these scores up for these 23 questions and creates a separate column (totalCorrect) which prints the final score out of 23. I have attached a screenshot of a portion of this column totalCorrect
What I want to do right now, is to assign a money incentive based on each performance. The incentive is 0.3$ for every right answer - the issue is, every survey has 23 questions but I only want to consider 20 of these questions to calculate the incentive. So out of the score (out of 23) we will consider only a min of 20 responses.
How can I do this?
This is what I have so far:
df['numCorrect'] = min{20, totalNumCorrect}
df['earnedAmount'] = 0.3 * df['numCorrect']
where 'earnedAmount' is trying to calculate the final incentive amount and numberCorrect is trying to isolate only 20 points out of a possible 23

df['earnedAmount'] = (0.3 * df['totalNumCorrect']).clip(0, 6)
0.3 * df['totalNumCorrect'] simply calculates the full amount, which is a Series (or dataframe column).
.clip then limits the values to be between 0 and 6. 6 is of course 0.3 * 20, the maximum amount someone can earn.

Recalculate fitting parameters after period of time loop

I have a dataframe similar to the one shown below and was wondering how I can loop through and calculate fitting parameters every set number of days. For example, I would like to be able to input 30 days and have be able to get new constants for the first 30 days, then the first 60 days and so on until the end of the date range.
ID date amount delta_t
1 2020/1/1 10.2 0
1 2020/1/2 11.2 1
2 2020/1/1 12.3 0
2 2020/1/2 13.3 1
I would like to have the parameters stored in another dataframe which is what I am currently doing for the entire dataset but that is over the whole time period rather than n day blocks. Then using the constants for each set period I will calculate the graph points and plot them.
Right now I am using groupby to group the wells by ID then using the apply method to calculate the constants for each ID. This works for the entire dataframe but the constants will change if I am only using 30 day periods.
I don't know if there is a way in the apply method to more easily do this and output the constants either to a new column or a seperate dataframe that is one row per ID. Any input is greatly appreciated.
def parameters(x):
variables, _ = curve_fit(expo, x['delta_t'], x['amount'])
return pd.Series({'param1': variables[0], 'param2': variables[1], 'param3': variables[2]})
param_series = df_filt.groupby('ID').apply(parameters)

Probability of occurrence based on historical data

The dataset is of occurrence of particular insects in a location for the given year and month. This is available for about 30 years. Now when I give a random location and year, month of future, I want what is the probability of finding that insects in that place based on the historic data.
I tried to to classification problem by labelling all available data as 1. And wanted to check the probability of new data point being label 1 . But the error was thrown as there should be at least two classes to train.
The data looks like this:The x and y are longitude and latitude
x y year month
17.01 22.87 2013 01
42.32. 33.09 2015 12

Think about the problem as a map. You'll need a map for each time period you're interested in, so sum all the occurrences in each month and year for each location. Unless the locations are already binned, you'll need to use some binning as otherwise it is pretty meaningless. So round the values in x and y to a reasonable precision level or use numpy to bin the data. Then you can create a map with the counts/ use a markov model to predict the occurrence.
The reason you're not getting anywhere at the moment is that the chance of finding an insect at any random point is virtually 0.

efficient, fast numpy histograms

I have a 2D numpy array consisting of ca. 15'000'000 datapoints. Each datapoint has a timestamp and an integer value (between 40 and 200). I must create histograms of the datapoint distribution (16 bins: 40-49, 50-59, etc.), sorted by year, by month within the current year, by week within the current year, and by day within the current month.
Now, I wonder what might be the most efficient way to accomplish this. Given the size of the array, performance is a conspicuous consideration. I am considering nested "for" loops, breaking down the arrays by year, by month, etc. But I was reading that numpy arrays are highly memory-efficient and have all kinds of tricks up their sleeve for fast processing. So I was wondering if there is a faster way to do that. As you may have realized, I am an amateur programmer (a molecular biologist in "real life") and my questions are probably rather naïve.

First, fill in your 16 bins without considering date at all.
Then, sort the elements within each bin by date.
Now, you can use binary search to efficiently locate a given year/month/week within each bin.

In order to do this, there is a function in numpy, numpy.bincount. It is blazingly fast. It is so fast that you can create a bin for each integer (161 bins) and day (maybe 30000 different days?) resulting in a few million bins.
The procedure:
calculate an integer index for each bin (e.g. 17 x number of day from the first day in the file + (integer - 40)//10)
run np.bincount
reshape to the correct shape (number of days, 17)
Now you have the binned data which can then be clumped into whatever bins are needed in the time dimension.
Without knowing the form of your input data the integer bin calculation code could be something like this:
# let us assume we have the data as:
# timestamps: 64-bit integer (seconds since something)
# values: 8-bit unsigned integer with integers between 40 and 200
# find the first day in the sample
first_day = np.min(timestamps) / 87600
# we intend to do this but fast:
indices = (timestamps / 87600 - first_day) * 17 + ((values - 40) / 10)
# get the bincount vector
b = np.bincount(indices)
# calculate the number of days in the sample
no_days = (len(b) + 16) / 17
# reshape b
b.resize((no_days, 17))
It should be noted that the first and last days in b depend on the data. In testing this most of the time is spent in calculating the indices (around 400 ms with an i7 processor). If that needs to be reduced, it can be done in approximately 100 ms with numexpr module. However, the actual implementation depends really heavily on the form of timestamps; some are faster to calculate, some slower.
However, I doubt if any other binning method will be faster if the data is needed up to the daily level.
I did not quite understand it from your question if you wanted to have separate views on the (one by year, ony by week, etc.) or some other binning method. In any case that boils down to summing the relevant rows together.

Here is a solution, employing the group_by functionality found in the link below:
http://pastebin.com/c5WLWPbp
import numpy as np
dates = np.arange('2004-02', '2005-05', dtype='datetime64[D]')
np.random.shuffle(dates)
values = np.random.randint(40,200, len(dates))
years = np.array(dates, dtype='datetime64[Y]')
months = np.array(dates, dtype='datetime64[M]')
weeks = np.array(dates, dtype='datetime64[W]')
from grouping import group_by
bins = np.linspace(40,200,17)
for m, g in zip(group_by(months)(values)):
print m
print np.histogram(g, bins=bins)[0]
Alternatively, you could take a look at the pandas package, which probably has an elegant solution to this problem as well.

Difference between large numpy arrays containing time values

I have ten (1000,1000) numpy arrays. Each array element contains a float, which represents the hour of the day. E.g. 14.0 = 2pm and 15.75 = 15:45pm.
I want to find the maximum difference between these arrays. The result should be a single (1000,1000) numpy array containing, for each array element, the maximum difference between the ten arrays . At the moment I have the following, which seems to work fine:
import numpy as np
max=np.maximum.reduce([data1,data2,data3,data4,data5])
min=np.minimum.reduce([data1,data2,data3,data4,data5])
diff=max-min
However, it results in the difference between 11pm and 1am of 22 hours. I need the difference to be 2 hours. I imagine I need to use datetime.time somehow, but I don't know how to get datetime to play nicely with numpy arrays.
Edit: The times refer to the average time of day that a certain event occurs, so they are not associated with a specific date. The difference two times could therefore be correctly interpreted as 22 hours, or 2 hours. However, I will always want to take the minimum of those two possible interpretations.

You can take the difference between two cyclic values by centering one value around the center location in the cycle (12.0). Rotate the other values the same amount to maintain their relative differences. Take the modulus of the adjusted values by the duration of the cycle to keep everything within bounds. You now have times adjusted so the maximum maximum possible distance stays within +/- 1/2*cycle duration (+/-12 hours).
e.g.,
adjustment = arr1 - 12.0
arr2 = (arr2 - adjustment) % 24.0
diff = 12.0 - arr2 # or abs(12.0 - arr2) if you prefer
If you're not using the absolute value, you'll need to play with the sign depending on which time you want to be considered 'first'.

Let's say you have the number 11pm and 1am, and you want to find the minimum distance.
1am -> 1
11pm -> 23
Then you have either:
23 - 1 = 22
Or,
24 - (23 - 1) % 24 = 2
Then distance can be thought of as:
def dist(x,y):
return min(abs(x - y), 24 - abs(x - y) % 24)
Now we need to take dist and apply it to every combination, If I recall correctly, there is a more numpy/scipy oriented function to do this, but the concept is more or less the same:
from itertools import combinations
data = [data1,data2,data3,data4,data5]
combs = combinations(data,2)
comb_list = list(combs)
dists = [dist(x,y) for x,y in comb_list]
max_dist = max(dists)

If you have an array diff of the time differences ranging between 0 and 24 hours you can make a correction to the wrongly calculated values as follows:
diff[diff > 12] = 24. - diff[diff > 12]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

generate random integer with numpy WITH include constraints - python

Related

How do you pick a certain min number in Python?

Recalculate fitting parameters after period of time loop

Probability of occurrence based on historical data

efficient, fast numpy histograms

Difference between large numpy arrays containing time values

Categories

Resources