How do you pick a certain min number in Python? - python

I'm running a program which corrects responses to tests. There are 23 questions and each correct answer is given a + 1. My code sums these scores up for these 23 questions and creates a separate column (totalCorrect) which prints the final score out of 23. I have attached a screenshot of a portion of this column totalCorrect
What I want to do right now, is to assign a money incentive based on each performance. The incentive is 0.3$ for every right answer - the issue is, every survey has 23 questions but I only want to consider 20 of these questions to calculate the incentive. So out of the score (out of 23) we will consider only a min of 20 responses.
How can I do this?
This is what I have so far:
df['numCorrect'] = min{20, totalNumCorrect}
df['earnedAmount'] = 0.3 * df['numCorrect']
where 'earnedAmount' is trying to calculate the final incentive amount and numberCorrect is trying to isolate only 20 points out of a possible 23

df['earnedAmount'] = (0.3 * df['totalNumCorrect']).clip(0, 6)
0.3 * df['totalNumCorrect'] simply calculates the full amount, which is a Series (or dataframe column).
.clip then limits the values to be between 0 and 6. 6 is of course 0.3 * 20, the maximum amount someone can earn.

Related

Conflicting results when grouping observations in Stata vs Python

I have a longitudinal dataset and I am trying to create two variables that correspond to two time periods based on specific date ranges (period_1 and period_2) to be able to analyze the effect of each of those time periods on my outcome.
My Stata code for grouping variables by ID is
gen period_1 = date_eval < mdy(5,4,2020)
preserve
collapse period_1=period_1
count if period_1
and it gives me a number of individuals during that period.
However, I get a different number if I use the SQL query in Python
evals_period_1 = ps.sqldf('SELECT id, COUNT(date_eval) FROM df WHERE strftime(date_eval) < strftime("%m/%d/%Y",{}) GROUP BY id'.format('5/4/2020'))
Am I grouping by ID differently in these two codes? Please let me know what you think.
Agree with Nick that a reproducible example would have been useful. Or at least a description of the results and how it is not as you expected. However, I can still say something about your Stata code. See a reproducible example below, and see how your code always results in the count 1. Even though the example below randomize the data to be different each time.
* Create a data set with 50 rows where period_1 is dummy (0,1) randomized
* differently each run
clear
set obs 50
gen period_1 = (runiform() < .5)
* List the first 5 rows
list in 1/5
* This collapses all rows and what you are left with is one row where the value
* is the average of all rows
collapse period_1=period_1
* List the one remaining observation
list
* Here Stata syntax is probably not what you are expecting. period_1 will
* here be replaced with the value in the first row. The random mean around .5.
* (This is my understanding assuming it follows what "display period_1" would do)
count if period_1
* That is identical to count if .5. And Stata evaluates
* any number >0 to "true" meaning the count where
* this statement is true to 1. This will always be the case in this code
* unless the random number generator creates the corner case where all rows are 0
count if .5
You probably want to drop the row with collapse and change the last row to count if period_1 == 1. But how your data is formatted is relevant for if this is the solution to your original question.

How to calculate relative frequency of an event from a dataframe?

I have a dataframe with temperature data for a certain period. With this data, I want to calculate the relative frequency of the month of August being warmer than 20° as well as January being colder than 2°. I have already managed to extract these two columns in a separate dataframe, to get the count of each temperature event and used the normalize function to get the frequency for each value in percent (see code).
df_temp1[df_temp1.aug >=20]
df_temp1[df_temp1.jan <= 2]
df_temp1['aug'].value_counts()
df_temp1['jan'].value_counts()
df_temp1['aug'].value_counts(normalize=True)*100
df_temp1['jan'].value_counts(normalize=True)*100
What I haven't managed is to calculate the relative frequency for aug>=20, jan<=2, as well as aug>=20 AND jan<=2 and aug>=20 OR jan<=2.
Maybe someone could help me with this problem. Thanks.
I would try something like this:
proprortion_of_augusts_above_20 = (df_temp1['aug'] >= 20).mean()
proprortion_of_januaries_below_20 = (df_temp1['jan'] <= 2).mean()
This calculates it in two steps. First, df_temp1['aug'] >= 20 creates a boolean array, with True representing months above 20, and False representing months which are not.
Then, mean() reinterprets True and False as 1 and 0. The average of this is the percentage of months which fulfill the criteria, divided by 100.
As an aside, I would recommend posting your data in a question, which allows people answering to check whether their solution works.

generate random integer with numpy WITH include constraints

I am working with numpy and simpy on a simulation. I simulate over a 12 months periods.
env.run(until=12.0)
I need to generate random demand values that are between 2 and 50, occuring at random moments within the 12 periods length of the env.
d = np.random.randint(2,50) #generate random demand values
now the values are passed at random intervals into the 12 months simpy environement
0.2 40
0.65 21
0.67 03
1.01 4
1.1 19
...
11.4 49
11.9. 21
what I trying to achieve is to constraint the numpy generation to make the sure that the sum of the values generated in each period (0,1,2...) does not exceed 100
to put it in different words, i am trying to generate random quantities, at random intervals along a 12 periods axis and I am trying to make sure that the sum of these quantities for one period does not exceed a given value
I cannot find anything about it online to twick numpy randint function to do that, would someone have a hint?
I do not understand your question. If you are looking for your simulation to give you an average of 100 per month, then the values on demand should not be in between [2, 50] as the maximum possible average will be 50. I think you might be looking for this: https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html
I won't go to the math but drawing random numbers from a normal distribution, and finding the mean, will give the mean of the normal distribution, which is a parameter you can use.

Median and quantile values in Pyspark

In my dataframe I have an age column. The total number of rows are approx 77 billion. I want to calculate the quantile values of that column using PySpark. I have some code but the computation time is huge (maybe my process is very bad).
Is there any good way to improve this?
Dataframe example:
id age
1 18
2 32
3 54
4 63
5 42
6 23
What I have done so far:
#Summary stats
df.describe('age').show()
#For Quantile values
x5 = df.approxQuantile("age", [0.5], 0)
x25 = df.approxQuantile("age", [0.25], 0)
x75 = df.approxQuantile("age", [0.75], 0)
The first improvment to do would be to do all the quantile calculations at the same time:
quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0)
Also, note that you use the exact calculation of the quantiles. From the documentation we can see that (emphasis added by me):
relativeError – The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
Since you have a very large dataframe I expect that some error is acceptable in these calculations, but it will be a trade-off between speed and precision (although anything more than 0 could have a significant speed improvement).

Difference between large numpy arrays containing time values

I have ten (1000,1000) numpy arrays. Each array element contains a float, which represents the hour of the day. E.g. 14.0 = 2pm and 15.75 = 15:45pm.
I want to find the maximum difference between these arrays. The result should be a single (1000,1000) numpy array containing, for each array element, the maximum difference between the ten arrays . At the moment I have the following, which seems to work fine:
import numpy as np
max=np.maximum.reduce([data1,data2,data3,data4,data5])
min=np.minimum.reduce([data1,data2,data3,data4,data5])
diff=max-min
However, it results in the difference between 11pm and 1am of 22 hours. I need the difference to be 2 hours. I imagine I need to use datetime.time somehow, but I don't know how to get datetime to play nicely with numpy arrays.
Edit: The times refer to the average time of day that a certain event occurs, so they are not associated with a specific date. The difference two times could therefore be correctly interpreted as 22 hours, or 2 hours. However, I will always want to take the minimum of those two possible interpretations.
You can take the difference between two cyclic values by centering one value around the center location in the cycle (12.0). Rotate the other values the same amount to maintain their relative differences. Take the modulus of the adjusted values by the duration of the cycle to keep everything within bounds. You now have times adjusted so the maximum maximum possible distance stays within +/- 1/2*cycle duration (+/-12 hours).
e.g.,
adjustment = arr1 - 12.0
arr2 = (arr2 - adjustment) % 24.0
diff = 12.0 - arr2 # or abs(12.0 - arr2) if you prefer
If you're not using the absolute value, you'll need to play with the sign depending on which time you want to be considered 'first'.
Let's say you have the number 11pm and 1am, and you want to find the minimum distance.
1am -> 1
11pm -> 23
Then you have either:
23 - 1 = 22
Or,
24 - (23 - 1) % 24 = 2
Then distance can be thought of as:
def dist(x,y):
return min(abs(x - y), 24 - abs(x - y) % 24)
Now we need to take dist and apply it to every combination, If I recall correctly, there is a more numpy/scipy oriented function to do this, but the concept is more or less the same:
from itertools import combinations
data = [data1,data2,data3,data4,data5]
combs = combinations(data,2)
comb_list = list(combs)
dists = [dist(x,y) for x,y in comb_list]
max_dist = max(dists)
If you have an array diff of the time differences ranging between 0 and 24 hours you can make a correction to the wrongly calculated values as follows:
diff[diff > 12] = 24. - diff[diff > 12]

Categories