Monte carlo simulation in python - problem with looping - python

I am running a simple python script for MC. Basically it reads through every row in the dataframe and selects the max and min of the two variables. Then the simulation if run 1000 times selecting a random value between the min and max and computes the product and writes the P50 value back to the datatable.
Somehow the P50 output is the same for all rows. Any help on where I am going wrong?
import pandas as pd
import random
import numpy as np
data = [[0.075,0.085, 120, 150], [0.055, 0.075, 150, 350],[0.045,0.055,175,400]]
df = pd.DataFrame(data, columns = ['P_min','P_max','H_min','H_max'])
NumSim = 1000
for index, row in df.iterrows():
outdata = np.zeros(shape=(NumSim,), dtype=float)
for k in range(NumSim):
phi = (row['P_min'] + (row['P_max'] - row['P_min']) * random.uniform(0, 1))
ht = (row['H_min'] + (row['H_max'] - row['H_min']) * random.uniform(0, 1))
outdata[k] = phi*ht
df['out_p50'] = np.percentile(outdata,50)
print(df)

By df['out_p50'] = np.percentile(outdata,50) you are saying that you want the whole column to be set to given value, not a specific row of the column. Therefore, the numbers are generated and saved but they are saved to the whole column and in the end, you see the last generated number in every row.
Instead, use df.loc[index, 'out_p50'] = np.percentile(outdata,50) to specify the specific row you want to set.

Yup -- you're writing a scalar value to the entire column. You overwrite that value on each iteration. If you want, you can simply specify the row with df.loc for a quick fix. Also consider using outdata.median instead of percentile.
Perhaps the most important feature of PANDAS is the built-in support for vectorization: you work with entire columns of data, rather than looping through the data frame. Think like a list comprehension in which you don't need the for row in df iteration at the end.

Related

In python How to ensure that seed in using randint keeps changing when i am trying to pick a random number?

def claims(dataframe):
dataframe.loc[(dataframe.severity ==1),'claims_made']= randint(200, 20000)
return dataframe
here 'severity' is an existing column and 'claims_made' is a new column, I want to have the randint keep picking different values that are being assigned to the 'claims_made' column. because for now it's just picking one random value out of the bucket specified and is assigning the same value to all the rows that satisfy the condition
Your code gets a single randint and applies that one value to the column you create. Its the same as if you had done
val = randint(20, 20000)
dataframe.loc[(dataframe.severity ==1),'claims_made']= val
Instead you could get an index of the rows you want to assign. Use it to create a series of random integers and when you assign that back to the dataframe, non-indexed rows become NaN.
import pandas as pd
import numpy as np
def claims(dataframe):
wanted_index = dataframe[df.severity==1].index
dataframe["claims_made"] = pd.Series(
np.random.randint(20,20000, size=len(wanted_index)),
index=wanted_index)
return dataframe
df = pd.DataFrame({"severity":[1, 1, 0, 8, -1, 99, 1]})
print(claims(df))
If you want to stick with your existing approach, you could do something like this:
def claims2(df):
n_rows = len(df.loc[(df.severity==1), 'claims_made'])
vals = [randint(200, 20000) for _ in range(n_rows)]
df.loc[(df.severity==1), 'claims_made'] = vals
return df
p.s. I'd recommend accessing columns via df['severity'] instead of df.severity -- you can get into trouble using the . syntax if you have a dataset with spaces etc. in the column names.
I'll give you a broad hint; coding is up to you.
Form a series (a temporary column object) of random numbers in the desired range. Assign that series to your data frame column. You can find examples of this technique in any tutorial on data frames.

Dataframe.sample - Weights - How to use it?

I have this situation:
A have a probability of 0.1348 calculated in a variable called treat_conv
Now, I am trying to create a dataframe from the original dataframe, using this probability to bring a especified column. Is that possible? I am trying to using weights but no success. Maybe am I using it wrong?
Follow my code:
weights = np.array(treat_conv) #creating a array with treat_conv
new_page_converted = df2.sample(n = treat_group.shape[0], weights=df2.converted(weights)) #creating new dataframe with the number of rows of treat_group and the column converted must have a 0.13 of chance to bring value 1
So, the code works if I use the n alone. It creates a new dataframe with the correct ammount of rows. But I cant get the correct probabiliy to bring certain ammount of value 1 in converted column.
I hope my explanation is undestandable.
Thank you!
You could do something like this
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.arange(0, 100, 1), columns=["SomeValue"])
selected = pd.DataFrame(data=np.random.choice(df["SomeValue"], int(len(df["SomeValue"]) * 0.13), replace=False),
columns=["SomeValue"])
selected["Trigger"] = 1
df = df.merge(selected, how="left", on="SomeValue")
df["Trigger"].fillna(0, inplace=True)
"df" is your original DataFrame. Then select random 13% of the values and add a column indicating they've been selected. Finally, merge all back to your original Dataframe.

Trying to select data slices in pandas using time stamps as the locations

I start by creating a data frame from my input csv file and I get the right format, but now I need to perform calculations on the V, I, and P columns. I want to split the data using the time stamps.
i.e get the mean for V, I, and P for all the values between Test loop 0 and Test loop 1. I know I can do this using iloc but I am trying to write a script that will work for different log files that might have a different number of entries.
Data frame output
Please let me know if you need any more information, any help/input is appreciated.
I think need extract first 19 values from column Time and aggregate mean:
df = df.groupby(df['Time'].str[:19]).mean()
If need remove rows with NaNs before:
df = df.dropna()
df = df.groupby(df['Time'].str[:19]).mean()
If you want to have the mean for the lines between your 'Test loop' lines:
First, you need to extract the limits of your time windows:
serie_time_limits = df[df['Time'].contains('Test loop')]['Time'].str[:19]
df_data = df[~df['Time'].contains('Test loop')]
df_data['Time'] = df_data['Time'].str[:19]
Then, you can get the mean for each test loop:
means = []
for i in range(len(serie_time_limits)):
if i==len(serie_time_limits)-1:
df_window = df_data[(df_data['Time']>=serie_time_limits[i])
else:
df_window = df_data[(df_data['Time']>=serie_time_limits[i]) & (df_data['Time']<serie_time_limits[i+1])]
means.append(df_window[['V', 'I', 'P']].mean())

Iterating over entire columns and storing the result into a list

I would like to know how I could iterate through each columns of a dataframe to perform some calculations and store the result in an another dataframe.
df_empty = []
m = daily.ix[:,-1] #Columns= stocks & Rows= daily returns
stocks = daily.ix[:,:-1]
for col in range (len(stocks.columns)):
s = daily.ix[:,col]
covmat = np.cov(s,m)
beta = covmat[0,1]/covmat[1,1]
return (beta)
print(beta)
In the above example, I first want to calculate a covariance matrix between "s" (the columns representing stocks daily returns and for which I want to iterate through one by one) and "m" (the market daily return which is my reference column/the last column of my dataframe). Then I want to calculate the beta for each covariance pair stock/market.
I'm not sure why return(beta) give me a single numerical result for one stock while print(beta) print the beta for all stocks.
I'd like to find a way to create a dataframe with all these betas.
beta_df = df_empty.append(beta)
I have tried the above code but it returns 'none' as if it could not append the outcome.
Thank you for your help
The return statement within your for-loop ends the loop itself the first time the return is encountered. Moreover, you are not saving the beta value anywhere because the for-loop itself does not return a value in python (it only has side effects).
Apart from that, you may choose a more pandas-like approach using apply on the data frame which basically iterates over the columns of the data frame and passes each column to a supplied function as the first parameter while returning the result of the function call. Here is a minimal working example with some dummy data:
import pandas as pd
import numpy as pd
# create some dummy data
daily = pd.DataFrame(np.random.randint(100, size=(100, 5)))
# define reference column
cov_column = daily.iloc[:, -1]
# setup computation function
def compute(column):
covmat = np.cov(column, cov_column)
return covmat[0,1]/covmat[1,1]
# use apply to iterate over columns
result = daily.iloc[:, :-1].apply(compute)
# show output
print(result)
0 -0.125382
1 0.024777
2 0.011324
3 -0.017622
dtype: float64

Pandas - expanding inverse quantile function

I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!
As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)
Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b
Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)

Categories