I want to take the ratio of the target variable and distribute it to other non-zero variables in the same line, by their own weight. Can you help with this?
I want to make a row sum 100% including the target variable. I want to take the ratio of the target variable and distribute it to other variables. I want the rates to be 100% again. (target will be zero)
What you describe is just normalization of the rows:
no_target = df.columns != 'target'
norm = df.loc[:, no_target].sum(axis=1) # sum of all values except target
df.loc[:, no_target] /= norm * 100
df['target'] = 0
I think you might be asking how to split the target value by the percentages in each column, replacing the percent value x with the target * x. You could do this by iterating over each percentage value and multiplying by the target. Zero is not a special case because 0 * target = 0. After each item in each row is changed, set the corresponding target value to zero. If the sum of the original column values is 1 before multiplying, the sum of each of the columns after multiplying will be equal to the former target.
If I'm not understanding you question, please post more details including what you have tried so far.
Related
I have a data frame with 1000 rows. I want to delete 500 observations from a specific column Y, in a way thet the bigger the values of Y, the probability it will be deleted is bigger.
One way to do that is to sort this column in an ascending way. For i = 1,...,1000, toss a bernoulli random variable with a p_i probability for success that is dependant on i. delete all observations that their bernoulli random variable is 1.
So first I sort this column:
df_sorted = df.sort_values("mycolumn")
Next, I tried something like this:
p_i = np.linspace(0,1,num=sample_Encoded_sorted.shape[0])
bernoulli = np.random.binomial(1, p_i)
delete_index = bernoulli == 1
I get delete_index is a boolian vector with True or False, and the probability to get a True is higher among higher index. However, I get more than 500 True in it.
How do I get only 500 Trues in this vector? and how do I use it to delete the corresponding rows of the data frame?
For example if i = 1 in delete_index is False, the first row of the data frame wont be deleted, if it's True it will be deleted.
I don't know why are you trying to limit the appearance of True and False to 500, due to the random binomial it will be near from 500 but most of the time it wont be 500, but here it is one possible solution, i don't know how useful it is for your purposes.
p_i = np.linspace(0,1,num=1000)
#We make a loop that make the number 1 appear 500 times
count=0
while count != 500:
bernoulli = np.random.binomial(1, p_i)
count=np.sum(bernoulli)
#We transform the array that we got from np.random.binomial into a boolean mask and slice the sorted_df
df_sorted=df_sorted[pd.Series(bernoulli)==0]
#This will return a new DataFrame with the 500 values
I hope this helps, i can't comment yet because of my reputation, thats why I put this as an answer and not a comment.
I have successfully imported temperature CSV file to Python Pandas DataFrame. I have also found the mean value of specific range:
df.loc[7623:23235, 'Temperature'].mean()
where 'Temperature' is Column title in DataFrame.
I would like to know if it is possible to change this function to find the average of last 25% (or 1/4) from the input range (7623:23235).
Yes, you can use the quantile method to find the value that separates the last 25% of the values in the input range and then use the mean method to calculate the average of the values in the last 25%.
Here's how you can do it:
quantile = df.loc[7623:23235, 'Temperature'].quantile(0.75)
mean = df.loc[7623:23235, 'Temperature'][df.loc[7623:23235, 'Temperature'] >= quantile].mean()
To find the average of the last 25% of the values in a specific range of a column in a Pandas DataFrame, you can use the iloc indexer along with slicing and the mean method.
For example, given a DataFrame df with a column 'Temperature', you can find the average of the last 25% of the values in the range 7623:23235 like this:
import math
# Find the length of the range
length = 23235 - 7623 + 1
# Calculate the number of values to include in the average
n = math.ceil(length * 0.25)
# Calculate the index of the first value to include in the average
start_index = length - n
# Use iloc to slice the relevant range of values from the 'Temperature' column
# and calculate the mean of those values
mean = df.iloc[7623:23235]['Temperature'].iloc[start_index:].mean()
print(mean)
This code first calculates the length of the range, then calculates the number of values that represent 25% of that range. It then uses the iloc indexer to slice the relevant range of values from the 'Temperature' column and calculates the mean of those values using the mean method.
Note that this code assumes that the indices of the DataFrame are consecutive integers starting from 0. If the indices are not consecutive or do not start at 0, you may need to adjust the code accordingly.
I am trying to populate values in the column motooutstandingbalance by subtracting the previous row actualmotordeductionfortheweek from previous row motooutstandingbalance. I am using pandas shift command but currently not getting the desired output which should be a consistent reduction in motooutstandingbalance week by week.
Final result should look like this
Here is my code
x['motooutstandingbalance']=np.where(x.salesrepid == x.shift(1).salesrepid, x.shift(1).motooutstandingbalance - x.shift(1).actualmotordeductionfortheweek, x.motooutstandingbalance)
Any ideas on how to achieve this?
This works:
start_value = 468300.0
df['motooutstandingbalance'] = (-df['actualmotordeductionfortheweek'][::-1]).append(pd.Series([start_value], index=[-1]))[::-1].cumsum().reset_index(drop=True)
Basically what I'm doing is I'm—
Taking the actualmotordeductionfortheweek column, negating it (all the values become negative), and reversing it
Adding the start value (which is positive, as opposed to all the other values which are negative) at index -1 (which is before 0, not at the very end as is usual in Python)
Reversing it back, so that the new -1 entry goes to the very beginning
Using cumsum() to add all the of values of the column. This actually work to subtract all the values from the start value, because the first value is positive and the rest of the values are negative (because x + (-y) = x - y)
I have a matrix where the values in each column need to be optimised such that the error with a following equation is minimised.
Each number in the matrix must be decimal and equal to or between zero and one.
The columns summed up need, to sum up to 1. So each number represents a proportion of something.
For this, I'm using scipy.optimize.minimize() with bounds for all the values such that they fit the restraints. The matrix is flattened to be usable in the optimize function.
The resulting optimised matrix fits the restraints but fails to archive the necessary constraint of reaching 1 when a column is summed up.
What can I do to make sure that each column reaches one with this function or do you have a suggestion for a better optimizer?
Finished Minimisation:
Optimized Matrix:
Summed up columns (Should be 1. for every entry):
I am pretty new to programming, and I am sure many solutions exist, but for now, mine seems not to be working. I have a dataset with over 200 predictor variables and the majority of them are binary 1= event, 0= no event. I want to filter out all variables that have an occurrence frequency below a certain threshold, e.g., 100 times.
I've tried something like this:
diag = luisa.T.reset_index().rename(columns = {'index': 'diagnosis'})
frequency = pd.concat([diag.iloc[:,:1],pd.DataFrame(diag.sum(1))], axis = 1).rename(columns = {0:'count'})
frequency.nlargest(150,'count)
Please help!
You can take the column-wise sum and filter out the columns whose sums are below a certain value, keeping in mind that the sum represents the total number of events:
threshold = 100
col_sum = df.sum()
filtered_df = df[col_sum[col_sum > threshold].index]
This will store in filtered_df a subset of the original DataFrame without those columns.
If not all your columns are binary, then you need to include the additional step of performing this operation only on the binary columns, and then reversing the condition to find the columns which do not fulfil your criteria:
binary_columns = df.isin([0, 1]).all(axis=0)
binary_df = df.loc[:, binary_columns]
col_sum = binary_df.sum()
filtered_df = df.drop(columns=col_sum[col_sum < threshold].index)