I have a spreadsheet with this formula. I am able to understand the condition checking part, the calculation of ($R7/$O7) and default value if condition does not satisfy. What exactly happens inside the PRODUCT(1+($U7:Z7)))-1 ?
{=IF($T7>=AA$5,($R7/$O7)/(PRODUCT(1+($U7:Z7)))-1,"")}
Also, why do we have {}? If I manually type the formula in some cell, it does not work.
I am trying to convert this formula to python. This is the code I have:
df.loc[(df['T'] >= df['AA']), 'x'] = (df['R']/df['O'])/PRODUCT()-1
My question is how do I compute the PRODUCT part of this calculation?
If you just want to know how to calculate the product of an array where 1 is added to every value and 1 subtracted from the result it can be easily done with numpy:
import numpy as np
arr = np.array([1,2,3,4,5])
product = np.prod(arr+1) - 1
print product
Numpy calculations are done array-wise, so adding 1 to every value is simply array+1
Based on your updates in the comments this is how its done:
df.loc[(df['T'] >= df['AA']), 'x'] = (df['R']/df['O']) / ((df[['a', 'b']]+1).product(axis=1) - 1)
Where a and b are the column names. Notice that this formula returns NaN when df['T'] >= df['AA'] is false.
Related
Using python read a column 'H' from a dataframe and implement this function:
CDF = {SUM(H1:H$266)/G$14}*100
Where:
H$266 is the last element of the column, and
G$14 is the total sum of the column H.
In sum(), the first variable iterates (H1, H2, H3 ... H266) but the last value remains the same (H$266). So the first value of CDF is obviously 100 and then it goes on decreasing downwards.
I want to implement this using dataframe.
As an example, you could do this:
from pandas import Series
s = Series([1, 2, 3]) # H1:H266 data
sum_of_s = s.sum() # G14
def calculus(subset, total_sum):
return subset.sum() / total_sum * 100
result = Series([calculus(s.iloc[i:], sum_of_s) for i in range(len(s))])
print(result)
You should adapt it to your dataset, but basically it's the idea. Let me know if it works.
I have a data frame with 1 column.
- There are many NA values at the beginning and at the end that I would like to eliminate them completely.
- At the same time, there are some NA values in the between of 2 available values that I would like to fill them by the mean of 2 closed available values.
For illustration, I attach the image here for your imagine.
I can not think of any solution. Just wonder if anyone can please help me with that.
Thank you for your help]1
Try this,i have reproduced example by using random numbers
import pandas as pd
import numpy as np
random_index = np.random.randint(0,100,size=(5, 1))
random_range = np.arange(10,15)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=list('A'))
df.loc[10:15,'A'] = "#N/A"
for c in random_index:
df.loc[c,"A"] = "#N/A"
// replacing start from here
df[df=="#N/A"]= np.nan
index = list(np.where(df['A'].isna()))[0]
drops = []
for i in index:
if pd.isnull(df.loc[(i-1),"A"]) is False and pd.isnull(df.loc[(i+1),"A"]) is False:
df.loc[i,"A"] = (df.loc[(i-1),"A"]+df.loc[(i+1),"A"])/2
else:
drops.append(i)
df = df.drop(df.index[drops]).reset_index(drop=True)
First, if each N/A is in string format, replace either with np.nan.The most straightforward possible way is to use isnan on the given column, then extract true indices(such as using the result on a np.arange array). From there you can either use a for to iterate indices to check if they are sequential or not, or calculate the distance between consecutive elements to find the ones not equal to 1.
FOUND SOLUTION: I needed to change datatype for dataframe:
for p in periods:
df['Probability{}'.format(p)] = 0
for p in periods:
df['Probability{}'.format(p)] = float(0)
Alternatively do as in approved answer below.
I am asserting new values for cells as floats but they are set as integers and I don't get why.
It is a part of a data mining project, which contains nested loops.
I am using Python 3.
I tried different modes of writing into a cell with pandas:
df.at[index,col] = float(val),
df.set_value[index,col,float(val)], and
df[col][index] = float(val) but none of them delivered a solution. The output I got was:
In: print(df[index][col])
Out: 0
In: print(val)
Out: 0.4774410939826658
Here is a simplified version of the loop
periods = [7,30,90,180]
for p in periods:
df['Probability{}'.format(p)] = 0
for i in range(len(df.index)):
for p in periods:
if i >= p - 1:
# Getting relevant data and computing value
vals = [df['Close'][j] for j in range(i - p, i)]
probability = (len([j for j in vals if j>0])/len(vals))
# Asserting value to cell in pd.dataframe
df.at[df.index[i], 'Probability{}'.format(p)] = float(probability)
I don't get why pandas.DataFrame are changing float to integer and rounds up or down. When I asserted values to cells in console directly I did not experience any problems.
Is there any work arounds or solutions to this problem?
I had no problem before nesting a for loop for periods to avoid hard coding a lot of trivial code.
NB: It also seems that if I factorize, e.g. with 100 * val = new_val, it do only factorize the rounded number. So if I multiplied 100*val = new_val = 0 because the number is rounded down to 0.
I also tried to change datatype for the dataframe:
df = df.apply(pd.to_numeric)
All the best.
Seems like a problem with incorrect data types in your dataframe. Your last attempt at converting the whole df was probably very close. Try and use
df['Close'] = pd.to_numeric(df['Close'], downcast="float")
Given a DF with 4 features and 1 index column :
df = pd.DataFrame(np.random.randint(0,100, size= (100,4)), columns=list('ABCD'))
df['index'] = range(1, len(df) + 1)
I want to calculate the Manhattan distance given input from a user. The user's inputs will be represented by a,b,c,d. The function is defined below.
def Manhattan_d(a,b,c,d):
return (a - df['A']) + (b -df['B']) + (c - df['C']) + (d - df['D'])
When the answer is returned to me, it comes out as a list. Now, I want to find the minimum value returned to me AND link it back to the index number from where its from.
If I do return(min(formula)), I get an output of one number and I can't locate it back to the index it was originally from. If it's easier, the index represents a category. So I need to find the category with the minimum output after the formula is applied.
Hope that's clear.
Perhaps a better approach is to apply Manhattan distance to each row of the dataframe. At that point, you can use .idxmin() to find the index of the point in the original dataframe which is most similar (has lowest Manhattan distance) to the point a,b,c,d that you fed the function.
def Manhattan_d(a,b,c,d, df):
return df.apply(lambda row:abs(row['A']-a)+abs(row['B']-b)+abs(row['C']-c)+abs(row['D']-d), axis=1).idxmin()
Note: Manhattan distance requires the absolute value of the difference, which I have included.
Another note: it is generally good practice to pass all variables into a function, which is why I included df as an input to your function.
Another possibility is to use existing implementations, such as the DistanceMetric class from Scikit-learn.
how do I null certain values in numpy array based on a condition?
I don't understand why I end up with 0 instead of null or empty values where the condition is not met... b is a numpy array populated with 0 and 1 values, c is another fully populated numpy array. All arrays are 71x71x166
a = np.empty(((71,71,166)))
d = np.empty(((71,71,166)))
for indexes, value in np.ndenumerate(b):
i,j,k = indexes
a[i,j,k] = np.where(b[i,j,k] == 1, c[i,j,k], d[i,j,k])
I want to end up with an array which only has values where the condition is met and is empty everywhere else but with out changing its shape
FULL ISSUE FOR CLARIFICATION as asked for:
I start with a float populated array with shape (71,71,166)
I make an int array based on a cutoff applied to the float array basically creating a number of bins, roughly marking out 10 areas within the array with 0 values in between
What I want to end up with is an array with shape (71,71,166) which has the average values in a particular array direction (assuming vertical direction, if you think of a 3D array as a 3D cube) of a certain "bin"...
so I was trying to loop through the "bins" b == 1, b == 2 etc, sampling the float where that condition is met but being null elsewhere so I can take the average, and then recombine into one array at the end of the loop....
Not sure if I'm making myself understood. I'm using the np.where and using the indexing as I keep getting errors when I try and do it without although it feels very inefficient.
Consider this example:
import numpy as np
data = np.random.random((4,3))
mask = np.random.random_integers(0,1,(4,3))
data[mask==0] = np.NaN
The data will be set to nan wherever the mask is 0. You can use any kind of condition you want, of course, or do something different for different values in b.
To erase everything except a specific bin, try the following:
c[b!=1] = np.NaN
So, to make a copy of everything in a specific bin:
a = np.copy(c)
a[b!=1] == np.NaN
To get the average of everything in a bin:
np.mean(c[b==1])
So perhaps this might do what you want (where bins is a list of bin values):
a = np.empty(c.shape)
a[b==0] = np.NaN
for bin in bins:
a[b==bin] = np.mean(c[b==bin])
np.empty sometimes fills the array with 0's; it's undefined what the contents of an empty() array is, so 0 is perfectly valid. For example, try this instead:
d = np.nan * np.empty((71, 71, 166)).
But consider using numpy's strength, and don't iterate over the array:
a = np.where(b, c, d)
(since b is 0 or 1, I've excluded the explicit comparison b == 1.)
You may even want to consider using a masked array instead:
a = np.ma.masked_where(b, c)
which seems to make more sense with respect to your question: "how do I null certain values in a numpy array based on a condition" (replace null with mask and you're done).