I am trying to compare values in a table, it so happens that some might be zero and I therefore get an error message that I cannot divide by 0.
Why isn't the script returning inf instead of an error?
When I test this script on a dataframe with one column it works, with more than one column it breaks with the Zero Division Error.
table[change] = ['{0}%'.format(str(round(100*x,2)) for x in \
(table.ix[:,table.shape[1]-1] - table.ix[:,0]) / table.ix[:,0]]
table example:
0 1 2 3 4 5 6 \
numbers 0.0 100.0 120.0 220.0 250.0 300.0 500.0\\
revenues 50.0 100.0 120.0 220.0 250.0 300.0 500.0
where table.ix[:,0] is 0.0.
Some of the values at table.ix[:,0] are zero and others are not, hence, try and except in my experience will not work because the script will break once the value divisible is equal to 0.
I tried two of the other methods and they did not work for me.
Can you be a little more descriptive in your answer? I am struggling to take the approach given.
I have another approach which I am trying and it is not working. Do not see yet what the problem is:
for index, row in table.iterrows():
if row[0] == 0:
table[change] = 'Nan'
else:
x = (row[-1] - row[0]) / row[0]
table[change] = '{0} {1}%'.format( str(round(100 * x, 2)))
The 'change' column contains the same values (i.e. the last comparison of the table)
Dividing by zero is usually a serious error; defaulting to infinity would not be appropriate for most situations.
Before attempting to calculate the value, check if the divisor (table.ix[:,0] in this case) is equal to zero. If it is, then skip the calculation and just assign whatever value you want.
Or you can wrap the division calculation in a try/except block as suggested by #Andrew.
Looks like python has a specific ZeroDivisionError, you should use try except to do something else in that case.
try:
table[change] = ['{0}%'.format(str(round(100*x,2)) for x in \
(table.ix[:,table.shape[1]-1] - table.ix[:,0]) / table.ix[:,0]]
except ZeroDivisionError:
table[change] = inf
In that case, you can just divide the whole Series, and Pandas will do the inf substitution for you. Something like:
if df1.ndim == 1:
table[change] = inf
elif df1.ndim > 1 and df1.shape[0] > 1:
table[change] = ['{0}%'.format(str(round(100*x,2)) for x in \
(table.ix[:,table.shape[1]-1] - table.ix[:,0]) / table.ix[:,0]]
The fact that your original example only had one row seems to make Pandas fetch the value in that cell for the division. If you do the division with an array with more than one row, it has the behaviour that I think you were originally expecting.
EDIT:
I've just spotted the generator expression that I completely overlooked. This is much easier than I thought.
Performing your normalisation then, if your version of pandas is up to date, then you can call round if you want.
table["change"] = 100 * ((table.iloc[:, -1] - table.iloc[:, 0])/ table.iloc[:, 0])
#And if you're running Pandas v 0.17.0
table.round({"change" : 2})
Related
I'm trying to select a column(of floats) in a dataframe and round all the values to the nearest 5 but only if the value is not NaN, which got me to this
df["velocity"].apply(lambda x: 0 if x == None else 5 * round(float(x)/5))
However, upon execution, I get this error
ValueError: cannot convert float NaN to integer
Here is vectorized pandas version instead loop apply solution with Series.astype and Series.round, last replace None or missing values to 0:
(5 * df["velocity"].astype(float) / 5).round()).fillna(0)
Or:
df["velocity"].astype(float).div(5).round().mul(5).fillna(0)
I wanna have a function that runs through every row of my data frame and calculates a value in a column by using the values of other columns.
this is the dataframe:
goal_value ordered_value proportion_of
0 20,67 5,34 0
1 30,38 3,78 0
For example this is the code i m using rn:
def calculate_value(row):
value1 = row[0]
value2 = row[1]
row[2] = value2/value1 * 100
dataframe.apply(calculate_value, axis = 'columns')
right now i get a ZeroDivisionError: float division by zero
Can someone pls help me.
this is all the code you need for this if I understand you correctly
df["proportion_of"] = df["ordered_value"] / df["goal_value"]
It is much faster than using apply given this is a vector operation, vs apply is a row-wise operation
i am trying to get the log of my data to create a model, but there are a zero value in some rows, so the output of log(0) is -inf, which is leading a problem.
if there are a better solution of deleting all zeros values?
my data set is :
ds y
0 2020-01-22 0
1 2020-01-23 0
2 2020-01-24 0
3 2020-01-25 0
4 2020-01-26 0
... ... ...
353 2021-01-09 148799
354 2021-01-10 149792
355 2021-01-11 150753
356 2021-01-12 151723
357 2021-01-13 152719
You can use list comprehension:
import pandas as pd
import numpy as np
a = pd.DataFrame({'y':[0,7,67656,15,0,2]})
print(a)
a["log"] = np.log10(a["y"])
a["log"] = [x[1] if x[0]!=0 else 0 for x in a.to_numpy()]
Or even easier:
import math
a["log"] = [math.log10(x) if x!=0 else 0 for x in a["y"]]
Put whatever you need after else.
It's a simple workaround for your problem, there may exist a more sophisticated way of doing it.
You could simply take log(x+a) instead of log(x) where a is a chosen constant (or hyperparameter you could say in ML lingo) and the value of a could be a very small number (for example 0.00001) depending on your aim and tolerance. So now you would calculate log(0+0.00001) instead of log(0) and it'll give a value -5 instead of blowing up.
As I said, there could be possibilities of a more smart way of going about it such as understanding the data thoroughly, why 0 exists there in the first place, what do they represent, etc.
(Note: It (value of a) shouldn't be however very low/small number, as all packages/software have a level of rounding, so it may round up and turn your a to zero.)
You may like to use numpy.log with the "where" parameter:
df['y'] = np.log(df['y'], where = df['y'] != 0)
You can also do it the following way:
mask = df['y'] != 0
valid = df.loc[mask,'y']
df['y'] = 0
df.loc[mask, 'y'] = np.log(valid)
I have a spreadsheet with this formula. I am able to understand the condition checking part, the calculation of ($R7/$O7) and default value if condition does not satisfy. What exactly happens inside the PRODUCT(1+($U7:Z7)))-1 ?
{=IF($T7>=AA$5,($R7/$O7)/(PRODUCT(1+($U7:Z7)))-1,"")}
Also, why do we have {}? If I manually type the formula in some cell, it does not work.
I am trying to convert this formula to python. This is the code I have:
df.loc[(df['T'] >= df['AA']), 'x'] = (df['R']/df['O'])/PRODUCT()-1
My question is how do I compute the PRODUCT part of this calculation?
If you just want to know how to calculate the product of an array where 1 is added to every value and 1 subtracted from the result it can be easily done with numpy:
import numpy as np
arr = np.array([1,2,3,4,5])
product = np.prod(arr+1) - 1
print product
Numpy calculations are done array-wise, so adding 1 to every value is simply array+1
Based on your updates in the comments this is how its done:
df.loc[(df['T'] >= df['AA']), 'x'] = (df['R']/df['O']) / ((df[['a', 'b']]+1).product(axis=1) - 1)
Where a and b are the column names. Notice that this formula returns NaN when df['T'] >= df['AA'] is false.
I use pandas frequently and often execute code comparable to the following:
df['var_rank'] = df['var'].rank(pct=True)
print( df.var_rank.max() )
And will often get values greater than 1. It still happens whether I keep or drop 'na' values. This is obviously easy to fix (just divide by the value with the largest rank), so I'm not asking for a work-around. I'm just curious why this happens and haven't found any clues online.
Anyone know why this happens?
Some very simple example data here (dropbox link - pickled pandas series).
I get a value of 1.0156 from df.rank(pct=True).max(). I've had other data with values as high as 4 or 5. I'm usually using pretty messy data.
You have bad data.
>>> s.rank(pct=True).max()
1.015625
s.sort(inplace=True)
>>> s.tail(7)
8 202512882
6 253661077
102 -
101 -
99 -
58 -
116 -
Name: Total Assets, dtype: object
>>> s[s != u'-'].rank(pct=True).max()
1.0
In Pandas 0.18.0 (released last week), you can specify numeric only:
s.rank(pct=True, numeric_only=True)
I've tried the above in 0.18.0 and couldn't seem to get it to work, so you can also do this to rank all float and int values:
>>> s[s.apply(lambda x: isinstance(x, (int, float)))].rank(pct=True).max()
1.0
It creates a boolean mask making sure each value is an int or float, and then ranks the filtered result.