I'm trying to select a column(of floats) in a dataframe and round all the values to the nearest 5 but only if the value is not NaN, which got me to this
df["velocity"].apply(lambda x: 0 if x == None else 5 * round(float(x)/5))
However, upon execution, I get this error
ValueError: cannot convert float NaN to integer
Here is vectorized pandas version instead loop apply solution with Series.astype and Series.round, last replace None or missing values to 0:
(5 * df["velocity"].astype(float) / 5).round()).fillna(0)
Or:
df["velocity"].astype(float).div(5).round().mul(5).fillna(0)
Related
I use the following function to calculate the rolling mean over 4 rows and use the returned value in the denominator of some ratios (in my case, 4 rows represent 4 quarters of the year):
def mean_year(var):
mean_year = var.rolling(min_periods=1, window=4).mean()
return mean_year
However, I am finding something really odd, whereby, if the column has all zeros, the mean_year value returned is not 0. Instead, the values being return are negative. E.g. 3.41061e-13.
I created this df to test what was going on:
df = pd.DataFrame(np.zeros((20, 2)), columns=list('AB'))
df['C'] = mean_year(df['A'])
Column C returns all zeros as you would expect. All the columns are float64 in my dataset too. Any ideas of why it would not return 0 when all the values in the column are 0 would be awesome!
It is a scientific notation to represent value. Try rounding off the value of mean and apply to your dataframe.
def mean_year(var):
mean_year = var.rolling(min_periods=1, window=4).mean().round(2)
return mean_year
you can also set the pandas display option for float to two decimal point
pd.set_option('display.float_format', lambda x: '%.2f' % x)
I have a very large column of phone numbers in a pandas dataframe, and they're in float format: 3.52831E+11. There are also NaNs present.
I am trying to convert the numbers to int and it's throwing an error that NaNs can't be converted to int. Fair enough. But I can't seem to get around this.
Here's a sample:
df = pd.DataFrame({'number':['3.578724e+11','3.568376e+11','3.538884e+11',np.NaN]})
number
0 3.578724e+11
1 3.568376e+11
2 3.538884e+11
3 NaN
# My first attempt: here's where I try to convert them to int() however I get 'cannot convert float NaN to integer'.
df['number'] = [int(x) for x in df['number'] if isinstance(x, float)]
# I have also tried the below, but I get SyntaxError: invalid syntax.
df['number'] = [int(x) for x in df['number'] if x not None]
# and then this one, but the error is: TypeError: must be real number, not str
df['number'] = [int(x) for x in df['number'] if not math.isnan(x) and isinstance(x, float)]
I'd appreciate some pointers on this. I thought at least one of these would work.
Thanks folks
From pandas 0.24+, we have the Nullable Integer Type. The first step is to convert your strings (objects) to float, then to nullable int:
df.astype('float').astype(pd.Int64Dtype())
number
0 357872400000
1 356837600000
2 353888400000
3 NaN
As a shorthand, you may also do,
df.astype('float').astype('Int64')
number
0 357872400000
1 356837600000
2 353888400000
3 NaN
On older versions, your only option will be to drop NaNs and convert:
df.dropna(subset=['number']).astype({'number':float}).astype({'number':int})
number
0 357872400000
1 356837600000
2 353888400000
This question already has answers here:
Pandas - Add leading "0" to string values so all values are equal len
(3 answers)
Closed 6 years ago.
I have a pandas dataframe that has a certain column that should have values of a length of four. If the length is three, I would like to add a '0' to the beginning of the value. For example:
a b c
1 2 0054
3 6 021
5 5 0098
8 2 012
So in column c I would like to change the second row to '0021' and last row to '0012.' The values are already strings. I've tried doing:
df.loc[len(df['c']) == 3, 'c'] = '0' + df['c']
but it's not working out. Thanks for any help!
If the type in C is int you can do something like this:
df['C'].apply(lambda x: ('0'*(4-len(str(x))))+str(x) if(len(str(x)) < 4) else str(x))
In the lambda function, I check whether the number of digits/characters in x is less than four. If yes, I add zeros in front, so that the number of digits/characters in x will be four (this is also known as padding). If not, I return the value as string.
In case your type is string, you can remove the str() function calls, but it will work either way.
I am trying to compare values in a table, it so happens that some might be zero and I therefore get an error message that I cannot divide by 0.
Why isn't the script returning inf instead of an error?
When I test this script on a dataframe with one column it works, with more than one column it breaks with the Zero Division Error.
table[change] = ['{0}%'.format(str(round(100*x,2)) for x in \
(table.ix[:,table.shape[1]-1] - table.ix[:,0]) / table.ix[:,0]]
table example:
0 1 2 3 4 5 6 \
numbers 0.0 100.0 120.0 220.0 250.0 300.0 500.0\\
revenues 50.0 100.0 120.0 220.0 250.0 300.0 500.0
where table.ix[:,0] is 0.0.
Some of the values at table.ix[:,0] are zero and others are not, hence, try and except in my experience will not work because the script will break once the value divisible is equal to 0.
I tried two of the other methods and they did not work for me.
Can you be a little more descriptive in your answer? I am struggling to take the approach given.
I have another approach which I am trying and it is not working. Do not see yet what the problem is:
for index, row in table.iterrows():
if row[0] == 0:
table[change] = 'Nan'
else:
x = (row[-1] - row[0]) / row[0]
table[change] = '{0} {1}%'.format( str(round(100 * x, 2)))
The 'change' column contains the same values (i.e. the last comparison of the table)
Dividing by zero is usually a serious error; defaulting to infinity would not be appropriate for most situations.
Before attempting to calculate the value, check if the divisor (table.ix[:,0] in this case) is equal to zero. If it is, then skip the calculation and just assign whatever value you want.
Or you can wrap the division calculation in a try/except block as suggested by #Andrew.
Looks like python has a specific ZeroDivisionError, you should use try except to do something else in that case.
try:
table[change] = ['{0}%'.format(str(round(100*x,2)) for x in \
(table.ix[:,table.shape[1]-1] - table.ix[:,0]) / table.ix[:,0]]
except ZeroDivisionError:
table[change] = inf
In that case, you can just divide the whole Series, and Pandas will do the inf substitution for you. Something like:
if df1.ndim == 1:
table[change] = inf
elif df1.ndim > 1 and df1.shape[0] > 1:
table[change] = ['{0}%'.format(str(round(100*x,2)) for x in \
(table.ix[:,table.shape[1]-1] - table.ix[:,0]) / table.ix[:,0]]
The fact that your original example only had one row seems to make Pandas fetch the value in that cell for the division. If you do the division with an array with more than one row, it has the behaviour that I think you were originally expecting.
EDIT:
I've just spotted the generator expression that I completely overlooked. This is much easier than I thought.
Performing your normalisation then, if your version of pandas is up to date, then you can call round if you want.
table["change"] = 100 * ((table.iloc[:, -1] - table.iloc[:, 0])/ table.iloc[:, 0])
#And if you're running Pandas v 0.17.0
table.round({"change" : 2})
Is it possible to change a column in a data frame that is float64 and holds some null values to an integer dtype? I get the following error
raise ValueError('Cannot convert NA to integer')
It is not possible, even if you try do some work around. Generally, NaN are more efficient when it comes to show missing values. So people try to do this,
Let's check what will happen if we try same.
Convert all NaN values to 0 (if your data does not have this
value), if 0 is not possible in your case use a very large number in
negative or positive, say 9999999999
df['x'].dtype output: dtype('float64')
df.loc[df['x'].notnull(),'x'] = 9999999999 or
df.loc[df['x'].notnull(),'x'] = 0
Convert all non NaN values to int only.
df['x'] = df['x'].astype('int64') converting to int64, now dtype is int64.
Put back your NaN values:
df.loc[df['x']==0,'x'] = np.nan
df['x'].dtype
output: dtype('float64')
Above technique can also be used to convert float column to integer column if it contains NaN and raising errors. But you will have to lose NaN anyway.