How to get "true" decimal place precision with pandas round? - python

What am I missing? I tried appending .round(3) to the end of of the api call but it doesnt work, and it also doesnt work in separate calls. The data types for all columns is numpy.float32.
>>> summary_data = api._get_data(units=list(units.id),
downsample=downsample,
table='summary_tb',
db=db).astype(np.float32)
>>> summary_data.head()
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.37945 70.399887 522.302124
1 20.0 1.0 1.0 1.0 3153.0 0.38449 70.575668 522.428162
2 30.0 1.0 1.0 1.0 3229.0 0.39079 70.575668 522.645020
3 40.0 1.0 1.0 1.0 3305.0 0.39438 70.575668 522.651184
4 50.0 1.0 1.0 1.0 3393.0 0.39690 70.663559 522.530090
>>> summary_data = summary_data.round(3)
>>> summary_data.head()
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.379 70.400002 522.302002
1 20.0 1.0 1.0 1.0 3153.0 0.384 70.575996 522.427979
2 30.0 1.0 1.0 1.0 3229.0 0.391 70.575996 522.645020
3 40.0 1.0 1.0 1.0 3305.0 0.394 70.575996 522.651001
4 50.0 1.0 1.0 1.0 3393.0 0.397 70.664001 522.530029
>>> print(type(summary_data))
pandas.core.frame.DataFrame
>>> print([type(summary_data[col][0]) for col in summary_data.columns])
[numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32]
It does in fact look like some form of rounding is taking place, but something weird is happening. Thanks in advance.
EDIT
The point of this is to use 32 bit floating numbers, not 64 bit. I have since used pd.set_option('precision', 3), but according the the documentation this only affects the display, but not the underlying value. As mentioned in a comment below, I am trying to minimize the number of atomic operations. Calculations on 70.575996 vs 70.57600 are more expensive, and this is the issue I am trying to tackle. Thanks in advance.

Hmm, this might be a floating-point issue. You could change the dtype to float instead of np.float32:
>>> summary_data.astype(float).round(3)
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.379 70.400 522.302
1 20.0 1.0 1.0 1.0 3153.0 0.384 70.576 522.428
2 30.0 1.0 1.0 1.0 3229.0 0.391 70.576 522.645
3 40.0 1.0 1.0 1.0 3305.0 0.394 70.576 522.651
4 50.0 1.0 1.0 1.0 3393.0 0.397 70.664 522.530
If you change it back to np.float32 afterwards, it re-exhibits the issue:
>>> summary_data.astype(float).round(3).astype(np.float32)
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.379 70.400002 522.302002
1 20.0 1.0 1.0 1.0 3153.0 0.384 70.575996 522.427979
2 30.0 1.0 1.0 1.0 3229.0 0.391 70.575996 522.645020
3 40.0 1.0 1.0 1.0 3305.0 0.394 70.575996 522.651001
4 50.0 1.0 1.0 1.0 3393.0 0.397 70.664001 522.530029

Related

Python Pandas groupby limited cumulative sum

This is my dataframe
import pandas as pd
import numpy as np
data = {'c1':[-1,-1,1,1,np.nan,1,1,1,1,1,np.nan,-1],\
'c2':[1,1,1,-1,1,1,-1,-1,1,-1,1,np.nan]}
index = pd.date_range('2000-01-01','2000-03-20', freq='W')
df = pd.DataFrame(index=index, data=data)
>>> df
c1 c2
2000-01-02 -1.0 1.0
2000-01-09 -1.0 1.0
2000-01-16 1.0 1.0
2000-01-23 1.0 -1.0
2000-01-30 NaN 1.0
2000-02-06 1.0 1.0
2000-02-13 1.0 -1.0
2000-02-20 1.0 -1.0
2000-02-27 1.0 1.0
2000-03-05 1.0 -1.0
2000-03-12 NaN 1.0
2000-03-19 -1.0 NaN
and this is a cumulative sum by month
df2 = df.groupby(df.index.to_period('m')).cumsum()
>>> df2
c1 c2
2000-01-02 -1.0 1.0
2000-01-09 -2.0 2.0
2000-01-16 -1.0 3.0
2000-01-23 0.0 2.0
2000-01-30 NaN 3.0
2000-02-06 1.0 1.0
2000-02-13 2.0 0.0
2000-02-20 3.0 -1.0
2000-02-27 4.0 0.0
2000-03-05 1.0 -1.0
2000-03-12 NaN 0.0
2000-03-19 0.0 NaN
what I need more is to ignore the increment if it is more than 3 or less than 0, something like this function
def cumsum2(arr, low=-float('Inf'), high=float('Inf')):
arr2 = np.copy(arr)
sm = 0
for index, elem in np.ndenumerate(arr):
if not np.isnan(elem):
sm += elem
if sm > high:
sm = high
if sm < low:
sm = low
arr2[index] = sm
return arr2
the desired result is
c1 c2
2000-01-02 0.0 1.0
2000-01-09 0.0 2.0
2000-01-16 1.0 3.0
2000-01-23 2.0 2.0
2000-01-30 2.0 3.0
2000-02-06 1.0 1.0
2000-02-13 2.0 0.0
2000-02-20 3.0 0.0
2000-02-27 3.0 1.0
2000-03-05 1.0 0.0
2000-03-12 1.0 1.0
2000-03-19 0.0 1.0
I tried to use apply and lambda but doesn't work and it's slow for large dataframe.
df.groupby(df.index.to_period('m')).apply(lambda x: cumsum2(x, 0, 3))
What's wrong? Is there a faster way?
You can try accumulate from itertools and use a custom function to clip values between 0 and 3:
from itertools import accumulate
lb = 0 # lower bound
ub = 3 # upper bound
def cumsum2(dfm):
def clip(bal, val):
return np.clip(bal + val, lb, ub)
return list(accumulate(dfm.to_numpy(), clip, initial=0))[1:]
out = df.fillna(0).groupby(df.index.to_period('m')).transform(cumsum2)
Output:
>>> out
c1 c2
2000-01-02 0.0 1.0
2000-01-09 0.0 2.0
2000-01-16 1.0 3.0
2000-01-23 2.0 2.0
2000-01-30 2.0 3.0
2000-02-06 1.0 1.0
2000-02-13 2.0 0.0
2000-02-20 3.0 0.0
2000-02-27 3.0 1.0
2000-03-05 1.0 0.0
2000-03-12 1.0 1.0
2000-03-19 0.0 1.0
In such sophisticated case we can resort to pandas.Series.rolling with window of size 2 piping each window to a custom function to keep each interim accumulation within a certain threshold:
def cumsum_tsh(x, low=-float('Inf'), high=float('Inf')):
def f(w):
w[-1] = min(high, max(low, w[0] if w.size == 1 else w[0] + w[1]))
return w[-1]
return x.apply(lambda s: s.rolling(2, min_periods=1).apply(f))
res = df.fillna(0).groupby(df.index.to_period('m'), group_keys=False)\
.apply(lambda x: cumsum_tsh(x, 0, 3))
c1 c2
2000-01-02 0.0 1.0
2000-01-09 0.0 2.0
2000-01-16 1.0 3.0
2000-01-23 2.0 2.0
2000-01-30 2.0 3.0
2000-02-06 1.0 1.0
2000-02-13 2.0 0.0
2000-02-20 3.0 0.0
2000-02-27 3.0 1.0
2000-03-05 1.0 0.0
2000-03-12 1.0 1.0
2000-03-19 0.0 1.0
I've tried various solutions, for some reason the fastest is manipulating single columns of frames created by groupby.
This is the code if it can be useful to anyone
def cumsum2(frame, low=-float('Inf'), high=float('Inf')):
for col in frame.columns:
sm = 0
xs = []
for e in frame[col]:
sm += e
if sm > high:
sm = high
if sm < low:
sm = low
xs.append(sm)
frame[col] = xs
return frame
res = df.fillna(0).groupby(df.index.to_period('m'), group_keys=False)\
.apply(cumsum2,0,3)

How to drop float feature where % NANs is higher than a certain number?

I'm trying to drop a feature which if float and a number of missing values is higher than certain number.
I've tried:
# Define threshold to 1/6
threshold = 0.1667
# Drop float > threshold
for f in data:
if data[f].dtype==float & data[f].isnull().sum() / data.shape[0] > threshold: del data[f]
..which raises an error:
TypeError: unsupported operand type(s) for &: 'type' and
'numpy.float64'
Help would be aprreciated.
Use DataFrame.select_dtypes for only floats columns, check missing values and get mean - sum/count and add another non floats columns by Series.reindex, last filter by inverse condition > to <= by boolean indexing:
np.random.seed(2019)
df = pd.DataFrame(np.random.choice([np.nan,1], p=(0.2,0.8),size=(10,10))).assign(A='a')
print (df)
0 1 2 3 4 5 6 7 8 9 A
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
1 1.0 1.0 NaN 1.0 NaN 1.0 NaN 1.0 1.0 1.0 a
2 1.0 1.0 1.0 1.0 1.0 NaN 1.0 NaN 1.0 1.0 a
3 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0 a
4 1.0 NaN 1.0 1.0 1.0 1.0 1.0 NaN 1.0 1.0 a
5 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0 1.0 1.0 a
6 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
7 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
8 1.0 NaN 1.0 1.0 1.0 1.0 NaN 1.0 1.0 1.0 a
9 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN a
threshold = 0.1667
df1 = df.select_dtypes(float).isnull().mean().reindex(df.columns, fill_value=False)
df = df.loc[:, df1 <= threshold]
print (df)
0 2 3 4 5 8 9 A
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
1 1.0 NaN 1.0 NaN 1.0 1.0 1.0 a
2 1.0 1.0 1.0 1.0 NaN 1.0 1.0 a
3 1.0 1.0 1.0 1.0 1.0 NaN 1.0 a
4 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
6 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
7 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
8 1.0 1.0 1.0 1.0 1.0 1.0 1.0 a
9 NaN 1.0 1.0 1.0 1.0 1.0 NaN a

Transposing in a specific order using Pandas

I have a df that looks like this:
EmpID Obs_ID Component Rating avg_2a avg_2d avg_3b avg_3c avg_3d
01301 13943 2a 3 3.0 3.0 1.0 1.33 1.0
01301 13944 2a 3 3.0 3.0 1.0 1.33 1.0
01301 13945 2a 3 3.0 3.0 1.0 1.33 1.0
01301 13945 2d 3 3.0 3.0 1.0 1.33 1.0
01301 13945 3b 3 3.0 3.0 1.0 1.33 1.0
And I need it to look like this:
EmpID comp_2a_obs_1 comp_2a_obs_2 comp_2a_obs_3 comp_2d_obs_1 ... ... comp_2a_avg comp_2d_avg comp_3b_avg comp_3c_avg comp_3d_avg
01301 3 3 3 3 ... ... 3.0 3.0 1.0 1.33 1.0
Where the obs order (obs_1, obs_2, obs_3) is based on the Obs_ID (smallest to largest) and the value of the comp_2a_obs_1, etc columns are drawn from Rating
Can I do this with pd.Transpose or would I need a for loop or something else?

Pandas fillna() not working as expected

I'm trying to replace NaN values in my dataframe with means from the same row.
sample_df = pd.DataFrame({'A':[1.0,np.nan,5.0],
'B':[1.0,4.0,5.0],
'C':[1.0,1.0,4.0],
'D':[6.0,5.0,5.0],
'E':[1.0,1.0,4.0],
'F':[1.0,np.nan,4.0]})
sample_mean = sample_df.apply(lambda x: np.mean(x.dropna().values.tolist()) ,axis=1)
Produces:
0 1.833333
1 2.750000
2 4.500000
dtype: float64
But when I try to use fillna() to fill the missing dataframe values with values from the series, it doesn't seem to work.
sample_df.fillna(sample_mean, inplace=True)
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 NaN 4.0 1.0 5.0 1.0 NaN
2 5.0 5.0 4.0 5.0 4.0 4.0
What I expect is:
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.0 5.0 4.0 5.0 4.0 4.0
I've reviewed the other similar questions and can't seem to uncover the issue. Thanks in advance for your help.
By using pandas
sample_df.T.fillna(sample_df.T.mean()).T
Out[1284]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Here's one way -
sample_df[:] = np.where(np.isnan(sample_df), sample_df.mean(1)[:,None], sample_df)
Sample output -
sample_df
Out[61]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Another pandas way:
>>> sample_df.where(pd.notnull(sample_df), sample_df.mean(axis=1), axis='rows')
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
An if condition is True is in operation here: Where elements of pd.notnull(sample_df) are True use the corresponding elements from sample_df else use the elements from sample_df.mean(axis=1) and perform this logic along axis='rows'.

Iterating and modifying dataframe using Pandas groupby

I am working with a large array of 1's and need to systematically remove 0's from sections of the array. The large array is comprised of many smaller arrays, for each smaller array I need to replace its upper and lower triangles with 0's systematically. For example we have an array with 5 sub arrays indicated by the index value (all sub-arrays have the same number of columns):
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
2 1.0 1.0 1.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
3 1.0 1.0 1.0
3 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
I want each group of rows to be modified in its upper and lower triangle such that the resulting matrix is:
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 0.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
2 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
3 1.0 1.0 0.0
3 0.0 1.0 1.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
4 1.0 1.0 0.0
4 1.0 1.0 1.0
4 0.0 1.0 1.0
4 0.0 0.0 1.0
At the moment I am using only numpy to achieve this resulting array, but I think I can speed it up using Pandas grouping. In reality my dataset is very large almost 500,000 rows long. The numpy code is below:
import numpy as np
candidateLengths = np.array([1,2,3,4,5])
centroidLength =3
smallPaths = [min(l,centroidLength) for l in candidateLengths]
# This is the k_values of zeros to delete. To be used in np.tri
k_vals = list(map(lambda smallPath: centroidLength - (smallPath), smallPaths))
maskArray = np.ones((np.sum(candidateLengths), centroidLength))
startPos = 0
endPos = 0
for canNo, canLen in enumerate(candidateLengths):
a = np.ones((canLen, centroidLength))
a *= np.tri(*a.shape, dtype=np.bool, k=k_vals[canNo])
b = np.fliplr(np.flipud(a))
c = a*b
endPos = startPos + canLen
maskArray[startPos:endPos, :] = c
startPos = endPos
print(maskArray)
When I run this on my real dataset it takes nearly 5-7secs to execute. I think this is down to this massive for loop. How can I use pandas groupings to achieve a higher speed? Thanks
New Answer
def tris(n, m):
if n < m:
a = np.tri(m, n, dtype=int).T
else:
a = np.tri(n, m, dtype=int)
return a * a[::-1, ::-1]
idx = np.append(df.index.values, -1)
w = np.append(-1, np.flatnonzero(idx[:-1] != idx[1:]))
c = np.diff(w)
df * np.vstack([tris(n, 3) for n in c])
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 0.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
2 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
3 1.0 1.0 0.0
3 0.0 1.0 1.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
4 1.0 1.0 0.0
4 1.0 1.0 1.0
4 0.0 1.0 1.0
4 0.0 0.0 1.0
Old Answer
I define some helper triangle functions
def tris(n, m):
if n < m:
a = np.tri(m, n, dtype=int).T
else:
a = np.tri(n, m, dtype=int)
return a * a[::-1, ::-1]
def tris_df(df):
n, m = df.shape
return pd.DataFrame(tris(n, m), df.index, df.columns)
Then
df * df.groupby(level=0, group_keys=False).apply(tris_df)
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 0.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
2 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
3 1.0 1.0 0.0
3 0.0 1.0 1.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
4 1.0 1.0 0.0
4 1.0 1.0 1.0
4 0.0 1.0 1.0
4 0.0 0.0 1.0

Categories