How to convert log2 transformed values back to normal scale in python
Any suggestions would be great
log2(x) is the inverse of 2**x (2 to the power of x). If you have a column of data that has been transformed by log2(x), all you have to do is perform the inverse operation:
df['colname'] = [2**i for i in df['colname']]
As suggested in the comment below, it would be more efficient to do:
df['colname'] = df['colname'].rpow(2)
rpow is a pandas Series method that is built in to the pandas package. The first argument is the base you'd like to take powers to. You can also use a fill_value argument, which is nice because you can tell it what to do if the result is NaN
Related
I have a column in a data frame called MARKET_VALUE that I need to pass to a downstream system in a defined format. MARKET_VALUE, a float, needs to be passed as two integer columns (significand, with no trailing zeros and exp) as follows
MARKET VALUE SIGNIFICAND EXP
6.898806e+09 6898806 3
6.898806e+05 6898806 -1
6.898806e+03 6898806 -3
I contemplated using formatted strings but am convinced there must be a smarter solution. The data frame is large, containing millions of rows, so a solution that doesn't depend on apply would be preferable.
Generate a random pandas dataframe
I use a DataFrame consiting in 1e5 rows (you could try with more to test the bottleneck)
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.random((100000,2))**10, columns=['random1', 'random2'])
Use .apply method
In this case I use the standard python formatting.
8E is the number of digits after point.
[:-4] to remove the exponential notation and keep only the significand.
[-3:] to get only the exponential with the sign, then convert it into a int value.
# get the signficand
df.random1.apply(lambda x: f'{x:.8E}'[:-4].replace('.', ''))
# get the exp
df.random1.apply(lambda x: int(f'{x:.0E}'[-3:]))
On my laptop it took less than 100ms.
I am thinking about faster solution (vectorized one), but for now I hope that this can help.
I'm working with time series data and have transformed numbers to logarithmic differences with numpy.
df['dlog']= np.log(df['columnx']).diff()
Then I made predictions with that transformation.
How can I return to normal numbers?
Reversing the transformation shouldn't be necessary, because columnx still exists in df
.diff() calculates the difference of a Series element compared with another
element in the Series.
The first row of dlog is NaN. Without a "base" number (e.g. np.log(764677)) there is not a way to step back that transformation
df = pd.DataFrame({'columnx': [np.random.randint(1_000_000) for _ in range(100)]})
df['dlog'] = np.log(df.columnx).diff()
Output:
columnx dlog
764677 NaN
884574 0.145653
621005 -0.353767
408960 -0.417722
248456 -0.498352
Undo np.log with np.exp
Use np.exp to transform from a logarithmic to linear scale.
df = pd.DataFrame({'columnx': [np.random.randint(1_000_000) for _ in range(100)]})
df['log'] = np.log(df.columnx)
df['linear'] = np.exp(df.log)
Output:
columnx log linear
412863 12.930871 412863.0
437565 12.988981 437565.0
690926 13.445788 690926.0
198166 12.196860 198166.0
427894 12.966631 427894.0
Further Notes:
Without a reproducible set, it's not possible to offer further solutions
You can include some data: How to make good reproducible pandas examples
Include the code used to transform the data: Minimal, Reproducible Example
Another option is, produce the predictions without taking np.log
I'd like to do something like this:
df['A'] = max(0, min(df.B, df.C - df.D))
However, I get a ValueError ("the truth value of a Series is ambiguous"), which I guess means that the max and min functions are doing some boolean operations under the hood, and this doesn't distribute over the Series. I understand you can get the min/max of some set of columns by e.g.
df[['col1','col2','col3']].max(axis = 1)
and so I should be able to get my desired output by way of making some temporary columns with intermediate values, but I'd like a clean solution that does it directly. How does one do this without having to make extra columns for throwaway intermediate values?
max and min are built-in Python functions. They aren't designed for vectorised functionality which comes with Pandas / NumPy.
Instead, you can use np.maximum / np.minimum to perform element-wise calculations:
import numpy as np
df['A'] = np.maximum(0, np.minimum(df['B'], df['C'] - df['D']))
I can compare two Pandas series for exact equality using pandas.Series.equals. Is there a corresponding function or parameter that will check if the elements are equal to some ε of precision?
You can use numpy.allclose:
numpy.allclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)
Returns True if two arrays are element-wise equal within a tolerance.
The tolerance values are positive, typically very small numbers. The
relative difference (rtol * abs(b)) and the absolute difference atol
are added together to compare against the absolute difference between
a and b.
numpy works well with pandas.Series objects, so if you have two of them - s1 and s2, you can simply do:
np.allclose(s1, s2, atol=...)
Where atol is your tolerance value.
Numpy works well with pandas Series. However one has to be careful with the order of indices (or columns and indices for pandas DataFrame)
For example
series_1 = pd.Series(data=[0,1], index=['a','b'])
series_2 = pd.Series(data=[1,0], index=['b','a'])
np.allclose(series_1,series_2)
will return False
A workaround is to use the index of one pandas series
np.allclose(series_1, series_2.loc[series_1.index])
If you want to avoid numpy, there is another way, use assert_series_equal
import pandas as pd
s1 = pd.Series([1.333333, 1.666666])
s2 = pd.Series([1.333, 1.666])
from pandas.testing import assert_series_equal
assert_series_equal(s1,s2)
raises an AssertionError. So use the check_less_precise flag
assert_series_equal(s1,s2, check_less_precise= True) # No assertion error
This doesn't raise an AssertionError as check_less_precise only compares 3 digits after decimal.
See the docs here
Not good to use asserts but if you want to avoid numpy, this is a way.
Note: I'm posting this mostly because I came to this thread via a Google search of something similar and it seemed too long for a comment. Not necessarily the best solution nor strictly "ε of precision"-based, but an alternative using scaling and rounding if you want to do this for vectors (i.e. rows) rather than scalars for a DataFrame (rather than Series) without looping through explicitly:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
Xcomb = pd.concat((X, X2), axis=0, ignore_index=True)
# scale
scaler = MinMaxScaler()
scaler.fit(Xcomb)
Xscl = scaler.transform(Xcomb)
# round
df_scl = pd.DataFrame(np.round(Xscl, decimals=8), columns=X.columns)
# post-processing
n_uniq = df_scl.drop_duplicates().shape[0]
n_dup = df.shape[0] + df2.shape[0] - n_uniq
print(f"Number of shared rows: {n_dup}")
I am working with the interpolate function in pandas. Here is a toy example to make an illustrative case:
df=pd.DataFrame({'Data':np.random.normal(size=200), 'Data2':np.random.normal(size=200)})
df.iloc[1, 0] = np.nan
print df
print df.interpolate('nearest')
My question: Does the interpolate function work over multiple columns? That is, does it use multivariate analysis to determine the value for a missing field? Or does it simply look at individual columns?
The docs reference the various available methods - most just rely on the index, possibly via the univariate scipy.interp1d or other univariate scipy methods:
method : {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’,
‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’,
‘spline’ ‘piecewise_polynomial’, ‘pchip’}
‘linear’: ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
default ‘time’: interpolation works on daily and higher resolution data to interpolate given length of interval ‘index’, ‘values’: use the actual numerical values of the index
‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘polynomial’ is passed to scipy.interpolate.interp1d. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method=’polynomial’, order=4). These use the actual numerical values of the index.
‘krogh’, ‘piecewise_polynomial’, ‘spline’, and ‘pchip’ are all wrappers around the scipy interpolation methods of similar names. These use the actual numerical values of the index.
Scipy docs and charts illustrating output here