I can compare two Pandas series for exact equality using pandas.Series.equals. Is there a corresponding function or parameter that will check if the elements are equal to some ε of precision?
You can use numpy.allclose:
numpy.allclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)
Returns True if two arrays are element-wise equal within a tolerance.
The tolerance values are positive, typically very small numbers. The
relative difference (rtol * abs(b)) and the absolute difference atol
are added together to compare against the absolute difference between
a and b.
numpy works well with pandas.Series objects, so if you have two of them - s1 and s2, you can simply do:
np.allclose(s1, s2, atol=...)
Where atol is your tolerance value.
Numpy works well with pandas Series. However one has to be careful with the order of indices (or columns and indices for pandas DataFrame)
For example
series_1 = pd.Series(data=[0,1], index=['a','b'])
series_2 = pd.Series(data=[1,0], index=['b','a'])
np.allclose(series_1,series_2)
will return False
A workaround is to use the index of one pandas series
np.allclose(series_1, series_2.loc[series_1.index])
If you want to avoid numpy, there is another way, use assert_series_equal
import pandas as pd
s1 = pd.Series([1.333333, 1.666666])
s2 = pd.Series([1.333, 1.666])
from pandas.testing import assert_series_equal
assert_series_equal(s1,s2)
raises an AssertionError. So use the check_less_precise flag
assert_series_equal(s1,s2, check_less_precise= True) # No assertion error
This doesn't raise an AssertionError as check_less_precise only compares 3 digits after decimal.
See the docs here
Not good to use asserts but if you want to avoid numpy, this is a way.
Note: I'm posting this mostly because I came to this thread via a Google search of something similar and it seemed too long for a comment. Not necessarily the best solution nor strictly "ε of precision"-based, but an alternative using scaling and rounding if you want to do this for vectors (i.e. rows) rather than scalars for a DataFrame (rather than Series) without looping through explicitly:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
Xcomb = pd.concat((X, X2), axis=0, ignore_index=True)
# scale
scaler = MinMaxScaler()
scaler.fit(Xcomb)
Xscl = scaler.transform(Xcomb)
# round
df_scl = pd.DataFrame(np.round(Xscl, decimals=8), columns=X.columns)
# post-processing
n_uniq = df_scl.drop_duplicates().shape[0]
n_dup = df.shape[0] + df2.shape[0] - n_uniq
print(f"Number of shared rows: {n_dup}")
Related
I have a Dataframe with several lines and columns and I have transformed it into a numpy array to speed-up the calculations.
The first five columns of the Dataframe looked like this:
par1 par2 par3 par4 par5
1.502366 2.425301 0.990374 1.404174 1.929536
1.330468 1.460574 0.917349 1.172675 0.766603
1.212440 1.457865 0.947623 1.235930 0.890041
1.222362 1.348485 0.963692 1.241781 0.892205
...
These columns are now stored in a numpy array a = df.values
I need to check whether at least two of the five columns satisfy a condition (i.e., their value is larger than a certain threshold). Initially I wrote a function that performed the operation directly on the dataframe. However, because I have a very large amount of data and need to repeat the calculations over and over, I switched to numpy to take advantage of the vectorization.
To check the condition I was thinking to use
df['Result'] = np.where(condition_on_parameters > 2, True, False)
However, I cannot figure out how to write the condition_on_parameters such that it returns a True of False when at least 2 out of the 5 parameters are larger than the threshold. I thought to use the sum() function on the condition_on_parameters but I am not sure how to write such condition.
EDIT
It is important to specify that the thresholds are different for each parameter. For example thr1=1.2, thr2=2.0, thr3=1.5, thr4=2.2, thr5=3.0. So I need to check that par1 > thr1, par2 > thr2, ..., par5 > thr5.
Assuming condition_on_parameters returns an array the sames size as a with entries as True or False, you can use np.sum(condition_on_parameters, axis=1) to sum over the true values (True has a numerical values of 1) of each row. This provides a 1D array with entries as the number of columns that meet the condition. This array can then be used with where to get the row numbers you are looking for.
df['result'] = np.where(np.sum(condition_on_parameters, axis=1) > 2)
Can you exploit pandas functionalities? For example, you can efficiently check conditions on multiple rows/columns with .apply and then .sum(axis=1).
Here some sample code:
import pandas as pd
df = pd.DataFrame([[1.50, 2.42, 0.88], [0.98,1.3, 0.56]], columns=['par1', 'par2', 'par3'])
# custom_condition, e.g. value less or equal than threshold
def leq(x, t):
return x<=t
condition = df.apply(lambda x: leq(x, 1)).sum(axis=1)
# filter
df.loc[condition >=2]
I think this should be equivalent to numpy in terms of efficiency as pandas is ultimately build on top of that, however I'm not entirely sure...
It seems you are looking for numpy.any
a = np.array(\
[[1.502366, 2.425301, 0.990374, 1.404174, 1.929536],
[1.330468, 1.460574, 0.917349, 1.172675, 0.766603 ],
[1.212440, 1.457865, 0.947623, 1.235930, 0.890041 ],
[1.222362, 1.348485, 0.963692, 1.241781, 0.892205 ]]);
df = pd.DataFrame(a, columns=[f'par{i}' for i in range(1, 6)])
df['Result'] = np.any(df > 1.46, axis=1) # append the result column
Gives the following dataframe
I am tryig to apply argrelextrema function with dataframe df. But unable to apply correctly. below is my code
import pandas as pd
from scipy.signal import argrelextrema
np.random.seed(42)
def maxloc(data):
loc_opt_ind = argrelextrema(df.values, np.greater)
loc_max = np.zeros(len(data))
loc_max[loc_opt_ind] = 1
data['loc_max'] = loc_max
return data
values = np.random.rand(23000)
df = pd.DataFrame({'value': values})
np.all(maxloc_faster(df).loc_max)
It gives me error
that loc_max[loc_opt_ind] = 1
IndexError: too many indices for array
A Pandas dataframe is two-dimensional. That is, df.values is two dimensional, even when it has only one column. As a result, loc_opt_ind will contain x and y indices (two tuples; just print loc_opt_ind to see), which can't be used to index loc_max. You probably want to use either df['values'].values (which turns into <Series>.values), or np.squeeze(df.values) as input. Note that argrelextrema still returns a tuple in that case, just a one-element one, so you may need loc_opt_ind[0] (np.where has similar behaviour).
I'd like to do something like this:
df['A'] = max(0, min(df.B, df.C - df.D))
However, I get a ValueError ("the truth value of a Series is ambiguous"), which I guess means that the max and min functions are doing some boolean operations under the hood, and this doesn't distribute over the Series. I understand you can get the min/max of some set of columns by e.g.
df[['col1','col2','col3']].max(axis = 1)
and so I should be able to get my desired output by way of making some temporary columns with intermediate values, but I'd like a clean solution that does it directly. How does one do this without having to make extra columns for throwaway intermediate values?
max and min are built-in Python functions. They aren't designed for vectorised functionality which comes with Pandas / NumPy.
Instead, you can use np.maximum / np.minimum to perform element-wise calculations:
import numpy as np
df['A'] = np.maximum(0, np.minimum(df['B'], df['C'] - df['D']))
In Pandas, there is a method DataFrame.shift(n) which shifts the contents of an array by n rows, relative to the index, similarly to np.roll(a, n). I can't seem to find a way to get a similar behaviour working with Dask. I realise things like row-shifts may be difficult to manage with Dask's chunked system, but I don't know of a better way to compare each row with the subsequent one.
What I'd like to be able to do is this:
import numpy as np
import pandas as pd
import dask.DataFrame as dd
with pd.HDFStore(path) as store:
data = dd.from_hdf(store, 'sim')[col1]
shifted = data.shift(1)
idx = data.apply(np.sign) != shifted.apply(np.sign)
in order to create a boolean series indicating the locations of sign changes in the data. (I am aware that method would also catch changes from a signed value to zero)
I would then use the boolean series to index a different Dask dataframe for plotting.
Rolling functions
Currently dask.dataframe does not implement the shift operation. It could though if you raise an issue. In principle this is not so dissimilar from rolling operations that dask.dataframe does support, like rolling_mean, rolling_sum, etc..
Actually, if you were to create a Pandas function that adheres to the same API as these pandas.rolling_foo functions then you can use the dask.dataframe.rolling.wrap_rolling function to turn your pandas style rolling function into a dask.dataframe rolling function.
dask.dataframe.rolling_sum = wrap_rolling(pandas.rolling_sum)
The following code might help to shift down the series.
s = dd_df['column'].rolling(window=2).sum() - dd_df['column']
Edit (03/09/2019):
When you are rolling and finding the sum, for a particular row,
result[i] = row[i-1] + row[i]
Then by subtracting the old value of the column from the result, you are doing the following operation:
final_row[i] = result[i] - row[i]
Which equals:
final_row[i] = row[i-1] + row[i] - row[i]
Which ultimately results in the whole column getting shifted down once.
Tip:
If you want to shift it down multiple rows, you should actually execute the whole operation again that many times with the same window.
This operation needs to be applied as fast as possible as the actual arrays which contain millions of elements. This is a simple version of the problem.
So, I have a random array of unique integers (normally millions of elements).
totalIDs = [5,4,3,1,2,9,7,6,8 ...]
I have another array (normally a tens of thousands) of unique integers which I can create a mask.
subsampleIDs1 = [5,1,9]
subsampleIDs2 = [3,7,8]
subsampleIDs3 = [2,6,9]
...
I can use numpy to do
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
I can then extract the information I want of another array using the mask (say column 0 contains the one I want).
variable = allvariables[mask][:,0]
Now given that the IDs are unique in both arrays, is there any way to speed this up significantly. It takes a long time to construct the mask for a few thousand points (subsampleIDs) matching against millions of IDs (totalIDs).
I thought of going through it once and writing out a binary file of an index (to speed up future searches).
for i in range(0,3):
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
index[mask] = i
where X is in subsampleIDsX. Then I can just do:
for i in range(0,3):
if index[i] == i:
rowmatch = i
break
variable = allvariables[rowmatch:len(subsampleIDs),0]
right? But this is also slow because there is a conditional in the loop to find when it first matches. Is there a faster way to find when a number first appears in an ordered array so the conditional doesn't slow the loop?
I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs].
Create some test data first:
import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))
subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)
Then convert you data in to a DataFrame:
import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs)
df.ix[subsampleIDs]
DataFrame use a hashtable to map the index to it's location, it's very fast.
Often this kind of indexing is best performed using a DB (with proper column-indexing).
Another idea is to sort totalIDs once, as a preprocessing stage, and implement your own version of in1d, which avoids sorting everything. The numpy implementation of in1d (at least in the version that I have installed) is fairly simple, and should be easy to copy and modify.
EDIT:
Or, even better, use bucket sort (or radix sort). That should give you O(N+M), N being the size of totalIDs, and M the size of sampleIDs (times a constant you can play with by changing the number of buckets). Here too, you can split totalIDs to buckets only once, which gives you a nifty O(N+M1+M2+...).
Unfortunately, I'm not aware of a numpy implementation, but I did find this: http://en.wikipedia.org/wiki/Radix_sort#Example_in_Python