Rowwise max/min in pandas - python

I'd like to do something like this:
df['A'] = max(0, min(df.B, df.C - df.D))
However, I get a ValueError ("the truth value of a Series is ambiguous"), which I guess means that the max and min functions are doing some boolean operations under the hood, and this doesn't distribute over the Series. I understand you can get the min/max of some set of columns by e.g.
df[['col1','col2','col3']].max(axis = 1)
and so I should be able to get my desired output by way of making some temporary columns with intermediate values, but I'd like a clean solution that does it directly. How does one do this without having to make extra columns for throwaway intermediate values?

max and min are built-in Python functions. They aren't designed for vectorised functionality which comes with Pandas / NumPy.
Instead, you can use np.maximum / np.minimum to perform element-wise calculations:
import numpy as np
df['A'] = np.maximum(0, np.minimum(df['B'], df['C'] - df['D']))

Related

Repeat calculations for every row of dataframe

The following calculations were for the 1st row, i.e., train_df.y1[0].
I want to repeat this operation for all 400 rows of train_df
squared_deviations_y1_0_train = ((ideal_df.loc[:0,"y1":"y50"] - train_df.y1[0]) ** 2).sum(axis=1)
The result is correct, just need to repeat it.
Since your end result seems to be a scalar, you can convert both of these dataframes to Numpy and take advantage of braodcasting.
Something like this,
squared_deviations = ((ideal_df.to_numpy() - train_df.y1.to_numpy().reshape(-1,1)) ** 2).sum(axis=1)
would do pretty nicely. If you MUST stay within pandas, you could use the subtract() method to get the same outcome.
(train_df.y1.subtract(ideal_df.T) ** 2).sum(axis=0)
Not that train_df.y1 becomes a row vector of size (400,) so you need to make the row dimension 400 to do this subtraction (hence the transpose of ideal_df).
You can also use the apply() method as Barmar suggested. This will require you to define a function that calculates the row index so that you can subtract the appropriate value of train_df for every cell before you perform the square and sum operations. Something like this,
(ideal_df.apply(lambda cell: cell - train_df.y1[cell.index]) ** 2).sum(axis=1)
would also work. I highly recommend using Numpy for these tasks because Numpy was designed with broadcasting in mind, but as shown you can get away with doing it in Pandas.

How to convert log2 scale to normal scale in pandas

How to convert log2 transformed values back to normal scale in python
Any suggestions would be great
log2(x) is the inverse of 2**x (2 to the power of x). If you have a column of data that has been transformed by log2(x), all you have to do is perform the inverse operation:
df['colname'] = [2**i for i in df['colname']]
As suggested in the comment below, it would be more efficient to do:
df['colname'] = df['colname'].rpow(2)
rpow is a pandas Series method that is built in to the pandas package. The first argument is the base you'd like to take powers to. You can also use a fill_value argument, which is nice because you can tell it what to do if the result is NaN

Comparing two pandas series for floating point near-equality?

I can compare two Pandas series for exact equality using pandas.Series.equals. Is there a corresponding function or parameter that will check if the elements are equal to some ε of precision?
You can use numpy.allclose:
numpy.allclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)
Returns True if two arrays are element-wise equal within a tolerance.
The tolerance values are positive, typically very small numbers. The
relative difference (rtol * abs(b)) and the absolute difference atol
are added together to compare against the absolute difference between
a and b.
numpy works well with pandas.Series objects, so if you have two of them - s1 and s2, you can simply do:
np.allclose(s1, s2, atol=...)
Where atol is your tolerance value.
Numpy works well with pandas Series. However one has to be careful with the order of indices (or columns and indices for pandas DataFrame)
For example
series_1 = pd.Series(data=[0,1], index=['a','b'])
series_2 = pd.Series(data=[1,0], index=['b','a'])
np.allclose(series_1,series_2)
will return False
A workaround is to use the index of one pandas series
np.allclose(series_1, series_2.loc[series_1.index])
If you want to avoid numpy, there is another way, use assert_series_equal
import pandas as pd
s1 = pd.Series([1.333333, 1.666666])
s2 = pd.Series([1.333, 1.666])
from pandas.testing import assert_series_equal
assert_series_equal(s1,s2)
raises an AssertionError. So use the check_less_precise flag
assert_series_equal(s1,s2, check_less_precise= True) # No assertion error
This doesn't raise an AssertionError as check_less_precise only compares 3 digits after decimal.
See the docs here
Not good to use asserts but if you want to avoid numpy, this is a way.
Note: I'm posting this mostly because I came to this thread via a Google search of something similar and it seemed too long for a comment. Not necessarily the best solution nor strictly "ε of precision"-based, but an alternative using scaling and rounding if you want to do this for vectors (i.e. rows) rather than scalars for a DataFrame (rather than Series) without looping through explicitly:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
Xcomb = pd.concat((X, X2), axis=0, ignore_index=True)
# scale
scaler = MinMaxScaler()
scaler.fit(Xcomb)
Xscl = scaler.transform(Xcomb)
# round
df_scl = pd.DataFrame(np.round(Xscl, decimals=8), columns=X.columns)
# post-processing
n_uniq = df_scl.drop_duplicates().shape[0]
n_dup = df.shape[0] + df2.shape[0] - n_uniq
print(f"Number of shared rows: {n_dup}")

Shifting all rows in dask dataframe

In Pandas, there is a method DataFrame.shift(n) which shifts the contents of an array by n rows, relative to the index, similarly to np.roll(a, n). I can't seem to find a way to get a similar behaviour working with Dask. I realise things like row-shifts may be difficult to manage with Dask's chunked system, but I don't know of a better way to compare each row with the subsequent one.
What I'd like to be able to do is this:
import numpy as np
import pandas as pd
import dask.DataFrame as dd
with pd.HDFStore(path) as store:
data = dd.from_hdf(store, 'sim')[col1]
shifted = data.shift(1)
idx = data.apply(np.sign) != shifted.apply(np.sign)
in order to create a boolean series indicating the locations of sign changes in the data. (I am aware that method would also catch changes from a signed value to zero)
I would then use the boolean series to index a different Dask dataframe for plotting.
Rolling functions
Currently dask.dataframe does not implement the shift operation. It could though if you raise an issue. In principle this is not so dissimilar from rolling operations that dask.dataframe does support, like rolling_mean, rolling_sum, etc..
Actually, if you were to create a Pandas function that adheres to the same API as these pandas.rolling_foo functions then you can use the dask.dataframe.rolling.wrap_rolling function to turn your pandas style rolling function into a dask.dataframe rolling function.
dask.dataframe.rolling_sum = wrap_rolling(pandas.rolling_sum)
The following code might help to shift down the series.
s = dd_df['column'].rolling(window=2).sum() - dd_df['column']
Edit (03/09/2019):
When you are rolling and finding the sum, for a particular row,
result[i] = row[i-1] + row[i]
Then by subtracting the old value of the column from the result, you are doing the following operation:
final_row[i] = result[i] - row[i]
Which equals:
final_row[i] = row[i-1] + row[i] - row[i]
Which ultimately results in the whole column getting shifted down once.
Tip:
If you want to shift it down multiple rows, you should actually execute the whole operation again that many times with the same window.

Pandas how to apply multiple functions to dataframe

Is there a way to apply a list of functions to each column in a DataFrame like the DataFrameGroupBy.agg function does? I found an ugly way to do it like this:
df=pd.DataFrame(dict(one=np.random.uniform(0,10,100), two=np.random.uniform(0,10,100)))
df.groupby(np.ones(len(df))).agg(['mean','std'])
one two
mean std mean std
1 4.802849 2.729528 5.487576 2.890371
For Pandas 0.20.0 or newer, use df.agg (thanks to ayhan for pointing this out):
In [11]: df.agg(['mean', 'std'])
Out[11]:
one two
mean 5.147471 4.964100
std 2.971106 2.753578
For older versions, you could use
In [61]: df.groupby(lambda idx: 0).agg(['mean','std'])
Out[61]:
one two
mean std mean std
0 5.147471 2.971106 4.9641 2.753578
Another way would be:
In [68]: pd.DataFrame({col: [getattr(df[col], func)() for func in ('mean', 'std')] for col in df}, index=('mean', 'std'))
Out[68]:
one two
mean 5.147471 4.964100
std 2.971106 2.753578
In the general case where you have arbitrary functions and column names, you could do this:
df.apply(lambda r: pd.Series({'mean': r.mean(), 'std': r.std()})).transpose()
mean std
one 5.366303 2.612738
two 4.858691 2.986567
I tried to apply three functions into a column and it works
#removing new line character
rem_newline = lambda x : re.sub('\n',' ',x).strip()
#character lower and removing spaces
lower_strip = lambda x : x.lower().strip()
df = df['users_name'].apply(lower_strip).apply(rem_newline).str.split('(',n=1,expand=True)
I am using pandas to analyze Chilean legislation drafts. In my dataframe, the list of authors are stored as a string. The answer above did not work for me (using pandas 0.20.3). So I used my own logic and came up with this:
df.authors.apply(eval).apply(len).sum()
Concatenated applies! A pipeline!! The first apply transforms
"['Barros Montero: Ramón', 'Bellolio Avaria: Jaime', 'Gahona Salazar: Sergio']"
into the obvious list, the second apply counts the number of lawmakers involved in the project. I want the size of every pair (lawmaker, project number) (so I can presize an array where I will study which parties work on what).
Interestingly, this works! Even more interestingly, that last call fails if one gets too ambitious and does this instead:
df.autores.apply(eval).apply(len).apply(sum)
with an error:
TypeError: 'int' object is not iterable
coming from deep within /site-packages/pandas/core/series.py in apply

Categories