Run function over dataframe with columns of differing length after dropna() - python

I am trying to apply the following function over each column in a dataframe:
def hurst_lag(x):
minlag = 200
maxlag = 300
lags = range(minlag, maxlag)
tau = [sqrt(std(subtract(x.dropna()[lag:], x.dropna()[:-lag]))) for lag in lags]
m = polyfit(log(lags), log(tau), 1)
return m[0]*2
The function only works on non NA values. In my dataframe, the lengths of my columns differ after applying dropna(). e.g.
df = pd.DataFrame({
'colA':[None, None, 1, 2],
'colB': [None, 2, 6, 4],
'colC': [None, None, 2, 8],
'colD': [None, 2.0, 3.0, 4.0],
})
Any ideas how to run the function over each column individually, excluding the NA values for that specific column? Many thanks

Use apply to run it on the dataframe
df = df.apply(hurst_lag)

Related

How to do point biserial correlation for multiple columns in one iteration

I am trying to calculate a point biserial correlation for a set of columns in my datasets. I am able to do it on individual variable, however if i need to calculate for all the columns in one iteration then it is showing an error.
Below is the code:
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
from scipy import stats
corr_list = {}
y = df['A'].astype(float)
for column in df:
x = df[['B','C','D']].astype(float)
corr = stats.pointbiserialr(x, y)
corr_list[['B','C','D']] = corr
print(corr_list)
TypeError: No loop matching the specified signature and casting was found for ufunc add
x must be a column not a dataframe, if you take the column instead of the dataframe , it will work. You can try this :
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
print(df)
from scipy import stats
corr_list = []
y = df['A'].astype(float)
for column in df:
x=df[column]
corr = stats.pointbiserialr(list(x), list(y))
corr_list.append(corr[0])
print(corr_list)
by the way you can use print(df.corr())and this give you the Correlation Matrix of the dataframe
You can use the pd.DataFrame.corrwith() function:
df[['B', 'C', 'D']].corrwith(df['A'].astype('float'), method=stats.pointbiserialr)
Output will be a list of the columns and their corresponding correlations & p-values (row 0 and 1, respectively) with the target DataFrame or Series. Link to docs:
B C D
0 4.547937e-18 0.400066 -0.094916
1 1.000000e+00 0.504554 0.879331

pandas largest value per group with multi columns / why does it only work when flattening?

For a pandas dataframe of:
import pandas as pd
df = pd.DataFrame({
'id': [1, 1, 2, 1], 'anomaly_score':[5, 10, 8, 100], 'match_level_0':[np.nan, 1, 1, 1], 'match_level_1':[np.nan, np.nan, 1, 1], 'match_level_2':[np.nan, 1, 1, 1]
})
display(df)
df = df.groupby(['id', 'match_level_0']).agg(['mean', 'sum'])
I want to calculate the largest rows per group.
df.columns = ['__'.join(col).strip() for col in df.columns.values]
df.groupby(['id'])['anomaly_score__mean'].nlargest(2)
Works but requires to flatten the multiindex for the columns.
Instead I want to directly use,
df.groupby(['id'])[('anomaly_score', 'mean')].nlargest(2)
But this fails with the key not being found.
Interestingly, it works just fine when not grouping:
df[('anomaly_score', 'mean')].nlargest(2)
For me working grouping by Series with first level of MultiIndex, but it seems bug why not working like in your solution:
print (df[('anomaly_score', 'mean')].groupby(level=0).nlargest(2))
id match_level_0
1 1.0 55
2 1.0 8
Name: (anomaly_score, mean), dtype: int64
print (df[('anomaly_score', 'mean')].groupby(level='id').nlargest(2))

creating a column based on missing value in pandas

I have a data-frame for which want to create a column that represents missing value patterns in data-frame.For example :
for example for the CSV file,
A,B,C,D
1,NaN,NaN,NaN
Nan,2,3,NaN
3,2,2,3
3,2,NaN,3
3,2,1,NaN
I want to create a column E,which has value in following way:
If A,B,C,D all are missing E = 4,
If A,B,C,D all are present E = 0,
if A and B are only missing E = 1 of that sort, encoding of E need not be like I mentioned just an indication of pattern.How can I come across this problem in pandas?
use isnull in combination with sum(axis=1)
Example:
import pandas as pd
df = pd.DataFrame({'A': [1, None, 3, 3, 3],
'B':[ None, None, 1, 1, 1]})
df['C'] = df.isnull().sum(axis=1)

how to ignore index comparison for pandas assert frame equal

I try to compare below two dataframe with "check_index_type" set to False. According to the documentation, if it set to False, it shouldn't "check the Index class, dtype and inferred_type are identical". Did I misunderstood the documentation? how to compare ignoring the index and return True for below test?
I know I can reset the index but prefer not to.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.testing.assert_frame_equal.html
from pandas.util.testing import assert_frame_equal
import pandas as pd
d1 = pd.DataFrame([[1,2], [10, 20]], index=[0,2])
d2 = pd.DataFrame([[1, 2], [10, 20]], index=[0, 1])
assert_frame_equal(d1, d2, check_index_type=False)
AssertionError: DataFrame.index are different
DataFrame.index values are different (50.0 %)
[left]: Int64Index([0, 2], dtype='int64')
[right]: Int64Index([0, 1], dtype='int64')
If you really don't care about the index being equal, you can drop the index as follows:
assert_frame_equal(d1.reset_index(drop=True), d2.reset_index(drop=True))
Index is part of data frame , if the index are different , we should say the dataframes are different , even the value of dfs are same , so , if you want to check the value , using array_equal from numpy
d1 = pd.DataFrame([[1,2], [10, 20]], index=[0,2])
d2 = pd.DataFrame([[1, 2], [10, 20]], index=[0, 1])
np.array_equal(d1.values,d2.values)
Out[759]: True
For more info about assert_frame_equal in git
For those who came to this question because they're interested in using pd.testing.assert_series_equal (operates on pd.Series), pandas 1.1.0 has introduced an argument check_index:
import pandas as pd
s1 = pd.Series({"a": 1})
s2 = pd.Series({"b": 1})
pd.testing.assert_series_equal(s1, s2, check_index=False)
This argument does not yet exist for pd.testing.assert_frame_equals.

Speeding up Pandas apply function

For a relatively big Pandas DataFrame (a few 100k rows), I'd like to create a series that is a result of an apply function. The problem is that the function is not very fast and I was hoping that it can be sped up somehow.
df = pd.DataFrame({
'value-1': [1, 2, 3, 4, 5],
'value-2': [0.1, 0.2, 0.3, 0.4, 0.5],
'value-3': somenumbers...,
'value-4': more numbers...,
'choice-index': [1, 1, np.nan, 2, 1]
})
def func(row):
i = row['choice-index']
return np.nan if math.isnan(i) else row['value-%d' % i]
df['value'] = df.apply(func, axis=1, reduce=True)
# expected value = [1, 2, np.nan, 0.4, 5]
Any suggestions are welcome.
Update
A very small speedup (~1.1) can be achieved by pre-caching the selected columns. func would change to:
cached_columns = [None, 'value-1', 'value-2', 'value-3', 'value-4']
def func(row):
i = row['choice-index']
return np.nan if math.isnan(i) else row[cached_columns[i]]
But I was hoping for greater speedups...
I think I got a good solution (speedup ~150).
The trick is not to use apply, but to do smart selections.
choice_indices = [1, 2, 3, 4]
for idx in choice_indices:
mask = df['choice-index'] == idx
result_column = 'value-%d' % (idx)
df.loc[mask, 'value'] = df.loc[mask, result_column]

Categories