Group by weighted mean, allowing for zero value weights - python

I want to take the weighted mean of a column in a group-by statement, like this
import pandas as pd
import numpy as np
df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [0.4, 0.3, 0.2, 0.4, 0.3, 0.2],
'weight': [2, 2, 4, 3, 1, 2]})
df_grouped = df.groupby('group')[['value', 'weight']].apply(lambda x: sum(x['value']*x['weight'])/sum(x['weight']))
df_grouped
Out[17]:
group
A 0.275000
B 0.316667
dtype: float64
So far all is well. However, in some cases the weights sum to zero, for instance
df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [0.4, 0.3, 0.2, 0.4, 0.3, 0.2],
'weight': [1, 2, 3, 0, 0, 0]})
In this case I want to take the simple mean. The above expression obviously fail because of a divide by zero.
The method I currently use is to replace the weights with one wherever the weights sum to one
df_temp = df.groupby('group')['weight'].transform('sum').reset_index()
df['new_weight'] = np.where(df_temp['weight']==0, 1, df['weight'])
df_grouped = df.groupby('group')[['value', 'new_weight']].apply(lambda x: sum(x['value']*x['new_weight'])/sum(x['new_weight']))
This is an ok solution. But can this be achieved by a one-liner? Some special function for instance?

If you need it to be done in a one-liner it is possible to check whether the Group By Sum is equivalent to zero using a ternary operator inside the lambda as follows. If the group by sum is zero then use the regular mean.
df.groupby('group')[['value', 'weight']].apply(lambda x:sum(x['value'])/len(x['weight']) if (sum(x['weight'])) == 0 else sum(x['value']*x['weight'])/sum(x['weight']))
group
A 0.266667
B 0.300000
dtype: float64
The above snippet's regular mean calculation can be further minified as follows.
df.groupby('group')[['value', 'weight']].apply(lambda x:x['value'].mean() if (sum(x['weight'])) == 0 else sum(x['value']*x['weight'])/sum(x['weight']))
However, I think this type of one liners reduce the readability of the code.

Related

Run function over dataframe with columns of differing length after dropna()

I am trying to apply the following function over each column in a dataframe:
def hurst_lag(x):
minlag = 200
maxlag = 300
lags = range(minlag, maxlag)
tau = [sqrt(std(subtract(x.dropna()[lag:], x.dropna()[:-lag]))) for lag in lags]
m = polyfit(log(lags), log(tau), 1)
return m[0]*2
The function only works on non NA values. In my dataframe, the lengths of my columns differ after applying dropna(). e.g.
df = pd.DataFrame({
'colA':[None, None, 1, 2],
'colB': [None, 2, 6, 4],
'colC': [None, None, 2, 8],
'colD': [None, 2.0, 3.0, 4.0],
})
Any ideas how to run the function over each column individually, excluding the NA values for that specific column? Many thanks
Use apply to run it on the dataframe
df = df.apply(hurst_lag)

Drop duplicate rows, but keep the union of their data

I have a data frame like this:
pd.DataFrame([
[1, None, 'a'],
[1, 3.3, None],
[2, 1.7, 'c']
], columns=['unique_id', 'x', 'target'])
I want to drop one of the rows where unique_id is 1, but take the union of their values. That is, I want to produce this:
pd.DataFrame([
[1, 3.3, 'a'],
[2, 1.7, 'c']
], columns=['unique_id', 'x', 'target'])
Can this be done efficiently in Pandas?
Assume this data frame has between 10k and 100k rows, with maybe 10% being duplicates I want to eliminate. There will only be 2 or 3 duplicates of each unique_id.
Edit: when both rows have disagreeing entries, just taking the first one is fine in my case. But I'm open to solutions where, e.g. both values are collected in a list.
This gives the result for your example. It takes the first non-Nan value for each column, in each group.
df.groupby("unique_id", as_index=False).first()
Use groupby and first:
df.groupby('unique_id').first()

How can I properly use a Pandas Dataframe with a multiindex that includes Intervals?

I'm trying to slice into a DataFrame that has a MultiIndex composed of an IntervalIndex and a regular Index. Example code:
from pandas import Interval as ntv
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))
Looks like this:
E var1
ntv id
(0, 10] 1 1 0.1
(0, 12] 2 0 0.5
What I would like to do is to slice into the DataFrame at a specific value and return all rows that has an interval that contains the value. Ex:
df.loc[4]
should return (trivially)
E var1
id
1 1 0.1
2 0 0.5
The problem is I keep getting a TypeError about the index, and the docs show a similar operation (but on a single-level index) that does produce what I'm looking for.
TypeError: only integer scalar arrays can be converted to a scalar index
I've tried many things, nothing seems to work normally. I could include the id column inside the dataframe, but I'd rather keep my index unique, and I would constantly be calling set_index('id').
I feel like either a) I'm missing something about MultiIndexes or b) there is a bug / ambiguity with using an IntervalIndex in a MultiIndex.
Since we are speaking intervals there is a method called get_loc to find the rows that has the value in between the interval. To say what I mean :
from pandas import Interval as ntv
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))
df.iloc[(df.index.get_level_values(0).get_loc(4))]
E var1
ntv id
(0, 10] 1 1 0.1
(0, 12] 2 0 0.5
df.iloc[(df.index.get_level_values(0).get_loc(11))]
E var1
ntv id
(0, 12] 2 0 0.5
This also works if you have multiple rows of data for one inteval i.e
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id': 3, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))
df.iloc[(df.index.get_level_values(0).get_loc(4))]
E var1
ntv id
(0, 10] 1 1 0.1
3 1 0.1
(0, 12] 2 0 0.5
If you time this up with a list comprehension, this approach is way faster for large dataframes i.e
ndf = pd.concat([df]*10000)
%%timeit
ndf.iloc[ndf.index.get_level_values(0).get_loc(4)]
10 loops, best of 3: 32.8 ms per loop
%%timeit
intervals = ndf.index.get_level_values(0)
mask = [4 in i for i in intervals]
ndf.loc[mask]
1 loop, best of 3: 193 ms per loop
So I did a bit of digging to try and understand the problem. If I try to run your code the following happens.
You try to index into the index label with
"slice(array([0, 1], dtype=int64), array([1, 2], dtype=int64), None)"
(when I say index_type I mean the Pandas datatype)
An index_type's label is a list of indices that map to the index_type's levels array. Here is an example from the documentation.
>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
>>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
MultiIndex(levels=[[1, 2], ['blue', 'red']],
labels=[[0, 0, 1, 1], [1, 0, 1, 0]],
names=['number', 'color'])
Notice how the second list in labels connects to the order of levels. level[1][1] is equal to red, and level[1][0] is equal to blue.
Anyhow, this is all to say that I don't believe intervalindex is meant to be used in an overlapping fashion. If you look at the orginal proposal for it
https://github.com/pandas-dev/pandas/issues/7640
"A IntervalIndex would be a monotonic and non-overlapping one-dimensional array of intervals."
My suggestion is to move the interval into a column. You could probably write up a simple function with numba to test if a number is in each interval. Do you mind explaining the way you're benefiting from the interval?
Piggybacking off of #Dark's solution, Index.get_loc just calls Index.get_indexer under the hood, so it might be more efficient to call the underlying method when you don't have additional parameters and red tape.
idx = df.index.get_level_values(0)
df.iloc[idx.get_indexer([4])]
My originally proposed solution:
intervals = df.index.get_level_values(0)
mask = [4 in i for i in intervals]
df.loc[mask]
Regardless, it's certainly strange though that these return two different results, but does look like it has to do with the index being unique/monotonic/neither of the two:
df.reset_index(level=1, drop=True).loc[4] # good
df.loc[4] # TypeError
This is not really a solution and I don't fully understand but think it may have to do with your interval index not being monotonic (in that you have overlapping intervals). I guess that could in a sense be considered monotonic so perhaps alternately you could say the overlap means the index is not unique?
Anyway, check out this github issue:
ENH: Implement MultiIndex.is_monotonic_decreasing #17455
And here's an example with your data, but changing the intervals to be non-overlapping (0,6) & (7,12):
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0, 6), 'E': 1},
{'id': 2, 'var1': 0.5, 'ntv': ntv(7,12), 'E': 0}
], index=('ntv', 'id'))
Now, loc works OK:
df.loc[4]
E var1
id
1 1 0.1
def check_value(num):
return df[[num in i for i in map(lambda x: x[0], df.index)]]
a = check_value(4)
a
>>
E var1
ntv id
(0, 10] 1 1 0.1
(0, 12] 2 0 0.5
if you want to drop the index level, you can add
a.index = a.droplevel(0)

Speeding up Pandas apply function

For a relatively big Pandas DataFrame (a few 100k rows), I'd like to create a series that is a result of an apply function. The problem is that the function is not very fast and I was hoping that it can be sped up somehow.
df = pd.DataFrame({
'value-1': [1, 2, 3, 4, 5],
'value-2': [0.1, 0.2, 0.3, 0.4, 0.5],
'value-3': somenumbers...,
'value-4': more numbers...,
'choice-index': [1, 1, np.nan, 2, 1]
})
def func(row):
i = row['choice-index']
return np.nan if math.isnan(i) else row['value-%d' % i]
df['value'] = df.apply(func, axis=1, reduce=True)
# expected value = [1, 2, np.nan, 0.4, 5]
Any suggestions are welcome.
Update
A very small speedup (~1.1) can be achieved by pre-caching the selected columns. func would change to:
cached_columns = [None, 'value-1', 'value-2', 'value-3', 'value-4']
def func(row):
i = row['choice-index']
return np.nan if math.isnan(i) else row[cached_columns[i]]
But I was hoping for greater speedups...
I think I got a good solution (speedup ~150).
The trick is not to use apply, but to do smart selections.
choice_indices = [1, 2, 3, 4]
for idx in choice_indices:
mask = df['choice-index'] == idx
result_column = 'value-%d' % (idx)
df.loc[mask, 'value'] = df.loc[mask, result_column]

Get the index of median value in array containing Nans

How can I get the index of the median value for an array which contains NaNs?
For example, I have the array of values [Nan, 2, 5, NaN, 4, NaN, 3, 1] with correspondent array of errors on those values [np.nan, 0.1, 0.2, np.nan, 0.1, np.nan, 0.4, 0.3]. Then the median is 3, while the error is 0.4.
Is there a simple way to do this?
EDIT: I edited the error array to imply a more realistic situation. And Yes, I am using numpy.
It's not really clear how you intend to meaningfully extract the error from the median, but if you do happen to have an array such that the median is one of its entries, and the corresponding error array is defined at the corresponding index, and there aren't other entries with the same value as the median, and probably several other disclaimers, then you can do the following:
a = np.array([np.nan,2,5,np.nan, 4,np.nan,3,1])
aerr = np.array([np.nan, 0.1, 0.2, np.nan, 0.1, np.nan, 0.4, 0.3])
# median, ignoring NaNs
amedian = np.median(a[np.isfinite(a)])
# find the index of the closest value to the median in a
idx = np.nanargmin(np.abs(a-amedian))
# this is the corresponding "error"
aerr[idx]
EDIT: as #DSM points out, if you have NumPy 1.9 or above, you can simplify the calculation of amedian as amedian = np.nanmedian(a).
numpy has everything you need:
values = np.array([np.nan, 2, 5, np.nan, 4, np.nan, 3, 1])
errors = np.array([np.nan, 0.1, 0.2, np.nan, 0.1, np.nan, 0.4, 0.3])
# filter
filtered = values[~np.isnan(values)]
# find median
median = np.median(filtered)
# find indexes
indexes = np.where(values == median)[0]
# find errors
errors[indexes] # array([ 0.4])
let say you have your list named as "a", then you can use this codeto find a masked array without "Nan" and then do median with a np.ma.median():
a=[Nan, 2, 5, NaN, 4, NaN, 3, 1]
am = numpy.ma.masked_array(a, [numpy.isnan(x) for x in a])
numpy.ma.median(am)
you can do the same for errors as well.

Categories