I'd like to calculate the percentile rank based on which Group they belongs to. I have written the following codes and was able to calculate, say zscore as there is only one input. What shall I do with a function which has two arguments? Thanks.
import pandas as pd
import scipy.stats as stats
import numpy as np
funZScore = lambda x: (x - x.mean()) / x.std()
funPercentile = lambda x, y: stats.percentileofscore(x[~np.isnan(x)], y)
A = pd.DataFrame({'Group' : ['A','A','A','A','B','B','B'],
'Value' : [4, 7, None, 6, 2, 8, 1]})
# Compute the Z-score by group
A['Z'] = A.groupby('Group')['Value'].apply(funZScore)
print(A)
Group Value Z
0 A 4.0 -1.091089
1 A 7.0 0.872872
2 A NaN NaN
3 A 6.0 0.218218
4 B 2.0 -0.440225
5 B 8.0 1.144586
6 B 1.0 -0.704361
# compute the percentile rank by group
# how to put two arguments into groupby apply?
# I hope to get something like below
Group Value Z P
0 A 4.0 -1.091089 33.33
1 A 7.0 0.872872 100
2 A NaN NaN NaN
3 A 6.0 0.218218 66.67
4 B 2.0 -0.440225 66.67
5 B 8.0 1.144586 100
6 B 1.0 -0.704361 33.33
I think need:
d = A.groupby('Group')['Value'].apply(list).to_dict()
print (d)
{'A': [4.0, 7.0, nan, 6.0], 'B': [2.0, 8.0, 1.0]}
A['P'] = A.apply(lambda x: funPercentile(np.array(d[x['Group']]), x['Value']), axis=1)
print (A)
Group Value Z P
0 A 4.0 -1.091089 33.333333
1 A 7.0 0.872872 100.000000
2 C NaN NaN NaN
3 A 6.0 0.218218 66.666667
4 B 2.0 -0.440225 66.666667
5 B 8.0 1.144586 100.000000
6 B 1.0 -0.704361 33.333333
Related
import pandas
import numpy
names = ['a', 'b', 'c']
df = pandas.DataFrame([1, 2, 3, numpy.nan, numpy.nan, 4, 5, 6, numpy.nan, numpy.nan, 7, 8, 9])
For the above one, how will the condition change? Can someone please explain?
how can I get this,
df1 =
0
0 1.0
1 2.0
2 3.0
df2 =
0
4 4.0
5 5.0
6 6.0
df3 =
0
8 7.0
9 8.0
10 9.0
You can generate a temporary column, remove NaNs, and group by the temporary column:
dataframes = {f'df{idx+1}': d for idx, (_, d) in enumerate(df.dropna().groupby(df.assign(cond=df.isna().cumsum()).dropna()['cond']))}
Output:
>>> dataframes
{'df1': 0
0 1.0
1 2.0
2 3.0,
'df2': 0
5 4.0
6 5.0
7 6.0,
'df3': 0
10 7.0
11 8.0
12 9.0}
>>> dataframes['df1']
0
0 1.0
1 2.0
2 3.0
>>> dataframes['df2']
0
5 4.0
6 5.0
7 6.0
>>> dataframes['df3']
0
10 7.0
11 8.0
12 9.0
I have a pandas dataframe with two dimensions. I want to calculate the rolling standard deviation along axis 1 while also including datapoints in the rows above and below.
So say I have this df:
data = {'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want a rectangular window 3 rows high and 2 columns across, moving from left to right. So, for example,
std_df.loc[1, 'C']
would be equal to
np.std([1, 5, 9, 2, 6, 10, 3, 7, 11])
But no idea how to achieve this without very slow iteration
Looks like what you want is pd.shift
import pandas as pd
import numpy as np
data = {'A': [1,2,3,4], 'B': [5,6,7,8], 'C': [9,10,11,12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Shifting the dataframe you provided by 1 yields the row above
print(df.shift(1))
A B C
0 NaN NaN NaN
1 1.0 5.0 9.0
2 2.0 6.0 10.0
3 3.0 7.0 11.0
Similarly, shifting the dataframe you provided by -1 yields the row below
print(df.shift(-1))
A B C
0 2.0 6.0 10.0
1 3.0 7.0 11.0
2 4.0 8.0 12.0
3 NaN NaN NaN
so the code below should do what you're looking for (add_prefix prefixes the column names to make them unique)
above_df = df.shift(1).add_prefix('above_')
below_df = df.shift(-1).add_prefix('below_')
lagged = pd.concat([df, above_df, below_df], axis=1)
lagged['std'] = lagged.apply(np.std, axis=1)
print(lagged)
A B C above_A above_B above_C below_A below_B below_C std
0 1 5 9 NaN NaN NaN 2.0 6.0 10.0 3.304038
1 2 6 10 1.0 5.0 9.0 3.0 7.0 11.0 3.366502
2 3 7 11 2.0 6.0 10.0 4.0 8.0 12.0 3.366502
3 4 8 12 3.0 7.0 11.0 NaN NaN NaN 3.304038
Given the following pandas DataFrame where some indices are NaN, how to drop the third and eight row since their index is NaN? Thanks
import pandas as pd
import numpy as np
data = list('abcdefghil')
indices = [0, 1, np.nan, 3, 4, 5, 6, np.nan, 8, 9]
df = pd.DataFrame(data, index=indices, columns=['data'])
You can call dropna on the index:
In[68]:
df.loc[df.index.dropna()]
Out[68]:
data
0.0 a
1.0 b
3.0 d
4.0 e
5.0 f
6.0 g
8.0 i
9.0 l
Note that the presence of NaN makes the index dtype float, to change it to int cast the type:
In[70]:
df = df.loc[df.index.dropna()]
df.index = df.index.astype(int)
df
Out[70]:
data
0 a
1 b
3 d
4 e
5 f
6 g
8 i
9 l
You can also call notnull on the index would also work (somehow undocumented)
In[71]:
df = df.loc[df.index.notnull()]
df.index = df.index.astype(int)
df
Out[71]:
data
0 a
1 b
3 d
4 e
5 f
6 g
8 i
9 l
there is also isna:
In[78]:
df.loc[~df.index.isna()]
Out[78]:
data
0.0 a
1.0 b
3.0 d
4.0 e
5.0 f
6.0 g
8.0 i
9.0 l
and the more readable inverse notna:
In[79]:
df.loc[df.index.notna()]
Out[79]:
data
0.0 a
1.0 b
3.0 d
4.0 e
5.0 f
6.0 g
8.0 i
9.0 l
As commented by #jpp you can use the top-level notnull also:
In[80]:
df.loc[pd.notnull(df.index)]
Out[80]:
data
0.0 a
1.0 b
3.0 d
4.0 e
5.0 f
6.0 g
8.0 i
9.0 l
There is also top-level isna, notna, and isnull but I'm not going to display those, you can check the docs
You can use the following:
df = df[df.index.isnull() == False]
You might want to reset the index after
Using np.isnan and taking the negative:
res = df[~np.isnan(df.index)]
print(res)
data
0.0 a
1.0 b
3.0 d
4.0 e
5.0 f
6.0 g
8.0 i
9.0 l
I am looking for a function that achieves the following. It is best shown in an example. Consider:
pd.DataFrame([ [1, 2, 3 ], [4, 5, np.nan ]], columns=['x', 'y1', 'y2'])
which looks like:
x y1 y2
0 1 2 3
1 4 5 NaN
I would like to collapase the y1 and y2 columns, lengthening the DataFame where necessary, so that the output is:
x y
0 1 2
1 1 3
2 4 5
That is, one row for each combination between either x and y1, or x and y2. I am looking for a function that does this relatively efficiently, as I have multiple ys and many rows.
You can use stack to get things done i.e
pd.DataFrame(df.set_index('x').stack().reset_index(level=0).values,columns=['x','y'])
x y
0 1.0 2.0
1 1.0 3.0
2 4.0 5.0
Repeat all the items in first column based on counts of not null values in each row. Then simply create your final dataframe using the rest of not null values in other columns. You can use DataFrame.count() method to count not null values and numpy.repeat() to repeat an array based on a respective count array.
>>> rest = df.loc[:,'y1':]
>>> pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
'y': rest.values[rest.notna()]})
Demo:
>>> df
x y1 y2 y3 y4
0 1 2.0 3.0 NaN 6.0
1 4 5.0 NaN 9.0 3.0
2 10 NaN NaN NaN NaN
3 9 NaN NaN 6.0 NaN
4 7 6.0 NaN NaN NaN
>>> rest = df.loc[:,'y1':]
>>> pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
'y': rest.values[rest.notna()]})
x y
0 1 2.0
1 1 3.0
2 1 6.0
3 4 5.0
4 4 9.0
5 4 3.0
6 9 6.0
7 7 6.0
Here's one based on NumPy, as you were looking for performance -
def gather_columns(df):
col_mask = [i.startswith('y') for i in df.columns]
ally_vals = df.iloc[:,col_mask].values
y_valid_mask = ~np.isnan(ally_vals)
reps = np.count_nonzero(y_valid_mask, axis=1)
x_vals = np.repeat(df.x.values, reps)
y_vals = ally_vals[y_valid_mask]
return pd.DataFrame({'x':x_vals, 'y':y_vals})
Sample run -
In [78]: df #(added more cols for variety)
Out[78]:
x y1 y2 y5 y7
0 1 2 3.0 NaN NaN
1 4 5 NaN 6.0 7.0
In [79]: gather_columns(df)
Out[79]:
x y
0 1 2.0
1 1 3.0
2 4 5.0
3 4 6.0
4 4 7.0
If the y columns are always starting from the second column onwards until the end, we can simply slice the dataframe and hence get further performance boost, like so -
def gather_columns_v2(df):
ally_vals = df.iloc[:,1:].values
y_valid_mask = ~np.isnan(ally_vals)
reps = np.count_nonzero(y_valid_mask, axis=1)
x_vals = np.repeat(df.x.values, reps)
y_vals = ally_vals[y_valid_mask]
return pd.DataFrame({'x':x_vals, 'y':y_vals})
I have a dataFrame where 'value'column has missing values. I'd like to filling missing values by weighted average within each 'name' group. There was post on how to fill the missing values by simple average in each group but not weighted average. Thanks a lot!
df = pd.DataFrame({'value': [1, np.nan, 3, 2, 3, 1, 3, np.nan, np.nan],'weight':[3,1,1,2,1,2,2,1,1], 'name': ['A','A', 'A','B','B','B', 'C','C','C']})
name value weight
0 A 1.0 3
1 A NaN 1
2 A 3.0 1
3 B 2.0 2
4 B 3.0 1
5 B 1.0 2
6 C 3.0 2
7 C NaN 1
8 C NaN 1
I'd like to fill in "NaN" with weighted value in each "name" group, i.e.
name value weight
0 A 1.0 3
1 A 1.5 1
2 A 3.0 1
3 B 2.0 2
4 B 3.0 1
5 B 1.0 2
6 C 3.0 2
7 C 3.0 1
8 C 3.0 1
You can group data frame by name, and use fillna method to fill the missing values with weighted average which can calculated with np.average with weights parameter:
df['value'] = (df.groupby('name', group_keys=False)
.apply(lambda g: g.value.fillna(np.average(g.dropna().value, weights=g.dropna().weight))))
df
#name value weight
#0 A 1.0 3
#1 A 1.5 1
#2 A 3.0 1
#3 B 2.0 2
#4 B 3.0 1
#5 B 1.0 2
#6 C 3.0 2
#7 C 3.0 1
#8 C 3.0 1
To make this less convoluted, define a fillValue function:
import numpy as np
import pandas as pd
def fillValue(g):
gNotNull = g.dropna()
wtAvg = np.average(gNotNull.value, weights=gNotNull.weight)
return g.value.fillna(wtAvg)
df['value'] = df.groupby('name', group_keys=False).apply(fillValue)