. Columns are attributes, rows are observation.
I would like to extract rows, where sum of any two attributes exceed a specified value (say 0.7). Then, in two new columns, list column header with bigger and smaller contribution to sum.
I am new to python, so I am stuck proceeding after generating my dataframe.
You can do this:
import pandas as pd
from itertools import combinations
THRESHOLD = 8.0
def valuation_formula(row):
l = [sorted(x) for x in combinations(row, r=2) if sum(x) > THRESHOLD]
if(len(l) == 0):
row["smaller"], row["larger"] = None, None
else:
row["smaller"], row["larger"] = l[0] # since not specified by OP, we take the first such pair
return row
contribution_df = df.apply(lambda row: valuation_formula(row), axis=1)
So that, if
df = pd.DataFrame({"a" : [1.0, 2.0, 4.0], "b" : [5.0, 6.0, 7.0]})
a b
0 1.0 5.0
1 2.0 6.0
2 4.0 7.0
then, contribution_df is
a b smaller larger
0 1.0 5.0 NaN NaN
1 2.0 6.0 NaN NaN
2 4.0 7.0 4.0 7.0
HTH.
Related
We wish to find the best matches out of multiple keys in two dataframes. From the documentation the method merge_asof seemed to be a reasonable choice. Testing it for one column it worked as expected.
import pandas as pd
import numpy as np
data_key_1 = np.arange(10).astype(float)
data_key_2 = np.arange(10).astype(float)
data_key_1_noisy = data_key_1-0.25*np.random.rand(10)
data_key_2_noisy = data_key_2-0.1*np.random.rand(10)
data_target = list('abcdefghij')
# one key approach
df_1 = pd.DataFrame(zip(data_key_1[::2], ), columns=['key_1',])
df_2 = pd.DataFrame(zip(data_key_1_noisy, data_target), columns=['key_1', 'target',])
df_result_1 = pd.merge_asof(df_1, df_2, on='key_1', direction='nearest', tolerance=0.5)
print(df_result_1)
With console output as follow.
key_1 target
0 0.0 a
1 2.0 c
2 4.0 e
3 6.0 g
4 8.0 i
When trying to use two keys, it failed. We tried different combinations of keyword settings but didn't manage to get it running.
# two keys approach
df_1 = pd.DataFrame(zip(data_key_1[::2], data_key_2[::2]), columns=['key_1', 'key2'])
df_2 = pd.DataFrame(zip(data_key_1_noisy, data_key_2_noisy, data_target), columns=['key_1', 'key2', 'target'])
df_result_2 = pd.merge_asof(df_1, df_2, on=['key_1', 'key_2'], direction='nearest', tolerance=0.5)
print(df_result_2)
It will throw an error:
pandas.errors.MergeError: can only asof on a key for left
Expected console output would have been.
key_1 key2 target
0 0.0 0.0 a
1 2.0 2.0 c
2 4.0 4.0 e
3 6.0 6.0 g
4 8.0 8.0 i
So we questioned ourselves whether we try to apply this method in an inappropriate context, or if it is an applicable use-case and we just messed up with the keywords.
I believe you can also use KDTree to do this
from scipy.spatial import cKDTree
tree2 = cKDTree(df_2[["key_1", "key2"]])
# get distances and indices of nearest neighbors
dists, inds = tree2.query(df_1[['key_1', 'key2']], 1)
# assign target if nearest neighbor is within tol
df_1['target'] = [df_2.loc[i, 'target'] if d < 0.5 else np.nan for d,i in zip(dists, inds)]
key_1 key2 target
0 0.0 0.0 a
1 2.0 2.0 c
2 4.0 4.0 e
3 6.0 6.0 g
4 8.0 8.0 i
Previously I suggested query_ball_tree, but that method doesnt return nearest neighbors in any particular order, hence one should use query to directly grab the nearest neighbor.
merge_asof is always one key only (think why it requires key to be sorted). You can merge on one key and filter the other key.
(pd.merge_asof(df_1, df_2, on='key_1', direction='nearest', tolerance=0.5, suffixes=['','_'])
# may need to drop duplicate on `key_1, key2` here based on `abs` as well
.loc[lambda x: x['key2'].sub(x['key2_']).abs() < 0.5]
.drop(columns=['key2_'])
)
Output:
key_1 key2 target
0 0.0 0.0 a
1 2.0 2.0 c
2 4.0 4.0 e
3 6.0 6.0 g
4 8.0 8.0 i
I am trying to find a nice/smart way to fill my DataFrame with median value from groups.
I have 2 groups "I" and "J" and 2 factors"A" and "B". I want to replace my negative value by the median of the group to which the value belongs.
One constraint is that I don't want to replace the NaN values.
Here is the code to make my initial DataFrame
tuples = [('I','0'), ('I','1'), ('I', '2'), ('J', '3'), ('I', '4'), ('J', '5')]
index = pd.MultiIndex.from_tuples(tuples, names=["Id1", "Id2"])
df = pd.DataFrame(np.arange(12).reshape(-1, 2), columns=['A', 'B'], index=index)
df["A"].iloc[0]=-1
df["B"].iloc[-1:]=-1
df["B"].iloc[-2]=18
df["B"].iloc[0]=np.NaN
df["B"].iloc[1]=np.NaN
which gives:
A B
Id1 Id2
I 0 -1 NaN
1 2 NaN
2 4 5.0
J 3 6 7.0
I 4 8 18.0
J 5 10 -1.0
Here is the way I solved it:
ind, col = np.where(df<0)
nb_df_lt_0 = len(ind)
for ii in np.arange(nb_df_lt_0) :
df.iloc[ind[ii],col[ii]] = np.NAN
xx, yy = ind[ii], col[ii]
index_Id1 = df.index.get_level_values("Id1")[xx]
df.iloc[xx,yy] = df.loc[index_Id1,:].iloc[:,yy].median()
df
This gives what I was looking for:
A B
Id1 Id2
I 0 4.0 NaN
1 2.0 NaN
2 4.0 5.0
J 3 6.0 7.0
I 4 8.0 18.0
J 5 10.0 7.0
It works, but it doesn't look nice, and surely not very efficient since I have a For loop.
I would be very please to look at a solution with pandas or numpy functions which make the job.
Thanks in advance
You can do something like this:
df.mask(df<0, df.mask(df<0, np.nan).groupby(level=0).median())
Lets break that down. You need the median of the two groups "I" and "J" excluding the negative values:
median_df = df.mask(df<0, np.nan).groupby(level=0).median()
Then you want to substitute the median for the negative values in the original DataFrame:
df.mask(df<0, median_df)
You can use this:
It groups each col and then finds the median of the group (not including the -1 values.)
for col in df.columns:
df[col] = df.groupby('Id1')[col].apply(lambda x: (
x.replace(-1, x.loc[x != -1].median())
))
Let's start from a small correction in the way you created the source DataFrame:
As each column can contain NaN, which is a special case of float,
create the temporary DataFrame with data type of float:
np.arange(12, dtype='float')
(no change in the rest of your code to create the DataFrame).
You will need the following group processing function:
def grpProc(grp):
grp[grp == -1] = grp[grp != -1].median()
return grp
It computes the median from elements != 0 and saves it in elements ==
-1, assuming that the source group (grp) is a part of the current column
for each Id1.
Then the changed group is returned.
And to get the result, apply it to each column of your DataFrame,
grouped by Id1 (level 0):
result = df.apply(lambda col: col.groupby(level=0).apply(grpProc))
No axis parameter has been passed, so this function is applied to
each column (axis == 0).
For your sample data the result is:
A B
Id1 Id2
I 0 4.0 NaN
1 2.0 NaN
2 4.0 5.0
J 3 6.0 7.0
I 4 8.0 18.0
J 5 10.0 7.0
I am working with a large panel dataset (longitudinal data) with 500k observations. Currently, I am trying to fill the missing data (at most 30% of observations) using the mean of up till time t of each variable. (The reason why I do not fill the data with overall mean, is to avoid a forward looking bias that arises from using data only available at a later point in time.)
I wrote the following function which does the job, but runs extremely slow (5 hours for 500k rows!!) In general, I find that filling missing data in Pandas is a computationally tedious task. Please enlighten me on how you normally fill missing values, and how you make it run fast
Function to fill with mean till time "t":
def meanTillTimeT(x,cols):
start = time.time()
print('Started')
x.reset_index(inplace=True)
for i in cols:
l1 =[]
for j in range(x.shape[0]):
if x.loc[j,i] !=0 and np.isnan(x.loc[j,i]) == False :
l1.append(x.loc[j,i])
elif np.isnan(x.loc[j,i])==True :
x.loc[j,i]=np.mean(l1)
end = time.time()
print("time elapsed:", end - start)
return x
Let us build a DataFrame for illustration:
import pandas as pd
import numpy as np
df = pd.DataFrame({"value1": [1, 2, 1, 5, np.nan, np.nan, 8, 3],
"value2": [0, 8, 1, np.nan, np.nan, 8, 9, np.nan]})
Here is the DataFrame:
value1 value2
0 1.0 0.0
1 2.0 8.0
2 1.0 1.0
3 5.0 NaN
4 NaN NaN
5 NaN 8.0
6 8.0 9.0
7 3.0 NaN
Now, I suggest to first compute the cumulative sums using pandas.DataFrame.cumsum and also the number of non-NaNs values so as to compute the means. After that, it is enough to fill the NaNs with those means, and inserting them in the original DataFrame. Both actions use pandas.DataFrame.fillna, which is going to be much much faster than Python loops:
df_mean = df.cumsum() / (~df.isna()).cumsum()
df_mean = df_mean.fillna(method = "ffill")
df = df.fillna(value = df_mean)
The result is:
value1 value2
0 1.00 0.0
1 2.00 8.0
2 1.00 1.0
3 5.00 3.0
4 2.25 3.0
5 2.25 8.0
6 8.00 9.0
7 3.00 5.2
If there are duplicate values in a DataFrame pandas already provides functions to replace or drop duplicates. In many experimental datasets on the other hand one might have 'near' duplicates.
How can one replace these near duplicate values with, e.g. their mean?
The example data looks as follows:
df = pd.DataFrame({'x': [1, 2,2.01, 3, 4,4.1,3.95, 5,],
'y': [1, 2,2.2, 3, 4.1,4.4,4.01, 5.5]})
I tried to hack together something to bin together near duplicates but this is using for loops and seems like a hack against pandas:
def cluster_near_values(df, colname_to_cluster, bin_size=0.1):
used_x = [] # list of values already grouped
group_index = 0
for search_value in df[colname_to_cluster]:
if search_value in used_x:
# value is already in a group, skip to next
continue
g_ix = df[abs(df[colname_to_cluster]-search_value) < bin_size].index
used_x.extend(df.loc[g_ix, colname_to_cluster])
df.loc[g_ix, 'cluster_group'] = group_index
group_index += 1
return df.groupby('cluster_group').mean()
Which does the grouping and averaging:
print(cluster_near_values(df, 'x', 0.1))
x y
cluster_group
0.0 1.000000 1.00
1.0 2.005000 2.10
2.0 3.000000 3.00
3.0 4.016667 4.17
4.0 5.000000 5.50
Is there a better way to achieve this?
Here's an example, where you want to group items to one digit of precision. You can modify this as needed. You can also modify this for binning values with threshold over 1.
df.groupby(np.ceil(df['x'] * 10) // 10).mean()
x y
x
1.0 1.000000 1.00
2.0 2.005000 2.10
3.0 3.000000 3.00
4.0 4.016667 4.17
5.0 5.000000 5.50
I'm posting this because the topic just got brought up in another question/answer and the behavior isn't very well documented.
Consider the dataframe df
df = pd.DataFrame(dict(
A=list('xxxyyy'),
B=[np.nan, 1, 2, 3, 4, np.nan]
))
A B
0 x NaN
1 x 1.0
2 x 2.0
3 y 3.0
4 y 4.0
5 y NaN
I wanted to get the first and last rows of each group defined by column 'A'.
I tried
df.groupby('A').B.agg(['first', 'last'])
first last
A
x 1.0 2.0
y 3.0 4.0
However, This doesn't give me the np.NaNs that I expected.
How do I get the actual first and last values in each group?
As noted here by #unutbu:
The groupby.first and groupby.last methods return the first and last non-null values respectively.
To get the actual first and last values, do:
def h(x):
return x.values[0]
def t(x):
return x.values[-1]
df.groupby('A').B.agg([h, t])
h t
A
x NaN 2.0
y 3.0 NaN
One option is to use the .nth method:
>>> gb = df.groupby('A')
>>> gb.nth(0)
B
A
x NaN
y 3.0
>>> gb.nth(-1)
B
A
x 2.0
y NaN
>>>
However, I haven't found a way to aggregate them neatly. Of course, one can always use a pd.DataFrame constructor:
>>> pd.DataFrame({'first':gb.B.nth(0), 'last':gb.B.nth(-1)})
first last
A
x NaN 2.0
y 3.0 NaN
Note: I explicitly used the gb.B attribute, or else you have to use .squeeze