How to group near-duplicate values in a pandas dataframe? - python

If there are duplicate values in a DataFrame pandas already provides functions to replace or drop duplicates. In many experimental datasets on the other hand one might have 'near' duplicates.
How can one replace these near duplicate values with, e.g. their mean?
The example data looks as follows:
df = pd.DataFrame({'x': [1, 2,2.01, 3, 4,4.1,3.95, 5,],
'y': [1, 2,2.2, 3, 4.1,4.4,4.01, 5.5]})
I tried to hack together something to bin together near duplicates but this is using for loops and seems like a hack against pandas:
def cluster_near_values(df, colname_to_cluster, bin_size=0.1):
used_x = [] # list of values already grouped
group_index = 0
for search_value in df[colname_to_cluster]:
if search_value in used_x:
# value is already in a group, skip to next
continue
g_ix = df[abs(df[colname_to_cluster]-search_value) < bin_size].index
used_x.extend(df.loc[g_ix, colname_to_cluster])
df.loc[g_ix, 'cluster_group'] = group_index
group_index += 1
return df.groupby('cluster_group').mean()
Which does the grouping and averaging:
print(cluster_near_values(df, 'x', 0.1))
x y
cluster_group
0.0 1.000000 1.00
1.0 2.005000 2.10
2.0 3.000000 3.00
3.0 4.016667 4.17
4.0 5.000000 5.50
Is there a better way to achieve this?

Here's an example, where you want to group items to one digit of precision. You can modify this as needed. You can also modify this for binning values with threshold over 1.
df.groupby(np.ceil(df['x'] * 10) // 10).mean()
x y
x
1.0 1.000000 1.00
2.0 2.005000 2.10
3.0 3.000000 3.00
4.0 4.016667 4.17
5.0 5.000000 5.50

Related

Fill a DataFrame with medians of group only for cell containing specific value

I am trying to find a nice/smart way to fill my DataFrame with median value from groups.
I have 2 groups "I" and "J" and 2 factors"A" and "B". I want to replace my negative value by the median of the group to which the value belongs.
One constraint is that I don't want to replace the NaN values.
Here is the code to make my initial DataFrame
tuples = [('I','0'), ('I','1'), ('I', '2'), ('J', '3'), ('I', '4'), ('J', '5')]
index = pd.MultiIndex.from_tuples(tuples, names=["Id1", "Id2"])
df = pd.DataFrame(np.arange(12).reshape(-1, 2), columns=['A', 'B'], index=index)
df["A"].iloc[0]=-1
df["B"].iloc[-1:]=-1
df["B"].iloc[-2]=18
df["B"].iloc[0]=np.NaN
df["B"].iloc[1]=np.NaN
which gives:
A B
Id1 Id2
I 0 -1 NaN
1 2 NaN
2 4 5.0
J 3 6 7.0
I 4 8 18.0
J 5 10 -1.0
Here is the way I solved it:
ind, col = np.where(df<0)
nb_df_lt_0 = len(ind)
for ii in np.arange(nb_df_lt_0) :
df.iloc[ind[ii],col[ii]] = np.NAN
xx, yy = ind[ii], col[ii]
index_Id1 = df.index.get_level_values("Id1")[xx]
df.iloc[xx,yy] = df.loc[index_Id1,:].iloc[:,yy].median()
df
This gives what I was looking for:
A B
Id1 Id2
I 0 4.0 NaN
1 2.0 NaN
2 4.0 5.0
J 3 6.0 7.0
I 4 8.0 18.0
J 5 10.0 7.0
It works, but it doesn't look nice, and surely not very efficient since I have a For loop.
I would be very please to look at a solution with pandas or numpy functions which make the job.
Thanks in advance
You can do something like this:
df.mask(df<0, df.mask(df<0, np.nan).groupby(level=0).median())
Lets break that down. You need the median of the two groups "I" and "J" excluding the negative values:
median_df = df.mask(df<0, np.nan).groupby(level=0).median()
Then you want to substitute the median for the negative values in the original DataFrame:
df.mask(df<0, median_df)
You can use this:
It groups each col and then finds the median of the group (not including the -1 values.)
for col in df.columns:
df[col] = df.groupby('Id1')[col].apply(lambda x: (
x.replace(-1, x.loc[x != -1].median())
))
Let's start from a small correction in the way you created the source DataFrame:
As each column can contain NaN, which is a special case of float,
create the temporary DataFrame with data type of float:
np.arange(12, dtype='float')
(no change in the rest of your code to create the DataFrame).
You will need the following group processing function:
def grpProc(grp):
grp[grp == -1] = grp[grp != -1].median()
return grp
It computes the median from elements != 0 and saves it in elements ==
-1, assuming that the source group (grp) is a part of the current column
for each Id1.
Then the changed group is returned.
And to get the result, apply it to each column of your DataFrame,
grouped by Id1 (level 0):
result = df.apply(lambda col: col.groupby(level=0).apply(grpProc))
No axis parameter has been passed, so this function is applied to
each column (axis == 0).
For your sample data the result is:
A B
Id1 Id2
I 0 4.0 NaN
1 2.0 NaN
2 4.0 5.0
J 3 6.0 7.0
I 4 8.0 18.0
J 5 10.0 7.0

Filling missing data with historical mean fast and efficiently in pandas

I am working with a large panel dataset (longitudinal data) with 500k observations. Currently, I am trying to fill the missing data (at most 30% of observations) using the mean of up till time t of each variable. (The reason why I do not fill the data with overall mean, is to avoid a forward looking bias that arises from using data only available at a later point in time.)
I wrote the following function which does the job, but runs extremely slow (5 hours for 500k rows!!) In general, I find that filling missing data in Pandas is a computationally tedious task. Please enlighten me on how you normally fill missing values, and how you make it run fast
Function to fill with mean till time "t":
def meanTillTimeT(x,cols):
start = time.time()
print('Started')
x.reset_index(inplace=True)
for i in cols:
l1 =[]
for j in range(x.shape[0]):
if x.loc[j,i] !=0 and np.isnan(x.loc[j,i]) == False :
l1.append(x.loc[j,i])
elif np.isnan(x.loc[j,i])==True :
x.loc[j,i]=np.mean(l1)
end = time.time()
print("time elapsed:", end - start)
return x
Let us build a DataFrame for illustration:
import pandas as pd
import numpy as np
df = pd.DataFrame({"value1": [1, 2, 1, 5, np.nan, np.nan, 8, 3],
"value2": [0, 8, 1, np.nan, np.nan, 8, 9, np.nan]})
Here is the DataFrame:
value1 value2
0 1.0 0.0
1 2.0 8.0
2 1.0 1.0
3 5.0 NaN
4 NaN NaN
5 NaN 8.0
6 8.0 9.0
7 3.0 NaN
Now, I suggest to first compute the cumulative sums using pandas.DataFrame.cumsum and also the number of non-NaNs values so as to compute the means. After that, it is enough to fill the NaNs with those means, and inserting them in the original DataFrame. Both actions use pandas.DataFrame.fillna, which is going to be much much faster than Python loops:
df_mean = df.cumsum() / (~df.isna()).cumsum()
df_mean = df_mean.fillna(method = "ffill")
df = df.fillna(value = df_mean)
The result is:
value1 value2
0 1.00 0.0
1 2.00 8.0
2 1.00 1.0
3 5.00 3.0
4 2.25 3.0
5 2.25 8.0
6 8.00 9.0
7 3.00 5.2

pandas groudby dataframe and get mean and most common value per group

I have a dataframe with 2 columns.
df=pd.DataFrame({'values':arrays,'ii':lin_index})
I want to group the values by the lin_index and get the mean per group and the most common value per group
I try this
bii=df.groupby('ii').median()
bii2=df.groupby('ii').agg(lambda x:x.value_counts().index[0])
bii3=df.groupby('ii')['values'].agg(pd.Series.mode)
I wonder if bii2 and bii3 return the same values
Then I want to return the mean and most common value to the original array
bs=np.zeros((np.unique(array).shape[0],1))
bs[bii.index.values]=bii.values
Does this look good?
df looks like
values ii
0 1.0 10446786
1 1.0 11316289
2 1.0 16416704
3 1.0 12151686
4 1.0 30312736
... ...
93071038 3.0 28539525
93071039 3.0 19667948
93071040 3.0 22240849
93071041 3.0 22212513
93071042 3.0 41641943
[93071043 rows x 2 columns]
something like this maybe:
# get the mean
df.groupby(['ii']).mean()
# get the most frequent
df.groupby(['ii']).agg(pd.Series.mode)
your question seems similar to
GroupBy pandas DataFrame and select most common value
this link might also be useful https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats

Find if sum of any two columns exceed X in pandas dataframe

. Columns are attributes, rows are observation.
I would like to extract rows, where sum of any two attributes exceed a specified value (say 0.7). Then, in two new columns, list column header with bigger and smaller contribution to sum.
I am new to python, so I am stuck proceeding after generating my dataframe.
You can do this:
import pandas as pd
from itertools import combinations
THRESHOLD = 8.0
def valuation_formula(row):
l = [sorted(x) for x in combinations(row, r=2) if sum(x) > THRESHOLD]
if(len(l) == 0):
row["smaller"], row["larger"] = None, None
else:
row["smaller"], row["larger"] = l[0] # since not specified by OP, we take the first such pair
return row
contribution_df = df.apply(lambda row: valuation_formula(row), axis=1)
So that, if
df = pd.DataFrame({"a" : [1.0, 2.0, 4.0], "b" : [5.0, 6.0, 7.0]})
a b
0 1.0 5.0
1 2.0 6.0
2 4.0 7.0
then, contribution_df is
a b smaller larger
0 1.0 5.0 NaN NaN
1 2.0 6.0 NaN NaN
2 4.0 7.0 4.0 7.0
HTH.

Why doesn't first and last in a groupby give me first and last

I'm posting this because the topic just got brought up in another question/answer and the behavior isn't very well documented.
Consider the dataframe df
df = pd.DataFrame(dict(
A=list('xxxyyy'),
B=[np.nan, 1, 2, 3, 4, np.nan]
))
A B
0 x NaN
1 x 1.0
2 x 2.0
3 y 3.0
4 y 4.0
5 y NaN
I wanted to get the first and last rows of each group defined by column 'A'.
I tried
df.groupby('A').B.agg(['first', 'last'])
first last
A
x 1.0 2.0
y 3.0 4.0
However, This doesn't give me the np.NaNs that I expected.
How do I get the actual first and last values in each group?
As noted here by #unutbu:
The groupby.first and groupby.last methods return the first and last non-null values respectively.
To get the actual first and last values, do:
def h(x):
return x.values[0]
def t(x):
return x.values[-1]
df.groupby('A').B.agg([h, t])
h t
A
x NaN 2.0
y 3.0 NaN
One option is to use the .nth method:
>>> gb = df.groupby('A')
>>> gb.nth(0)
B
A
x NaN
y 3.0
>>> gb.nth(-1)
B
A
x 2.0
y NaN
>>>
However, I haven't found a way to aggregate them neatly. Of course, one can always use a pd.DataFrame constructor:
>>> pd.DataFrame({'first':gb.B.nth(0), 'last':gb.B.nth(-1)})
first last
A
x NaN 2.0
y 3.0 NaN
Note: I explicitly used the gb.B attribute, or else you have to use .squeeze

Categories