Pandas - Filter dataframe by group aggregates - python

The full version is that I'm trying to return a dataframe of rows where each row represents an outlier within each group. So ultimately I'm trying to filter on values that fall outside of two other values.
To simplify things here though I'll just use mean() as its the comparison that I'm struggling with.
Example:
df = pd.DataFrame({
"Group": ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
"Sub": ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
"Values": [1, 2, 3, 10, 20, 10, 25, 100, 75, 1500, 1600, 1800]
})
Then I want to group by "Group" and "Sub" to find the mean of each group:
df.groupby(["Group", 'Sub']).mean()
Then I want to use these values to filter the original dataframe. So for example, filter Rows where "Values" > group "Values".mean()
So in this example I'd be expecting to see something like this, as these are the only rows above the group mean:
I've tried comparing them directly and I get:
ValueError: Can only compare identically-labeled DataFrame objects
So I tried .set_index(['Group', 'Sub']) and I get the same error but as far as I can tell, the labels are identical? At least they are when I check .index on both.
This seems like something that should be quite straight forward but I'm really struggling to work it out.

Applying filtering condition:
res_df = df.groupby(["Group", 'Sub'])["Values"]\
.apply(lambda x: x[x > x.mean()]).reset_index(level=[0,1])
Group Sub Values
3 A A 10
4 A B 20
5 A C 10
9 B A 1500
10 B B 1600
11 B C 1800

If you use transform you can then compare it against the original value
df.loc[df['Values'].gt(df.groupby(['Group','Sub'])['Values'].transform('mean'))]
Output
Group Sub Values
3 A A 10
4 A B 20
5 A C 10
9 B A 1500
10 B B 1600
11 B C 1800

Related

pandas apply function on groups gives error when group has only one distinct value

I have extracted a group of data (e.g. col 'A') from a larger dataset and wanted to apply a function to the group in order to verify the results of the function. The Problem is, when I apply the function to a group that has only one distinct value (df_false), pandas returns a
ValueError: Expected a 1D array, got an array with shape (6, 6)
When I apply the same function to a df that has more than one distinct value in the grouping column (df_true), the error doesn't appear.
Does anyone know how to deal with that?
import pandas as pd
df_false = pd.DataFrame({'A' : ['a', 'a', 'a', 'a', 'a', 'a'],
'B': [10,10,20,20,30,10],
'C': [10,10,20,30,10,5]})
df_ok = pd.DataFrame({'A' : ['a', 'a', 'a', 'a', 'a', 'c'],
'B': [10,10,20,20,30,10],
'C': [10,10,20,30,10,5]})
display(df_false)
def myf(x):
y = []
for i in x.iterrows():
y.append(len(x))
return pd.Series(y)
df_false['result'] = df_false.groupby('A').apply(myf).reset_index(drop=True)
display(df)
The issue is that your code with df_false outputs a DataFrame (of a single row). You can force it into a Series with squeeze:
df_false['result'] = (df_false.groupby('A').apply(myf)
.reset_index(drop=True).squeeze()
)
That said, unless this was a dummy example, you should rather use vectorial code:
df_false['result'] = df_false.groupby('A')['A'].transform('size')
output:
A B C result
0 a 10 10 6
1 a 10 10 6
2 a 20 20 6
3 a 20 30 6
4 a 30 10 6
5 a 10 5 6

How to find similarities in two dataframes and extract its value [duplicate]

I have a dataframe df1 which looks like:
c k l
0 A 1 a
1 A 2 b
2 B 2 a
3 C 2 a
4 C 2 d
and another called df2 like:
c l
0 A b
1 C a
I would like to filter df1 keeping only the values that ARE NOT in df2. Values to filter are expected to be as (A,b) and (C,a) tuples. So far I tried to apply the isin method:
d = df[~(df['l'].isin(dfc['l']) & df['c'].isin(dfc['c']))]
That seems to me too complicated, it returns:
c k l
2 B 2 a
4 C 2 d
but I'm expecting:
c k l
0 A 1 a
2 B 2 a
4 C 2 d
You can do this efficiently using isin on a multiindex constructed from the desired columns:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
keys = list(df2.columns.values)
i1 = df1.set_index(keys).index
i2 = df2.set_index(keys).index
df1[~i1.isin(i2)]
I think this improves on #IanS's similar solution because it doesn't assume any column type (i.e. it will work with numbers as well as strings).
(Above answer is an edit. Following was my initial answer)
Interesting! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where df2 is defined. Here is an example, which makes use of a temporary array:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
# create a column marking df2 values
df2['marker'] = 1
# join the two, keeping all of df1's indices
joined = pd.merge(df1, df2, on=['c', 'l'], how='left')
joined
# extract desired columns where marker is NaN
joined[pd.isnull(joined['marker'])][df1.columns]
There may be a way to do this without using the temporary array, but I can't think of one. As long as your data isn't huge the above method should be a fast and sufficient answer.
This is pretty succinct and works well:
df1 = df1[~df1.index.isin(df2.index)]
Using DataFrame.merge & DataFrame.query:
A more elegant method would be to do left join with the argument indicator=True, then filter all the rows which are left_only with query:
d = (
df1.merge(df2,
on=['c', 'l'],
how='left',
indicator=True)
.query('_merge == "left_only"')
.drop(columns='_merge')
)
print(d)
c k l
0 A 1 a
2 B 2 a
4 C 2 d
indicator=True returns a dataframe with an extra column _merge which marks each row left_only, both, right_only:
df1.merge(df2, on=['c', 'l'], how='left', indicator=True)
c k l _merge
0 A 1 a left_only
1 A 2 b both
2 B 2 a left_only
3 C 2 a both
4 C 2 d left_only
I think this is a quite simple approach when you want to filter a dataframe based on multiple columns from another dataframe or even based on a custom list.
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
#values of df2 columns 'c' and 'l' that will be used to filter df1
idxs = list(zip(df2.c.values, df2.l.values)) #[('A', 'b'), ('C', 'a')]
#so df1 is filtered based on the values present in columns c and l of df2 (idxs)
df1 = df1[pd.Series(list(zip(df1.c, df1.l)), index=df1.index).isin(idxs)]
How about:
df1['key'] = df1['c'] + df1['l']
d = df1[~df1['key'].isin(df2['c'] + df2['l'])].drop(['key'], axis=1)
Another option that avoids creating an extra column or doing a merge would be to do a groupby on df2 to get the distinct (c, l) pairs and then just filter df1 using that.
gb = df2.groupby(("c", "l")).groups
df1[[p not in gb for p in zip(df1['c'], df1['l'])]]]
For this small example, it actually seems to run a bit faster than the pandas-based approach (666 µs vs. 1.76 ms on my machine), but I suspect it could be slower on larger examples since it's dropping into pure Python.
You can concatenate both DataFrames and drop all duplicates:
df1.append(df2).drop_duplicates(subset=['c', 'l'], keep=False)
Output:
c k l
0 A 1.0 a
2 B 2.0 a
4 C 2.0 d
This method doesn't work if you have duplicates subset=['c', 'l'] in df1.

Create new rows to pandas dataframe based on condition efficiently

I have two pandas dataframes: one with IDs and values and another that maps IDs with other IDs. The objective is to create a new dataframe that is based on df1. It loops through each sourceId in df1 and looks to df2, a mapping df, for matches in sourceId. If a match is found, a new row is created with the same value as in df1. So if multiple matches are found, the loop creates multiple rows (e.g. with ids A and C). If only one match is found (e.g. with id B), only one row is created.
The below code does exactly what I want, but it does it very slowly. In my original dataset df1 is 440K rows and df2 has mappings for thousands of different IDs - currently the code runs at 10-25 it/s which is too much.
Is there a faster way to do this that would benefit from matrix calculations/other benefits of numpy/pandas?
import pandas as pd
df1 = pd.DataFrame({
'SourceId': ['A', 'B', 'C', 'A', 'C', 'B'],
'value': [1, 5, 12, 30, 32, 55],
'time': [pd.to_datetime('2020-04-04 08:49:52.166498900+0000'),
pd.to_datetime('2020-08-14 06:12:40.860460500+0000'),
pd.to_datetime('2020-05-13 09:20:50.052688900+0000'),
pd.to_datetime('2020-03-09 13:55:17.335340600+0000'),
pd.to_datetime('2020-08-14 09:30:56.359635400+0000'),
pd.to_datetime('2020-01-31 23:03:46.539892900+0000')],
'otherInfo': ['0A10a', '055jA', 'boAqz', '0t,m5A', '09tjq1', 'akk_1!']})
df2 = pd.DataFrame({'SourceId': ['A', 'A', 'B', 'C', 'C', 'C'], 'TargetId': ['A', 'Q', 'B', 'C', 'B', 'X'], 'trueIfMatch': [1, 0, 1, 1, 0, 0]})
df3 = pd.DataFrame()
for r in df1.itertuples():
SourceId = r.SourceId
value = r.value
time = r.time
otherInfo = r.otherInfo
if SourceId in df2.SourceId.unique():
entries = df2.loc[df2.SourceId == SourceId].TargetId.tolist()
for entry in entries:
df3 = df3.append({
'sourceId': SourceId,
'targetId': entry,
'value': value,
'time': time,
'otherInfo': otherInfo
}, ignore_index=True)
display(df3)
Use df.merge with sort_values:
In [2293]: df3 = df1.merge(df2, on='SourceId').sort_values('value')
In [2294]: df3
Out[2294]:
SourceId value TargetId
0 A 1 A
1 A 1 Q
4 B 5 B
6 C 12 C
7 C 12 B
8 C 12 X
2 A 30 A
3 A 30 Q
9 C 32 C
10 C 32 B
11 C 32 X
5 B 55 B

Finding the Location of the Duplicate for Duplicated Columns in Pandas

I know I can find duplicate columns using:
df.T.duplicated()
what I'd like to know the index that a duplicate column is a duplicate of. For example, both C and D are duplicates of a A below:
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
A B C D
0 1 0 1 1
1 2 0 2 2
I'd like something like:
duplicate_index = pd.Series([None, None, 'A', 'A'], ['A', 'B', 'C', 'D'])
I don't know if duplicated have an option to give information about the first row with the same data. My idea is by using groupby and transform such as:
arr_first = (df.T.reset_index().groupby([col for col in df.T.columns])['index']
.transform(lambda x: x.iloc[0]).values)
With your example, arr_first is then equal to array(['A', 'B', 'A', 'A'], dtype=object) and because they have the same order than df.columns, to get the expected output, you use np.where like:
duplicate_index = pd.Series(pd.np.where(arr_first != df.columns, arr_first, None),df.columns)
and the result for duplicate_index is
A None
B None
C A
D A
dtype: object
Another more direct way to test if two numeric columns are duplicated with each other is to test the correlation matrix, which test all pairs of columns. Here is the code:
import pandas as pd
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
# compute the correlation matrix
cm = df.corr()
cm
This shows a matrix of the correlation of all columns to each other column (including itself). If a column is 1:1 with another column, then the value is 1.0.
To find all columns that are duplicates of A, then :
cm['A']
A 1.0
B NaN
C 1.0
D 1.0
If you have categorical (string objects) and not numeric, you could make a cross correlation table.
Hope this helps!

How can I multiply two dataframes with different column labels in pandas?

I'm trying to multiply (add/divide/etc.) two dataframes that have different column labels.
I'm sure this is possible, but what's the best way to do it? I've tried using rename to change the columns on one df first, but (1) I'd rather not do that and (2) my real data has a multiindex on the columns (where only one layer of the multiindex is differently labeled), and rename seems tricky for that case...
So to try and generalize my question, how can I get df1 * df2 using map to define the columns to multiply together?
df1 = pd.DataFrame([1,2,3], index=['1', '2', '3'], columns=['a', 'b', 'c'])
df2 = pd.DataFrame([4,5,6], index=['1', '2', '3'], columns=['d', 'e', 'f'])
map = {'a': 'e', 'b': 'd', 'c': 'f'}
df1 * df2 = ?
I was also troubled by this problem.
It seems that the pandas requires matrix multiply needs both dataframes has same column names.
I searched a lot and found the example in the setting enlargement is add one column to the dataframe.
For your question,
rs = pd.np.multiply(ds2, ds1)
The rs will have the same column names as ds2.
Suppose we want to multiply several columns with other serveral columns in the same dataframe and append these results into the original dataframe.
For example ds1,ds2 are in the same dataframe ds. We can
ds[['r1', 'r2', 'r3']] = pd.np.multiply(ds[['a', 'b', 'c']], ds[['d', 'e', 'f']])
I hope these will help.
Updated solution now that pd.np is being deprecated: df1.multiply(np.array(df2)
It will keep the column names of df1 and multiply them by the columns of df2 in order
I just stumbled onto the same problem. It seems like pandas wants both the column and row index to be aligned to do the element-wise multiplication, so you can just rename with your mapping during the multiplication:
>>> df1 = pd.DataFrame([[1,2,3]], index=['1', '2', '3'], columns=['a', 'b', 'c'])
>>> df2 = pd.DataFrame([[4,5,6]], index=['1', '2', '3'], columns=['d', 'e', 'f'])
>>> df1
a b c
1 1 2 3
2 1 2 3
3 1 2 3
>>> df2
d e f
1 4 5 6
2 4 5 6
3 4 5 6
>>> mapping = {'a' : 'e', 'b' : 'd', 'c' : 'f'}
>>> df1.rename(columns=mapping) * df2
d e f
1 8 5 18
2 8 5 18
3 8 5 18
If you want the 'natural' order of columns, you can create a mapping on the fly like:
>>> df1 * df2.rename(columns=dict(zip(df2.columns, df1.columns)))
for example to do the "Frobenius inner product" of the two matrices, you could do:
>>> (df1 * df2.rename(columns=dict(zip(df2.columns, df1.columns)))).sum().sum()
96
This is a pretty old question, and as nnsk said, pd.np is being deprecated.
A nice looking solution is df1 * df2.values. This will produce the element-wise product of the two dataframes, and keep the column names of df1.
Assuming the index is already aligned, you probably just want to align the columns in both DataFrame in the right order and divide the .values of both DataFrames.
Supposed mapping = {'a' : 'e', 'b' : 'd', 'c' : 'f'}:
v1 = df1.reindex(columns=['a', 'b', 'c']).values
v2 = df2.reindex(columns=['e', 'd', 'f']).values
rs = DataFrame(v1 / v2, index=v1.index, columns=['a', 'b', 'c'])
Another solution assuming indexes and columns are well positioned :
df_mul= pd.DataFrame(df1.values * df2.values, columns= df1.columns, index= df1.index)

Categories