How can I multiply two dataframes with different column labels in pandas? - python

I'm trying to multiply (add/divide/etc.) two dataframes that have different column labels.
I'm sure this is possible, but what's the best way to do it? I've tried using rename to change the columns on one df first, but (1) I'd rather not do that and (2) my real data has a multiindex on the columns (where only one layer of the multiindex is differently labeled), and rename seems tricky for that case...
So to try and generalize my question, how can I get df1 * df2 using map to define the columns to multiply together?
df1 = pd.DataFrame([1,2,3], index=['1', '2', '3'], columns=['a', 'b', 'c'])
df2 = pd.DataFrame([4,5,6], index=['1', '2', '3'], columns=['d', 'e', 'f'])
map = {'a': 'e', 'b': 'd', 'c': 'f'}
df1 * df2 = ?

I was also troubled by this problem.
It seems that the pandas requires matrix multiply needs both dataframes has same column names.
I searched a lot and found the example in the setting enlargement is add one column to the dataframe.
For your question,
rs = pd.np.multiply(ds2, ds1)
The rs will have the same column names as ds2.
Suppose we want to multiply several columns with other serveral columns in the same dataframe and append these results into the original dataframe.
For example ds1,ds2 are in the same dataframe ds. We can
ds[['r1', 'r2', 'r3']] = pd.np.multiply(ds[['a', 'b', 'c']], ds[['d', 'e', 'f']])
I hope these will help.

Updated solution now that pd.np is being deprecated: df1.multiply(np.array(df2)
It will keep the column names of df1 and multiply them by the columns of df2 in order

I just stumbled onto the same problem. It seems like pandas wants both the column and row index to be aligned to do the element-wise multiplication, so you can just rename with your mapping during the multiplication:
>>> df1 = pd.DataFrame([[1,2,3]], index=['1', '2', '3'], columns=['a', 'b', 'c'])
>>> df2 = pd.DataFrame([[4,5,6]], index=['1', '2', '3'], columns=['d', 'e', 'f'])
>>> df1
a b c
1 1 2 3
2 1 2 3
3 1 2 3
>>> df2
d e f
1 4 5 6
2 4 5 6
3 4 5 6
>>> mapping = {'a' : 'e', 'b' : 'd', 'c' : 'f'}
>>> df1.rename(columns=mapping) * df2
d e f
1 8 5 18
2 8 5 18
3 8 5 18
If you want the 'natural' order of columns, you can create a mapping on the fly like:
>>> df1 * df2.rename(columns=dict(zip(df2.columns, df1.columns)))
for example to do the "Frobenius inner product" of the two matrices, you could do:
>>> (df1 * df2.rename(columns=dict(zip(df2.columns, df1.columns)))).sum().sum()
96

This is a pretty old question, and as nnsk said, pd.np is being deprecated.
A nice looking solution is df1 * df2.values. This will produce the element-wise product of the two dataframes, and keep the column names of df1.

Assuming the index is already aligned, you probably just want to align the columns in both DataFrame in the right order and divide the .values of both DataFrames.
Supposed mapping = {'a' : 'e', 'b' : 'd', 'c' : 'f'}:
v1 = df1.reindex(columns=['a', 'b', 'c']).values
v2 = df2.reindex(columns=['e', 'd', 'f']).values
rs = DataFrame(v1 / v2, index=v1.index, columns=['a', 'b', 'c'])

Another solution assuming indexes and columns are well positioned :
df_mul= pd.DataFrame(df1.values * df2.values, columns= df1.columns, index= df1.index)

Related

pandas apply function on groups gives error when group has only one distinct value

I have extracted a group of data (e.g. col 'A') from a larger dataset and wanted to apply a function to the group in order to verify the results of the function. The Problem is, when I apply the function to a group that has only one distinct value (df_false), pandas returns a
ValueError: Expected a 1D array, got an array with shape (6, 6)
When I apply the same function to a df that has more than one distinct value in the grouping column (df_true), the error doesn't appear.
Does anyone know how to deal with that?
import pandas as pd
df_false = pd.DataFrame({'A' : ['a', 'a', 'a', 'a', 'a', 'a'],
'B': [10,10,20,20,30,10],
'C': [10,10,20,30,10,5]})
df_ok = pd.DataFrame({'A' : ['a', 'a', 'a', 'a', 'a', 'c'],
'B': [10,10,20,20,30,10],
'C': [10,10,20,30,10,5]})
display(df_false)
def myf(x):
y = []
for i in x.iterrows():
y.append(len(x))
return pd.Series(y)
df_false['result'] = df_false.groupby('A').apply(myf).reset_index(drop=True)
display(df)
The issue is that your code with df_false outputs a DataFrame (of a single row). You can force it into a Series with squeeze:
df_false['result'] = (df_false.groupby('A').apply(myf)
.reset_index(drop=True).squeeze()
)
That said, unless this was a dummy example, you should rather use vectorial code:
df_false['result'] = df_false.groupby('A')['A'].transform('size')
output:
A B C result
0 a 10 10 6
1 a 10 10 6
2 a 20 20 6
3 a 20 30 6
4 a 30 10 6
5 a 10 5 6

How to find similarities in two dataframes and extract its value [duplicate]

I have a dataframe df1 which looks like:
c k l
0 A 1 a
1 A 2 b
2 B 2 a
3 C 2 a
4 C 2 d
and another called df2 like:
c l
0 A b
1 C a
I would like to filter df1 keeping only the values that ARE NOT in df2. Values to filter are expected to be as (A,b) and (C,a) tuples. So far I tried to apply the isin method:
d = df[~(df['l'].isin(dfc['l']) & df['c'].isin(dfc['c']))]
That seems to me too complicated, it returns:
c k l
2 B 2 a
4 C 2 d
but I'm expecting:
c k l
0 A 1 a
2 B 2 a
4 C 2 d
You can do this efficiently using isin on a multiindex constructed from the desired columns:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
keys = list(df2.columns.values)
i1 = df1.set_index(keys).index
i2 = df2.set_index(keys).index
df1[~i1.isin(i2)]
I think this improves on #IanS's similar solution because it doesn't assume any column type (i.e. it will work with numbers as well as strings).
(Above answer is an edit. Following was my initial answer)
Interesting! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where df2 is defined. Here is an example, which makes use of a temporary array:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
# create a column marking df2 values
df2['marker'] = 1
# join the two, keeping all of df1's indices
joined = pd.merge(df1, df2, on=['c', 'l'], how='left')
joined
# extract desired columns where marker is NaN
joined[pd.isnull(joined['marker'])][df1.columns]
There may be a way to do this without using the temporary array, but I can't think of one. As long as your data isn't huge the above method should be a fast and sufficient answer.
This is pretty succinct and works well:
df1 = df1[~df1.index.isin(df2.index)]
Using DataFrame.merge & DataFrame.query:
A more elegant method would be to do left join with the argument indicator=True, then filter all the rows which are left_only with query:
d = (
df1.merge(df2,
on=['c', 'l'],
how='left',
indicator=True)
.query('_merge == "left_only"')
.drop(columns='_merge')
)
print(d)
c k l
0 A 1 a
2 B 2 a
4 C 2 d
indicator=True returns a dataframe with an extra column _merge which marks each row left_only, both, right_only:
df1.merge(df2, on=['c', 'l'], how='left', indicator=True)
c k l _merge
0 A 1 a left_only
1 A 2 b both
2 B 2 a left_only
3 C 2 a both
4 C 2 d left_only
I think this is a quite simple approach when you want to filter a dataframe based on multiple columns from another dataframe or even based on a custom list.
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
#values of df2 columns 'c' and 'l' that will be used to filter df1
idxs = list(zip(df2.c.values, df2.l.values)) #[('A', 'b'), ('C', 'a')]
#so df1 is filtered based on the values present in columns c and l of df2 (idxs)
df1 = df1[pd.Series(list(zip(df1.c, df1.l)), index=df1.index).isin(idxs)]
How about:
df1['key'] = df1['c'] + df1['l']
d = df1[~df1['key'].isin(df2['c'] + df2['l'])].drop(['key'], axis=1)
Another option that avoids creating an extra column or doing a merge would be to do a groupby on df2 to get the distinct (c, l) pairs and then just filter df1 using that.
gb = df2.groupby(("c", "l")).groups
df1[[p not in gb for p in zip(df1['c'], df1['l'])]]]
For this small example, it actually seems to run a bit faster than the pandas-based approach (666 µs vs. 1.76 ms on my machine), but I suspect it could be slower on larger examples since it's dropping into pure Python.
You can concatenate both DataFrames and drop all duplicates:
df1.append(df2).drop_duplicates(subset=['c', 'l'], keep=False)
Output:
c k l
0 A 1.0 a
2 B 2.0 a
4 C 2.0 d
This method doesn't work if you have duplicates subset=['c', 'l'] in df1.

How to drop rows based on column value if column is not set as index in pandas?

I have a list and a dataframe which look like this:
list = ['a', 'b']
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]})
I would like to do something like this:
df1 = df.drop([x for x in list])
I am getting the following error message:
"KeyError: "['a' 'b'] not found in axis""
I know I can do the following:
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]}).set_index('A')
df1 = df.drop([x for x in list])
How can I drop the list values without having to set column 'A' as index? My dataframe has multiple columns.
Input:
A B
0 a 9
1 b 9
2 c 8
3 d 4
Code:
for i in list:
ind = df[df['A']==i].index.tolist()
df=df.drop(ind)
df
Output:
A B
2 c 8
3 d 4
You need to specify the correct axis, namely axis=1.
According to the docs:
axis{0 or ‘index’, 1 or ‘columns’}, default 0. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
You also need to make sure you are using the correct column names, (ie. uppercase rather than lower case. And you should avoid using python reserved names, so don't use list.
This should work:
myList = ['A', 'B']
df1 = df.drop([x for x in myList], axis=1)

Finding the Location of the Duplicate for Duplicated Columns in Pandas

I know I can find duplicate columns using:
df.T.duplicated()
what I'd like to know the index that a duplicate column is a duplicate of. For example, both C and D are duplicates of a A below:
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
A B C D
0 1 0 1 1
1 2 0 2 2
I'd like something like:
duplicate_index = pd.Series([None, None, 'A', 'A'], ['A', 'B', 'C', 'D'])
I don't know if duplicated have an option to give information about the first row with the same data. My idea is by using groupby and transform such as:
arr_first = (df.T.reset_index().groupby([col for col in df.T.columns])['index']
.transform(lambda x: x.iloc[0]).values)
With your example, arr_first is then equal to array(['A', 'B', 'A', 'A'], dtype=object) and because they have the same order than df.columns, to get the expected output, you use np.where like:
duplicate_index = pd.Series(pd.np.where(arr_first != df.columns, arr_first, None),df.columns)
and the result for duplicate_index is
A None
B None
C A
D A
dtype: object
Another more direct way to test if two numeric columns are duplicated with each other is to test the correlation matrix, which test all pairs of columns. Here is the code:
import pandas as pd
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
# compute the correlation matrix
cm = df.corr()
cm
This shows a matrix of the correlation of all columns to each other column (including itself). If a column is 1:1 with another column, then the value is 1.0.
To find all columns that are duplicates of A, then :
cm['A']
A 1.0
B NaN
C 1.0
D 1.0
If you have categorical (string objects) and not numeric, you could make a cross correlation table.
Hope this helps!

pandas - filter dataframe by another dataframe by row elements

I have a dataframe df1 which looks like:
c k l
0 A 1 a
1 A 2 b
2 B 2 a
3 C 2 a
4 C 2 d
and another called df2 like:
c l
0 A b
1 C a
I would like to filter df1 keeping only the values that ARE NOT in df2. Values to filter are expected to be as (A,b) and (C,a) tuples. So far I tried to apply the isin method:
d = df[~(df['l'].isin(dfc['l']) & df['c'].isin(dfc['c']))]
That seems to me too complicated, it returns:
c k l
2 B 2 a
4 C 2 d
but I'm expecting:
c k l
0 A 1 a
2 B 2 a
4 C 2 d
You can do this efficiently using isin on a multiindex constructed from the desired columns:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
keys = list(df2.columns.values)
i1 = df1.set_index(keys).index
i2 = df2.set_index(keys).index
df1[~i1.isin(i2)]
I think this improves on #IanS's similar solution because it doesn't assume any column type (i.e. it will work with numbers as well as strings).
(Above answer is an edit. Following was my initial answer)
Interesting! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where df2 is defined. Here is an example, which makes use of a temporary array:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
# create a column marking df2 values
df2['marker'] = 1
# join the two, keeping all of df1's indices
joined = pd.merge(df1, df2, on=['c', 'l'], how='left')
joined
# extract desired columns where marker is NaN
joined[pd.isnull(joined['marker'])][df1.columns]
There may be a way to do this without using the temporary array, but I can't think of one. As long as your data isn't huge the above method should be a fast and sufficient answer.
This is pretty succinct and works well:
df1 = df1[~df1.index.isin(df2.index)]
Using DataFrame.merge & DataFrame.query:
A more elegant method would be to do left join with the argument indicator=True, then filter all the rows which are left_only with query:
d = (
df1.merge(df2,
on=['c', 'l'],
how='left',
indicator=True)
.query('_merge == "left_only"')
.drop(columns='_merge')
)
print(d)
c k l
0 A 1 a
2 B 2 a
4 C 2 d
indicator=True returns a dataframe with an extra column _merge which marks each row left_only, both, right_only:
df1.merge(df2, on=['c', 'l'], how='left', indicator=True)
c k l _merge
0 A 1 a left_only
1 A 2 b both
2 B 2 a left_only
3 C 2 a both
4 C 2 d left_only
I think this is a quite simple approach when you want to filter a dataframe based on multiple columns from another dataframe or even based on a custom list.
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
#values of df2 columns 'c' and 'l' that will be used to filter df1
idxs = list(zip(df2.c.values, df2.l.values)) #[('A', 'b'), ('C', 'a')]
#so df1 is filtered based on the values present in columns c and l of df2 (idxs)
df1 = df1[pd.Series(list(zip(df1.c, df1.l)), index=df1.index).isin(idxs)]
How about:
df1['key'] = df1['c'] + df1['l']
d = df1[~df1['key'].isin(df2['c'] + df2['l'])].drop(['key'], axis=1)
Another option that avoids creating an extra column or doing a merge would be to do a groupby on df2 to get the distinct (c, l) pairs and then just filter df1 using that.
gb = df2.groupby(("c", "l")).groups
df1[[p not in gb for p in zip(df1['c'], df1['l'])]]]
For this small example, it actually seems to run a bit faster than the pandas-based approach (666 µs vs. 1.76 ms on my machine), but I suspect it could be slower on larger examples since it's dropping into pure Python.
You can concatenate both DataFrames and drop all duplicates:
df1.append(df2).drop_duplicates(subset=['c', 'l'], keep=False)
Output:
c k l
0 A 1.0 a
2 B 2.0 a
4 C 2.0 d
This method doesn't work if you have duplicates subset=['c', 'l'] in df1.

Categories