I have a DataFrame with columns that there is only a True in a row (like the df below). How may I form a column with the column index that has True for that row? In real life the matrix is big, so I avoid using for loop and apply.
df = pd.DataFrame([[True,False,False],
[True,False,False],
[False,True,False],
[False,False,True],
[True,False,False],
])
The answer shall be sth like:
df['TrueColIndex'] = [0,0,1,2,1]
Thanks!
This is idxmax
df.idxmax(1)
Out[444]:
0 0
1 0
2 1
3 2
4 0
dtype: int64
Related
I am trying to multiply dataframe 1 column a by dataframe 2 column b.
combineQueryandBookFiltered['pnlValue'] = np.multiply(combineQueryandBookFiltered['pnlValue'], df_fxrate['fx_rate'])
pnlValue column has many numbers and fx_rate column is just the one number.
The code executes but my end result ends up with tons of NaN .
Any help would be appreciated.
It is probably due to the index of your dataframe. You need to use df_fxrate['fx_rate'].values:
combineQueryandBookFiltered['pnlValue'] = np.multiply(combineQueryandBookFiltered['pnlValue'], df_fxrate['fx_rate'].values)
or better:
combineQueryandBookFiltered['pnlValue']=combineQueryandBookFiltered['pnlValue']*df_fxrate['fx_rate'].values
I show you an example:
df1=pd.DataFrame(index=[1, 2])
df2=pd.DataFrame(index=[0])
df1['col1']=[1,1]
print(df1)
col1
1 1
2 1
df2['col1']=[1]
print(df2)
col1
0 1
print(np.multiply(df1['col1'],df2['col1']))
0 NaN
1 NaN
2 NaN
as you can see the multiplication is done according to the index
So you need something like this:
np.multiply(df1['col1'],df2['col1'].values)
or
df1['col1']*df2['col1'].values
Output:
1 1
2 1
Name: 1, dtype: int64
as you can see now only the df1['col1'] series index is used
-- Hi excelguy,
Is there a reason why you can't use the simple column multiplication?
df['C'] = df['A'] * df['B']
As was pointed out, multiplications of two series are based on their indices and it's likely that your fx_rate series does not have the same indices as the pnlValue series.
But since your fx_rate is only one value, I suggest multiplying your dataframe with a scalar instead:
fx_rate = df_fxrate['fx_rate'].iloc[0]
combineQueryandBookFiltered['pnlValue'] = combineQueryandBookFiltered['pnlValue'] * fx_rate
I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)
I had a problem and I found a solution but I feel it's the wrong way to do it. Maybe, there is a more 'canonical' way to do it.
I already had an answer for a really similar problem, but here I have not the same amount of rows in each dataframe. Sorry for the "double-post", but the first one is still valid so I think it's better to make a new one.
Problem
I have two dataframe that I would like to merge without having extra column and without erasing existing infos. Example :
Existing dataframe (df)
A A2 B
0 1 4 0
1 2 5 1
2 2 5 1
Dataframe to merge (df2)
A A2 B
0 1 4 2
1 3 5 2
I would like to update df with df2 if columns 'A' and 'A2' corresponds.
The result would be :
A A2 B
0 1 4 2 <= Update value ONLY
1 2 5 1
2 2 5 1
Here is my solution, but I think it's not a really good one.
import pandas as pd
df = pd.DataFrame([[1,4,0],[2,5,1],[2,5,1]],columns=['A','A2','B'])
df2 = pd.DataFrame([[1,4,2],[3,5,2]],columns=['A','A2','B'])
df = df.merge(df2,on=['A', 'A2'],how='left')
df['B_y'].fillna(0, inplace=True)
df['B'] = df['B_x']+df['B_y']
df = df.drop(['B_x','B_y'], axis=1)
print(df)
I tried this solution :
rows = (df[['A','A2']] == df2[['A','A2']]).all(axis=1)
df.loc[rows,'B'] = df2.loc[rows,'B']
But I have this error because of the wrong number of rows :
ValueError: Can only compare identically-labeled DataFrame objects
Does anyone has a better way to do ?
Thanks !
I think you can use DataFrame.isin for check where are same rows in both DataFrames. Then create NaN by mask, which is filled by combine_first. Last cast to int:
mask = df[['A', 'A2']].isin(df2[['A', 'A2']]).all(1)
print (mask)
0 True
1 False
2 False
dtype: bool
df.B = df.B.mask(mask).combine_first(df2.B).astype(int)
print (df)
A A2 B
0 1 4 2
1 2 5 1
2 2 5 1
With a minor tweak in the way in which the boolean mask gets created, you can get it to work:
cols = ['A', 'A2']
# Slice it to match the shape of the other dataframe to compare elementwise
rows = (df[cols].values[:df2.shape[0]] == df2[cols].values).all(1)
df.loc[rows,'B'] = df2.loc[rows,'B']
df
What is the best way to figure out how two dataframes differ based on a combination of multiple columns. So if I have the following:
df1:
A B C
0 1 2 3
1 3 4 2
df2:
A B C
0 1 2 3
1 3 5 2
Want to show all rows where there is a difference such as (3,4,2) vs. (3,5,2) from above example. I've tried using the pd.merge() thinking that if I use all columns as the key to join using outer join, I would end up with dataframe that would help me get what I want but it doesn't turn out that way.
Thanks to EdChum I was able to use a mask from a boolean diff as below but first had to make sure indexes were comparable.
df1 = df1.set_index('A')
df2 = df2.set_index('A') #this gave me a nice index using one of the keys.
#if there are different rows than I would get nulls.
df1 = df1.reindex_like(df2)
df1[~(df1==df2).all(axis=1)] #this gave me all rows that differed.
We can use .all and pass axis=1 to perform row comparisons, we can then use this boolean index to show the rows that differ by negating ~ the boolean index:
In [43]:
df[~(df==df1).all(axis=1)]
Out[43]:
A B C
1 3 4 2
breaking this down:
In [44]:
df==df1
Out[44]:
A B C
0 True True True
1 True False True
In [45]:
(df==df1).all(axis=1)
Out[45]:
0 True
1 False
dtype: bool
We can then pass the above as a boolean index to df and invert it using ~
I have a pandas DataFrame df with a list of unique ids id, and a DataFrame with master list of all known ids master_df.id. I'm trying to figure out the best way to preform an isin that also returns to me the index where the value is located. So if my DataFrame was
master_df was
index id
1 1
2 2
3 3
and df was
index id
1 3
2 4
3 1
I want something like (3, False, 1).
I'm currently doing an is in and then looking then brute forcing the lookup with a loop, but I'm sure there is a much better way to do it.
One way is to do a merge:
In [11]: df.merge(mdf, on='id', how='left')
Out[11]:
index_x id index_y
0 1 3 3
1 2 4 NaN
2 3 1 1
and column index_y is the desired result*:
In [12]: df.merge(mdf, on='id', how='left').index_y
Out[12]:
0 3
1 NaN
2 1
Name: index_y, dtype: float64
* Except for NaN vs. False, but I think NaN is what you really want here. As #DSM points out, in python False == 0 so you may get into trouble with False as the representative for missing vs being found with id 0. (If you still want to do it then replace the NaN with 0 using .fillna(0)).
Note: it's possible it will be more efficient to just take the columns you care about:
df[['id']].merge(mdf[['id', 'index']], on='id', how='left')