I am trying to multiply dataframe 1 column a by dataframe 2 column b.
combineQueryandBookFiltered['pnlValue'] = np.multiply(combineQueryandBookFiltered['pnlValue'], df_fxrate['fx_rate'])
pnlValue column has many numbers and fx_rate column is just the one number.
The code executes but my end result ends up with tons of NaN .
Any help would be appreciated.
It is probably due to the index of your dataframe. You need to use df_fxrate['fx_rate'].values:
combineQueryandBookFiltered['pnlValue'] = np.multiply(combineQueryandBookFiltered['pnlValue'], df_fxrate['fx_rate'].values)
or better:
combineQueryandBookFiltered['pnlValue']=combineQueryandBookFiltered['pnlValue']*df_fxrate['fx_rate'].values
I show you an example:
df1=pd.DataFrame(index=[1, 2])
df2=pd.DataFrame(index=[0])
df1['col1']=[1,1]
print(df1)
col1
1 1
2 1
df2['col1']=[1]
print(df2)
col1
0 1
print(np.multiply(df1['col1'],df2['col1']))
0 NaN
1 NaN
2 NaN
as you can see the multiplication is done according to the index
So you need something like this:
np.multiply(df1['col1'],df2['col1'].values)
or
df1['col1']*df2['col1'].values
Output:
1 1
2 1
Name: 1, dtype: int64
as you can see now only the df1['col1'] series index is used
-- Hi excelguy,
Is there a reason why you can't use the simple column multiplication?
df['C'] = df['A'] * df['B']
As was pointed out, multiplications of two series are based on their indices and it's likely that your fx_rate series does not have the same indices as the pnlValue series.
But since your fx_rate is only one value, I suggest multiplying your dataframe with a scalar instead:
fx_rate = df_fxrate['fx_rate'].iloc[0]
combineQueryandBookFiltered['pnlValue'] = combineQueryandBookFiltered['pnlValue'] * fx_rate
Related
I have a dataframe, and I would like to select a subset of the dataframe using both index and column values. I can do both of these separately, but cannot figure out the syntax to do them simultaneously. Example:
import pandas as pd
# sample dataframe:
cid=[1,2,3,4,5,6,17,18,91,104]
c1=[1,2,3,1,2,3,3,4,1,3]
c2=[0,0,0,0,1,1,1,1,0,1]
df=pd.DataFrame(list(zip(c1,c2)),columns=['col1','col2'],index=cid)
df
Returns:
col1 col2
1 1 0
2 2 0
3 3 0
4 1 0
5 2 1
6 3 1
17 3 1
18 4 1
91 1 0
104 3 1
Using .loc, I can collect by index:
rel_index=[5,6,17]
relc1=[2,3]
relc2=[1]
df.loc[rel_index]
Returns:
col1 col2
5 1 5
6 2 6
17 3 7
Or I can select by column values:
df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returning:
col1 col2
5 2 1
6 3 1
17 3 1
104 3 1
However, I cannot do both. When I try the following:
df.loc[rel_index,df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returns:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I have tried a few other variations (such as "&" instead of the ","), but these return the same or other errors.
Once I collect this slice, I am hoping to reassign values on the main dataframe. I imagine this will be trivial once the above is done, but I note it here in case it is not. My goal is to assign something like df2 in the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
to the slice referenced by index and multiple column conditions (overwriting what was in the original dataframe).
The reason for the IndexingError, is that you're calling df.loc with arrays of 2 different sizes.
df.loc[rel_index] has a length of 3 whereas df['col1'].isin(relc1) has a length of 10.
You need the index results to also have a length of 10. If you look at the output of df['col1'].isin(relc1), it is an array of booleans.
You can achieve a similar array with the proper length by replacing df.loc[rel_index] with df.index.isin([5,6,17])
so you end up with:
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)]
which returns:
col1 col2
5 2 1
6 3 1
17 3 1
That said, I'm not sure why your index would ever look like this. Typically when slicing by index you would use df.iloc and your index would match the 0,1,2...etc. format.
Alternatively, you could first search by value - then assign the resulting dataframe to a new variable df2
df2 = df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
then df2.loc[rel_index] would work without issue.
As for your overall goal, you can simply do the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)] = df2
#Rexovas explains it quite well, this is an alternative, where you can compute the filters on the index before assigning - it is a bit long, involves MultiIndex, but once you get your head around MultiIndex, should be intuitive:
(df
# move columns into the index
.set_index(['col1', 'col2'], append = True)
# filter based on the index
.loc(axis = 0)[rel_index, relc1, relc2]
# return cols 1 and 2
.reset_index(level = [-2, -1])
# assign values
.assign(col1 = c3, col2 = c4)
)
col1 col2
5 1 5
6 2 6
17 3 7
I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)
I have a series and df
s = pd.Series([1,2,3,5])
df = pd.DataFrame()
When I add columns to df like this
df.loc[:, "0-2"] = s.iloc[0:3]
df.loc[:, "1-3"] = s.iloc[1:4]
I get df
0-2 1-3
0 1 NaN
1 2 2.0
2 3 3.0
Why am I getting NaN? I tried create new series with correct idxs, but adding it to df still causes NaN.
What I want is
0-2 1-3
0 1 2
1 2 3
2 3 5
Try either of the following lines.
df.loc[:, "1-3"] = s.iloc[1:4].values
# -OR-
df.loc[:, "1-3"] = s.iloc[1:4].reset_index(drop=True)
Your original code is trying unsuccessfully to match the index of the data frame df to the index of the subset series s.iloc[1:4]. When it can't find the 0 index in the series, it places a NaN value in df at that location. You can get around this by only keeping the values so it doesn't try to match on the index or resetting the index on the subset series.
>>> s.iloc[1:4]
1 2
2 3
3 5
dtype: int64
Notice the index values since the original, unsubset series is the following.
>>> s
0 1
1 2
2 3
3 5
dtype: int64
The index of the first row in df is 0. By dropping the indices with the values call, you bypass the index matching which is producing the NaN. By resetting the index in the second option, you make the indices the same.
I have created a function that replaces the NaNs in a Pandas dataframe with the means of the respective columns. I tested the function with a small dataframe and it worked. When I applied it though to a much larger dataframe (30,000 rows, 9 columns) I got the error message: IndexError: index out of bounds
The function is the following:
# The 'update' function will replace all the NaNs in a dataframe with the mean of the respective columns
def update(df): # the function takes one argument, the dataframe that will be updated
ncol = df.shape[1] # number of columns in the dataframe
for i in range(0 , ncol): # loops over all the columns
df.iloc[:,i][df.isnull().iloc[:, i]]=df.mean()[i] # subsets the df using the isnull() method, extracting the positions
# in each column where the
return(df)
The small dataframe I used to test the function is the following:
0 1 2 3
0 NaN NaN 3 4
1 NaN NaN 7 8
2 9.0 10.0 11 12
Could you explain the error? Your advice will be appreciated.
I would use DataFrame.fillna() method in conjunction with DataFrame.mean() method:
In [130]: df.fillna(df.mean())
Out[130]:
0 1 2 3
0 9.0 10.0 3 4
1 9.0 10.0 7 8
2 9.0 10.0 11 12
Mean values:
In [138]: df.mean()
Out[138]:
0 9.0
1 10.0
2 7.0
3 8.0
dtype: float64
The reason you are getting "index out of bounds" is because you are assigning the value df.mean()[i] when i is one iteration of what are supposed to be ordinal positions. df.mean() is a Series whose indices are the columns of df. df.mean()[something] implies something better be a column name. But they aren't and that's why you get your error.
your code... fixed
def update(df): # the function takes one argument, the dataframe that will be updated
ncol = df.shape[1] # number of columns in the dataframe
for i in range(0 , ncol): # loops over all the columns
df.iloc[:,i][df.isnull().iloc[:, i]]=df.mean().iloc[i] # subsets the df using the isnull() method, extracting the positions
# in each column where the
return(df)
Also, your function is altering the df directly. You may want to be careful. I'm not sure that's what you intended.
All that said. I'd recommend another approach
def update(df):
return df.where(df.notnull(), df.mean(), axis=1)
You could use any number of methods to fill missing with the mean. I'd suggest using #MaxU's answer.
df.where
takes df when first arg is True otherwise second argument
df.where(df.notnull(), df.mean(), axis=1)
df.combine_first with awkward pandas broadcasting
df.combine_first(pd.DataFrame([df.mean()], df.index))
np.where
pd.DataFrame(
np.where(
df.notnull(), df.values,
np.nanmean(df.values, 0, keepdims=1)),
df.index, df.columns)
I had a problem and I found a solution but I feel it's the wrong way to do it. Maybe, there is a more 'canonical' way to do it.
I already had an answer for a really similar problem, but here I have not the same amount of rows in each dataframe. Sorry for the "double-post", but the first one is still valid so I think it's better to make a new one.
Problem
I have two dataframe that I would like to merge without having extra column and without erasing existing infos. Example :
Existing dataframe (df)
A A2 B
0 1 4 0
1 2 5 1
2 2 5 1
Dataframe to merge (df2)
A A2 B
0 1 4 2
1 3 5 2
I would like to update df with df2 if columns 'A' and 'A2' corresponds.
The result would be :
A A2 B
0 1 4 2 <= Update value ONLY
1 2 5 1
2 2 5 1
Here is my solution, but I think it's not a really good one.
import pandas as pd
df = pd.DataFrame([[1,4,0],[2,5,1],[2,5,1]],columns=['A','A2','B'])
df2 = pd.DataFrame([[1,4,2],[3,5,2]],columns=['A','A2','B'])
df = df.merge(df2,on=['A', 'A2'],how='left')
df['B_y'].fillna(0, inplace=True)
df['B'] = df['B_x']+df['B_y']
df = df.drop(['B_x','B_y'], axis=1)
print(df)
I tried this solution :
rows = (df[['A','A2']] == df2[['A','A2']]).all(axis=1)
df.loc[rows,'B'] = df2.loc[rows,'B']
But I have this error because of the wrong number of rows :
ValueError: Can only compare identically-labeled DataFrame objects
Does anyone has a better way to do ?
Thanks !
I think you can use DataFrame.isin for check where are same rows in both DataFrames. Then create NaN by mask, which is filled by combine_first. Last cast to int:
mask = df[['A', 'A2']].isin(df2[['A', 'A2']]).all(1)
print (mask)
0 True
1 False
2 False
dtype: bool
df.B = df.B.mask(mask).combine_first(df2.B).astype(int)
print (df)
A A2 B
0 1 4 2
1 2 5 1
2 2 5 1
With a minor tweak in the way in which the boolean mask gets created, you can get it to work:
cols = ['A', 'A2']
# Slice it to match the shape of the other dataframe to compare elementwise
rows = (df[cols].values[:df2.shape[0]] == df2[cols].values).all(1)
df.loc[rows,'B'] = df2.loc[rows,'B']
df