Pandas Dataframe interpolating in sections delimited by indexes

Pandas Dataframe interpolating in sections delimited by indexes - python

My sample code is as follow:
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
I'm trying to interpolate various segments which contain the value 'nan'.
For context, I'm trying to track bus speeds using GPS data provided by the city (São Paulo, Brazil), but the data is scarce and with parts that do not provide the information, as the e.g., but there're segments which I know for a fact that they are stopped, such as dawn, but the information come as 'nan' as well.
What I need:
I've been experimenting with dataframe.interpolate() parameters (limit and limit_diretcion) but came up short. If I set df.interpolate(limit=2) I will not only interpolate the data that I need but the data where it shouldn't. So I need to interpolate between sections defined by a limit
Desired output:
Out[7]:
col1 col2 col3
0 1.0 20.00 15.00
1 nan nan nan
2 nan nan nan
3 nan nan nan
4 5.0 22.00 10.00
5 6.0 23.50 12.00
6 7.0 25.00 14.00
7 8.0 27.50 13.50
8 9.0 30.00 13.00
9 nan nan nan
10 nan nan nan
11 nan nan nan
12 13.0 25.00 9.00
The logic that I've been trying to apply is basically trying to find nan's and calculating the difference between their indexes and so createing a new dataframe_temp to interpolate and only than add it to another creating a new dataframe_final. But this has become hard to achieve due to the fact that 'nan'=='nan' return False

This is a hack but may still be useful. Likely Pandas 0.23 will have a better solution.
https://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#dataframe-interpolate-has-gained-the-limit-area-kwarg
df_fw = df.interpolate(limit=1)
df_bk = df.interpolate(limit=1, limit_direction='backward')
df_fw.where(df_bk.notna())
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Not a Hack
More legitimate way of handling it.
Generalized to handle any limit.
def interp(df, limit):
d = df.notna().rolling(limit + 1).agg(any).fillna(1)
d = pd.concat({
i: d.shift(-i).fillna(1)
for i in range(limit + 1)
}).prod(level=1)
return df.interpolate(limit=limit).where(d.astype(bool))
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Can also handle variation in NaN from column to column. Consider a different df
dictx = {'col1':[1,'nan','nan','nan',5,'nan','nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan','nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9,'nan']}
df = pd.DataFrame(dictx).astype(float)
df
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN NaN NaN
6 NaN 25.0 14.0
7 7.0 NaN NaN
8 NaN NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 NaN
Then with limit=1
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN 23.5 12.0
6 NaN 25.0 14.0
7 7.0 NaN 13.5
8 8.0 NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 9.0
And with limit=2
df.pipe(interp, 2).round(2)
col1 col2 col3
0 1.00 20.00 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.00 22.00 10.0
5 5.67 23.50 12.0
6 6.33 25.00 14.0
7 7.00 26.67 13.5
8 8.00 28.33 13.0
9 9.00 30.00 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.00 25.00 9.0

Here is a way to selectively ignore rows which are consecutive runs of NaNs whose length is greater than a certain size (given by limit):
import numpy as np
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
limit = 2
notnull = pd.notnull(df).all(axis=1)
# assign group numbers to the rows of df. Each group starts with a non-null row,
# followed by null rows
group = notnull.cumsum()
# find the index of groups having length > limit
ignore = (df.groupby(group).filter(lambda grp: len(grp)>limit)).index
# only ignore rows which are null
ignore = df.loc[~notnull].index.intersection(ignore)
keep = df.index.difference(ignore)
# interpolate only the kept rows
df.loc[keep] = df.loc[keep].interpolate()
print(df)
prints
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
By changing the value of limit you can control how big the group has to be before it should be ignored.

This is a partial answer.
for i in list(df):
for x in range(len(df[i])):
if not df[i][x] > -100:
df[i][x] = 0
df
col1 col2 col3
0 1.0 20.0 15.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 5.0 22.0 10.0
5 0.0 0.0 0.0
6 7.0 25.0 14.0
7 0.0 0.0 0.0
8 9.0 30.0 13.0
9 0.0 0.0 0.0
10 0.0 0.0 0.0
11 0.0 0.0 0.0
12 13.0 25.0 9.0
Now,
df["col1"][1] == df["col2"][1]
True

Related

Is there a way to forward fill with ascending logic in pandas / numpy?

What is the most pandastic way to forward fill with ascending logic (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,1,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,6,np.nan,np.nan
df['desired_output'] = np.nan,np.nan,1,1,1,3,3,3,3,3,6,6,6
print (df)
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
In the 'test' column, the number of consecutive NaN's is random.
In the 'desired_output' column, trying to forward fill with ascending values only. Also, when lower values are encountered (row 8, value = 2.0 above), they are overwritten with the current higher value.
Can anyone help? Thanks in advance.

You can combine cummax to select the cumulative maximum value and ffill to replace the NaNs:
df['desired_output'] = df['test'].cummax().ffill()
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
intermediate Series:
df['test'].cummax()
0 NaN
1 NaN
2 1.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN
8 3.0
9 NaN
10 6.0
11 NaN
12 NaN
Name: test, dtype: float64

Setting Values with pandas DataFrame.loc

Consider I have a data frame :
>>> data
c0 c1 c2 _c1 _c2
0 0 1 2 18.0 19.0
1 3 4 5 NaN NaN
2 6 7 8 20.0 21.0
3 9 10 11 NaN NaN
4 12 13 14 NaN NaN
5 15 16 17 NaN NaN
I want to update the values in the c1 and c2 columns with the values in the _c1 and _c2 columns whenever those latter values are not NaN. Why won't the following work, and what is the correct way to do this?
>>> data.loc[~(data._c1.isna()),['c1','c2']]=data.loc[~(data._c1.isna()),['_c1','_c2']]
>>> data
c0 c1 c2 _c1 _c2
0 0 NaN NaN 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 NaN NaN 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN
For completeness's sake I want the result to look like
>>> data.loc[~(data._c1.isna()),['c1','c2']]=data.loc[~(data._c1.isna()),['_c1','_c2']]
>>> data
c0 c1 c2 _c1 _c2
0 0 18.0 19.0. 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 20.0 21.0 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN

I recommend update after rename
df.update(df[['_c1','_c2']].rename(columns={'_c1':'c1','_c2':'c2'}))
df
Out[266]:
c0 c1 c2 _c1 _c2
0 0 18.0 19.0 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 20.0 21.0 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN

You can use np.where:
df[['c1', 'c2']] = np.where(df[['_c1', '_c2']].notna(),
df[['_c1', '_c2']],
df[['c1', 'c2']])
print(df)
# Output:
c0 c1 c2 _c1 _c2
0 0 18.0 19.0 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 20.0 21.0 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN
Update
Do you know by any chance WHY the above doesn't work the way I thought it would?
Your column names from left and right side of your expression are different so Pandas can't use this values even if the shape is the same.
# Left side of your expression
>>> data.loc[~(data._c1.isna()),['c1','c2']]
c1 c2 # <- note the column names
0 18.0 19.0
2 20.0 21.0
# Right side of your expression
>>> data.loc[~(data._c1.isna()),['_c1','_c2']]
_c1 _c2 # <- Your column names are difference from left side
0 18.0 19.0
2 20.0 21.0
How to solve it? Simply use .values on the right side. As your right side is not row/column indexed, Pandas use the shape to set the values.
data.loc[~(data._c1.isna()),['c1','c2']] = \
data.loc[~(data._c1.isna()),['_c1','_c2']].values
print(data)
# Output:
c0 c1 c2 _c1 _c2
0 0 18.0 19.0 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 20.0 21.0 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN

How to append a list each time to a new column of a CSV file?

I have a for loop which produces a python list in each of it's iterations. I want to append the list to a new column in the CSV file in each iteration of for loop. The CSV file should be created at the time of writing the first list to it.
The code producing the lists is similar to this code:
for a in range(1,10):
b = list(range(1,a+1))
print(b)
After the first iteration of the for loop, the CSV file should contain the first list and so on.
The CSV file after three iterations of the for loop should be similar to this.
col1 col2 col3
1 1 1
2 2 2
3 3 3
4 4
5
I don't necessarily want the headers for the columns.
Thank You All...

This might help you:
import pandas as pd
for a in range(1,10):
b = list(range(1,a+1))
if a==1:
df = pd.DataFrame({a:b})
else:
df = df.merge(pd.DataFrame({a:b}), how='outer', left_index=True, right_index=True)
When you print the df you'll get this:
1 2 3 4 5 6 7 8 9
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1
1 NaN 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2
2 NaN NaN 3.0 3.0 3.0 3.0 3.0 3.0 3
3 NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4
4 NaN NaN NaN NaN 5.0 5.0 5.0 5.0 5
5 NaN NaN NaN NaN NaN 6.0 6.0 6.0 6
6 NaN NaN NaN NaN NaN NaN 7.0 7.0 7
7 NaN NaN NaN NaN NaN NaN NaN 8.0 8
8 NaN NaN NaN NaN NaN NaN NaN NaN 9

Indexing columns based on cell value in pandas

I have a dataframe of race results. I'd like to create a series that takes the last stage position and subtracts that by the average of all the stages before that. Here is a small slice for the df (could have more stages, countries and rows)
race_location stage1_position stage2_position stage3_position number_of_stages
AUS 2.0 2.0 NaN 2
AUS 1.0 5.0 NaN 2
AUS 3.0 4.0 NaN 2
AUS 4.0 8.0 NaN 2
AUS 10.0 6.0 NaN 2
AUS 9.0 7.0 NaN 2
FRA 23.0 1.0 10.0 3
FRA 6.0 12.0 24.0 3
FRA 14.0 11.0 14.0 3
FRA 18.0 10.0 1.0 3
FRA 15.0 14.0 4.0 3
USA 24.0 NaN NaN 1
USA 7.0 NaN NaN 1
USA 22.0 NaN NaN 1
USA 11.0 NaN NaN 1
USA 8.0 NaN NaN 1
USA 16.0 NaN NaN 1
USA 13.0 NaN NaN 1
USA 19.0 NaN NaN 1
USA 5.0 NaN NaN 1
USA 25.0 NaN NaN 1
The output would be
last_stage_minus_average
0
4
1
4
-4
-2
-2
15
1.5
-13
-10.5
0
0
0
0
0
0
0
0
0
0
0
This wont work, but I was thinking something like this:
new_series = []
for country in country_list:
num_stages = df.loc[df['race_location'] == country, 'number_of_stages']
differnce = df.ix[df['race_location'] == country, num_stages] -
df.iloc[:, 0:num_stages-1].mean(axis=1)
new_series.append(difference)
I'm not sure how to go about doing this. Any help or direction would be amazing!

#use pandas apply to take the mean for the first n-1 stages and subtract from last stage.
df.apply(lambda x: x.iloc[x.number_of_stages]-np.mean(x.iloc[1:x.number_of_stages]),axis=1).fillna(0)
Out[264]:
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64

I'd use filter to get just he stage columns, then stack and groupby
stages = df.filter(regex='^stage\d+.*')
stages.stack().groupby(level=0).apply(
lambda x: x.iloc[-1] - x.iloc[:-1].mean()
).fillna(0)
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
how it works
stack will automatically drop the NaN values when converting to a series.
Now, position -1 is the last value within each group if we grouped by the first level of the new multiindex
So, we use a lambda and calculate the mean with every thing up to the last value x.iloc[:-1].mean()
And subtract that from the last value x.iloc[-1]

subtracts that by the average of all the stages before that
It's not a big deal but I'm just curious! Unlike your desired output but along to your description, if one of the racers finished only one race, shouldn't their result be inf or nan instead of 0? (to specify them from the one who has already done 2~3 race but last race result is exactly same with average of races? like racer #1 vs racer #11~20)
df_sp = df.filter(regex='^stage\d+.*')
df['last'] = df_sp.T.fillna(method='ffill').T.iloc[:, -1]
df['mean'] = (df_sp.sum(axis=1) - df['last']) / (df['number_of_stages'] - 1)
print(df['last'] - df['mean'])
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN

What does the term "broadcasting" mean in Pandas documentation?

I'm reading through the Pandas documentation, and the term "broadcasting" is used extensively, but never really defined or explained.
What does it mean?

So the term broadcasting comes from numpy, simply put it explains the rules of the output that will result when you perform operations between n-dimensional arrays (could be panels, dataframes, series) or scalar values.
Broadcasting using a scalar value
So the simplest case is just multiplying by a scalar value:
In [4]:
s = pd.Series(np.arange(5))
s
Out[4]:
0 0
1 1
2 2
3 3
4 4
dtype: int32
In [5]:
s * 10
Out[5]:
0 0
1 10
2 20
3 30
4 40
dtype: int32
and we get the same expected results with a dataframe:
In [6]:
df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4)})
df
Out[6]:
a b
0 0.216920 0.652193
1 0.968969 0.033369
2 0.637784 0.856836
3 -2.303556 0.426238
In [7]:
df * 10
Out[7]:
a b
0 2.169204 6.521925
1 9.689690 0.333695
2 6.377839 8.568362
3 -23.035557 4.262381
So what is technically happening here is that the scalar value has been broadcasted along the same dimensions of the Series and DataFrame above.
Broadcasting using a 1-D array
Say we have a 2-D dataframe of shape 4 x 3 (4 rows x 3 columns) we can perform an operation along the x-axis by using a 1-D Series that is the same length as the row-length:
In [8]:
df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4), 'c':np.random.randn(4)})
df
Out[8]:
a b c
0 0.122073 -1.178127 -1.531254
1 0.011346 -0.747583 -1.967079
2 -0.019716 -0.235676 1.419547
3 0.215847 1.112350 0.659432
In [26]:
df.iloc[0]
Out[26]:
a 0.122073
b -1.178127
c -1.531254
Name: 0, dtype: float64
In [27]:
df + df.iloc[0]
Out[27]:
a b c
0 0.244146 -2.356254 -3.062507
1 0.133419 -1.925710 -3.498333
2 0.102357 -1.413803 -0.111707
3 0.337920 -0.065777 -0.871822
the above looks funny at first until you understand what is happening, I took the first row of values and added this row-wise to the df, it can be visualised using this pic (sourced from scipy):
The general rule is this:
In order to broadcast, the size of the trailing axes for both arrays
in an operation must either be the same size or one of them must be
one.
So if I tried to add a 1-D array that didn't match in length, say one with 4 elements, unlike numpy which will raise a ValueError, in Pandas you'll get a df full of NaN values:
In [30]:
df + pd.Series(np.arange(4))
Out[30]:
a b c 0 1 2 3
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
Now some of the great things about pandas is that it will try to align using existing column names and row labels, this can get in the way of trying to perform a fancier broadcasting like this:
In [55]:
df[['a']] + df.iloc[0]
Out[55]:
a b c
0 0.244146 NaN NaN
1 0.133419 NaN NaN
2 0.102357 NaN NaN
3 0.337920 NaN NaN
In the above I use double subscripting to force the shape to be (4,1) but we see a problem when trying to broadcast using the first row as the column alignment only aligns on the first column. To get the same form of broadcasting to occur like the diagram above shows we have to decompose to numpy arrays which then become anonymous data:
In [56]:
df[['a']].values + df.iloc[0].values
Out[56]:
array([[ 0.24414608, -1.05605392, -1.4091805 ],
[ 0.13341899, -1.166781 , -1.51990758],
[ 0.10235701, -1.19784299, -1.55096957],
[ 0.33792013, -0.96227987, -1.31540645]])
It's also possible to broadcast in 3-dimensions but I don't go near that stuff often but the numpy, scipy and pandas book have examples that show how that works.
Generally speaking the thing to remember is that aside from scalar values which are simple, for n-D arrays the minor/trailing axes length must match or one of them must be 1.
Update
it seems that the above now leads to ValueError: Unable to coerce to Series, length must be 1: given 3 in latest version of pandas 0.20.2
so you have to call .values on the df first:
In[42]:
df[['a']].values + df.iloc[0].values
Out[42]:
array([[ 0.244146, -1.056054, -1.409181],
[ 0.133419, -1.166781, -1.519908],
[ 0.102357, -1.197843, -1.55097 ],
[ 0.33792 , -0.96228 , -1.315407]])
To restore this back to the original df we can construct a df from the np array and pass the original columns in the args to the constructor:
In[43]:
pd.DataFrame(df[['a']].values + df.iloc[0].values, columns=df.columns)
Out[43]:
a b c
0 0.244146 -1.056054 -1.409181
1 0.133419 -1.166781 -1.519908
2 0.102357 -1.197843 -1.550970
3 0.337920 -0.962280 -1.315407

Broadcasting on Pandas DataFrames with MultiIndex
Broadcasting is especially interesting with DataFrames which have a pandas.MultiIndex as I show you in the following example.
Pandas makes it possible to broadcast over the dimensions added via a multidimensional and even hierarchical index, and this is very powerfull, if you know how to use it. You don't need to code your loops and conditions. You can rely on what works already.
I filled two pandas.DataFrames, af and df with a pandas.MultiIndex on the 0-axis (the index) and 10 columns labeled with integeres refering for example to scenario data from a Monte-Carlo simulation.
The pandas.MultiIndexes of the af and df share some common levels in the names (I call them dimensions). Not all labels (newer pandas versions call them codes) need to be in the matching dimensions. In the example, the dimensions 'a' and 'c' are shared. In both frames the 'a'-dimensions has the entries (labels) ['A' and 'B'], whereas in the 'c' dimension the frames af and bf have the entries [0, 1, 2, 3] and [0, 1, 2] respectively.
Nonetheless, Broadcasting works fine. Which means in the following example, when multiplying the two frames, a group-wise multiplication for each group with matching entries in the matching dimensions is performed.
The following example shows broadcasting on multiplications, but it works for all binary operations between pandas.DataFrames on the left- and right-hand side.
Some observations
Note, that both frames can have additional dimensions. It is not necessary that one set of names is a subset of the other. In the example we have ['a', 'b', 'c'] and ['a', 'c', 'd'] for the af and bf frames respectively
The result spans up the whole space, as expected: ['a', 'b', 'c', 'd']
Since dimension 'c' does not have the entry (code) '3' in frame bf, whereas af has, the result fills the resulting block with NaNs.
Note, that pandas 1.0.3 has been used here. Broadcasting with more then one overlapping dimensions did not work with pandas version 0.23.4.
Broadcasting over the 0-axis and the 1-axis at the same time does also work. See the last two examples. For example, if you would like to multiply the af with only bf[0].to_frame(), the first scenario. But it will only be applied to the equally labeled columns (as broadcasting is intended).
Further Hints
If you want to multiply the af frame with a column vector (I need to apply some weights sometimes with additional dimensions), then you can easily implement it yourself. You can expand your dataframe to n = af.shape[1] columns and use then that one for multiplication. Have a look at numpy.tile on how to do it 'without' coding.
>>> af
Values 0 1 2 3 4 5 6 7 8 9
a b c
A a 0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
b 0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
c 0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
B a 0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
b 0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
c 0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
>>> bf
Values 0 1 2 3 4 5 6 7 8 9
a c d
A 0 * 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
# 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
1 * 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
# 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
2 * 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
# 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
B 0 * 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
# 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
1 * 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
# 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
2 * 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
# 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> af * bf
Values 0 1 2 3 4 5 6 7 8 9
a c b d
A 0 a * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
b * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
c * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
1 a * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
b * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
c * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
2 a * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
b * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
c * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
3 a NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
B 0 a * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
b * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
c * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
1 a * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
b * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
c * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
2 a * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
b * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
c * 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
# 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
3 a NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
>>> af * bf[0] # Raises Error: ValueError: cannot join with no overlapping index names
# Removed that part
>>> af * bf[0].to_frame() # works consistently
0 1 2 3 4 5 6 7 8 9
a c b d
A 0 a * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
b * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
c * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 a * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
b * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
c * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 a * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
b * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
c * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 a NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
B 0 a * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
b * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
c * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 a * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
b * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
c * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 a * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
b * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
c * 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 a NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
>>> cf = bf[0].to_frame()
>>> cf.columns = [3]
>>> af * cf # And as expected we can broadcast over the same column labels at the same time
0 1 2 3 4 5 6 7 8 9
a c b d
A 0 a * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
b * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
c * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
1 a * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
b * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
c * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
2 a * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
b * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
c * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
3 a NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
B 0 a * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
b * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
c * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
1 a * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
b * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
c * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
2 a * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
b * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
c * NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
# NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN
3 a NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Dataframe interpolating in sections delimited by indexes - python

Related

Is there a way to forward fill with ascending logic in pandas / numpy?

Setting Values with pandas DataFrame.loc

How to append a list each time to a new column of a CSV file?

Indexing columns based on cell value in pandas

What does the term "broadcasting" mean in Pandas documentation?

Categories

Resources