I want to multiply 2 columns (A*B) in a DataFrame where columns are pd.MultiIndex.
I want to perform this multiplication for each DataX (Data1, Data2, ...) column in columns level=0.
df = pd.DataFrame(data= np.arange(32).reshape(8,4),
columns = pd.MultiIndex.from_product(iterables = [["Data1","Data2"],["A","B"]]))
Data1 Data2
A B A B
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
5 20 21 22 23
6 24 25 26 27
7 28 29 30 31
The result of multiplication should be also a DataFrame with columns=pd.MultiIndex (see below).
Data1 Data2 Data1 Data2
A B A B A*B A*B
0 0 1 2 3 0 6
1 4 5 6 7 20 42
2 8 9 10 11 72 110
3 12 13 14 15 156 210
4 16 17 18 19 272 342
5 20 21 22 23 420 506
6 24 25 26 27 600 702
7 28 29 30 31 812 930
I managed to perform this multiplication by iterating over columns, level=0,but looking a better way to do it.
for _ in df.columns.get_level_values(level=0).unique().tolist()[:]:
df[(_, "A*B")] = df[(_, "A")] * df[(_, "B")]
Any suggestions or hints much appreciated!
Thanks
Here is another alternative using df.prod and df.join
u = df.prod(axis=1,level=0)
u.columns=pd.MultiIndex.from_product((u.columns,['*'.join(df.columns.levels[1])]))
out = df.join(u)
Data1 Data2 Data1 Data2
A B A B A*B A*B
0 0 1 2 3 0 6
1 4 5 6 7 20 42
2 8 9 10 11 72 110
3 12 13 14 15 156 210
4 16 17 18 19 272 342
5 20 21 22 23 420 506
6 24 25 26 27 600 702
7 28 29 30 31 812 930
Slice out the 'A' and 'B' along the first level of the columns Index. Then you can multiply which will align on the 0th level ('Data1', 'Data2'). We'll then re-create the MultiIndex on the columns and join back
df1 = df.xs('A', axis=1, level=1).multiply(df.xs('B', axis=1, level=1))
df1.columns = pd.MultiIndex.from_product([df1.columns, ['A*B']])
df = pd.concat([df, df1], axis=1)
Here are some timings assuming you have 2 groups (Data1, Data2) and your DataFrame just gets longer. Turns out, the simple loop might be the fastest of them all. (I added some sorting and needed to copy them all so the output is the same).
import perfplot
import pandas as pd
import numpy as np
##Tom
def simple_loop(df):
for _ in df.columns.get_level_values(level=0).unique().tolist()[:]:
df[(_, "A*B")] = df[(_, "A")] * df[(_, "B")]
return df.sort_index(axis=1)
##Roy2012
def mul_with_stack(df):
df = df.stack(level=0)
df["A*B"] = df.A * df.B
return df.stack().swaplevel().unstack(level=[2,1]).sort_index(axis=1)
##Alollz
def xs_concat(df):
df1 = df.xs('A', axis=1, level=1).multiply(df.xs('B', axis=1, level=1))
df1.columns = pd.MultiIndex.from_product([df1.columns, ['A*B']])
return pd.concat([df, df1], axis=1).sort_index(axis=1)
##anky
def prod_join(df):
u = df.prod(axis=1,level=0)
u.columns=pd.MultiIndex.from_product((u.columns,['*'.join(df.columns.levels[1])]))
return df.join(u).sort_index(axis=1)
perfplot.show(
setup=lambda n: pd.DataFrame(data=np.arange(4*n).reshape(n, 4),
columns =pd.MultiIndex.from_product(iterables=[["Data1", "Data2"], ["A", "B"]])),
kernels=[
lambda df: simple_loop(df.copy()),
lambda df: mul_with_stack(df.copy()),
lambda df: xs_concat(df.copy()),
lambda df: prod_join(df.copy())
],
labels=['simple_loop', 'stack_and_multiply', 'xs_concat', 'prod_join'],
n_range=[2 ** k for k in range(3, 20)],
equality_check=np.allclose,
xlabel="len(df)"
)
Here's a way to do it with stack and unstack. The advantage: fully vectorized, no loops, no join operations.
t = df.stack(level=0)
t["A*B"] = t.A * t.B
t = t.stack().swaplevel().unstack(level=[2,1])
The output is:
Data1 Data2
A B A*B A B A*B
0 0 1 0 2 3 6
1 4 5 20 6 7 42
2 8 9 72 10 11 110
3 12 13 156 14 15 210
4 16 17 272 18 19 342
Another alternative here, using prod :
df[("Data1", "A*B")] = df.loc(axis=1)["Data1"].prod(axis=1)
df[("Data2", "A*B")] = df.loc(axis=1)["Data2"].prod(axis=1)
df
Data1 Data2 Data1 Data2
A B A B A*B A*B
0 0 1 2 3 0 6
1 4 5 6 7 20 42
2 8 9 10 11 72 110
3 12 13 14 15 156 210
4 16 17 18 19 272 342
5 20 21 22 23 420 506
6 24 25 26 27 600 702
7 28 29 30 31 812 930
Related
Given a df
a
0 1
1 2
2 1
3 7
4 10
5 11
6 21
7 22
8 26
9 51
10 56
11 83
12 82
13 85
14 90
I would like to drop rows if the value in column a is not within these multiple range
(10-15),(25-30),(50-55), (80-85). Such that these range are made from the 'lbotandltop`
lbot =[10, 25, 50, 80]
ltop=[15, 30, 55, 85]
I am thinking this can be achieve via pandas isin
df[df['a'].isin(list(zip(lbot,ltop)))]
But, it return empty df instead.
The expected output is
a
10
11
26
51
83
82
85
You can use numpy broadcasting to create a boolean mask where for each row it returns True if the value is within any of the ranges and filter df with it.:
out = df[((df[['a']].to_numpy() >=lbot) & (df[['a']].to_numpy() <=ltop)).any(axis=1)]
Output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
Create values in flatten list comprehension with range:
df = df[df['a'].isin([z for x, y in zip(lbot,ltop) for z in range(x, y+1)])]
print (df)
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
Or use np.concatenate for flatten list of ranges:
df = df[df['a'].isin(np.concatenate([range(x, y+1) for x, y in zip(lbot,ltop)]))]
A method that uses between():
df[pd.concat([df['a'].between(x, y) for x,y in zip(lbot, ltop)], axis=1).any(axis=1)]
output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
If your values in the two lists are sorted, a method that doesn't require any loop would be to use pandas.cut and checking that you obtain the same group cutting on the two lists:
# group based on lower bound
id1 = pd.cut(df['a'], bins=lbot+[float('inf')], labels=range(len(lbot)),
right=False) # include lower bound
# group based on upper bound
id2 = pd.cut(df['a'], bins=[0]+ltop, labels=range(len(ltop)))
# ensure groups are identical
df[id1.eq(id2)]
output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
intermediate groups:
a id1 id2
0 1 NaN 0
1 2 NaN 0
2 1 NaN 0
3 7 NaN 0
4 10 0 0
5 11 0 0
6 21 0 1
7 22 0 1
8 26 1 1
9 51 2 2
10 56 2 3
11 83 3 3
12 82 3 3
13 85 3 3
14 90 3 NaN
assumed I have two dataframes:
df1: 4 columns, n lines
df2: 50 columns, n lines
what is the best way to calculate the difference of each column of df1 to all columns of df2?
My only idea up to now is to merge the tables and create 4*50 new columns with the differences, as a loop. But there has to be a better way, right?
Thanks already! Paul
For this I have created 2 fictive dataframes:
Input Dataframes
df1 = pd.DataFrame({"a":[1,1,1],
"b":[2,2,2],
})
df2 = pd.DataFrame({"aa":[10,10,10],
"bb":[20,20,20],
"cc":[30,30,30],
"dd":[40,40,40],
"ee":[50,50,50]
})
print(df1)
a b
0 1 2
1 1 2
2 1 2
print(df2)
aa bb cc dd ee
0 10 20 30 40 50
1 10 20 30 40 50
2 10 20 30 40 50
Solution
df = pd.concat([df2.sub(df1[i], axis=0) for i in df1.columns],axis =1)
df.columns= [i for i in range(df1.shape[1]*df2.shape[1])]
df
Result
0 1 2 3 4 5 6 7 8 9
0 9 19 29 39 49 8 18 28 38 48
1 9 19 29 39 49 8 18 28 38 48
2 9 19 29 39 49 8 18 28 38 48
I have a dataset in the following format. It got 48 columns and about 200000 rows.
slot1,slot2,slot3,slot4,slot5,slot6...,slot45,slot46,slot47,slot48
1,2,3,4,5,6,7,......,45,46,47,48
3.5,5.2,2,5.6,...............
I want to reshape this dataset to something as below, where N is less than 48 (maybe 24 or 12 etc..) column headers doesn't matter.
when N = 4
slotNew1,slotNew2,slotNew3,slotNew4
1,2,3,4
5,6,7,8
......
45,46,47,48
3.5,5.2,2,5.6
............
I can read row by row and then split each row and append to a new dataframe. But that is very inefficient. Is there any efficient and faster way to do that?
You may try this
N = 4
df_new = pd.DataFrame(df_original.values.reshape(-1, N))
df_new.columns = ['slotNew{:}'.format(i + 1) for i in range(N)]
The code extracts the data into numpy.ndarray, reshape it, and create a new dataset of desired dimension.
Example:
import numpy as np
import pandas as pd
df0 = pd.DataFrame(np.arange(48 * 3).reshape(-1, 48))
df0.columns = ['slot{:}'.format(i + 1) for i in range(48)]
print(df0)
# slot1 slot2 slot3 slot4 ... slot45 slot46 slot47 slot48
# 0 0 1 2 3 ... 44 45 46 47
# 1 48 49 50 51 ... 92 93 94 95
# 2 96 97 98 99 ... 140 141 142 143
#
# [3 rows x 48 columns]
N = 4
df = pd.DataFrame(df0.values.reshape(-1, N))
df.columns = ['slotNew{:}'.format(i + 1) for i in range(N)]
print(df.head())
# slotNew1 slotNew2 slotNew3 slotNew4
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# 3 12 13 14 15
# 4 16 17 18 19
Another approach
N = 4
df1 = df0.stack().reset_index()
df1['i'] = df1['level_1'].str.replace('slot', '').astype(int) // N
df1['j'] = df1['level_1'].str.replace('slot', '').astype(int) % N
df1['i'] -= (df1['j'] == 0) - df1['level_0'] * 48 / N
df1['j'] += (df1['j'] == 0) * N
df1['j'] = 'slotNew' + df1['j'].astype(str)
df1 = df1[['i', 'j', 0]]
df = df1.pivot(index='i', columns='j', values=0)
Use pandas.explode after making chunks. Given df:
import pandas as pd
df = pd.DataFrame([np.arange(1, 49)], columns=['slot%s' % i for i in range(1, 49)])
print(df)
slot1 slot2 slot3 slot4 slot5 slot6 slot7 slot8 slot9 slot10 ... \
0 1 2 3 4 5 6 7 8 9 10 ...
slot39 slot40 slot41 slot42 slot43 slot44 slot45 slot46 slot47 \
0 39 40 41 42 43 44 45 46 47
slot48
0 48
Using chunks to divide:
def chunks(l, n):
"""Yield successive n-sized chunks from l.
Source: https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
"""
n_items = len(l)
if n_items % n:
n_pads = n - n_items % n
else:
n_pads = 0
l = l + [np.nan for _ in range(n_pads)]
for i in range(0, len(l), n):
yield l[i:i + n]
N = 4
new_df = pd.DataFrame(list(df.apply(lambda x: list(chunks(list(x), N)), 1).explode()))
print(new_df)
Output:
0 1 2 3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
4 17 18 19 20
...
Advantage of this approach over numpy.reshape is that it can handle when N is not a factor:
N = 7
new_df = pd.DataFrame(list(df.apply(lambda x: list(chunks(list(x), N)), 1).explode()))
print(new_df)
Output:
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7.0
1 8 9 10 11 12 13 14.0
2 15 16 17 18 19 20 21.0
3 22 23 24 25 26 27 28.0
4 29 30 31 32 33 34 35.0
5 36 37 38 39 40 41 42.0
6 43 44 45 46 47 48 NaN
I have created a Pandas DataFrame. I need to create a RangeIndex for the DataFrame that corresponds to the frame -
RangeIndex(start=0, stop=x, step=y) - where x and y relate to my DataFrame.
I've not seen an example of how to do this - is there a method or syntax specific to this?
thanks
It seems you need RangeIndex constructor:
df = pd.DataFrame({'A' : range(1, 21)})
print (df)
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
print (df.index)
RangeIndex(start=0, stop=20, step=1)
df.index = pd.RangeIndex(start=0, stop=99, step=5)
print (df)
A
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20
print (df.index)
RangeIndex(start=0, stop=99, step=5)
More dynamic solution:
step = 10
df.index = pd.RangeIndex(start=0, stop=len(df.index) * step - 1, step=step)
print (df)
A
0 1
10 2
20 3
30 4
40 5
50 6
60 7
70 8
80 9
90 10
100 11
110 12
120 13
130 14
140 15
150 16
160 17
170 18
180 19
190 20
print (df.index)
RangeIndex(start=0, stop=199, step=10)
EDIT:
As #ZakS pointed in comments better is use only DataFrame constructor:
df = pd.DataFrame({'A' : range(1, 21)}, index=pd.RangeIndex(start=0, stop=99, step=5))
print (df)
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20
If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)?
For instance with:
In [186]:
df["NY-WEB01"].head()
Out[186]:
NY-WEB01 NY-WEB01
DateTime
2012-10-18 16:00:00 5.6 2.8
2012-10-18 17:00:00 18.6 12.0
2012-10-18 18:00:00 18.4 12.0
2012-10-18 19:00:00 18.2 12.0
2012-10-18 20:00:00 19.2 12.0
How might I collapse the NY-WEB01 columns (there are a bunch of duplicate columns, not just NY-WEB01) by summing each row where the column name is the same?
I believe this does what you are after:
df.groupby(lambda x:x, axis=1).sum()
Alternatively, between 3% and 15% faster depending on the length of the df:
df.groupby(df.columns, axis=1).sum()
EDIT: To extend this beyond sums, use .agg() (short for .aggregate()):
df.groupby(df.columns, axis=1).agg(numpy.max)
pandas >= 0.20: df.groupby(level=0, axis=1)
You don't need a lambda here, nor do you explicitly have to query df.columns; groupby accepts a level argument you can specify in conjunction with the axis argument. This is cleaner, IMO.
# Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
df
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
<!_ >
df.groupby(level=0, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
Handling MultiIndex columns
Another case to consider is when dealing with MultiIndex columns. Consider
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
df
one two
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
To perform aggregation across the upper levels, use
df.groupby(level=1, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
or, if aggregating per upper level only, use
df.groupby(level=[0, 1], axis=1).sum()
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Alternate Interpretation: Dropping Duplicate Columns
If you came here looking to find out how to simply drop duplicate columns (without performing any aggregation), use Index.duplicated:
df.loc[:,~df.columns.duplicated()]
A B
0 44 0
1 39 19
2 23 24
3 1 39
4 24 37
Or, to keep the last ones, specify keep='last' (default is 'first'),
df.loc[:,~df.columns.duplicated(keep='last')]
A B
0 47 3
1 9 36
2 6 12
3 38 46
4 17 13
The groupby alternatives for the two solutions above are df.groupby(level=0, axis=1).first(), and ... .last(), respectively.
Here is possible simplier solution for common aggregation functions like sum, mean, median, max, min, std - only use parameters axis=1 for working with columns and level:
#coldspeed samples
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
print (df)
print (df.sum(axis=1, level=0))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
print (df.sum(axis=1, level=1))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
print (df.sum(axis=1, level=[0,1]))
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Similar it working for index, then use axis=0 instead axis=1:
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('ABCDE'), index=list('aabbc'))
print (df)
A B C D E
a 44 47 0 3 3
a 39 9 19 21 36
b 23 6 24 24 12
b 1 38 39 23 46
c 24 17 37 25 13
print (df.min(axis=0, level=0))
A B C D E
a 39 9 0 3 3
b 1 6 24 23 12
c 24 17 37 25 13
df.index = pd.MultiIndex.from_arrays([['bar']*3 + ['foo']*2, df.index])
print (df.mean(axis=0, level=1))
A B C D E
a 41.5 28.0 9.5 12.0 19.5
b 12.0 22.0 31.5 23.5 29.0
c 24.0 17.0 37.0 25.0 13.0
print (df.max(axis=0, level=[0,1]))
A B C D E
bar a 44 47 19 21 36
b 23 6 24 24 12
foo b 1 38 39 23 46
c 24 17 37 25 13
If need use another functions like first, last, size, count is necessary use coldspeed answer